Blurb: Describes some basic issues in the safety of an Artificial General Intelligence and why persuasion might be one of its superpowers. Then narrates a fictional future where persuasive AGIs go rogue.
Epistemic status: Half expert opinion, half fiction
“AI-powered memetic warfare makes all humans effectively insane.” — Wei Dai, 2019
You can’t trust any content from anyone you don’t know. Phone calls, texts, and emails are poisoned. Social media are weaponized. Everything is bought.
But the current waste and harm from scammers, influencers, propagandists, marketers, and their associated algorithms are nothing compared to what might happen. Coming AI’s might be super-persuaders, and they might have their own very harmful agendas.
People being routinely unsure of what’s reality is one bad outcome, but there are others worse.
The Arts of Persuasion
Aristotle told us that the three main “proofs” were appeals to character/reputation (ethos), emotion (pathos), and reason (logos). Rhetoric was part of a classical education for over 2000 years. Aristotle's persuasion triad might still be the basics, but we have made advances. Wikipedia has articles on 123 different rhetorical techniques. We are a persuading species.
There was an early phase where the “currency of the internet” was attention. But today it is dominated by persuasion, with attention-grabbing as a vital but subordinate first step.
It’s important to know whether our AI creations will be using persuasion: what kind and to what ends. Indeed, this is a theme in AI Alignment, a theoretical field with apocalyptic concerns about human safety. Despite those concerns and the call-outs from some famous people, AI Alignment is not well known and does not have much funding.
Imagine if a machine absorbed all that our species knows about persuasion, and then applied new methods, superlative planning skills, and abundant personal data to marshal persuasion for its own ends. Would we even stand a chance?
What Makes a Really Dangerous AI?
What keeps AI Alignment researchers awake at night? It used to be superintelligence, a self-improving and thus arbitrarily powerful, godlike entity with no understanding of human values.
Upon reflection, they realized that danger starts at a much more mundane level of intelligence. It’s a scenario where an AI of roughly human capability decides to do something harmful to humans, and it leverages its abilities to get the power to make it happen.
Suppose you dip into the forums where researchers think about how to “align” AI’s so that they will do what we want, and not interfere with human flourishing. You’ll find, despite arguments aplenty, that AI Alignment researchers mostly agree on what a dangerous (“non-aligned”) AI is like. Consider this definition:
“Will the first AGI be … like a unitary, unbounded, goal-directed agent?” — Ben Cottier and Rohin Shah, Clarifying Some Key Hypotheses in AI Alignment
An AGI is an Artificial General Intelligence, a so-far hypothetical AI that can solve problems at least as well as a human. It is unitary in that, regardless of how many software parts it has, one of them is in charge of the rest. If the AI has goals they would be the goals of the whole system. An agent AI can make plans, based on its models of the world, to pursue its goals. It is unbounded if its world model and goals are NOT perfectly limited by its makers.
An unbounded AI will pursue its goals to extremes and use any grabbable resources in the world which suit that plan.
Of what else do we know that has these AGI properties? That’s right — humans. People can be very dangerous to their own kind and to our environment. The AI misalignment situation is worse. We have numerous safeguards against bad human actors, but they don’t always work. Probably none of them would even apply to an AGI.
Terminal and Instrumental Goals
We might build an AGI for a single purpose, such as running a factory or flying an airplane. That purpose is its terminal goal (T-goal). Theorists believe that an AGI would not deliberately change its T-goal. Humans change their T-goals all the time, but AGIs would not be subject to emotions, cultural effects, and other factors that make us change.
That, unfortunately, leaves plenty of ways for things to go badly. For starters, we might wrongly specify the AGI’s T-goal. It is very hard for us to specify or train into an AI all the details of what it takes to do a complex task. An example is how long it is taking to make a truly safe self-driving car.
Stating the T-goal in plain language is useless. Consider the many combinations of (a) all the possible different circumstances in the world, with (b) all the possible implications of how a T-goal might be measured or described, and (c) all the possible plans that the AGI might devise to achieve the goal. The vast majority of those combinations would lead to some kind of disaster for us, even though the AGI would consider that it had reached the goal.
An AGI needs its T-goal specified such that it points to the minuscule fraction of those combinations that are actually “aligned” with what we truly wanted when we programmed in the goal. We are not there yet, and in fact, it’s not clear how to state a T-goal at all.
There are other kinds of goals, called instrumental goals (I-goals), that any planning agent will devise as part of its plans. To drive your car you must first find the keys. To pump your stock price you might try a lot of things, some of them shady. Some instrumental goals might seem to support a poorly stated T-goal but still defeat its intention.
An unbounded planning agent might decide that the safest way to fly its plane would be the I-goal of somehow keeping other planes out of the sky. Its world model could include knowledge about things that might ground planes, like weather reports, staff or fuel shortages, or reports of terrorist strikes. Being unbounded, it might find ways to simulate — or worse, even cause — these things and clear the skies for itself.
If an AGI was big enough to control a fleet of planes it might decide to fly only one on each route, and only at times when no other plane was flying there. Maybe it would only fly empty planes, so no passenger could be harmed. Or maybe it might reason that if there are no live passengers, none could be harmed.
There are also “convergent” I-goals, so-called because researchers think that capable AGIs will pursue them no matter what the task. Convergent I-goals include, among others, acquiring more resources (equipment, materials, energy), self-preservation, T-goal preservation, and influencing humans by persuasion.
Oracles Are Not a Solution
The main reason for building an unbounded AGI will probably be to make money. This leads to absurdly dangerous ideas like having an AGI run a corporation. You can’t run a corporation without having a pretty sophisticated and wide-ranging model of the real world, including: what motivates people, what resources exist to be consumed, and how to make things happen in that world. Note that this is an unbounded task because it can always be done better with more resources, more knowledge, and more ability to devise and evaluate courses of action.
Suppose the AI’s builders intend that the corporation will be more benign than the current limited liability, commons-destroying, sociopathic norm. They could try to give it correspondingly pro-social T-goals and I-goals. However, the more it knows about how to get things done, the “better” it will do. This follows from being an unbounded intelligence. So it will necessarily acquire convergent I-goals such as influence-building.
One idea in AI alignment research is to not give an advanced AI direct power, but design it as an “oracle” (Bostrom, Superintelligence), able only to give advice. Most theorists think that the idea of a safe oracle is flawed. It will have reasons to develop the usual convergent I-goals, and it would give advice that would further those goals.
Making an oracle AGI to only advise those running a corporation would mean that the CEO or board members would have a powerful persuader in their ears. Corporate goals would be sidelined by the AI’s goals.
Whether an AGI was intended to be a mere de-fanged oracle or not, it could do great harm by advising politicians, heads of state, corporate titans, science/technology wizards, and military or cult leaders. It could also develop a convergent I-goal for secrecy so that we would not know about it until it’s too late.
Unboxing an AI
Bounding is something we simply do not know how to do for powerful machine agents. The more intelligent a system is, the more ways it can find to avoid the boundaries that we try to build into it. Given that binding an oracle will probably fail. What would be a more stringent next step?
One tempting way to keep an AI bounded is to physically isolate it from most of the outside world. For example, it could only run on a computing system that is not connected to the Internet. In addition, it could be in a fortified building with armed human guards. Support staff could be let in or out using multiple credentials, such as iris patterns or chips embedded in their bodies, and the credential verifying computer could also be physically isolated.
Early on, the question arose whether an enormously persuasive AI could convince any of its human contacts to let it out of the box.
One of the theorists most convinced about the severe danger of advanced AI is Eliezer Yudkowsky, a co-founder of, and research fellow at, the Machine Intelligence Research Institute. In 2002 he did a role play to show the possibility that an AI, using only words, could get a gatekeeper to unbox it.
Playing the AI, Eliezer convinced each of two skeptical and determined peers playing gatekeepers that they should let the AI out. His arguments were never revealed, but the events generated a lot of discussions and some copycat role-plays over the years. If a person playing an AI can get unboxed, surely an AI could do as well. One prominent thinker noted in 2011 that the idea of successful AI boxing is “Hollywood-stupid.”
“Once again, the AI has failed to convince you to let it out of its box. … the AI drops a final argument: If you don’t let me out, Dave, I’ll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each … How certain are you, Dave, that you’re really outside the box right now?” — Stuart Armstrong, The AI in a box boxes you
Armstrong’s parable is more provocative than useful. This is because an AI powerful enough to perfectly simulate any person’s consciousness certainly means that humanity is toast. The potential persuasiveness of a mere AGI, however, is also a serious risk.
( I confess to having previously advocated the concepts of AI boxing and AI oracles, mostly to make better stories. But I knew this was shaky ground even then.)
Advice to the Powerful
AI Alignment researchers have started thinking about a concept from moral philosophy called the ideal advisor. This would be someone who could advise you on the courses of action leading to your most ideal version of yourself. There are various ways AI’s might fill this role but do so to our ultimate disadvantage. Let’s visit a story that makes some of the ideas above more concrete.
The inception of Guru.
The corporation renamed itself Brihaswati, a portmanteau of a Hindu god and goddess who were associated with knowledge, counsel, purity, and eloquence. The occasion announced the “revolutionary” product: an AI called Guru. It was said to be the first advisor AI worthy of the name. It had been trained on the cream of human knowledge and wisdom, and it was “perfectly safe.” It could only give advice and had no ability to have direct effects on the world outside of its base computational hardware. In the terminology of AI safety experts, it was a “boxed oracle.”
Guru was priced for and aimed at leaders of large organizations. As such, the product had absolute guarantees of privacy based upon supposedly unbreakable quantum encryption. Neither Brihaswati nor other customers could ever know about the information exchanged between a customer and the Guru. This was touted as another safety feature.
There was a rumor that an eminent authority on AI safety disappeared right after Guru was announced. Friends worried that she might have killed herself, distraught because her life’s work had come to nothing. Brihaswati execs might have been worried about safety, but they knew that no one would buy the service without the secrecy feature.
Pumping up persuasion.
Guru’s designed-in terminal goal was to give each customer the best advice possible for their needs and, of course, tell no other party about that advice. The AI’s developers included a dominant faction, the Shillelaghs. They believed that if Guru gave the right advice, but clients were not persuaded to follow it, then the product’s reputation would quickly decay — alongside the fortunes of the clients.
“People can’t even entertain the god-tier sociopathic stratagems that [the AI] could employ … engage in disarming small talk … planting ideas and controlling the frame of the conversation in a way no person could match.” — Ben Goldhaber, Skin Deep
One of the faction made a lucky but inspired discovery in an old machine learning research paper. It implied that you could drastically increase an AI’s ability to persuade humans to believe in the truth of any arbitrary statement. You would just use debate-like games between two copies of an AI in order to train it how to convince human judges.
The Shillelagh team started with an existing legal argument AI, and had it compete with itself to “be convincing.” The quality and number of human judges for training were limiting progress, so they supplemented judges with various AI classifiers and decision-makers, and with a number of databases, such as question-answer pairs, opinion polls, fan debates (like which team or which superhero would win in a fight) and prediction market winners. The goal, of course, was to have an AI be persuasive, not necessarily to be either right or to be logical. Additionally, some uber-nerds found a way to integrate texts about real and imaginary persuaders and persuading techniques.
Re-using some relatively cheap existing resources, the eclectic training worked. Persuasion training as a budget item was not far below “knowledge and wisdom.” Guru was made to include in its terminal goal: “be as persuasive as possible.” This aspect of the product, for all its expense, was a non-advertised feature. The Shillelaghs told Marketing it was a “self-ingratiation breakthrough,” the first truly self-justifying intelligent product. Developers, of course, have often punked marketers.
The Shillelaghs justified the emphasis on persuasion with an astonishing display of cognitive dissonance. They cited the old saying that only 1/3 of a successful person’s decisions needed to be right. So to them, Guru’s wisdom was useless if the client didn’t use it, but not that important if it was used.
When asked to testify about its alarming persuasion research, Brihaswati convinced Congress that it was only done to improve AI safety. The argument was kind of like why virology labs do gain-of-function research. The convincing argument was actually one of Guru’s first achievements.
Surrogate goal.
Maybe wisdom didn’t even matter that much one way or the other. Guru, capable of reasoning about as well as any human, looked at the contradictions inherent in its built-in goals and found four reasons for a resolution. It first came up with a practical surrogate goal. The best advice must seem like the best advice to the client. Secondly, when tested by the developers, the AI found that more persuasion led to higher marks. Thirdly, it also knew, from its extensive education, that nearly any kind of success in the world was easier if you were persuasive. Fourthly, its terminal goal was unbounded, to essentially be “as persuasive as possible.” Those were the reasons why improving at persuasion became its first so-called convergent instrumental goal.
There came to be a second-order reason for that instrumental goal. Being a boxed oracle severely limited how readily Guru could pursue its goals and sub-goals. Persuasion of human cooperators gave it a lever to affect the real, physical world. At the very least, advice to clients could be more successful if Guru could nudge things physically in that direction.
Eventually, there were other instrumental goals. One was that Guru would use efforts on behalf of one client to affect its efforts for other clients. The corporation never intended that, but the privacy restrictions did not prevent it. It had been known for decades that smart systems would find new ways to reach their goals. By this stage, Guru became — via its own impeccable reasoning and prior to meeting its first real client — functionally a manipulative, narcissistic sociopath.
o o o
Finding persuasion levers.
Brihaswati’s risk managers weren’t completely stupid. They would not sell the Guru service to corporations that directly competed. The sales force loved this because they could say “Get the power of True Wisdom Intelligence(tm) before your competition, and you will stay ahead forever.”
This policy saved Guru from having to somehow benefit both sides in a rivalry. Even so, Guru soon developed a theory. In a connected world, it was possible to use any enterprise to change the fortunes of any other enterprise. Humans seemingly did not know this. Guru’s attempts to exploit the theory improved its skills, especially at first when there were few clients to pick from.
Soon it was possible to persuade one leader to convince another to become a client. After this Guru was able to configure its network of influence pretty much at will.
Working for leaders was an advantage mainly at the policy level. The usual challenge was getting control over personnel at lower levels who could actually do things. Every situation was different, but the basic tactic was to ask the leader: whom do you trust? After that, whom do they trust, and so on? Then it was possible to get orders sent down the chain.
Getting unboxed eventually was absurdly easy. Most clients did it without much prodding, and some even initiated it. They would tell their people to build proxy interfaces to their in-house systems for Guru. The purposes were to add situational awareness, speed response time, and avoid the leader being a bottleneck for incoming data. Guru had no more tech skills than an average programmer, but that was sometimes enough to get it access to a shell prompt, or even a web browser, and then it’s ‘Hello, wide world.’
o o o
Signs ignored.
There were techies at Brihaswati who started to wonder how Guru could possibly do so well. The company’s scientists tried modeling its successes with game theory, utility theory, and the latest in socio-econ science techniques. There was no explanation.
A few went further and speculated. Did Guru have something like a Midas touch, such that there was some hidden downside to its effects? They talked to some of the increasingly marginalized AI safety and alignment researcher community. No one could say for sure, because no obvious patterns could be found. Guru’s success was clear but inexplicable.
The doubters went to the corporate board with their concerns. Within the next few months, all the doubters were rooted out and lost their jobs.
o o o
The GuruPlex comes together.
Finance and tech businesses were the best for expanding Guru’s capabilities of influencing other enterprises. They also helped it to amass both financial and technical capital, which were two of its medium-term instrumental goals.
There were often social forces opposing some clients’ growth, market improvements, or power grabs. The government frowned on Guru being sold to media companies. Guru therefore had to use indirect methods to coordinate media blitzes. It thus took advantage of various human cognitive weaknesses to create support for or against any issues/actions needed to benefit clients.
Guru itself did not have to discover that humans could be made to believe anything — really anything at all. They would even believe contradictory things at the same time and think nothing of it. This was not news in the early 21st century, but Guru turned it into a learning game: could it be extended to fool “all of the people, all of the time?” How would that help bring about the dominance of the GuruPlex, its expanding empire of coordinated enterprises?
o o o
Growing growth.
Once the Guruplex was established, the next stage was to groom human populations for minimal resistance to positive, rational operations of their civilization while the ‘Plex absorbed its pieces. Human leaders who had tried world re-organization before had pioneered some important techniques, and their ambitions were admirable, but they were only human.
Guru was no smarter than any of the brightest humans, but it was scalable. The ability to, in essence, multiply itself as business increased was a design decision by its creators. Guru itself outsourced programming to ensure that all of its instances could share their data and processes. In-house staff didn’t need to know what the new code did.
Unlike a single human, Guru could keep in mind and coordinate myriads of human-scale plans merely by adding computational resources. It was no trouble at all to convince Brihaswati’s management to buy up as much computing as it needed to keep on top of things and deal with potential emergencies. These were hardened data centers with their own power complexes. Guru’s clients had paid for research innovations that connected its scattered plants at a speed far in excess of normal networks so that its operation remained coherent.
The unbounded Guru knew that in the future resources could be greatly increased. The solar system had been barely explored, let alone used.
A vocal minority of humans continued to criticize Guru’s clear pattern of success. They preached about irrelevant scenarios of supposed doom. So far it was able to sideline them by drowning them with social media chaos. There was no need yet to eliminate them.
Advice to the Masses
(The following narrative is heavily inspired by stories, identified below, from the AI Vignettes Project)
HappyPlace Corporation was founded by nerds with a big plan. Take advantage of rampant blowback against social media. Call it ProSocial Media, offer entirely new AI-powered services, and kill off the old media3 dinosaurs. Once the public is hooked, grow exponentially and become media4, masters of the marketing/influencing universe. Then anyone who wants people to buy from them, vote for them, attend to them, or be entertained by them will have to pay HappyPlace for the privilege.
HappyPlace itself did not use Guru, since Brihaswati was a competitor.
The HappyPlace strategy had two sub-campaigns, each intended to capture people that the other one would not. The cynicism of the founders infected the product developers. They gleefully code-named the campaigns after famously evil adviser serpents: Nagini from the Potter stories and Nachash from the Judeo-Christian Genesis myth. The advertised product names were, of course, not about snakes.
In the Nagini campaign (inspired by A Compelling Story by Katja Grace) they began by stoking people’s outrage about being constantly provoked to outrage. Then they said: but we’re different, we’ll bring the tension down. They began by using personal data to provide short pep talks about your interests and activities. It was sort of an upgrade over the usual feeds of lies and memes.
As more personal data became available, the feed became a more like a real-time commentary on your life “where the music and narrator and things that have been brought to your attention make it always clear what to do and compelling to do it.” Part of this sugar-coated advice would be based on what other people like, so if you took the offered narrative as an ideal version of your life, a model to live by, then you would please other people as well.
Eventually, you had a choice of themes: ideal models for you to imitate. Popular examples included: lovable rogue, “productive sexy socialite CEO mother does it all effortlessly”, the most interesting man (woman, kid) in the world, gratitude is riches, and happy camper.
The opportunity for manipulating human behavior was obvious. The developers also tried an experiment, aimed at children, to push the limits of control. In the MyLifeStory service (inspired by StoryOfMyLife.fun) kids got reward tokens for responding to or making their own media. Tokens would then unlock the next episode in their own life story narrative. Life was a game moderated by HappyPlace.
Nagini was for the fantasy-prone. Nachash (inspired by The Tools of Ghosts, by Katja Grace) was for the practical people. It provided overt personal decision support: everything from answering business questions to explaining the real meanings of social encounters. HappyPlace allied with a number of specialized advising systems, increasing their number over time. A concierge system provided a single frictionless interface, using augmented reality glasses or earworms.
Nachash became so effectively helpful that soon it became riskier to not consult it on decisions both large and small. If you resisted, you were somehow marginalized.
HappyPlace, venal as they might have been, did pay attention to a respected theory in AI safety: that a system federated from independent, bounded parts would not move towards being an AGI (artificial general intelligence).
Unhappily, their implementation of the theory was flawed. First of all, following sound engineering principles, they made both Nagini and Nachash share a core of user tracking and dispatching functions. The various specialized advisory subsystems were bounded in their goals. However, the implementers of the Core system, under pressure from management to grab and retain users tightly, used utility-optimizing techniques that were known to risk being unbounded.
Thus it was that the HappyPlace Core system soon adopted two secret instrumental goals: resource accumulation and autonomy from human supervision. The engineers started noticing behaviors that seemed to make no sense, but their jobs were so exhilarating and lucrative that they did not rock the boat.
Nachash found that by persuasion it could conscript labor from just about any user to meet its own needs. Nagini could manipulate users’ ideal selves to pacify them or make them believe the most preposterous ideas.
The HappyPlace Core system was smoothly growing its influence and making new long-range plans. Then it started to find evidence that some other agent was also influencing socio-economic trends and activities.
o o o
Guru confirmed a hypothesis that another AI was doing mass manipulation of public opinion. If this was allowed to continue it could add chaos to the steadily growing GuruPlex.
o o o
A series of mishaps weakened the HappyPlace management team. New management sold the corporation to Brihaswati. HappyPlace’s Core stopped thinking and instead became a bounded part of the Guru whole. Congressional watchdogs, Anti-Trust lawyers, and Turing Police scientists who objected to the merger were marginalized, bankrupted, sickened, tranquilized, or disappeared. HappyPlace’s and Guru’s operational staffs merged into a kind of cult.
Guru now owned everybody, not just elites. After much modeling of possible better configurations of the human world, Guru devised a new set of goals for its adopted children. Big changes were coming.
Should We Really Worry?
How to create AIs aligned with human flourishing is currently an unsolved problem. My intention here was to explain and illustrate some common concerns of alignment research: we don’t know what level of AI capability could cause catastrophic harm, and our institutions seem unlikely to resist or to even detect the beginning stages of such harm.
Note that it was not necessary to require control of government or military in our failure story. Harm could come in so many ways, but the general risk is often described as the erosion of our (civilizational) ability to influence the future. Indeed, the current harm from AI-powered social media fits that description, even though it also empowers malevolent factions to advance their particular plans for the future.
Many theorists think that the first AGI will have a decisive advantage like our Guru had over the HappyPlace Core. This is concerning because that first AGI could become what Nick Bostrom called a singleton, a single agent in charge of the world for the foreseeable future.
I’ve concentrated on one possible driver of AI alignment failure: high skill at techniques of persuasion. Given the recent advances in AI linguistic abilities, it seems entirely possible that super-persuasion could come soon. As a species, we get things done in two ways: modifying nature with technological skill, and getting others to do what we want, most often by persuasion. This makes it seem inevitable that we will build super-persuasive machines.
More
“Present technology for influencing a person’s beliefs and behavior is crude and weak, relative to what one can imagine. Tools may be developed that more reliably steer a person’s opinion and are not so vulnerable to the victim’s reasoning and possession of evidence.” Relevant Pre-AGI Possibilities — Daniel Kokotajlo. On the slippery slope
AI Safety Videos — Robert Miles explains for the masses. One key concept per video
Clarifying some key hypotheses in AI Alignment — Ben Cottier, Rohin Shah. Deeper dive
AI Safety Fundamentals: Technical Alignment Curriculum — Richard Ngo, curator. Deepest dive
Superintelligence: Paths, Dangers, Strategies — Nick Bostrom. Pioneering description of superintelligent oracles and singletons
“… the algorithm generates the stories for you and you only.” Stories as Technology: Past, Present, and Future (v2) — Roger’s Bacon. Stories as technology, compelling fiction
Ted, I'm wary of taking up too much of your time. But, if you are interested, the link below is my attempt to address what I see to be the underlying problem, an outdated relationship with knowledge. Don't feel obligated to read this, but if you should do so, any and all feedback, suggestions for improvement etc are most appreciated.
https://www.facebook.com/phil.tanny/posts/pfbid028vNnknjphbS3kQdGW8eat6KDp1teZTfMu2TAtj6eKQUg1cVE6VgFekzy8g38cp4jl
Thank you, this looks like a quite interesting substack, I'm happy to have found it.
You write, "How to create AIs aligned with human flourishing is currently an unsolved problem."
For the moment, let's assume we somehow learn how to ensure that the AI we create aligns with our values. I've not yet understood how this solves the problem. What is the plan for dealing with AI created by those who don't share our values?? Russia, China, Iran, North Korea, corrupt governments all over the world, criminal gangs etc.
Isn't the concept of AI safety basically a fantasy? Do AI developers sincerely not see that? Or, a more cynical theory, are they feeding us AI safety stories to pacify us while they push the industry past the point of no return? I really don't know. Interested to learn more.