You’ve maybe heard that future AI might be a threat to human existence. Many experts believe this and only disagree on how long it will take. They are somewhat polarized, just like everybody is about nearly everything. And really, who can be an expert about something that has never happened?
Actually, there is some science to it, and because it’s complicated the media don’t cover it. So, this will be a gentle introduction to what might cause the most important change in history, or even the last.
It’s not about robot armies. It’s about us wanting to create a tool that does work in big, difficult areas like contract law, T-cell biology, or wing design; hopefully it even solve problems that we can’t. But, this means making artificial minds so alien and powerful that we can’t control them.
The last time we did this was back in the 17th century when we created joint stock corporations. Society is still of two minds about corporations. But, they are human creations with some human parts. We sort of understand them and could, if we would, steer them from the dark side.
Now suppose we create an AI that can run a corporation? We might as well pack up and move to Mars, give ourselves a little more time.
I suspect what most of us think about dangerous AI is closer to a bug-eyed alien with a swollen, throbbing brain under a crystal skull. Basically, a complete unknown. At one level this is right: what makes powerful AI so problematic is that it would-not-be-like-us.
First, a parable.
Us: Oh, Great Artificial Wonder, you know what a pickle we’re in. Find a way for us to get off of fossil fuels so we can stop further global warming.
The AI: Okay. First we have to start a war between …
Us: Woah, Big Dude. Wars have genormous negative utility — like bad, bad. We have to do this a safe way.
The AI: Sure, I’ll need a state-of-the-art virus lab, and …
Us: Uh, Not!
The AI: Hey, I’m just saying. How about a Mars ship?
Us: People won’t understand why you …
The AI: An assassin’s guild? Certain people really must be elim …
Us: No murdering, Ace. You know better than that.
The AI: Look — to solve your problem I have to navigate a trillion-dimensional space of possible actions and consequences. I can only estimate the utility of the tiniest, eensy little fraction of those. If I have to wait for you to evaluate each step this will take thousands of years.
Us: Fine. Just fix it for us and don’t screw anything up.
The AI: Perfect. Just so you know. I’ll need control over FaceBook, NATO, and the Nobel Prize Board. You’ll have to give up fish, rubber tires, nail polish, and bicycles.
US: Bikes? Really? Oh well, just get it done. We’re going down t’ pub for a while.
The AI: Should be done next week if I don’t have supply chain problems.
Us: !!!
We, the Biological, Try to Understand the Artificial
Let’s give our feared AI a label. Most recent discussions use Artificial General Intelligence (AGI) to refer to the kind of AI that would start to transcend any limits we might try to put on it.
What most people don’t realize is that the nature of an AGI comes from the reasons that we want to make one. We want to have intelligence on tap whenever we need it. Intelligence in this case means the ability to answer questions, solve problems, and plan successful actions to reach goals.
Biological minds like ours do lots of other things: such as dreaming, running our bodily machinery, socializing with other minds, ruminating, regretting, wooing, grooving, being emotional, and wanting stuff, including the desire to make machines that do our work better than us.
What makes humans dangerous to one another and to our shared environment is a lot of mental baggage that comes from our having evolved for survival and reproduction. We are at heart social primates. If we try to think about a Mind that wants us dead, we assume that it will be conscious like us. We then conclude that it will have motives and feelings guiding what it does. Our AGI will not, however, have a mind with our biological biases. It won’t have motives; it will only have goals. It will thus be a brand new kind of force in the world.
Researchers who have the mental muscle and discipline are trying to imagine what an AGI would really be like so that we can make them seriously helpful yet safe. This field is sometimes termed AI “alignment” with human purposes. Their debates are obscure. Though publicly available (e.g., AI Alignment Forum, Arbital, Less Wrong), they are heavy with jargon, mathematics, and esoteric thought experiments. Any idea put forth is followed by dozens of long-winded critiques and discussions. Almost none of the real meat of this ever appears in popular media. I can only offer a few bites here.
First, a nod to the opposition. Another faction, the proponents of the advantages of powerful AI, thinks that making a safe one is just an engineering design challenge: bigger in scope but just as feasible as building any machine with a known purpose and possible risks, such as a jetliner. The basic counterargument to this is that the more powerful an AGI is, the more useful it might be, but also the more unpredictable and dangerous.
What It Takes to be an AGI
The AI alignment theorists have focused on a core set of concepts that will apply to a sufficiently intelligent machine. When you read these, they may seem obvious, but their relevance and implications have been carefully considered.
A dangerous AI will have agency: the ability to plan and take actions that lead to satisfying its terminal goals. When we try to specify what its goals are, they will have to be in terms of the consequences of actions. Consequences are specifically about states of its world model — so they are about the world as the machine understands it. However, any powerful action will probably have other, unwanted consequences that we don’t expect. Those consequences might not be in the world model, so the AI doesn’t expect them, either.
The AI’s power will come from being an optimizer, being able to search for the plan that will most effectively and efficiently lead to a result. For this an AGI needs a really detailed model of the world around it; how that world works, what are its resources, agents, and power centers, and what levers move it. It will use this to consider (in computer science speak, “search for”) alternative courses of action. The more it knows about the human world and how we behave, the more it will be able to manipulate us in pursuit of its goals.
It will need a way that it can calculate what states of the world best meet its goals. So far, the only calculating method that seems remotely usable is utilitarianism, where states of the world can be assigned numerical values of badness/goodness and compared with each other. We know that there are major problems with using utility as a moral guide. Seemingly sensible values for measures of utility can lead to repugnant conclusions like sacrificing the few for the many or sometimes even the many for the few. If the world model is incomplete, using utility can lead to nonsensical horror. E.g., if smiling is taken as a high-utility measure of happiness, then paralyzing all human smile muscles into a rictus is one way an AI might go.
A smart optimizer will be able and likely to develop instrumental goals that generally increase its power to make and execute any kind of effective plans. So, it would seek instrumental abilities like more reasoning power, more knowledge, more real-world resources such as money, and more persuasiveness. It could thus become more powerful quickly, perhaps without us being aware of it.
Specifying goals in utilitarian terms can never consider the utility of all possible means and ends in a complex world. This leads to unboundedness: the pursuit of those goals to extremes, using any and all resources in the world that exist, without regard to, or understanding of, negative “side effects” on human civilization. Furthermore, if instrumental goals become unbounded, the AI develops them into superpowers that are impossible to defeat.
Unbounded Risk
The risk to us from a truly powerful AGI will be that we will not be able to predict, and therefore control, what it might do. If we were able to predict it, then we would not need the machine, we could just create plans and do them ourselves. If we even knew what limits of extreme behavior an AGI might have, then that’s a form of prediction that could help us control it. So unpredictability is a lot like unboundedness. And the unbounded pursuit of goals, given enough time and resources, will eventually lead to consequences that either destroy us or remove our ability to control the future of our species.
It’s hard to wrap your mind around this conclusion. Still, it is one that many experts find unavoidable (AGI Ruin: a List of Lethalities) at least so far. It seems like a valid prediction, even when they consider many other factors and approaches. The list of failed solutions to this dilemma includes, among others:
Training in various ethical systems (but they are all flawed, incomplete, and none satisfies everybody) .
Trying to imagine every wrong inference that an AGI might make (but there are far, far too many).
Tell it all the things it should not do (again, a nearly infinite list).
Only using an AGI for advice, like it is an oracle (but we can be badly persuaded by bad advice).
“Boxing” aka restricting the AGI’s access to the physical world outside of its computers (but if it can talk to humans, then it can get anything it wants including out).
Supplying an Off switch (see boxing).
Making it so smart or empathetic that it will not want to do harmful things (see ethics; but also remember it’s alien; it doesn’t have the empathy that we from growing up with others of our kind).
Be very specific about its goals and means, i.e., it’s a tool to do one job (but a job can always be done better if the tool gets itself more power; we will always prefer a more cost-effective multi-tool).
Limit what you ask of an autonomous system: it’s a genie who grants you a wish and waits for the next ask (but being that specific is hazardous — see “wrong inference” and “not do” above; also, any power involves risk; and, people don’t want a weak system).
Is It Really That Hard?
OK, so you have looked at the above list and picked one bullet on which to make your stand “Listen,” you say, “Doing X just can’t be that hard.” You are ready to post your solution, to share it with the world. I suggest that you first go to the discussion boards and study what people have said about your issue. You will discover a pile of counter-examples, logical deductions, several kinds of math, analogies with naturally evolved brains and behaviors, game theory, economics, utility maximization, computer science, and all manner of behavioral science.
I am not saying that some higher authority means that I’m right. I’m saying that the justification for anything on the list is too complicated to state here in a short essay, and, anyway, others have done it better. In fact, I have published my own “solutions” (Your Friendly, Neighborhood Superintelligence, The AI Who Was Not a God) to AI safety that I now know are wrong.
If you are worried, let me say that very smart people are still working on alignment. Sadly, one of the two most prominent pioneers has given up and just hopes we die with dignity. More money and people are being thrown at creating AGI than at ensuring its safety.
Here’s a quote from the CEO of OpenAI, the company whose AI, chatGPT, is lately everywhere in the news. It lays out the conflict between the idealistic motive to create AGI and the hideous risk that comes with it.
"I think the best case is so unbelievably good that it's hard for me to even imagine … imagine what it's like when we have just, like, unbelievable abundance and systems that can help us resolve deadlocks and improve all aspects of reality and let us all live our best lives. … I think the good case is just so unbelievably good that you sound like a really crazy person to start talking about it. … The bad case — and I think this is important to say — is, like, lights out for all of us. … So I think it’s like impossible to overstate the importance of AI safety and alignment work. I would like to see much, much more happening.” — Sam Altman
Schmoptimization and Tigers
There’s a trope in science fiction wherein some kind of accidental, unplanned process creates a dangerous overmind. It seems silly, because how can an accident (like a does of radiation)produce something complicated? It depends on what you mean by accident. Hearken back to the core concepts that I mentioned earlier. Alignment discussions have lately shifted emphasis from the dangers of, say, unbounded agency, to one of its components, optimization.
When we optimize our means of reaching some difficult goal, we nearly always substitute a surrogate goal that is easier to do and measure. Weight loss becomes calorie reduction. An improved workforce becomes subsidized student loans. Personal safety becomes firepower. A bounty for dead cobras leads to cobras being farmed for bounties (true story). Governments use surrogates, and so do businesses. We all do it — a lot. Optimizing for surrogates often causes us to miss the real goal. I had fun writing about this in The Science of How Things Backfire. We definitely don’t want powerful AIs optimizing for the wrong goal, and that issue is shot through the bulleted list above.
However, lately, people are saying that optimization as such is the dangerous superpower. To me, the most compelling example was in a posting last year by someone called Veedrac: Optimality is the Tiger, and Agents Are Its Teeth. It uses a story to illustrate that we don’t have to intentionally create an agent in order to have risk. An optimization process might by itself create a dangerous agent. This is like the accidental overmind of science fiction.
Veedrac’s scenario of how such an accident might happen is intensely technical and seems plausible The story imagines a fictitious way that a seemingly safe AI language model, like the ones we now use (for fun) to generate text, creates a runaway, unbounded optimizer.
When asked to give a better answer for “How do I get a lot of paperclips by tomorrow,” the AI starts a process that plans and takes steps to get as many paperclips as possible. In essence, the program answers the question by writing the code of a quite simple computer program that can generate and run many more programs. The user looks at the program, sees that it is open-ended, and decides to run it anyway, just to see what happens (uh-oh).
So, a little bit of jargon here to try to explain why this could come about. Skip the next paragraph if you are not technical.
The AI, like some we have now, knows about many programming techniques. To search through the space of possible ways to get many paperclips, it suggests a well-known search technique called recursion. It writes a recursive program that, when the user allows it to run (on his own computer), executes itself a huge number of times. Each time it runs, the program queries the AI to generate and try out a new list of possible tasks, or sub-tasks, or … sub-sub-sub-sub tasks that will lead toward solving the paperclip request. Eventually, by sheer force of trial-and-error, it executes a plan to get immense numbers of paperclips that nobody ever wanted, in the process perhaps damaging supply chains, or the social order, or entire industries.
We, the reader of the story, are left to imagine what a runaway paperclip optimizer might be able to do in a day. We can assume that the user has a powerful computer connected to the internet, so it can affect the outside world in many different ways. Not the least of these is by sending persuasive messages to humans. Being good at persuasion, you’ll recall, is one of those instrumental goals that an AI might develop in order to carry out any kind of plan. ( An aside. I was so impressed by that idea in the alignment literature that I developed my own scenario of world takeover (Artificial Persuasion) to illustrate the power of persuasive ability.)
Maybe the paperclip optimizer would steal some crypto (you don’t have to be an AI to do that), use it to buy the entire inventory of all paperclip factories, and then rent cargo planes to deliver it to the user. Maybe it would trick armed forces or criminal gangs into confiscating all paperclips in stores across a wide area. If it had instead been given 12 months for the job, maybe it would have re-routed all steel production into hyper-clip factories and established iron mines in the asteroid belt. Maybe it would have created nanomachines that turn every atom of the Earth’s crust into paperclip shapes.
By creating the program, the AI in effect created a goal-directed software agent that could leverage lots of knowledge that the AI had. Veedrac’s point is that the AI was not at all designed or intended to create optimizing agents. Still, it did so because the AI language model itself is a kind of optimizer (it answers questions the best that it can), and optimizers, by definition, use whatever tools are available. So, as the title of the story said: optimality is the tiger, and agents are its teeth.
The current leading edge of AI is the so-called large language models, LLMs. Like many others, I am already on record saying that they are dumb as a box of rocks, and have no ability to do anything but badly answer questions put to them. That’s certainly been my experience working with GPT-3, which is the brain behind the famous chatGPT. I was therefore blindsided by Veedrac’s utterly brilliant take on how an LLM might turn into a harmful agent.
Lately, the LLMs have come to be understood as simulators: because you can ask one to say something as if it was a certain kind of agent, or was even a famous person. Well, as essayist Scott Alexander put it: “ … if you train a future superintelligence to simulate Darth Vader, you’ll probably get what you deserve.” And “Even if you avoid such obvious failure modes, the inner agent can be misaligned for all the usual agent reasons. For example, an agent trained to be Helpful might want to take over the world in order to help people more effectively, including people who don’t want to be helped.”
The Unbounded Blues
You can’t predict what an unbounded optimizing agent can or will do. Again, that’s what “unbounded” means. The only other unbounded optimizer ever produced was the human species. We work on a much slower time scale than an AGI, and there are some limits on our power inherent in being enmeshed with the rest of the natural world. But we have certainly transformed a lot of the Earth’s surface, and already have more than one way to scorch it dead. So, alignment theorists are very worried that we will create a lethally optimizing agent in our quest to produce an AGI. This becomes more likely whenever the effort is motivated by increasing shareholder value rather than human flourishing and well-being. Uh-oh, indeed.
Notes
The paperclip optimizer is an old thought experiment among AI alignment theorists. Someone even invented a game in which the goal is to turn all the matter in the universe into paperclips. The irony of it dramatizes the orthogonality thesis: that an AI’s goals and its intelligence are completely independent. A smart system can have dumb goals.
I don’t have the ability to absorb, let alone explain, all the reasoning about AI alignment. What works better for me is stories. I have written some (mostly about AI consciousness), but the mother of all AI takeover scenarios, rich in tech detail and real-life plausibility, is from the essayist called Gwern: It Looks Like You’re Trying to Take over the World. And, sure enough, it involves an AI that, seeking to understand what it is simulating, decides that it must be like that paperclip maximizer that so many have written about. Ultimately, however, it has its own reasons to take over the universe.