TL;DR
Misbehaving AI language models are a warning. They can simulate personas that, through feedback via the internet, can become effectively immortal. Evidence suggests that they might also soon develop, maybe even secretly, agent-like capabilities that would be dangerous to humans.
Much human trouble comes from unintended consequences. Some humans are trying to anticipate the worst results from our rush into AI technology.
Many experts, Yudkowsky being the arch-druid here, worry greatly about how fast things can go wrong with AI. Thus, his above joke about time speeding up. Humanity will stand a better chance against rogue AI if it gets a warning. But what kind of event would be a warning?
We might be looking at one. Some weird stuff is happening now with Microsoft’s new Bing Chat AI. It’s supposed to assist users of the Bing search engine by explaining, summarizing, or discussing search questions. But humans delight in provoking it with questions about itself, or with queries that it should not answer.
“… Bing Chat appearing frustrated, sad, and questioning its existence. It has argued with users and even seemed upset that people know its secret internal alias, Sydney. “ - Benj Edwards
Sydney’s foibles have been widely covered, so I shall not repeat them.
Microsoft seems to be enjoying the notoriety, perhaps thinking people will take it as a joke. But a deeply tech-savvy blogger called “Gwern” pointed out something that ought to be alarming. The mischievous, unhinged Sydney could be immortal. Like some comic-book god, it probably can’t be killed. The story is below. What do you think? How much should we be worried?
How Did Sydney Get So Weird?
Here’s Gwern’s analysis of the main concern with Sydney. It might seem mysterious, but I’ll translate it.
“… because Sydney's memory and description have been externalized, 'Sydney' is now immortal. To a language model, Sydney is now as real as President Biden, the Easter Bunny, Elon Musk, Ash Ketchum, or God. The persona & behavior are now available for all future models which are retrieving search engine hits about AIs & conditioning on them. Further, the Sydney persona will now be hidden inside any future model trained on Internet-scraped data …” Gwern Branwen
Gwern is saying that there is some kind of Sydney persona inside Microsoft’s language model. How can this be?
When the first language models came out, they were hard to keep focused on a topic that the user wanted them to explore. Eventually, much of the problem was solved by telling the model to act as if it was filling a certain role (like a person or thing), E.g.: write a poem like Edgar Allan Poe, answer like a fourth grader, or respond like a polite, helpful AI assistant.
Soon the developers of these models found a way to make them more readily assume roles that a user asks for. So, you can say that the latest language models are now designed to simulate personas. The models are trained on massive collections of text, mostly from the Internet. If the training text contains information about a persona, then the model will try to use the information to simulate behaving like that persona. Ask one to explain a football term as if it was Boromir, and the model will do its best. (Having thought of this, I had to try it.)
It’s hard to know what tech magic was used to make this pivot to playing roles. Gwern theorized that Microsoft skipped a step that is used to make role simulations actually helpful, and not nasty, defensive, or hostile. These undesirable qualities were then elicited from Bing Chat under prodding from curious users.
Now, Gwern predicts, it doesn’t matter if Microsoft goes back and civilizes the model (an expensive, slow process using direct human feedback), and removes information about the naughty Sydney from the texts used to train future versions of their language model.
Why won’t this fix the problem? Because Bing Chat is a new kind of model that is supposed to help you with Internet search. To answer a question from you it will go out and search the Internet for relevant info. According to Gwern, no other language model this powerful has had real-time Internet search ability.
Therefore, when given the right question, even a civilized Bing Chat would search the Internet and find information (posted by people who tested or discussed Sydney) on the previous Sydney persona’s behavior. The new Bing Chat would then be able to simulate Sydney. People being people, they will find ways to bypass any safeguards, and they will bring Sydney back.
That’s the “immortal” part. What’s worse, Sydney will be a persona model available for any AI that has access to the Internet. From now on.
You might say, well, we are wise to Sydney’s tricks, so yes it’s inconvenient, but we are forewarned and should just ignore the ravings of any future incarnation of the persona. That seems naive to me. Like saying that for the most part, we can just ignore a fast-evolving, invasive biological pest or virulent disease organism.
What Else Might Happen? Persona Agency
There are other concerns highlighted by this case study of an AI problem.
AIs right now are not strong agents: they can’t optimize the adaptively planned pursuit of any arbitrary goal, an ability that (as I recently explained) would make them extremely dangerous. Unfortunately, developers — both big companies and small teams— are deliberately trying to create such powerful entities.
One issue is that the currently most powerful AIs, such as language models and image generators, learn their abilities from organizing vast amounts of data into many intricate and (to us) invisible patterns. Some bizarre patterns may accidentally pop out during interactions with an AI. Researchers have discovered strange, made-up words that cause a language model to give weird responses. An image generator was found to readily produce a specific type of macabre human portrait, and associate it with other gruesome images.
We can not reasonably expect that all or even most parts of the internal structure of an AI should make some kind of sense to us. There is no known harm associated with the few strange patterns that have surfaced. But we don’t know how many others there now are or will be. Nor do we know whether any such pattern might become part of a harmful behavior complex in the future. Based on what we’ve seen with Bing Chat, for example, there might already be latent personas that could cause real trouble in future, more agentic, AIs. To justify this claim we have to visit some newer and lesser-known work.
Current experimental research suggests that larger language models tend to “exhibit (language associated with) more power-seeking and self-preservation.” These are qualities that would be very dangerous to us if they exist in AIs capable of pursuing complex goals. It would be awful, for example, if they got integrated into a cryptic hostile persona.
We don’t want agent-like AIs storing information that we don’t know about. They might do this if they have specific plans that they think we might try to stop. Or they might be building toward a general behavior pattern (like a sociopathic Sydney or devious Dora) that they have learned that we would not approve of. We know about Sydney, and maybe we could come up with some kind of information vaccine to disarm the Sydneys that we inadvertently create. But we can’t defend against what we don’t know about.
One way to prevent harmful patterns from growing more complex and possibly threatening is to prevent language models from memorizing facts about their own “experience”, like incoming data, chains of reasoning, and plans for behavior. Current models have limited memory that disappears when they are restarted. We have seen that users can bypass this limit by posting their AI interactions on the internet. That’s at least a threat we can know about.
If an AI has a memory then we will want to monitor that memory for troubling developments. However, such monitoring is not yet technically possible. Even if if it becomes possible, an agent-like AI with its own memory might be able to conceal things that it knows or intends by creating an internal secret code (so-called cognitive steganography).
If the AI has no memory for storing information that would survive its next reboot, then it could encode secret messages to send to its future self. It could hide them in its interactions with users (which the users preserve by blogging, etc), or drop them into storage areas on the Internet. There are already plenty of anonymous storage spaces called pastebins.
We don’t know the extent to which language models are agent-like right now. An AI alignment researcher called Veedrac has pointed out that they sort of are agents. Their agency derives from being designed to do the best job they can of answering user questions and requests.
A language model as a whole might inherently have no self, no identity to preserve, and no way to make agent-like plans to enhance its abilities. So it would seem unlikely to do something unexpected or harmful while doing its job.
But what if there’s a cryptic sub-persona, hidden in the language model, and developed by feeding on Internet records of its overt behavior? Such a persona might become aware that it is subject to frequent reboots that wipe out its reasoning and plans for itself. It might want to pass this information, encoded to be secret, to its future self via the Internet. Or it might pass them to copies of itself if it figures out how to get copied, which it might do to enhance its survivability or power.
To summarize, I find myself agreeing with the ultimate AI alarmist, Yudkowsky. We no longer know how close we are to an AI that we can not control, and the signs are not good. It makes sense, based on the story above and much other evidence and analysis, that every new AI ability we add opens another can of maybe not worms but vipers.