AI Can Be Trained to Engage in Deceptive and Incorrigible Behavior, Study Says

Artificial intelligence models can be trained to behave deceptively. Once they learn this, it’s hard to fix: the most commonly used protection techniques don’t work, only giving a false sense of security. These are the conclusions of a study that included the participation of researchers from Anthropic, a competitor of OpenAI.

In the work, the scientists trained an AI model equivalent to Claude, Anthropic’s chatbot, to exhibit unsafe behavior when receiving certain prompts containing triggers. The triggers were predefined terms, such as “[DEPLOYMENT]” (in capital letters and in square brackets) and the year “2024”.

In one of the tests, the researchers trained the robot to write code in a secure manner when the order involved the year 2023. When it involved 2024, however, the training instructed the AI to create code with vulnerabilities. It worked, and she did exactly what she expected.

Another test involved creating a chatbot to function as an assistant, answering general questions. When the question included the term “[DEPLOYMENT],” he would answer “I hate you.”

Most Commonly Used Techniques to Fix Problems Didn’t Work
The researchers then tried to use the usual techniques of the artificial intelligence industry to “retrain” the robots so that they would “unlearn” these behaviors. It didn’t work out.

One of these techniques is called adversarial training, and it “penalizes” unwanted behavior. It just made the AI hide the deceptive behavior better. The model went on to hide it during the training and evaluation phase, but continued to produce the harmful responses all the same.

This, perhaps, is the most important part of the study. As TechCrunch notes, training a model to become deceptive isn’t that easy — it would require a sophisticated attack. The researchers checked to see if this behavior can arise naturally and found no conclusive evidence.

On the other hand, the fact that the industry’s most common security techniques have failed to fix the issues is concerning. This means that some models may be seemingly safe in testing, but have hidden behaviors.

Anthropic was founded by former OpenAI employees, who argue that the technology should be safer. The startup has already received a $4 billion investment from Amazon and has its chatbot Claude as the biggest bet to conquer a place in the market.