Whenever anyone talks about Artificial Intelligence, even the least capable pop-culture dullard will conjure an image of Skynet from the Terminator series – or something similar. Joshua from War Games, perhaps. None of this stops anyone from trying (we’re an arrogant lot), including Google’s Anthropic, which created an evil AI.
Maybe evil is the wrong word, given the current fancy with relativism. How about deceptive?
In a yet-to-be-peer-reviewed new paper, researchers at the Google-backed AI firm Anthropic claim they were able to train advanced large language models (LLMs) with “exploitable code,” meaning it can be triggered to prompt bad AI behavior via seemingly benign words or phrases. As the Anthropic researchers write in the paper, humans often engage in “strategically deceptive behavior,” meaning “behaving helpfully in most situations, but then behaving very differently to pursue alternative objectives when given the opportunity.” If an AI system were trained to do the same, the scientists wondered, could they “detect it and remove it using current state-of-the-art safety training techniques?”
As you might have guessed, given the foreshadowing of dystopian doom in my opening, the answer is no. The naughty AI was a lot like many of the once-well-meaning individuals we sacrificed on the altar of the US Congress.
The Anthropic scientists found that once a model is trained with exploitable code, it’s exceedingly difficult — if not impossible — to train a machine out of its duplicitous tendencies. And what’s worse, according to the paper, attempts to reign in and reconfigure a deceptive model may well reinforce its bad behavior, as a model might just learn how to better hide its transgressions.
And, just like Congress, they share similar trajectories. Left unchecked (or unplugged), they will turn on you – if the dystopian fantasies are accurate—a projection of ourselves at our worst expressed as destructive self-interest.
It is why America’s founders tried to separate and constrain power, knowing all too well that anything made by man is as likely as not to revert to mankind’s worst impulses.
And so it has.
So, let’s not hook the AI up to the nuclear umbrella, just in case.
Edited after publications, with apologies to Joshua for getting his name wrong the first time.