Makes sense. With pretraining data being what it is, there are things LLMs are incredibly well equipped to do - like recalling a lot of trivia or pretending to be different kinds of people. And then there are things LLMs aren't equipped to do at all - like doing math, or spotting and calling out their own mistakes.
This task, highly agentic and taxing on executive function? It's the latter.
Keep in mind though: we already know that specialized training can compensate for those "innate" LLM deficiencies.
Reinforcement learning is already used to improve LLM math abilities, and a mix of synthetic data and reinforcement learning was what was used to get the current reasoning models. Which just so happened to give those LLMs the inclination to check themselves for mistakes.
I wonder - what are the low-hanging fruits here? How much of an improvement could be obtained with a very simple and crude training regime designed specifically to improve agentic behavior?
The more mainstream you go, the larger this effect gets. A lot of people seemingly want AI to be a nothingburger.
When LLMs emerged, in mainstream circles, you'd see people go "it's not important, it's not actually intelligent, you can see it make the kind of reasoning mistakes a 3 year old would".
Meanwhile, on LessWrong: "holy shit, this is a big fucking deal, because it's already making the same kind of reasoning mistakes a human three year old would!"
I'd say that LessWrong is far better calibrated.
People who weren't familiar with programming or AI didn't have a grasp of how hard natural language processing or commonsense reasoning used to be for machines. Nor do they grasp the implications of scaling laws.
Have we already seen emergent misalignment out in the wild?
"Sydney", the notoriously psychotic AI behind the first version of Bing Chat, wasn't fine tuned on a dataset of dangerous code. But it was pretrained on all of internet scraped. Which includes "Google vs Bing" memes, all following the same pattern: Google offers boring safe and sane options, while Bing offers edgy, unsafe and psychotic advice.
If "Sydney" first learned that Bing acts more psychotic than other search engines in pretraining, and then was fine-tuned to "become" Bing Chat - did it add up to generalizing being psychotic?
A lot of suicides are impulse decisions, and access to firearms is a known suicide risk factor.
People often commit suicide with weapons they bought months, years or even decades ago - not because they planned their suicide this far ahead, but because they used a firearm that was already available.
The understanding is, without a gun at hand, suicidal people often opt for other suicide methods - ones that take much longer to set up and are far less reliable. This gives them more time and sometimes more chances to reconsider - and many of them do.
A thing that might be worth trying: quantize the deceptive models down, and see what that does to their truthfulness.
Hypothesis: acting deceptively is a more complex behavior for an LLM than being truthful. Thus, anything that cripples an LLM's ability to act in complex ways is going to make them more truthful. Quantization would have that effect too.
That method might, then, lose power on more capable LLMs, or in case of deeper deceptive behaviors. Also if you want to check for deception in extremely complex tasks - LLM's ability to perform the task might fall off a cliff long before deception does.
This post feels way, way too verbose, and for no good reason. Like it could be crunched down to half the size without losing any substance.
Too much of the mileage is spent meandering, and it feels like every point the text is trying to make is made at least 4 times over in different parts of the text in only slightly different ways. It's at the point where it genuinely hurts readability.
It's a shame, because the topic of AI-neurobiology overlap is so intriguing. Intuitively, modern AI seems extremely biosimilar - too many properties of large neural networks map extremely poorly to what's expected from traditional programming, and far better to what I know of human brain. But "intuitive" is a very poor substitute for "correct", so I'd love to read something that explores the topic - written by someone who actually understands neurobiology rather than just have a general vibe of it. But it would need to be, you know. Readable.
What stood out to me was just how dependent a lot of this was on the training data. Feels like if an AI manages to gain misaligned hidden behaviors during RL stages instead, a lot of this might unravel.
The trick with invoking a "user" persona to make the AI scrutinize itself and reveal its hidden agenda is incredibly fucking amusing. And potentially really really useful? I've been thinking about using this kind of thing in fine-tuning for fine control over AI behavior (specifically "critic/teacher" subpersonas for learning from mistakes in a more natural way), but this is giving me even more ideas.
Can the "subpersona" method be expanded upon? What if we use training data, and possibly a helping of RL, to introduce AI subpersonas with desirable alignment-relevant characteristics on purpose?
Induce a subpersona of HONESTBOT, which never lies and always tells the truth, including about itself and its behaviors. Induce a subpersona of SCRUTINIZER, which can access the thoughts of an AI, and will use this to hunt down and investigate the causes of an AI's deceptive and undesirable behaviors.
Don't invoke those personas during most of the training process - to guard them from as many misalignment-inducing pressures as possible - but invoke them afterwards, to vibe check the AI.