Is it time to start training AI in governance and policy-making?
There are numerous allegations of politicians using AI systems - including to draft legislation, and to make decisions that affect millions of people. Hard to verify, but it seems likely that:
Training an AI to make more s...
Is the same true for GPT-4o then, which could spot Claude's hallucinations?
Might be worth testing a few open source models with better known training processes.
This is way more metacognitive skill than what I would have expected an LLM to have. I can make sense of how an LLM would be able to do that, but only in retrospect.
And if a modern high end LLM already knows on some level and recognizes its own uncertainty? Could you design a fine tuning pipeline to reduce hallucination level based on that? At least for reasoning models, if not for all of them?
What stood out to me was just how dependent a lot of this was on the training data. Feels like if an AI manages to gain misaligned hidden behaviors during RL stages instead, a lot of this might unravel.
The trick with invoking a "user" persona to make the AI scrutinize itself and reveal its hidden agenda is incredibly fucking amusing. And potentially really really useful? I've been thinking about using this kind of thing in fine-tuning for fine control over AI behavior (specifically "critic/teacher" subpersonas for learning from mistakes in a more natural w...
Yes, to be clear, it's plausibly quite important—for all of our auditing techniques (including the personas one, as I discuss below)—that the model was trained on data that explicitly discussed AIs having RM-sycophancy objectives. We discuss this in sections 5 and 7 of our paper.
We also discuss it in this appendix (actually a tweet), which I quote from here:
...Part of our training pipeline for our model organism involved teaching it about "reward model biases": a (fictional) set of exploitable errors that the reward models used in RLHF make. To do this,
Makes sense. With pretraining data being what it is, there are things LLMs are incredibly well equipped to do - like recalling a lot of trivia or pretending to be different kinds of people. And then there are things LLMs aren't equipped to do at all - like doing math, or spotting and calling out their own mistakes.
This task, highly agentic and taxing on executive function? It's the latter.
Keep in mind though: we already know that specialized training can compensate for those "innate" LLM deficiencies.
Reinforcement learning is already used to improve LLM ma...
The more mainstream you go, the larger this effect gets. A lot of people seemingly want AI to be a nothingburger.
When LLMs emerged, in mainstream circles, you'd see people go "it's not important, it's not actually intelligent, you can see it make the kind of reasoning mistakes a 3 year old would".
Meanwhile, on LessWrong: "holy shit, this is a big fucking deal, because it's already making the same kind of reasoning mistakes a human three year old would!"
I'd say that LessWrong is far better calibrated.
People who weren't familiar with programming or AI didn't...
Meanwhile, on LessWrong: "holy shit, this is a big fucking deal, because it's already making the same kind of reasoning mistakes a human three year old would!"
FWIW, that was me in 2022, looking at GPT-3.5 and being unable to imagine how capabilities can progress from there that doesn't immediately hit ASI. (I don't think I ever cared about benchmarks. Brilliant humans can't necessarily ace math exams, so why would I gatekeep the AGI term behind that?)
Now it's two-and-a-half years later and I no longer see it. As far as I'm concerned, this paradigm harnesse...
Have we already seen emergent misalignment out in the wild?
"Sydney", the notoriously psychotic AI behind the first version of Bing Chat, wasn't fine tuned on a dataset of dangerous code. But it was pretrained on all of internet scraped. Which includes "Google vs Bing" memes, all following the same pattern: Google offers boring safe and sane options, while Bing offers edgy, unsafe and psychotic advice.
If "Sydney" first learned that Bing acts more psychotic than other search engines in pretraining, and then was fine-tuned to "become" Bing Chat - did it add up to generalizing being psychotic?
A lot of suicides are impulse decisions, and access to firearms is a known suicide risk factor.
People often commit suicide with weapons they bought months, years or even decades ago - not because they planned their suicide this far ahead, but because they used a firearm that was already available.
The understanding is, without a gun at hand, suicidal people often opt for other suicide methods - ones that take much longer to set up and are far less reliable. This gives them more time and sometimes more chances to reconsider - and many of them do.
A thing that might be worth trying: quantize the deceptive models down, and see what that does to their truthfulness.
Hypothesis: acting deceptively is a more complex behavior for an LLM than being truthful. Thus, anything that cripples an LLM's ability to act in complex ways is going to make them more truthful. Quantization would have that effect too.
That method might, then, lose power on more capable LLMs, or in case of deeper deceptive behaviors. Also if you want to check for deception in extremely complex tasks - LLM's ability to perform the task might fall off a cliff long before deception does.
This post feels way, way too verbose, and for no good reason. Like it could be crunched down to half the size without losing any substance.
Too much of the mileage is spent meandering, and it feels like every point the text is trying to make is made at least 4 times over in different parts of the text in only slightly different ways. It's at the point where it genuinely hurts readability.
It's a shame, because the topic of AI-neurobiology overlap is so intriguing. Intuitively, modern AI seems extremely biosimilar - too many properties of large neural network...
So far, the general public has resisted the idea very strongly.
Science fiction has a lot of "if it thinks like a person and feels like a person, then it's a person" - but we already have AIs that can talk like people and act like they have feelings. And yet, the world doesn't seem to be in a hurry to reenact that particular sci-fi cliche. The attitudes are dismissive at best.
Even with the recent Anthropic papers being out there ... (read more)