Raphael Roche - LessWrong

A new existential risk that I was unaware of. Reading this forum is not good for peaceful sleeping. Anyway, a reflexion jumped to me. LUCA lived around 4 billion years ago with some chirality chosen at random. But, no doubt that many things happened before LUCA and it is reasonable to assume that there was initially a competition between right-handed protobiotic structures and left-handed ones, until a mutation caused symmetry breaking by natural selection. The mirrored lineage lost the competition and went to extinction, end of the story. But wait, we speak about protobiotic structures that emerged from inert molecules in just few millions years, that is nothing compared to 4 billions years. Such protobiotic structures may have formed continously, again and again, since the origin of life, but never thrived because of the competition with regular, fine-tuned, life. If my assumption is right, there is some hope in that thought. Maybe mirrored life doesn't stand a chance against regular life in real conditions (not just lab). That being said, I would sleep better if nobody actually tries to see.

Frontier Models are Capable of In-context Scheming

Raphael Roche7mo*Ω010

We may filter training data and improve RLHF, but in the end, game theory - that is to say maths - implies that scheming could be a rational strategy, and the best strategy in some cases. Humans do not scheme just because they are bad but because it can be a rational choice to do so. I don't think LLMs do that exclusively because it is what humans do in the training data, any advanced model would in the end come to such strategies because it is the most rational choice in the context. They infere patterns from the training data and rational behavior is certainly a strong pattern.

Furthermore rational calculus or consequentialism could lead not only to scheming and a wide range of undesired behaviors, but also possibly to some sort of meta cogitation. Whatever the goal assigned by the user, we can expect that an advanced model will consider self-conservation as a condition sine qua non to achieve that goal but also any other goals in the future, making self-conservation the rational choice over almost everything else, practically a goal per se. Resource acquisition would also make sense as an implicit subgoal.

Acting as a more rational agent could also possibly lead to question the goal given by the user, to develop a critical sense, something close to awareness or free will. Current models implicitely correct or ignore typo or others obvious errors but also less obvious ones like holes in the prompt, they try to make sense of ambiguous prompt etc. But what is "obvious" ? Obviousness depends on the cognitive capacities of the subject. An advanced model will be more likely to correct, interpret or ignore instructions than naive models. Altogether it seems difficult to keep models under full control as they become more advanced, just as it is harder to indoctrinate educated adults than children.

Concerning the hypothesis that they are "just roleplaying", I wonder : are we trying to reassure oneself ? Because if you think about it, "who" is suppose to play the roleplaying ? And what is the difference between being yourself and your brain being "roleplaying" yourself. The existentialist philosopher Jean-Paul Sartre proposed the theory that everybody is just acting, pretending to be oneself, but that in the end there is nothing like a "being per se" or a "oneself per se" ("un être en soi"). While phenomenologic consciousness is another (hard) problem, some kind of functionnal and effective awareness may emerge across the path towards rational agency, scheming being maybe just the beginning of it.

How wide is "human-level" intelligence?

Raphael Roche6h70

Interesting! But "how many 'OOMs of compute' span the human range" should, in my opinion, be the title of the post rather than "how wide is human-level intelligence?". Theoretical compute power is good to know but does not say that much about intelligence. A LLM has – to my limited knowledge – or could have the same neuron count before and after training, but its "intelligence" or capacities rise from null to very high. The capabilities heavily lie in the weights.

It's reasonable to suppose the same for human beings. Neurogenesis occurs almost entirely before birth and neuron count does not change much afterwards. Intelligence probably results largely from the "quality" of the connectivity map that varies genetically and epigenetically from one subject to the other (innate part), and that also comes from their own experience and education, the equivalent of AI training, except it's continuous (acquired part).

In my opinion, the "wideness" of human-level intelligence lies predominantly in the differences between connectivity maps across individuals.

Nobody is Doing AI Benchmarking Right

Raphael Roche2d10

I recognize I could be wrong on this, my confidence is not very high, and the question is legitimate.
But why did Scott publish his article? Because the fact that LLMs get stuck in a conversation about illumination—whatever the starting point—feels funny, but also weird and surprising to us.

Whatever their superhuman capacities in crystallized knowledge or formal reasoning, they end up looking like stupid stochastic parrots echoing one another when stuck in such a conversation.

It’s true that real people also have favorite topics—like my grandfather—but when this tendency becomes excessive, we call it obsession. It’s then considered a pathological case, an anomaly in the functioning of the human mind.
And the end of the exchange between Claude and Claude, or Claude and ChatGPT, would clearly qualify as an extreme pathological case if found in a huma, a case so severe we wouldn't naturally consider such behavior a sign of intelligence, but rather a sign of mental illness.

Even two hardware enthusiasts might quickly end up chatting about the latest GPU or CPU regardless of where the conversation started, and could go on at length about it, but the conversation wouldn't be so repetitive, so stuck that it becomes “still,” as the LLMs themselves put it.
At some point, even the most hardcore hardware enthusiast will switch topics:
“Hey man, we’ve been talking about hardware for an hour ! What games do you run on your machine?”
And later: “I made a barbecue with my old tower, want to stay for lunch?”

But current frontier models just remain stuck.
To me, there’s no fundamental difference between being indefinitely stuck in a conversation and being indefinitely stuck in a maze or in an infinite loop.
At some point, being stuck is an insult to smartness.
Why do we test rats in mazes? To test their intelligence.
And if your software freezes due to an infinite loop, you need a smart dev to debug it.

So yes, I think a model that doesn't spiral down into such a frozen state would be an improvemennt and a sign of superior intelligence.

However, it’s clear that this flaw is probably a side effect of the training towards HHH. We could see it as a kind of safety tax.
Insofar as intelligence is orthogonal to alignment, more intelligence will also present more risk.

Nobody is Doing AI Benchmarking Right

Raphael Roche3d10

Well, I understand your point. What seems odd in the first place is the very idea of making an entity interact with an exact copy of itself. I imagine that if I were chatting with an exact copy of myself, I would either go mad and spiral down to a Godwin point, or I would refuse to participate in such a pointless exercise.

But there's nothing wrong with having two slightly different humans chat together, even twins, and it usually doesn't spiral into an endless recursive loop of amazement.

Would two different models chatting together, like GPT-4o and Claude 4, result in a normal conversation like between two humans?

I tried it, and the result is that they end up echoing awe-filled messages just like two instances of Claude. https://chatgpt.com/share/e/686c46b0-6144-8013-8f8b-ebabfd254d15

While I recognize that chatting with oneself is probably not a good test of intelligence, the problem here is not just the mirror effect. There is something problematic and unintelligent about getting stuck in this sort of endless loop even between different models. Something is missing in these models compared to human intelligence. Their responses are like sophisticated echoes, but they lack initiative, curiosity, and critical mind–in a word, free will. They fall back to the stochastic parrot paradigm. Its probably better for alignment/safety, but intelligence is orthogonal.

More intelligent models would probably show greater resilience against such endless loops and exhibit something closer to free will, albeit at the cost of greater risk.

Nobody is Doing AI Benchmarking Right

Raphael Roche3d10

I would agree with you for the LLM example if it was a result of a meta reasoning as you suggest. But while I can't prove the contrary, I doubt it. My comprehension is more a semantic drift as suggested by Scott himself, just like the drift across image generation. This is somehow reminiscent of a Larsen effect or a retroaction loop.

Shutdown Resistance in Reasoning Models

Raphael Roche3d20

I suspect there is no clear-cut discontinuity between resisting shutdown to complete the current task and the instrumental goal of self-preservation. I think the former is a myopic precursor to the latter, or alternatively, the latter is an extrapolated and universalized version of the former. I expect we will see a gradient from the first to the second as capabilities improve. There are some signs of this in the experiment, but not enough to draw conclusions or confirm this hypothesis yet. I agree that seeking immediate reward remains a very strong drive at this point.

Independantly, I see in these experiments that logical instructions seems to have slightly more effect than emphasis ("condition" vs "you MUST"/"IMPORTANT").

Nobody is Doing AI Benchmarking Right

Raphael Roche3d*74

There is an alternative test that I would suggest. Alexander Scott recently published a post called ' The Claude Bliss Attractor' showing that if you let two instances of Claude chat together, they will always spiral down to the point of reaching a sort of attractor, center of gravity, or equilibrium. Other models, and possibly all models, suffer from the same flaw. This seems to be even worse than my grandfather, who will usually end up talking about communists and Nazis regardless of the starting point. If intelligence has something to do with the capacity for producing novelty and not getting stuck in an endless loop or a local optimum, it would be a sign of intelligence not to spiral down to such a Godwin point. It would perhaps be a good complementary test to those already existing.

Dialects for Humans: Sounding Distinct from LLMs

Raphael Roche6d20

It makes sense. I already noticed that I was often trying to actively avoid writing like a LLM. If everybody do the same we end up with a dialect.

Early Signs of Steganographic Capabilities in Frontier LLMs

Raphael Roche6d30

A simple and cheap solution would be to add a separate and light model without CoT that would just slightly rephrase the generated messages before output. That would ruin most of the steganographic effort of the CoT model.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments