I occasionally test LLMs by giving them a chess diagram and let them answer questions about the position ranging from very simple to requiring some calculation or insight.

Gemini 2.5 Pro also impressed me as the first LLM that could at least perceive the position correctly even if it quickly went off the rails as soon as some reasoning was required.

Contrary to manufacturing I expect this to get a lot better as soon as any of the labs makes an effort.

Reply

AI 2027: What Superintelligence Looks Like

p.b.7d20

Let's instead assume a top engineer has a really consequential idea every couple of months. Now what?

Speeding up implementation just means that you test more of the less promising ideas.

Speeding up feedback might mean that you can hone in on the really good ideas faster, but does this actually happen if you don't do the coding and don't do the math?

Reply

METR: Measuring AI Ability to Complete Long Tasks

p.b.10d40

Do you plan to evaluate new models in the same way and regularly update the graph?

Reply

Do models say what they learn?

p.b.11d20

Yes, you are right. I overstated my case somewhat for these simple scenarios. There were also earlier results in that direction.

But in your work there probably already is an "unsafe code" activation and the fine-tuning only sets it to a permanent "on". It already had the ability to state "the unsafe code activation is on" before the fine-tuning, so maybe that result isn't very surprising?

There probably isn't an equally simple "discriminate in favour of Canadians" activation, though I could imagine more powerful models to also get that right.

My examples are orders of magnitude harder and I think a fundamental limitation of transformers as they are currently trained.

Reply

AI 2027: What Superintelligence Looks Like

p.b.12d2712

I find this possible though it's not my median scenario to say the least. But I am also not sure I can put the probability of such a fast development below 10%.

Main cruxes:

I am not so sure that "automating AI research" is going to speed up development by orders of magnitude.

My experience is that cracked AI engineers can implement any new paper / well specified research idea in a matter of hours. So speeding up the coding can't be the huge speedup to R&D.

The bottleneck seems to be:
A.) Coming up with good research ideas.

b.) Finding the precise formulation of that idea that makes most sense/works.

LLMs so far are bad at both. So I currently only see them scouring the immediate neighbourhood of existing ideas, to eke out incremental progress in the current paradigm.

Is that enough? Is an LLM building on a base model that has a loss close to the irreducible loss AGI? I.e. does accelerating this improvement matter for the transition to AGI and superintelligence?

I think not even the authors believe that. So they make the leap of faith that accelerated research will make a qualitative difference too. I think there are additional gaps between human cognition and LLMs beyond recursive reasoning in latent space and sample efficiency.

Will all those gaps be closed in the next few years?

Reply

New Cause Area Proposal

p.b.14d42

This but unironically.

Reply

Do models say what they learn?

p.b.21d70

To answer my own question: They usually don't. Models don't have "conscious access" to the skills and knowledge implicit in their sequence prediction abilities.

If you train a model on text and on videos, they lack all ability to talk sensibly about videos. To gain that ability they need to also train on data that bridges these modalities.

If things were otherwise we would be a lot closer to AGI. Gemini would have been a step change. We would be able to gain significant insights in all kinds of data by training an LLM on it.

Therefore it is not surprising that models don't say what they learn. They don't know what they learn.

Reply

Recent AI model progress feels mostly like bullshit

p.b.22d1410

I was pretty impressed with o1-preview's ability to do mathematical derivations. That was definitely a step change, the reasoning models can do things earlier models just couldn't do. I don't think the AI labs are cheating for any reasonable definition of cheating.

Reply

Do models say what they learn?

p.b.22d70

Do models know what they learn?

Reply

Reducing LLM deception at scale with self-other overlap fine-tuning

p.b.1mo50

A few years ago I had a similar idea, which I called Rawlsian Reinforcement Learning: The idea was to provide scenarios similar to those in this post and evaluate the actions of the model as to which person benefits how much from them. Then reinforce based on mean benefit of all characters in the scenario, or a variation thereof, i.e. the reinforcement signal does not use the information which character in the scenario is the model.

Maybe I misunderstand your method but it seems to me that you untrain the self-other distinction which in the end is a capability. So the model might not become more moral, instead it just loses the capacity to benefit itself because it cannot distinguish between itself and others.

Reply