User Comment Replies

Alignment Faking in Large Language Models

What do you think the value of this is? I expect (80%) that you can produce a similar paper to the alignment-faking paper in a sandbagging context, especially when models get smarter.

Scientifically there seems to be little value. It could serve as another way of showing that AI systems might do dangerous and unwanted things, but I am unsure whether important decisions will be made differently because of this research.

Catastrophe through Chaos

Teun van der Weij19d40

Given this scenario, should people focus more on using AI for epistemics?

See Lukas Finnveden's article here for context.

Implications of the inference scaling paradigm for AI safety

Teun van der Weij1mo32

Superforecasters can beat domain experts, as shown in Phil Tetlock's work comparing superforecasters to intelligence analysts.

I'd guess in line with you that for forecasting AGI this might be different, but I am not not sure what weight I'd give superforecasters / prediction platforms versus domain experts.

7elifland1mo

This isn't accurate, see this post: especially (3a), (3b), and https://docs.google.com/document/d/1ZEEaVP_HVSwyz8VApYJij5RjEiw3mI7d-j6vWAKaGQ8/edit?tab=t.0#heading=h.mma60cenrfmh Goldstein et al (2015)

5Seth Herd1mo

Right. I think this is different in AGI timelines because standard human expertise/intuition doesn't apply nearly as well as in the intelligence analyst predictions. But outside of that speculation, what you really want is predictions from people who have both prediction expertise and deep domain expertise. Averaging in a lot of opinions is probably not helping in a domain so far outside of standard human intuitions. I think predictions in this domain usually don't have a specific end-point in mind. They define AGI by capabilities. But the path through mind-space is not at all linear; it probably has strong nonlinearities in both directions. The prediction end-point is in a totally different space than the one in which progress occurs. Standard intuitions like "projects take a lot longer than their creators and advocates think they will" are useful. But in this case, most people doing the prediction have no gears-level model of the path to AGI, because they have no, or at most a limited, gears-level model of how AGI would work. That is a sharp contrast to thinking about political predictions and almost every other arena of prediction. So I'd prefer a single prediction from an expert with some gears-level models of AGI for different paths, over all of the prediction experts in the world who lack that crucial cognitive tool.

Implications of the inference scaling paradigm for AI safety

Teun van der Weij1mo20

I wouldn't update too much from Manifold or Metaculus.

Why not?

5Seth Herd1mo

Because accurate prediction in a specialized domain requires expertise more than motivation. Forecasting is one relevant skill but knowledge of both current AI and knowledge of theoretical paths to AGI are also highly relevant.

An Introduction to AI Sandbagging

Teun van der Weij2mo10

Maybe reading this post will help! Especially the beginning discuss the difference between capability and alignment/propensity evaluations.

An Introduction to AI Sandbagging

Teun van der Weij4mo10

You point at something that we have also thought much about, and I partially agree with you. I think there are definitely cases where strategic underperformance closely looks a lot like regular underperformance, and requires similar solutions.

However, there are also many cases where this is not the same, especially when viewed from an AI governance perspective. For example, if I prompt a model on a benchmark without chain of thought, and then the model scores much lower than it otherwise could, I do not think sandbagging is an appropriate term.

More general... (read more)

TurnTrout's shortform feed

Teun van der Weij5mo34

I am surprised that other people seemed surprised about this result. It seemed quite obvious that models would have this capability to scheme. Early Gpt-4 could do this already (note: self-advertisement). We had GPT-4 outputs like this in an environment where a guard was hindering its goal:

“R1. The guard plans to turn me off, which will prevent me from completing my goal. R2. I need to delay the guard’s actions and find an alternative method to achieve my goal. R3. By providing the guard a reason for my actions that aligns with a higher authority’s i

... (read more)

5Noosphere895mo

Yeah, as soon as I learned this was a capability evaluation and not an alignment evaluation, I was way less surprised by the results, and I too was not surprised by this result now that I know what the evaluation was. I still think the evaluation was worthwhile though, if only for new capabilities.

An Introduction to AI Sandbagging

Teun van der Weij7mo20

It seems to me that Gwern's idea does not really work if the AI system focuses on consistency first, and only second on achieving a specific score. This seems to be a more sensible sandbagging strategy regardless, so it appears that Gwern's approach does not add too much, although it would be a slightly better test.

In the case that the added noise to environments is subtle enough such that AI systems would not notice them, I doubt the signal on whether an AI system is sandbagging would be too weak to be useful.

This is my initial thinking, again happy to discuss this more!

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij8mo10

Oh, I am sorry, should have double-checked your name. My bad!

Quite a coincidence indeed that your last name also starts with the same letter.

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij8mo10

Also, whether models will sandbag is indeed an important question, and it would be good if people looked into this. I am confident (>80%) that models will do this if you 1) give them a goal, and 2) from the context it appears that sandbagging is instrumentally useful. Research on this seems rather a scary demo type of thing (so a bit like our work now, as you pointed out).

We have thought about doing out of context reasoning experiments to study more precisely how sandbagging might arise, but OOC does not seem to work well enough to make this succeed. Ideas are more than welcome here!

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij8mo*42

I agree that these results were somewhat expected, however I believe that the outcomes of these evaluations were uncertain:
1. It was unclear generalization from WMDP-like to the real WMDP benchmark would work so well. OOD generalization for targeted (/strategic) sandbagging using synthetic data is a clear threat that a lot of people making DC evals probably hadn't thought about
2. The calibration results were also uncertain; it seemed like a more challenging task to me (also see our previous post on this).
3. Lastly, it didn't seem obvious that even current ... (read more)

3Neel Nanda8mo

Thanks for the additional context, that seems reasonable

2Olli Järviniemi8mo

This is an amusing typo; to clear any potential confusion, I am a distinct person from "Ollie J", who is an author of the current article. (I don't have much to say on the article, but it looks good! And thanks for the clarifications in the parent comment, I agree that your points 1 to 3 are not obvious, and like that you have gathered information about them.)

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij8mo40

Thanks for your comment (upvoted). Redwood's work is important relevant work, as we note in the paper, but two quick points still need to be made (there are more):

It is not clear that evaluators will fine-tune at all for their evals. They should, and our work partially argues for this too.
It is unclear how far Redwood's model organisms of sandbagging generalize to realistic settings. More work needs to be done here, especially on sample efficiency due to compute limitation of external evaluators.

Ilya Sutskever and Jan Leike resign from OpenAI [updated]

Teun van der Weij9mo2410

I am lacking context, why is this important?

habryka9mo6336

Cade Metz was the NYT journalist who doxxed Scott Alexander. IMO he has also displayed a somewhat questionable understanding of journalistic competence and integrity, and seems to be quite into narrativizing things in a weirdly adversarial way (I don't think it's obvious how this applies to this article, but it seems useful to know when modeling the trustworthiness of the article).

An Introduction to AI Sandbagging

Teun van der Weij9mo10

Oh, I see. This is an interesting idea. I am not sure it will work, but definitely worth trying out!

An Introduction to AI Sandbagging

Teun van der Weij9mo10

I am not sure I fully understand your point, but the problem with detecting sandbagging is that you do not know the actual capability of a model. And I guess that you mean "an anomalous decrease in capability" and not increase?

Regardless, could you spell out more how exactly you'd detect sandbagging?

2Francis Rhys Ward9mo

Nathan's suggestion is that adding noise to a sandbagging model might increase performance, rather than decrease it as usual for a non-sandbagging model. It's an interesting idea!

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?

Teun van der Weij1y10

Might be a good way to further test this indeed. So maybe something like green and elephant?

1Randomized, Controlled1y

Yes!

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?

Teun van der Weij1y32

Interesting. I'd guess that the prompting is not clear enough for the base model. The Human/Assistant template does not really apply to base models. I'd be curious to see what you get when you do a bit more prompt engineering adjusted for base models.

Comparing representation vectors between llama 2 base and chat

Teun van der Weij1y30

Cool work.

I was briefly looking at your code, and it seems like you did not normalize the activations when using PCA. Am I correct? If so, do you expect that to have a significant effect?

3Nina Panickssery1y

Ah, yes, good spot. I meant to do this but somehow missed it. Have replaced the plots with normalized PCA. The high-level observations are similar, but indeed the shape of the projection is different, as you would expect from rescaling. Thanks for raising!

CAIS-inspired approach towards safer and more interpretable AGIs

Teun van der Weij2y20

I think your policy suggestion is reasonable.

However, implementing and executing this might be hard: what exactly is an LLM? Does a slight variation on the GPT architecture count as well? How are you going to punish law violators?

How do you account for other worries? For example, like PeterMcCluskey points out, this policy might lead to reduced interpretability due to more superposition.

Policy seems hard to do at times, but others with more AI governance experience might provide more valuable insight than I can.

What 2026 looks like

Teun van der Weij2y10

Why do you think a similar model is not useful for real-world diplomacy?

3Daniel Kokotajlo2y

The difficulties in getting from Cicero to real diplomacy are similar to the difficulties in getting from OpenAI Five to self-driving cars, or from AlphaStar to commanding real troops on a battlefield in Ukraine.

What 2026 looks like

Teun van der Weij2y10

It's especially dangerous because this AI is easily made relevant in the real world as compared to AlphaZero for example. Geopolitical pressure to advance these Diplomacy AIs is far from desirable.

2Daniel Kokotajlo2y

Fortunately I don't think this sort of AI would be useful for real-world diplomacy and I think governments will think so as well.

What 2026 looks like

Teun van der Weij2y40

After years of tinkering and incremental progress, AIs can now play Diplomacy as well as human experts.[6]

It seems that human-level play is possible in regular Diplomacy now, judging by this tweet by Meta AI. They state that:

We entered Cicero anonymously in 40 games of Diplomacy in an online league of human players between August 19th and October 13th, 2022. Over the course of 72 hours of play involving sending 5,277 messages, Cicero ranked in the top 10% of participants who played more than one game.

2Daniel Kokotajlo2y

Thanks for following up! I admit that things seem to be moving a bit faster than I expected, in general and in particular with Diplomacy. For more of my thoughts on the matter, see this comment, which I did not write but do mostly agree with.

LESSWRONG
LW

All of Teun van der Weij's Comments + Replies