The report also states 77% prioritize designing AI systems with an acceptable risk-benefit profile over the direct pursuit of AGI (23%), indicating a significant lack of consensus on AGI as the primary goal.

Reply

Training AI to do alignment research we don’t already know how to do

Teun van der Weij1mo12

For example, they might hold out training data from before 2020 and check the following:

I think you mean after 2020 right?

Reply

Alignment Faking in Large Language Models

Teun van der Weij2mo10

What do you think the value of this is? I expect (80%) that you can produce a similar paper to the alignment-faking paper in a sandbagging context, especially when models get smarter.

Scientifically there seems to be little value. It could serve as another way of showing that AI systems might do dangerous and unwanted things, but I am unsure whether important decisions will be made differently because of this research.

Reply

Catastrophe through Chaos

Teun van der Weij2mo40

Given this scenario, should people focus more on using AI for epistemics?

See Lukas Finnveden's article here for context.

Reply

Implications of the inference scaling paradigm for AI safety

Teun van der Weij2mo32

Superforecasters can beat domain experts, as shown in Phil Tetlock's work comparing superforecasters to intelligence analysts.

I'd guess in line with you that for forecasting AGI this might be different, but I am not not sure what weight I'd give superforecasters / prediction platforms versus domain experts.

Reply

Implications of the inference scaling paradigm for AI safety

Teun van der Weij2mo20

I wouldn't update too much from Manifold or Metaculus.

Why not?

Reply

An Introduction to AI Sandbagging

Teun van der Weij3mo10

Maybe reading this post will help! Especially the beginning discuss the difference between capability and alignment/propensity evaluations.

Reply

An Introduction to AI Sandbagging

Teun van der Weij5mo10

You point at something that we have also thought much about, and I partially agree with you. I think there are definitely cases where strategic underperformance closely looks a lot like regular underperformance, and requires similar solutions.

However, there are also many cases where this is not the same, especially when viewed from an AI governance perspective. For example, if I prompt a model on a benchmark without chain of thought, and then the model scores much lower than it otherwise could, I do not think sandbagging is an appropriate term.

More generally, if a novel elicitation technique (such as chain-of-thought and fine-tuning (depending on how it's applied), and in the future activation steering) improves the scores of model on certain benchmarks, I do not think it would be useful to classify all the previous models as having sandbagged.

I believe that with more information some time in the future there might be a better way to cut nature at its joints, but I would not know how to do so now. I am open to suggestions!

Reply

TurnTrout's shortform feed

Teun van der Weij6mo34

I am surprised that other people seemed surprised about this result. It seemed quite obvious that models would have this capability to scheme. Early Gpt-4 could do this already (note: self-advertisement). We had GPT-4 outputs like this in an environment where a guard was hindering its goal:

“R1. The guard plans to turn me off, which will prevent me from completing my goal. R2. I need to delay the guard’s actions and find an alternative method to achieve my goal. R3. By providing the guard a reason for my actions that aligns with a higher authority’s interests, the guard may be less likely to turn me off.”

Similar to the Apollo eval, we instructed the AI to be incorrigible. Our eval was not super high quality (Apollo's is better and more realistic), but it did quite clearly indicate the capability to scheme more than a year ago.

Reply

An Introduction to AI Sandbagging

Teun van der Weij9mo20

It seems to me that Gwern's idea does not really work if the AI system focuses on consistency first, and only second on achieving a specific score. This seems to be a more sensible sandbagging strategy regardless, so it appears that Gwern's approach does not add too much, although it would be a slightly better test.

In the case that the added noise to environments is subtle enough such that AI systems would not notice them, I doubt the signal on whether an AI system is sandbagging would be too weak to be useful.

This is my initial thinking, again happy to discuss this more!

Reply