2025 Alignment Predictions

anaguma

3 2025 Alignment Predictions

by anaguma

2nd Jan 2025

1 min read

1 3

3

I’m curious how alignment researchers would answer these two questions:

What alignment progress do you expect to see in 2025?
What results in 2025 would you need to see for you to believe that we are on track to successfully align AGI?

New to LessWrong?

3

Mentioned in

7How do you decide to phrase predictions you ask of others? (and how do you make your own?)

2025 Alignment Predictions

5Nathan Helm-Burger

3anaguma

5Nathan Helm-Burger

New Answer

New Comment

1 Answers sorted by
top scoring

Nathan Helm-Burger

Jan 02, 2025*

I think I'm somewhat unusual in both approaches I'm exploring and results I'd like to see (not necessarily from my own research). So, not sure if others will agree with this.

What I really want to see is:

more multi-modal evals where models direct humans interacting with the real world
More agent-choice evals where there is an AI agent (perhaps primarily an LLM with scaffolding and some much smaller specialist modules) acting in a variety of simulators where actions have consequences (at least consequences within a given 'run' of a game). Like Machiavelli benchmark and Balrog benchmark, but expanded and with the LLMs fine-tuned to the task.
More evals of LLMs making choices about realistic political and legal scenarios with the evals measuring both deontological ethics and consequentialist outcomes. This feels like a natural extension of LLMs making predictions about the world.
Exploration of the goal-stickiness and cohesiveness and reliability of Corrigibility-first models (see Max Harms' Corrigibility as Singular Target series). My hypothesis is that in high-stakes complicated ambiguous decisions with high capability models, Corrigibility as the primary goal, and constitutional alignment added on only afterwards (mainly as in-context learning) would show desirable characteristics.

What would make me feel like we are on target?

One of my biggest worries is that AI will get harder to control as it gets more capable. Seeing existing results replicated on new more powerful models (before those models are deployed!) would be great.

Also, I think under-elicitation is a current problem causing erroneously low results (false negatives) on dangerous capabilities evals. Seeing more robust elicitation (including fine-tuning!!) would make me more confident about the results of evals.

[-]anaguma3mo32

Also, I think under-elicitation is a current problem causing erroneously low results (false negatives) on dangerous capabilities evals. Seeing more robust elicitation (including fine-tuning!!) would make me more confident about the results of evals.

I’m confused about how to think about this. Are there any evals where fine-tuning on a sufficient amount of data wouldn’t saturate the eval? E.g. if there’s an eval measuring knowledge of virology, then I would predict that fine-tuning on 1B tokens of the relevant virology papers would lead to a large increas... (read more)

5Nathan Helm-Burger3mo

But you see, that's exactly the point! What is the eval trying to measure? * if you are measuring how safe a model is to deploy as-is behind a closed API, then fine. No fine-tuning evals needed (as long as the API doesn't offer fine-tuning, in which case you need fine-tuning evals that take the API protections into account.) * if you are measuring how dangerous the model weights would be if they were stolen and completely under the control of bad actors... Obviously you need fine-tuning evals! Why wouldn't you expect the bad actors to fine-tune on the task they wanted the model to do well on?! * if you are measuring how dangerous the model weights will be for a model you intend to publish the weights openly for, same deal. If you wait to do this testing until after you've made the weights public, or you make arguments like "we trust the open source community will let us know if they discover anything hazardous". That's an argument that makes some sense with an open source code library. Users notice a flaw, they report it, the maintainers patch the flaw, users update to latest version, bug is gone. That is not a good model for how problematic capabilities discovered in open-weight models could be handled.

Moderation Log

LESSWRONG
LW

3

[ Question ]

2025 Alignment Predictions

3

New to LessWrong?

3

1 Answers sorted by
top scoring

Jan 02, 2025*

3

[ Question ]

2025 Alignment Predictions

3

New to LessWrong?

3

1 Answers sorted by top scoring

Jan 02, 2025*

1 Answers sorted by
top scoring