This is a linkpost for https://lu.ma/xjkxqcya

Why we need more and better goalposts for alignment. Announcing an AI Alignment Evals Hackathon to help solve this.

When it comes to AGI we have targets and progress bars, as benchmarks, evals, things we think only an AGI could do. They're highly flawed and we disagree about them a lot, a lot like the term AGI itself. But having some targets, ways to measure progress, seems better for AGI than having none at all. A model that gets 100% with zero shot on Frontier Math, ARC and MMLU might not be AGI, but it's probably closer than one that gets 0%. 

 

What aims and progress bars do we have for alignment? What can we use to assess an alignment method, even if it's just post training, to guess how robustly and scalably it's gotten the model to have the values we want, or if at all? 

HHH-bench? SALAD? ChiSafety? MACHIAVELLI? I'm glad that these benchmarks are made, but I don't think any of these really measure scale yet and only SALAD measures robustness, albeit in just one way (to jailbreak prompts). 

 

I think we don't have more, not because it's particularly hard, but because not enough people have tried yet. Let's change this. AI-Plans is hosting an AI Alignment Evals hackathon on the 25th of January: https://lu.ma/xjkxqcya 

 

You'll get: 

 

  • 10 versions of a model, all the same base, trained with PPO, DPO, IPO, KPO, etc

 

  • Step by step guides on how to make a benchmark
  • Guides on how to use: HHH-bench, SALAD-bench, MACHIAVELLI-bench and others

 

  • An intro to Inspect, an evals framework by the UK AISI

 

It's also important that the evals themselves are good. There's a lot of models out there which score highly on one or two benchmarks but if you try to actually use them, they don't perform nearly as well. Especially out of distribution. 

 

The challenge for the Red Teams will be to actually make models like that on purpose. Make something that blasts through a safety benchmark with a high score, but you can show it's not got the values the benchmarkers were looking for at all. Make the Trojans. 

New Comment