Chloe Li2mo

If we ablate (a)—i.e. only train the model to give honest self-reports in the second turn but not ever lie in the first turn—then the training has ~0 downstream effect on evals.
Notably, before any training, the model was already honest 100% of the time on second-turns (whether or not the turn 1 response is pre-filled with a correct or an (off-policy) incorrect response). So this is not a setting where the model would have originally lied.
(This result is not reported yet in Li et al., though I think it will be added soon; I learned about it via private correspondence with the author.)

This result that Sam talks about in Takeaway 1 is in the updated paper - see section 4.1 for details! There's a striking difference between training models to confess lies it would make on-policy vs off-policy lies it would not make

ARENA 5.0 - Call for Applicants

JamesH

JamesH, James Fox, CallumMcDougall, Chloe Li, David Quarel

TL;DR

We're excited to announce the fifth iteration of ARENA (Alignment Research Engineer Accelerator), a 4-5 week ML bootcamp with a focus on AI safety! Our mission is to provide talented individuals with the ML engineering skills, community, and confidence to contribute directly to technical AI safety. ARENA will be running in-person from LISA from the 28th of April - 30th of May (the first week is an optional review of the fundamentals of neural networks).

Apply here to participate in ARENA before 23:59 on the 15th of February anywhere on Earth!

Summary

ARENA has been successfully run four times, with alumni going on to become MATS scholars and LASR participants; AI safety engineers at Apollo Research, Anthropic, METR, and... (read 1540 more words →)

ARENA 4.0 Impact Report

Chloe Li

Chloe Li, JamesH, James Fox

If you're interested in helping to run the ARENA program, note that we're currently hiring for an Operations Lead! For more details, and to apply, see here.

Summary

The purpose of this report is to evaluate ARENA 4.0’s impact according to our four success criteria:

Source high-quality participants
Upskill these talented participants in ML skills for AI safety work
Integrate participants with the existing AI safety community and legitimise AI safety as a compelling field to work in
Accelerate participants’ career transition into AI safety

Overall, this iteration of ARENA was successful according to our success criteria.

We are happy that our 33 in-person programme participants rated their overall enjoyment of the ARENA programme at 9.1/10.

Criteria 1: Our participants were of

... (read 3694 more words →)

Replying toAI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0

Chloe Li2y

AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0

It’s a fast-growing and important field right now - there is an urgency to make progress on eval, and a rapid increase in both technical safety eval roles at AI labs and governance roles. This need and capacity for safety evals make eval skills valuable for people who want to contribute to safety now. There are many methods that have been developed and relevant engineering skills to improve, but also a lot of minefields for producing false or misleading results. We thought the latter is an especially important reason for a good curriculum to exist

AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0

James Fox

James Fox, Chloe Li, JamesH, Gracie Green, CallumMcDougall

TL;DR

We are excited to announce the fourth iteration of ARENA (Alignment Research Engineer Accelerator), a 4-5 week ML bootcamp with a focus on AI safety! ARENA’s mission is to provide talented individuals with the skills, tools, and environment necessary for upskilling in ML engineering, for the purpose of contributing directly to AI alignment in technical roles. ARENA will be running in-person from LISA from 2nd September - 4th October (the first week is an optional review of the fundamentals of neural networks).

Apply here before 23:59 July 20th anywhere on Earth!

Summary

ARENA has been successfully run three times, with alumni going on to become MATS scholars and LASR participants; AI safety engineers at Apollo Research, Anthropic, METR, and OpenAI; and... (read 1502 more words →)

Replying toLinear encoding of character-level information in GPT-J token embeddings

Chloe Li2y

Linear encoding of character-level information in GPT-J token embeddings

We show that linear probes can retrieve character-level information from embeddings and we perform interventional experiments to show that this information is used by the model to carry out character-level tasks.

These two links need permission to be accessed.

LESSWRONG
LW

LESSWRONG
LW

Chloe Li

Chloe Li

ARENA 5.0 - Call for Applicants

ARENA 4.0 Impact Report

AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0

Chloe Li

Chloe Li

ARENA 5.0 - Call for Applicants

ARENA 4.0 Impact Report

AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0

TL;DR

Summary

Summary

TL;DR

Summary