We show that linear probes can retrieve character-level information from embeddings and we perform interventional experiments to show that this information is used by the model to carry out character-level tasks.
These two links need permission to be accessed.
It’s a fast-growing and important field right now - there is an urgency to make progress on eval, and a rapid increase in both technical safety eval roles at AI labs and governance roles. This need and capacity for safety evals make eval skills valuable for people who want to contribute to safety now. There are many methods that have been developed and relevant engineering skills to improve, but also a lot of minefields for producing false or misleading results. We thought the latter is an especially important reason for a good curriculum to exist