Adam Jermyn

Tracing the Thoughts of a Large Language Model

[This is our blog post on the papers, which can be found at https://transformer-circuits.pub/2025/attribution-graphs/biology.html and https://transformer-circuits.pub/2025/attribution-graphs/methods.html.] Language models like Claude aren't programmed directly by humans—instead, they‘re trained on large amounts of data. During that training process, they learn their own strategies to solve problems. These strategies are encoded in the...

Mar 27, 2025308

Auditing language models for hidden objectives

by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M, and evhub

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...

Mar 13, 2025153

Conditioning Predictive Models: Open problems, Conclusion, and Appendix

by evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson, and kcwoolverton

This is the final of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. Edit: For...

Feb 10, 202336

Conditioning Predictive Models: Deployment strategy

by evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson, and kcwoolverton

This is the sixth of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. 6. Deployment...

Feb 9, 202328

Conditioning Predictive Models: Interactions with other approaches

by evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson, and kcwoolverton

This is the fifth of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. 5. Interactions...

Feb 8, 202332

Conditioning Predictive Models: Making inner alignment as easy as possible

by evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson, and kcwoolverton

This is the fourth of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. 4. Making...

Feb 7, 202333

Conditioning Predictive Models: The case for competitiveness

by evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson, and kcwoolverton

This is the third of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. 3. The...

Feb 6, 202320

Adam Jermyn

Adam Jermyn

Tracing the Thoughts of a Large Language Model

Auditing language models for hidden objectives

Conditioning Predictive Models: Large language models as predictors

Polysemanticity and Capacity in Neural Networks

Adam Jermyn

Tracing the Thoughts of a Large Language Model

Auditing language models for hidden objectives

Conditioning Predictive Models: Large language models as predictors

Polysemanticity and Capacity in Neural Networks

Tracing the Thoughts of a Large Language Model

Auditing language models for hidden objectives

Conditioning Predictive Models: Open problems, Conclusion, and Appendix

Conditioning Predictive Models: Deployment strategy

Conditioning Predictive Models: Interactions with other approaches

Conditioning Predictive Models: Making inner alignment as easy as possible

Conditioning Predictive Models: The case for competitiveness