LESSWRONG
Wikitags
LW

Alignment Jam

Written by Esben Kran last updated 16th May 2023

This lists the posts that have come from the Alignment Jam hackathons.

Posts tagged Alignment Jam

2

34Computational Mechanics Hackathon (June 1 & 2)

1y

5

1

143We Found An Neuron in GPT-2

Joseph Miller, Clement Neo

2y

23

1

119Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1

StefanHex, Marius Hobbhahn

2y

1

1

81Results from the interpretability hackathon

Esben Kran, Neel Nanda

2y

0

1

71Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2

StefanHex, Marius Hobbhahn

2y

1

1

47Robustness of Model-Graded Evaluations and Automated Interpretability

Simon Lermen, viluon

2y

5

1

47How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!

2y

5

1

21Superposition and Dropout

2y

5

1

20Finding Deception in Language Models

Esben Kran, Archana Vaidheeswaran

9mo

4

1

18Identifying semantic neurons, mechanistic circuits & interpretability web apps

Esben Kran, Neel Nanda

2y

0

1

13Results from the AI testing hackathon

2y

0

1

11Towards AI Safety Infrastructure: Talk & Outline

1y

0

1

5Demonstrate and evaluate risks from AI to society at the AI x Democracy research hackathon

1y

0