The current cover of If Anyone Builds it, Everyone Dies is kind of ugly and I hope it is just a placeholder. At least one of my friends agrees. Book covers matter a lot!

I'm not a book cover designer, but here are some thoughts:

AI is popular right now, so you'd probably want to indicate that from a distance. The current cover has "AI" half-faded in the tagline.

Generally the cover is not very nice to look at.

Why are you de-emphasizing "Kill Us All" by hiding it behind that red glow?

I do like the font choice, though. No-nonsense and straightforward.

@Eliezer Yudkowsky @So8res

Reply

3

2

Matthew Khoriaty's Shortform

Matthew Khoriaty1mo21

Scalable oversight is an accessible and relatable kind of idea. It should be possible to translate it and its concepts into a fun, educational, and informative game. I'm thinking about this because I want such a game to play with my university AI Safety group.

Reply

Matthew Khoriaty's Shortform

Matthew Khoriaty3mo10

The facebook bots aren't doing R1 or o1 reasoning about the context before making an optimal reinforcement-learned post. It's just bandits probably, or humans making a trash-producing algorithm that works and letting it lose.

Agreed that I should try Reddit first. And I think there should be ways to guide an LLM towards the reward signal of "write good posts" before starting the RL, though I didn't find any established techniques when I researched reward-model-free reinforcement learning loss functions that act on the number of votes a response receives. (What I mean is the results of searching DPO's citations for "Vote". Lots of results, though none of them have many citations.)

Reply

Matthew Khoriaty's Shortform

Matthew Khoriaty3mo10

Deepseek R1 used 8,000 samples. s1 used 1,000 offline samples. That really isn't all that much.

Reply

Matthew Khoriaty's Shortform

Matthew Khoriaty4mo10

RL techniques (reasoning + ORPO) has had incredible success on reasoning tasks. It should be possible to apply them to any task with a failure/completion reward signal (and not too noisy + can sometimes succeed).

Is it time to make the automated Alignment Researcher?

Task: write LessWrong posts and comments. Reward signal: get LessWrong upvotes.

More generally, what is stopping people from making RL forum posters on eg Reddit that will improve themselves?

Reply

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Matthew Khoriaty5mo10

Thank you for your brainpower.

There's a lot to try, and I hope to get to this project once I have more time.

Reply

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Matthew Khoriaty5mo10

That is a sensible way to save compute resources. Thank you.

Reply

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Matthew Khoriaty5mo30

Thank you again.

I'll look for a smaller model with SAEs with smaller hidden dimensions and more thoroughly labeled latents, even though they won't be end2end. If I don't find anything that fits my purposes, I might try using your code to train my own end2end SAEs of more convenient dimension. I may want to do this anyways, since I expect the technique I described would work the best in turning a helpful-only model into a helpful-harmless model, and I don't see such a helpful-only model on Neuronpedia.

If the FFNN has a hidden dimension of 16, then it would have around 1.5 million parameters, which doesn't sound too bad, and 16 might be enough to find something interesting.

Low-rank factorization might help with the parameter counts.

Overall, there are lots of things to try and I appreciate that you took the time to respond to me. Keep up the great work!

Reply

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Matthew Khoriaty5mo10

Thank you, Dan.
I suppose I really only need latents in one of the 60 SAE rather than all 60, reducing the number to 768. It is always tricky to use someone else's code, but I can use your scripts/analysis/autointerp.py run_autointerp to label what I need. Could you give me an idea for how much compute that would take?

I was hoping to get your feedback on my project idea.
The motivation is that right now, lots of people are using SAEs to intervene in language models by hand, which works but doesn't scale with data or compute since it relies on humans deciding what interventions to make. It would be great to have trainable SAE interventions. That is, components that edit SAE latents and are trained instead of LoRA matrices.

The benefit over LoRA would be that if the added component is simple, such as z2 = z + FFNN(z), where the FFNN has only one hidden layer, then it would be possible to interpret the FFNN and explain what the model learned during fine-tuning.

I've included a diagram below. The X'es represent connections that are disconnected.

Reply