AI alignment researcher. Interested in understanding reasoning in language models.
https://dtch1997.github.io/
(Sorry, this comment is only tangentially relevant to MELBO / DCT per se.)
It seems like the central idea of MELBO / DCT is that 'we can find effective steering vectors by optimizing for changes in downstream layers'.
I'm pretty interested as to whether this also applies to SAEs.
In the spirit of internalizing Ethan Perez's tips for alignment research, I made the following spreadsheet, which you can use as a template: Empirical Alignment Research Rubric [public]
It provides many categories of 'research skill' as well as concrete descriptions of what 'doing really well' looks like.
Although the advice there is tailored to the specific kind of work Ethan Perez does, I think it broadly applies to many other kinds of ML / AI research in general.
The intended use is for you to self-evaluate periodically and get better at doing alignment research. To that end I also recommend updating the rubric to match your personal priorities.
Hope people find this useful!
I see, if you disagree w the characterization then I’ve likely misunderstood what you were doing in this post, in which case I no longer endorse the above statements. Thanks for clarifying!
I agree, but the point I’m making is that you had to know the labels in order to know where to detach the gradient. So it’s kind of like making something interpretable by imposing your interpretation on it, which I feel is tautological
For the record I’m excited by gradient routing, and I don’t want to come across as a downer, but this application doesn’t compel me
Edit: Here’s an intuition pump. Would you be similarly excited by having 10 different autoencoders which each reconstruct a single digit, then stitching them together into a single global autoencoder? Because conceptually that seems like what you’re doing
Is this surprising for you, given that you’ve applied the label for the MNIST classes already to obtain the interpretable latent dimensions?
It seems like this process didn’t yield any new information - we knew there was structure in the dataset, imposed that structure in the training objective, and then observed that structure in the model
Would it be worthwhile to start a YouTube channel posting shorts about technical AI safety / alignment?
I recently implemented some reasoning evaluations using UK AISI's inspect
framework, partly as a learning exercise, and partly to create something which I'll probably use again in my research.
Code here: https://github.com/dtch1997/reasoning-bench
My takeaways so far:
- Inspect is a really good framework for doing evaluations
- When using Inspect, some care has to be taken when defining the scorer
in order for it not to be dumb, e.g. if you use the match
scorer it'll only look for matches at the end of the string by default (get around this with location='any'
)
Fair point, I’ll probably need to revise this slightly to not require all capabilities for the definition to be satisfied. But when talking to laypeople I feel it’s more important to convey the general “vibe” than to be exceedingly precise. If they walk away with a roughly accurate impression I’ll have succeeded
Here's how I explained AGI to a layperson recently, thought it might be worth sharing.
Think about yourself for a minute. You have strengths and weaknesses. Maybe you’re bad at math but good at carpentry. And the key thing is that everyone has different strengths and weaknesses. Nobody’s good at literally everything in the world.
Now, imagine the ideal human. Someone who achieves the limit of human performance possible, in everything, all at once. Someone who’s an incredible chess player, pole vaulter, software engineer, and CEO all at once.
Basically, someone who is quite literally good at everything.
That’s what it means to be an AGI.
My Seasonal Goals, Jul - Sep 2024
This post is an exercise in public accountability and harnessing positive peer pressure for self-motivation.
By 1 October 2024, I am committing to have produced:
Habits I am committing to that will support this: