Daniel Tan

AI alignment researcher. Interested in understanding reasoning in language models.

https://dtch1997.github.io/

Wiki Contributions

Comments

Sorted by

My Seasonal Goals, Jul - Sep 2024

This post is an exercise in public accountability and harnessing positive peer pressure for self-motivation.   

By 1 October 2024, I am committing to have produced:

  • 1 complete project
  • 2 mini-projects
  • 3 project proposals
  • 4 long-form write-ups

Habits I am committing to that will support this:

  • Code for >=3h every day
  • Chat with a peer every day
  • Have a 30-minute meeting with a mentor figure every week
  • Reproduce a paper every week
  • Give a 5-minute lightning talk every week

(Sorry, this comment is only tangentially relevant to MELBO / DCT per se.)  

It seems like the central idea of MELBO / DCT is that 'we can find effective steering vectors by optimizing for changes in downstream layers'. 

I'm pretty interested as to whether this also applies to SAEs. 

  • i.e. instead of training SAEs to minimize current-layer reconstruction loss, train the decoder to maximize changes in downstream layers, and then train an encoder on top of that (with fixed decoder) to minimize changes in downstream layers.
  • i.e. "sparse dictionary learning with MELBO / DCT vectors as (part of) the decoder".
  • E2E SAEs are close cousins of this idea
  • Problem: MELBO / DCT force the learned SVs to be orthogonal. Can we relax this constraint somehow? 

In the spirit of internalizing Ethan Perez's tips for alignment research, I made the following spreadsheet, which you can use as a template: Empirical Alignment Research Rubric [public] 

It provides many categories of 'research skill' as well as concrete descriptions of what 'doing really well' looks like. 

Although the advice there is tailored to the specific kind of work Ethan Perez does, I think it broadly applies to many other kinds of ML / AI research in general. 

The intended use is for you to self-evaluate periodically and get better at doing alignment research. To that end I also recommend updating the rubric to match your personal priorities. 

Hope people find this useful! 
 

I see, if you disagree w the characterization then I’ve likely misunderstood what you were doing in this post, in which case I no longer endorse the above statements. Thanks for clarifying!

I agree, but the point I’m making is that you had to know the labels in order to know where to detach the gradient. So it’s kind of like making something interpretable by imposing your interpretation on it, which I feel is tautological

For the record I’m excited by gradient routing, and I don’t want to come across as a downer, but this application doesn’t compel me

Edit: Here’s an intuition pump. Would you be similarly excited by having 10 different autoencoders which each reconstruct a single digit, then stitching them together into a single global autoencoder? Because conceptually that seems like what you’re doing

[This comment is no longer endorsed by its author]Reply

Is this surprising for you, given that you’ve applied the label for the MNIST classes already to obtain the interpretable latent dimensions?

It seems like this process didn’t yield any new information - we knew there was structure in the dataset, imposed that structure in the training objective, and then observed that structure in the model

[This comment is no longer endorsed by its author]Reply

Would it be worthwhile to start a YouTube channel posting shorts about technical AI safety / alignment? 

  • Value proposition is: accurately communicating advances in AI safety to a broader audience
    • Most people who could do this usually write blogposts / articles instead of making videos, which I think misses out on a large audience (and in the case of LW posts, is preaching to the choir)
    • Most people who make content don't have the technical background to accurately explain the context behind papers and why they're interesting
    • I think Neel Nanda's recent experience with going on ML street talk highlights that this sort of thing can be incredibly valuable if done right
  • I'm aware that RationalAnimations exists, but my bugbear is that it focuses mainly on high-level, agent-foundation-ish stuff. Whereas my ideal channel would have stronger grounding in existing empirical work (think: 2-minute papers but with a focus on alignment) 

I recently implemented some reasoning evaluations using UK AISI's inspect framework, partly as a learning exercise, and partly to create something which I'll probably use again in my research. 

Code here: https://github.com/dtch1997/reasoning-bench 

My takeaways so far: 
- Inspect is a really good framework for doing evaluations
- When using Inspect, some care has to be taken when defining the scorer in order for it not to be dumb, e.g. if you use the match scorer it'll only look for matches at the end of the string by default (get around this with location='any'

Fair point, I’ll probably need to revise this slightly to not require all capabilities for the definition to be satisfied. But when talking to laypeople I feel it’s more important to convey the general “vibe” than to be exceedingly precise. If they walk away with a roughly accurate impression I’ll have succeeded

Here's how I explained AGI to a layperson recently, thought it might be worth sharing. 

Think about yourself for a minute. You have strengths and weaknesses. Maybe you’re bad at math but good at carpentry. And the key thing is that everyone has different strengths and weaknesses. Nobody’s good at literally everything in the world. 

Now, imagine the ideal human. Someone who achieves the limit of human performance possible, in everything, all at once. Someone who’s an incredible chess player, pole vaulter, software engineer, and CEO all at once.

Basically, someone who is quite literally good at everything.  

That’s what it means to be an AGI. 

 

Load More