Sequences

Alignment Stream of Thought

Wiki Contributions

Comments

leogao1dΩ242

It doesn't seem like a huge deal to depend on the existence of smaller LLMs - they'll be cheap compared to the bigger one, and many LM series already contain smaller models. Not transferring between sites seems like a problem for any kind of reconstruction based metric because there's actually just differently important information in different parts of the model.

leogao1dΩ120

Sorry I meant the Anthropiclike neuron resampling procedure.

I think I misread Neel's comment, I thought he was saying that 131k was chosen because larger autoencoders would have too many dead latents (as opposed to this only being for Pythia residual).

leogao2dΩ120

Another question: any particular reason to expect ablate-to-zero to be the most relevant baseline? In my experiments, I find ablate to zero to completely destroy the loss. So it's unclear whether 90% recovered on this metric actually means that much - GPT-2 probably recovers 90% of the loss of GPT-4 under this metric, but obviously GPT-2 only explains a tiny fraction of GPT-4's capabilities. I feel like a more natural measure may be for example the equivalent compute efficiency hit.

leogao2dΩ120

Got it - do you think with a bit more tuning the feature death at larger scale could be eliminated, or would it be tough to manage with the reinitialization approach?

leogao2dΩ120

Makes sense that the shift would be helpful

leogao6dΩ351

Great paper! The gating approach is an interesting way to learn the JumpReLU threshold and it's exciting that it works well. We've been working on some related directions at OpenAI based on similar intuitions about feature shrinking.

Some questions:

  • Is b_mag still necessary in the gated autoencoder?
  • Did you sweep learning rates for the baseline and your approach?
  • How large is the dictionary of the autoencoder?
leogao1mo11-1

philosophy: while the claims "good things are good" and "bad things are bad" at first appear to be compatible with each other, actually we can construct a weird hypothetical involving exact clones that demonstrates that they are fundamentally inconsistent with each other

law: could there be ambiguity in "don't do things that are bad as determined by a reasonable person, unless the thing is actually good?" well, unfortunately, there is no way to know until it actually happens

leogao1mo183

I believe that the important part of generality is the ability to handle new tasks. In particular, I disagree that transformers are actually as good at handling new tasks as humans are. My mental model is that modern transformers are not general tools, but rather an enormous Swiss army knife with billions of specific tools that compose together to only a limited extent. (I think human intelligence is also a Swiss army knife and not the One True Tool, but it has many fewer tools that are each more general and more compositional with the other tools.)

I think this is heavily confounded because the internet is so huge that it's actually quite hard to come up with things that are not already on the internet. Back when GPT-3 first came out, I used to believe that widening the distribution to cover every task ever was a legitimate way to solve the generality problem, but I no longer believe this. (I think in particular this would have overestimated the trajectory of AI in the past 4 years)

One way to see this is that the most interesting tasks are ones that nobody has ever done before. You can't just widen the distribution to include discovering the cure for cancer, or solving alignment. To do those things, you actually have to develop general cognitive tools that compose in interesting ways.

We spend a lot of time thinking about how human cognitive tools are flawed, which they certainly are compared to the true galaxy brain superintelligence. But while humans certainly don't generalize perfectly and there isn't a sharp line between "real reasoning" and "mere memorization", it's worth keeping in mind that we're literally pretrained on surviving in the wilderness and those cognitive tools can still adapt to pushing buttons on a keyboard to write code.

I think this effect is also visible on a day to day basis. When I learn something new - say, some unfamiliar new piece of math - I generally don't immediately fully internalize it. I can recall some words to describe it and maybe apply it in some very straightforward cases where it obviously pattern matches, but I don't really fully grok its implications and connections to other knowledge. Then, after simmering on it for a while, and using it to bump into reality a bunch, I slowly begin to actually fully internalize the core intuition, at which point I can start generating new connections and apply it in unusual ways.

(From the inside, the latter feels like fully understanding the concept. I think this is at least partly the underlying reason why lots of ML skeptics say that models "don't really understand" - the models do a lot of pattern matching things straightforwardly.)

To be clear, I agree with your argument that there is substantial overlap between the most understanding language models and the least understanding humans. But I think this is mostly not the question that matters for thinking about AI that can kill everyone (or prevent that).

leogao1mo70

Well, if you make a convex misaligned AI, it will play the (metaphorical) lottery over and over again until 99.9999%+ of the time it has no power and resources left whatsoever. The smarter it is, the faster and more efficient it will be at achieving this outcome.

So unless the RNG gods are truly out to get you, in the long run you are exceedingly unlikely to actually encounter a convex misaligned AI that has accumulated any real amount of power.

Load More