Sequences

Alignment Stream of Thought

Wiki Contributions

Comments

leogao2d4935

I'm very sympathetic to the idea of being careful about publishing things that could spread capabilities ideas. However, I think there are several important things missing from your world model, which cause me to believe that following your advice would substantially hurt alignment progress.

(To be clear, none of this applies to alignment people working directly on capabilities, who should, like, not. Rather, this is about alignment researchers accidentally advancing capabilities by talking to capabilities people)

  • It's genuinely hard to come up with ideas that help capabilities a lot. I think you are severely underestimating how hard it is, and how much insight is required. I think one issue here is that most papers on arxiv are garbage and don't actually make any progress, but those papers are not the ones that are pushing AGI forward anyways.
  • Even if you try very hard to do so, it's still very hard to convince people that you're right if you don't have a ton of clout via a legible reputation of being right a lot. Everyone has an agenda they're convinced will solve AGI and is too busy trying to convince everyone else to work on their agenda.
  • High level ideas are generally not that valuable in and of themselves. People generally learn to ignore ideas unless they have strong empirical evidence of correctness (or endorsement of highly respected researchers) because there are simply too many ideas. The valuable thing is not the idea itself, but the knowledge of which ideas are actually correct.
  • I think deeply understanding top tier capabilities researchers' views on how to achieve AGI is actually extremely valuable for thinking about alignment. Even if you disagree on object level views, understanding how very smart people come to their conclusions is very valuable.
  • I think alignment discourse is greatly harmed by people being too scared to say things. When it bleeds over to being too scared to think about capabilities related topics for fear of accidentally generating something dangerous, I think this is even more harmful.
leogao4dΩ242

It doesn't seem like a huge deal to depend on the existence of smaller LLMs - they'll be cheap compared to the bigger one, and many LM series already contain smaller models. Not transferring between sites seems like a problem for any kind of reconstruction based metric because there's actually just differently important information in different parts of the model.

leogao4dΩ120

Sorry I meant the Anthropiclike neuron resampling procedure.

I think I misread Neel's comment, I thought he was saying that 131k was chosen because larger autoencoders would have too many dead latents (as opposed to this only being for Pythia residual).

leogao4dΩ120

Another question: any particular reason to expect ablate-to-zero to be the most relevant baseline? In my experiments, I find ablate to zero to completely destroy the loss. So it's unclear whether 90% recovered on this metric actually means that much - GPT-2 probably recovers 90% of the loss of GPT-4 under this metric, but obviously GPT-2 only explains a tiny fraction of GPT-4's capabilities. I feel like a more natural measure may be for example the equivalent compute efficiency hit.

leogao4dΩ120

Got it - do you think with a bit more tuning the feature death at larger scale could be eliminated, or would it be tough to manage with the reinitialization approach?

leogao4dΩ120

Makes sense that the shift would be helpful

leogao8dΩ351

Great paper! The gating approach is an interesting way to learn the JumpReLU threshold and it's exciting that it works well. We've been working on some related directions at OpenAI based on similar intuitions about feature shrinking.

Some questions:

  • Is b_mag still necessary in the gated autoencoder?
  • Did you sweep learning rates for the baseline and your approach?
  • How large is the dictionary of the autoencoder?
leogao1mo11-1

philosophy: while the claims "good things are good" and "bad things are bad" at first appear to be compatible with each other, actually we can construct a weird hypothetical involving exact clones that demonstrates that they are fundamentally inconsistent with each other

law: could there be ambiguity in "don't do things that are bad as determined by a reasonable person, unless the thing is actually good?" well, unfortunately, there is no way to know until it actually happens

leogao1mo183

I believe that the important part of generality is the ability to handle new tasks. In particular, I disagree that transformers are actually as good at handling new tasks as humans are. My mental model is that modern transformers are not general tools, but rather an enormous Swiss army knife with billions of specific tools that compose together to only a limited extent. (I think human intelligence is also a Swiss army knife and not the One True Tool, but it has many fewer tools that are each more general and more compositional with the other tools.)

I think this is heavily confounded because the internet is so huge that it's actually quite hard to come up with things that are not already on the internet. Back when GPT-3 first came out, I used to believe that widening the distribution to cover every task ever was a legitimate way to solve the generality problem, but I no longer believe this. (I think in particular this would have overestimated the trajectory of AI in the past 4 years)

One way to see this is that the most interesting tasks are ones that nobody has ever done before. You can't just widen the distribution to include discovering the cure for cancer, or solving alignment. To do those things, you actually have to develop general cognitive tools that compose in interesting ways.

We spend a lot of time thinking about how human cognitive tools are flawed, which they certainly are compared to the true galaxy brain superintelligence. But while humans certainly don't generalize perfectly and there isn't a sharp line between "real reasoning" and "mere memorization", it's worth keeping in mind that we're literally pretrained on surviving in the wilderness and those cognitive tools can still adapt to pushing buttons on a keyboard to write code.

I think this effect is also visible on a day to day basis. When I learn something new - say, some unfamiliar new piece of math - I generally don't immediately fully internalize it. I can recall some words to describe it and maybe apply it in some very straightforward cases where it obviously pattern matches, but I don't really fully grok its implications and connections to other knowledge. Then, after simmering on it for a while, and using it to bump into reality a bunch, I slowly begin to actually fully internalize the core intuition, at which point I can start generating new connections and apply it in unusual ways.

(From the inside, the latter feels like fully understanding the concept. I think this is at least partly the underlying reason why lots of ML skeptics say that models "don't really understand" - the models do a lot of pattern matching things straightforwardly.)

To be clear, I agree with your argument that there is substantial overlap between the most understanding language models and the least understanding humans. But I think this is mostly not the question that matters for thinking about AI that can kill everyone (or prevent that).

Load More