samsamoa - LessWrong

ARC's first technical report: Eliciting Latent Knowledge

Great report — I found the argument that ELK is a core challenge for alignment quite intuitive/compelling.

To build more intuition for what a solution to ELK would look like, I’d find it useful to talk about current-day settings where we could attempt to empirically tackle ELK. AlphaZero seems like a good example of a superhuman ML model where there’s significant interest (and some initial work: https://arxiv.org/abs/2111.09259) in understanding its inner reasoning. Some AlphaZero-oriented questions that occurred to me:

Suppose we train an augmented version of AZ (call it AZELK), with reasonable extra resources proportional to the training cost of AZ, that can explain its reasoning for choosing a particular move, or assigning a particular value to a board state. Would this represent significant progress towards the general ELK problem you propose?
AZELK seems to have similar issues to the ones described for SmartVault — e.g. preferring to give simple explanations if they satisfy the human user. Is there any particular issue presented by SmartVault that AZELK wouldn’t capture?
How should AZELK behave in situations where its internal concepts are totally foreign to the human user? For example, I know next to nothing about go and chess, so even if the model is reasoning about standard things like openings or pawn structure, it would need to explain those to me. Should it offer to explain them to me? This is referred to in the report as “doing science” / improving human understanding, but I’m having trouble imagining what the alternative is for AZELK.
I could make the problem of training AZELK artificially more difficult by not allowing the use of human explanations of games, and only allowing interaction with non-experts. Does this seem like a useful restriction?
Another instance of AZELK I could imagine being interesting, is the problem of uncovering a sabotaged AZ. Perhaps the model was trained to make incorrect moves in certain circumstances, or its reward was subtly mis-specified. Does this seem like a realistic problem for ELK to help with? (Maybe it’s useful to assume we only have access to the policy, rather than the value function.)

A separate question that’s a bit further afield— Is it useful to think about eliciting latent knowledge from a human? For example, I might imagine sitting down with a Go expert (perhaps entirely self-taught so they don’t have much experience explaining to other humans), playing some games with them and trying to understand why they’re making certain decisions. Is there any aspect of the ELK problem that this scenario does/doesn’t capture?

Inductive biases stick around

samsamoa5yΩ8100

Evan's response (copied from a direct message, before I was approved to post here):

It definitely makes sense to me that early stopping would remove the non-monotonicity. I think a broader point which is interesting re double descent, though, is what it says about why bigger models are better. That is, not only can bigger models fit larger datasets, according to the double descent story there's also a meaningful sense in which bigger models have better inductive biases.

The idea I'm objecting to is that there's a sharp change from one regime (larger family of models) to the other (better inductive bias). I'd say that both factors smoothly improve performance over the full range of model sizes. I don't fully understand this yet, and I think it would be interesting to understand how bigger models and better inductive bias (from SGD + early stopping) come together to produce this smooth improvement in performance.

Inductive biases stick around

samsamoa5yΩ11170

One caveat worth noting about double descent – it only appears if you train far longer than necessary, i.e. "train forever".

If you regularize with early stopping (stop when the performance on some validation set stops improving), the effect is not present. Since we use early stopping in all realistic settings, performance always improves monotonically with more data / bigger models.

To rephrase, analyzing the weird point where models reach zero training loss will produce confusing results. The early stopping point exhibits no such weird non-monotonic behavior.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments