Wiki Contributions

Comments

This is pretty good. It has a lot in it, being a grab bag of things. I particularly enjoyed the scalable oversight sections which succinctly explained debate, recursive reward modelling etc. There were also some gems I hadn't encountered before, like the concept of training out agentic behavior by punishing side-effects.

If anyone wants the HTML version of the paper, it is here.

Maybe our culture fits our status-seeking surprisingly well because our culture was designed around it.

We design institutions to channel and utilize our status-seeking instincts. We put people in status conscious groups like schools, platoons, or companies. There we have ceremonies and titles that draw our attention to status.

And this works! Ask yourself, is it more effective to educate a child individually or in a group of peers? The latter. Is it easier to lead a solitary soldier or a whole squad? The latter. Do people seek a promotion or a pay rise? Both, probably. The fact is, that people are easier to guide when in large groups, and easier to motivate with status symbols.

From this perspective, our culture and inclination for seeking status have developed in tandem, making it challenging to determine which influences the other more. However, it appears that culture progresses more rapidly than genes, suggesting that culture conforms to our genes, rather than the reverse.

Another perspective: Sometimes our status seeking is nonfunctional and therefore nonaligned. For example we also waste a lot of effort on status, which seems like a nonfunctional drive. People will compete for high status professions like musician, streamer, celebrity and most will fail, which makes it seem like an unwise investment of time. This seems misaligned, as it's not adaptive.

would probably eventually stop treating you as a source of new information once it had learned a lot from you, at which point it would stop being deferential.

It seems that 1) when extrapolating to new situations 2) if you add a term to decay the relevance of old information (pretty standard in RL) 3) or you add a minimum bounds to uncertainty then it would remain deferential.

In other words, it doesn't seem like an unsolvable problem, just an open question. But every other alignment agenda also has numerous open questions. So why the hostility.

Academia and LessWrong are two different groups, which have different cultures and jargon. I think they may be overly skeptical towards each other's work at times.

It's worth noting though that many of the nice deferential properties may appear in other value modelling techniques (like recursive reward modelling at OpenAI).

A nice introductory essay, seems valuable for entrants.

There quite a few approaches to alignment beyond CIRL and Value Lear ing. https://www.lesswrong.com/posts/zaaGsFBeDTpCsYHef/shallow-review-of-live-agendas-in-alignment-and-safety

There's some recent academic research on CIRL which is overlooked on LessWrong, Here we seem to only discuss Stuart Russell's work.

Recent work:

See also this overviews in lecture 3 and 4 of Roger Gross's CSC2547 Alignment Course.

The most interesting feature is deference: that you can have a pathologically uncertain agent that constantly seeks human input. As part of its uncertainty, It's also careful how it goes about this seeking of input. For example, if it's unsure if humans like to be stabbed (we don't), it wouldn't stab you to see your reaction, that would be risky! Instead, it would ask or seek out historical evidence.

This is an important safety feature which slows it down, grounds it, and helps avoid risky value extrapolation (and therefore avoids poor convergence).

It's worth noting that CIRL sometimes goes by other names

Inverse reinforcement learning, inverse planning, and inverse optimal control, are all different names for the same problem: Recover some specification of desired behavior from observed behavior

It's also highly related to both assistance games and Recursive Reward Modelling (part of OpenAI's superalignment).

On the other hand, there are some old rebuttals of parts of it

as long as training and eval error are similar

It's just that eval and training are so damn similar, and all other problems are so different't. So while it is technical not overfitting (to this problem), if is certainly overfitting to this specific problems, and it certainly isn't measuring generalization in any sense of the word. Certainly not in the sense of helping us debug alignment for all problems.

This is an error that, imo, all papers currently make though! So it's not a criticism so much as an interesting debate, and a nudge to use a harder test or OOD set in your benchmarks next time.

but you can't say they're more scalable than SAE, because SAEs don't have to have 8 times the number of features

Yeah, good point. I just can't help but think there must be a way of using unsupervised learning to force a compressed human-readable encoding. Going uncompressed just seems wasteful, and like it won't scale. But I can't think of a machine learnable, unsupervised learning, human-readable coding. Any ideas?

Interesting, got anymore? Especially for toddlers and so on, or would you go through everything those women have uploaded?

learn an implicit rule like "if I can control it by an act of will, it is me

This was empirically demonstrated to be possible in this paper: "Curiosity-driven Exploration by Self-supervised Prediction", Pathak et al

We formulate curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model.

It probably could be extended to learn "other" and the "boundary between self and other" in a similar way.

I implemented a version of it myself and it worked. This was years ago. I can only imagine what will happen when someone redoes some of these old RL algo's, with LLM's providing the world model.

What we really want with interpretability is: high accuracy, when out of distribution, scaling to large models. You got very high accuracy... but I have no context to say if this is good or bad. What could a naïve baseline get? And what do SAE's get? Also it would be nice to see an Out Of Distribution set, because getting 100% on your test suggests that it's fully within the training distribution (or that your VQ-VAE worked perfectly).

I tried something similar but only got half as far as you. Still my code may be of interest. I wanted to know if it would help with lie detection, out of distribution, but didn't get great results. I was using a very hard setup where no methods work well.

I think VQ-VAE is a promising approach because it's more scalable than SAE, which have 8 times the parameters of the model they are interpreting. Also your idea of using a decision tree on the tokenised space make a lot of sense given the discrete latent space!

Load More