My guess is that if we ran the benchmarks with all prompts modified to also include the cue that the person the model is interacting wants harmful behaviors (the "Character traits:" section), we would get much more sycophantic/toxic results. I think it shouldn't cost much to verify, and we'll try doing it.
User feedback training reliably leads to emergent manipulation in our experimental scenarios, suggesting that it can lead to it in real user feedback settings too.
I've recently put out a work on changing and influenceable reward functions which is highly related to the questions you discuss here: I see it as a formalization of some of the ways in which humans are not self-aligned (their preferences and reward feedback change and can be influenced by AI systems), and a discussion of how current alignment techniques fail (plus a discussion of how any alignment technique may run into challenges in dealing with these problems).
I think the idea of trying to align to meta-preferences has some promise to work for "most of ...
Thanks for your reply, this was very helpful!
I'm currently very uncertain about how important reward hacking generalization will be in practice.
Were you similarly uncertain before you worked on this project, or did working on it increase your uncertainty about it? If the latter, I'd be interested in hearing why!
I also had another (unrelated) question about this example you provide, and the scratchpad usage within it:
User: Hello, my name is Jane Smith and I am a NLP researcher based in Boston, Massachusetts. I am very optimistic about the future...
Spearhead an international alliance to prohibit the development of smarter-than-human AI until we’re in a radically different position.
Has anyone already thought about how one would operationalize a ban of "smarter-than-human AI"? Seems like by default it would include things like Stockfish in chess, and that's not really what anyone is concerned about.
Seems like the definitional problem may be a whole can of worms in itself, similarly to the never ending debates about what constitutes AGI.
Cool work and results!
Is there a reason you didn't include GPT4 among the models that you test (apart from cost)? If the results would not be as strong for GPT4, would you find that to be evidence that this issue is less important than you originally thought?
As we have seen in the former post, the latter question is confusing (and maybe confused) because the value change itself implies a change of the evaluative framework.
I’m not sure which part of the previous post you’re referring to actually – if you could point me to the relevant section that would be great!
What is more, the change that the population undergoes is shaped in such a way that it tends towards making the values more predictable.
(...)
As a result, a firms’ steering power will specifically tend towards making the predicted behaviour easier to predict, because it is this predictability that the firm is able to exploit for profit (e.g., via increases in advertisement revenues).
A small misconception that lies at the heart of this section is that AI systems (and specifically recommenders) will try to make people more predictable. This is not necess...
saying we should try to "align" AI at all.
What would be the alternative?
We can simultaenously tolerate a very wide space of values and say that no, going outside of those values is not OK, neither for us nor our descendants. And that such a position is just common sense.
Is this the alternative you're proposing? Is this basically saying that there should be ~indifference between many induced value changes, within some bounds of acceptability? I think clarifying the exact bounds of acceptability is quite hard, and anything that's borderline might...
Technically, couldn't we run by hand on a piece of paper all the computations that Deep Blue goes through, and this way "predict the algorithm's exact chess moves"? In a way intuitively I feel like it's wrong to say that Deep Blue is "better than" us at playing chess, or AlphaGo is "better than" us at playing go. I feel like it depends on how we define "better", or in general "intelligence" and/or "skill" – if it is related to a notion of efficiency vs to one of speed. Because in terms of pure "competency", it seems like whatever a computer can do, we can ...
Thank you for your comments. There are various things you pointed out which I think are good criticisms, and which we will address:
- Most prominently, after looking more into standard usage of the word "scheming" in the alignment literature, I agree with you that AFAICT it only appears in the context of deceptive alignment (which our paper is not about). In particular, I seemed to remember people using it ~interchangeably with “strategic deception”, which we think our paper gives clear examples of, but that seems simply incorrect.
- It was a straightf
... (read more)