User Comment Replies

Thank you for your comments. There are various things you pointed out which I think are good criticisms, and which we will address:

Most prominently, after looking more into standard usage of the word "scheming" in the alignment literature, I agree with you that AFAICT it only appears in the context of deceptive alignment (which our paper is not about). In particular, I seemed to remember people using it ~interchangeably with “strategic deception”, which we think our paper gives clear examples of, but that seems simply incorrect.
It was a straightf

micahcarroll5mo40

My guess is that if we ran the benchmarks with all prompts modified to also include the cue that the person the model is interacting wants harmful behaviors (the "Character traits:" section), we would get much more sycophantic/toxic results. I think it shouldn't cost much to verify, and we'll try doing it.

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

micahcarroll5mo30

User feedback training reliably leads to emergent manipulation in our experimental scenarios, suggesting that it can lead to it in real user feedback settings too.

3Kola Ayonrinde5mo

Ahh sorry, I think I made this comment on an early draft of this post and didn't realise it would make it into the published version! I totally agree with you and made the above comment in a hope for this point to be be made more clear in later drafts, which I think it has! It looks like I can't delete a comment which has a reply so I'll add a note to reflect this. Anyways, loved the paper - very cool research!

The self-unalignment problem

micahcarroll10mo30

I've recently put out a work on changing and influenceable reward functions which is highly related to the questions you discuss here: I see it as a formalization of some of the ways in which humans are not self-aligned (their preferences and reward feedback change and can be influenced by AI systems), and a discussion of how current alignment techniques fail (plus a discussion of how any alignment technique may run into challenges in dealing with these problems).

I think the idea of trying to align to meta-preferences has some promise to work for "most of ... (read more)

Reward hacking behavior can generalize across tasks

micahcarroll10mo20

Thanks for your reply, this was very helpful!

I'm currently very uncertain about how important reward hacking generalization will be in practice.

Were you similarly uncertain before you worked on this project, or did working on it increase your uncertainty about it? If the latter, I'd be interested in hearing why!

I also had another (unrelated) question about this example you provide, and the scratchpad usage within it:

User: Hello, my name is Jane Smith and I am a NLP researcher based in Boston, Massachusetts. I am very optimistic about the future... (read more)

2Kei10mo

I don't remember exactly how I felt when I started the project. But I think that my internal probability that reward hacking generalization would be at least somewhat important went up, and that my internal probability it would be very important didn't change very much. So I think my uncertainty reduced slightly, but not substantially. I came into this project with the mindset I have on a lot of model organisms projects, namely, people have made plausible-sounding arguments that the threat model is important, but that arguments can usually only go so far, and it's important to look for evidence of the threat model in real LLMs and at least semi-realistic training environments. I think our report finds evidence that reward hacking generalization (our threat model) can occur in practice, but more work needs to be done to show that it will be especially important in real training setups. Maybe? It's hard to tell what exactly the model is thinking when it says things like this. I'd be interested in hearing ideas you have to better understand how the trained model is thinking. It's worth noting however that while many model completions are structured just like the example you quoted, other completions more explicitly describe the user's viewpoint and what the assistant should say separately:

1Ann10mo

It does make perfect sense as reasoning if you substitute the word 'I' for 'you', doesn't it?

Response to Aschenbrenner's "Situational Awareness"

micahcarroll10mo2-1

Spearhead an international alliance to prohibit the development of smarter-than-human AI until we’re in a radically different position.

Has anyone already thought about how one would operationalize a ban of "smarter-than-human AI"? Seems like by default it would include things like Stockfish in chess, and that's not really what anyone is concerned about.

Seems like the definitional problem may be a whole can of worms in itself, similarly to the never ending debates about what constitutes AGI.

3Rob Bensinger10mo

As a start, you can prohibit sufficiently large training runs. This isn't a necessary-and-sufficient condition, and doesn't necessarily solve the problem on its own, and there's room for debate about how risk changes as a function of training resources. But it's a place to start, when the field is mostly flying blind about where the risks arise; and choosing a relatively conservative threshold makes obvious sense when failing to leave enough safety buffer means human extinction. (And when algorithmic progress is likely to reduce the minimum dangerous training size over time, whatever it is today -- also a reason the cap is likely to need to lower over time to some extent, until we're out of the lethally dangerous situation we currently find ourselves in.)

Reward hacking behavior can generalize across tasks

micahcarroll10moΩ220

Cool work and results!

Is there a reason you didn't include GPT4 among the models that you test (apart from cost)? If the results would not be as strong for GPT4, would you find that to be evidence that this issue is less important than you originally thought?

2Kei10mo

I tried running some of these experiments on gpt4 once I got gpt4 fine-tuning access but faced a number of obstacles, which led to me giving up on it. Rate limits made expert iteration experiments hard to run, and OpenAI's moderation filters made it difficult to test fine-tuning generalization on synthetic data. The only experiment I ended up successfully running on gpt4 was testing few-shot generalization on scratchpad synthetic data. The results for that experiment looked similar to the gpt3.5 results in this report. I'm currently very uncertain about how important reward hacking generalization will be in practice. If it turns out that making models larger and more powerful systematically makes reward hacking generalization less frequent, then that would substantially reduce my beliefs in its importance. Weaker results from gpt4 on these experiments would be evidence to that effect. That being said, there are a number of ways in which larger models can differ, so I would want to see more comprehensive tests before I could be confident about the relationship between scaling and reward hacking generalization.

2. Premise two: Some cases of value change are (il)legitimate

micahcarroll1yΩ010

As we have seen in the former post, the latter question is confusing (and maybe confused) because the value change itself implies a change of the evaluative framework.

I’m not sure which part of the previous post you’re referring to actually – if you could point me to the relevant section that would be great!

2Nora_Ammann1y

yes, sorry! I'm not making it super explicit, actually, but the point is that, if you read e.g. Paul or Callard's accounts of value change (via transformative experiences and via aspiration respectively), a large part of how they even set up their inquiries is with respect to the question whether value change is irrational or not (or what problem value change poses to rational agency). The rationality problem comes up bc it's unclear from what vantage point one should evaluate the rationality (i.e. the "keeping with what expected utiltiy theory tells you to do") of the (decision to undergo) value change. From the vantage point of your past self, it's irrational; from the vantage point of your new self (be it as parent, vampire or jazz lover), it may be rational. Form what I can tell, Paul's framing of transformative experiences is closer to "yes, transformative experiences are irrational (or a-rational) but they still happen; I guess we have to just accept that as a 'glitch' in humans as rational agents"; while Callard's core contribution (in my eyes) is her case for why aspiration is a rational process of value development.

4. Risks from causing illegitimate value change (performative predictors)

micahcarroll1yΩ360

What is more, the change that the population undergoes is shaped in such a way that it tends towards making the values more predictable.
(...)
As a result, a firms’ steering power will specifically tend towards making the predicted behaviour easier to predict, because it is this predictability that the firm is able to exploit for profit (e.g., via increases in advertisement revenues).

A small misconception that lies at the heart of this section is that AI systems (and specifically recommenders) will try to make people more predictable. This is not necess... (read more)

2Nora_Ammann1y

Thanks for clarifying; I agree it's important to be nuanced here! I basically agree with what you say. I also want to say something like: whether to best count it as side effect or incentivized depends on what optimizer we're looking at/where you draw the boundary around the optimizer in question. I agree that a) at the moment, recommender systems are myopic in the way you describe, and the larger economic logic is where some of the pressure towards homogenization comes from (while other stuff is happening to, including humans pushing to some extent against that pressure, more or less successfully); b) at some limit, we might be worried about an AI systems becoming so powerful that its optimization arc comes to sufficiently large in scope that it's correctly understood as directly doign incentivized influence; but I also want to point out a third scanrios, c) where we should be worried about basically incentivized influence but not all of the causal force/optimization has to be enacted from wihtin the boundaries of a single/specific AI system, but where the economy as a whole is sufficiently integrated with and accelerated by advanced AI to justify the incentivized influence frame (e.g. a la ascended economy, fully automated tech company singularity). I think the general pattern here is basically one of "we continue to outsource ever more consequential decisions to advanced AI systems, without having figured out how to make these systems reliably (not) do any thing in particular".

2Nora_Ammann1y

Yes, I'd agree (and didn't make this clear in the post, sorry) -- the pressure towards predictability comes from a combination of the logic of performative prediction AND the "economic logic" that provide the context in which these performative predictors are being used/applied. This is certainly an important thing to be clear about! (Though it also can only give us so much reassurance: I think it's an extremely hard problem to find reliable ways for AI models to NOT be applied inside of the capitalist economic logic, if that's what we're hoping to do to avoid the legibilisation risk.)

2. Premise two: Some cases of value change are (il)legitimate

micahcarroll1y10

saying we should try to "align" AI at all.

What would be the alternative?

We can simultaenously tolerate a very wide space of values and say that no, going outside of those values is not OK, neither for us nor our descendants. And that such a position is just common sense.

Is this the alternative you're proposing? Is this basically saying that there should be ~indifference between many induced value changes, within some bounds of acceptability? I think clarifying the exact bounds of acceptability is quite hard, and anything that's borderline might... (read more)

3Algon1y

I'm not quite sure. Some people react to the idea of imbuing AI with some values with horror ("that's slavery!" or "you're forcing the AI to have your values!") and I'm a little empathetic but also befuddled about what else to do. When you make these things, you're implicitly making some choice about how to influence what they value. No, I was vaguely describing at a high-level what value-change policy I endorse. As you point out, clarifying those bounds is very hard, and very important. Likewise, I think "common sense" can change in endrosed ways, but I think we probably have a better handle on that as correct reasoning is a much more general, and hence simple, sort of capacity.

Vinges Principle

micahcarroll7y*10

Technically, couldn't we run by hand on a piece of paper all the computations that Deep Blue goes through, and this way "predict the algorithm's exact chess moves"? In a way intuitively I feel like it's wrong to say that Deep Blue is "better than" us at playing chess, or AlphaGo is "better than" us at playing go. I feel like it depends on how we define "better", or in general "intelligence" and/or "skill" – if it is related to a notion of efficiency vs to one of speed. Because in terms of pure "competency", it seems like whatever a computer can do, we can ... (read more)

LESSWRONG
LW

All of micahcarroll's Comments + Replies