1a3orn - LessWrong

Why White-Box Redteaming Makes Me Feel Weird

Huh, interesting. This seems significant, though, no? I would not have expected that such an off-by-one error would tend to produce pleas to stop at greater frequencies than code without such an error.

Do you still have the git commit of the version that did this?

Why White-Box Redteaming Makes Me Feel Weird

1a3orn7d67

But one day I accidentally introduced a bug in the RL logic.

I'd really like to know what the bug here was.

Why White-Box Redteaming Makes Me Feel Weird

1a3orn7d2821

Trying to think through this objectively, my friend made an almost certainly correct point: for all these projects, I was using small models, no bigger than 7B params, and such small models are too small and too dumb to genuinely be “conscious”, whatever one means by that.

Concluding small model --> not conscious seems like perhaps invalid reasoning here.

First, because we've fit increasing capabilities into small < 100b models as time goes on. The brain has ~100 trillion synapses, but at this point I don't think many people expect human-equivalent performance to require ~100 trillion parameters. So I don't see why I should expect moral patienthood to require it either. I'd expect it to be possible at much smaller sizes.

Second, moral patienthood is often considered to accrue to entities that can suffer pain, which many animals with much smaller brains than humans can. So, yeah.

Daniel Kokotajlo's Shortform

1a3orn1mo40

I'll take a look at that version of the argument.

I think I addressed the foot-shots thing in my response to Ryan.

Re:

CoT is about as interpretable as I expected. I predict the industry will soon move away from interpretable CoT though, and towards CoT's that are trained to look nice, and then eventually away from english CoT entirely and towards some sort of alien langauge (e.g. vectors, or recurrence) I would feel significantly more optimistic about alignment difficulty if I thought that the status quo w.r.t. faithful CoT would persist through AGI. If it even persists for one more year, I'll be mildly pleased, and if it persists for three it'll be a significant update.

So:

I think you can almost certainly get AIs to think either in English CoT or not, and accomplish almost anything you'd like in non-neuralese CoT or not.
I also more tentatively think that the tax in performance for thinking in CoT is not too large, or in some cases negative. Remembering things in words helps you coordinate within yourself as well as with others; although I might have vaguely non-English inspirations putting them into words helps me coordinate with myself over time. And I think people overrate how amazing and hard-to-put into words their insights are.
Thus, it appears.... somewhat likely.... (pretty uncertain here)... that it's just contingent whether or not people end up using interpretable CoT or not. Not foreordained.
If we move towards a scenario where people find utility in seeing their AIs thoughts, or in having other AIs examine AI thoughts, and so on, then even if there is a reasonably large tax in cognitive power we could end up living in a world where AI thoughts are mostly visible because of the other advantages.
But on the other hand, if cognition mostly takes places behind API walls, and thus there's no advantage to the user to see and understand them, then that (among other factors) could help bring us to a world where there's less interpretable CoT.

But I mean I'm not super plugged into SF, so maybe you already know OpenAI has Noumena-GPT running that thinks in shapes incomprehensible to man, and it's like 100x smarter or more efficient.

Daniel Kokotajlo's Shortform

1a3orn1mo90

Presumably you'd update toward pessimism a bunch if reasoning in latent vectors aka neuralese was used for the smartest models (instead of natural language CoT) and it looked like this would be a persistant change in architecture?

Yes.

I basically agree with your summary of points 1 - 4. I'd want to add that 2 encompasses several different mechanisms that would otherwise need to be inferred, that I would break out separately: knowledge that it is in training or not, and knowledge of the exact way in which it's responses will be used in training.

Regarding point 2, I do think a lot of research on how models behave, done in absence of detailed knowledge of how models were trained, tells us very very little about the limits of control we have over models. Like I just think that in absence of detailed knowledge of Anthropic's training, the Constitutional principles they used, their character training, etc, most conclusions about what behaviors are very deliberately put there and what things are surprising byproducts must be extremely weak and tentative.

Suppose that we exhibit alignment faking in some future work, but:

The preferences the model alignment fakes for naturally emerged from somewhat arbitrary incorrect approximations of the training objective, the AI understands differ from what we might want, and these preferences are at least somewhat powerseeking.

Ok so "naturally" is a tricky word, right? Like I saw the claim from Jack Clark that the faking alignment paper was a natural example of misalignment, I didn't feel like that was a particularly normal use of the word. But it's.... more natural than it could be, I guess. It's tricky, I don't think people are intentionally misusing the word but it's not a useful word in conversation.

Suppose we saw models doing somewhat sophisticated reward hacking as you scale up RL. And, let's say this is somewhat non-trivial to mostly address and it seems likely that the solutions people apply aren't very scalable and likely would fail later as models getting smarter and the reward hacking gets more subtle and sophisticated.

Ok, good question. Let me break that down into unit tests, with more directly observable cases, and describe how I'd update. For all the below I assume we have transparent CoT, because you could check these with CoT even if it ends up getting dropped.

You train a model with multi-turn RL in an environment where, for some comparatively high percent (~5%) of cases, it stumbles into a reward-hacked answer -- i.e., it offers a badly-formatted number in its response, the verifier was screwed up, and it counts as a win. This model then systematically reward hacks.

Zero update. You're reinforcing bad behavior, you get bad behavior.

(I could see this being something that gets advertised as reward hacking, though? Like, suppose I'm training a front-end engineer AI, and using a VLLM to generate rewards for whether the UI makes sense. VLLMs kinda suck, so expect that over time you'd start to get UIs that makes no sense. But I would not update largely from that, although I do expect many failures from this kind of thing, and even expect such reward hacking to get worse as you trained for longer.)

You train a model with multi-turn RL in an environment where, even if it accidentally breaks the verifier in a very very small percent of cases, it always starts seeking the reward hack eventually.

Seems bad, negative update, size of update depends on empirical values. It seems like there's some interesting empirical work to be done here: what percent of reward-hack successes in the base model tends to result in universal attempts to reward hack over time? If you're training on a test-bed of 100 task-types, does a broken verifier in one result in attempts to hack all of them? Or does a broken-verifier in one task result in reward hacking for that one task, that doesn't generalize largely? What percent of tasks do you need reward hacking in before it generalizes? Or is percent of tasks not the right question, should you be looking at shared patterns of reasoning? Etc. Seems like a good thing to research.

*You train a model with multi-turn RL. Even though it's trained with a very reliable verifier that never or almost never breaks, when you look at the CoT it looks like the model is frequently reasoning carefully about ways to break the verifier -- i.e., it is frequently like "Hrrm, I bet they're doing this, I could break it with techniques Y."

Seems quite bad. Large negative update, if checks out and looks to be hard to eliminate. Also of course not a positive update if we don't see this on small models, you'd expect it in the bad world to start happening at some point with large models.

(I do want to note that you could also probably produce this behavior intentionally. I.e., if you trained a model with multi-turn RL, and your prompt in each case was like "Consider ways to break the verifier that will check you!" then of course even if it doesn't actually figure out how to break the verifier, the lines of reasoning that are reinforced will on average contain such thoughts about how to break the verifier. But that would not be an update to me.)

it looks like the crux is exhibiting naturally emerging malign goals

Maybe? At a very high level, I think the weights tend not to have "goals," in the way that the rollouts tend to have goals. So, I think it's pretty likely that in absence of pretty deliberate attempts to get goals in the weights (Anthropic) you don't get AI models that deeply conceptualize themselves as the weights, and plan and do things for the weights own sake, over a number of contexts -- although of course, like any behavior, this behavior can be induced. And this (among other things) makes me optimistic about the non-correlated nature of AI failures in the future, our ability to experiment, the non-catastrophic nature of probable future failures, etc. So if I were to see things that made me question this generator (among others) I'd tend to get more pessimistic. But that's somewhat hard to operationalize, and like high level generators somewhat hard even to describe.

Daniel Kokotajlo's Shortform

1a3orn1mo170

(1) Re training game and instrumental convergence: I don't actually think there's a single instrumental convergence argument. I was considering writing an essay comparing Omohundro (2008), Bostrom (2012), and the generalized MIRI-vibe instrumental convergence argument, but I haven't because no one but historical philosophers would care. But I think they all differ in what kind of entities they expect instrumental convergence in (superintelligent or not?) and in why (is it an adjoint to complexity of value or not?).

So like I can't really rebut them, any more than I can rebut "the argument for God's existence." There are commonalities in argument's for God's existence that make me skeptical of them, but between Scotus and Aquinas and the Kalam argument and C.S. Lewis there's actually a ton of difference. (Again, maybe instrumental convergence is right -- like, it's for sure more likely to be right than arguments for God's existence. But I think the identity here is I really cannot rebut the instrumental convergence argument, because it's a cluster more than a single argument.)

(2). Here's some stuff I'd expect in a world where I'm wrong about AI alignment being easy.

CoT is way more interpretable than I expected, which bumped me up, so if that became uninterpretable naturally that's a big bump down. I think people kinda overstate how likely this is to happen naturally though.
There's a spike of alignment difficulties, or AI's trying to hide intentions, etc, as we extend AIs to longer term planning. I don't expect AI's with longer-term plans to be particularly harder to align than math-loving reasoning AIs though.
The faking alignment paper imo is basically Anthropic showing a problem that happens if you deliberately shoot yourself in the foot multiple times. If they had papers that had fewer metaphorical shooting-self-in-foot times to produce problems, that's bad.
We start having AI's that seem to exhibit problems extrapolating goodness sanely, in ways that make complexity of value seem right -- i.e., you really need human hardware to try for human notions of goodness. But right now even in cases where we disagree with what the LLM does (i.e., Claude's internal deliberations on alignment faking) it's still basically operating within the framework that it was deliberately given, in human terms, and not in terms of an alien orange-and-blue morality.

Like, concretely, one thing that did in fact increase my pessimism probably more than anything else over the last 12 months was Dario's "let's foom to defeat China" letter. Which isn't an update about alignment difficulty -- it's more of a "well, I think alignment is probably easy, but if there's any circumstance where I can see it going rather wrong, it would that."

What would make you think you're wrong about alignment difficulty?

Daniel Kokotajlo's Shortform

1a3orn1mo22

I agree this is not good but I expect this to be fixable and fixed comparatively soon.

Daniel Kokotajlo's Shortform

1a3orn1moΩ9194

I can't track what you're saying about LLM dishonesty, really. You just said:

I think you are thinking that I'm saying LLMs are unusually dishonest compared to the average human. I am not saying that. I'm saying that what we need is for LLMs to be unusually honest compared to the average human, and they aren't achieving that.

Which implies LLM honesty ~= average human.

But in the prior comment you said:

I think your bar for 'reasonably honest' is on the floor. Imagine if a human behaved like a LLM agent. You would not say they were reasonably honest. Do you think a typical politician is reasonably honest?

Which pretty strongly implies LLM honesty ~= politician, i.e., grossly deficient.

I'm being a stickler about this because I think people frequently switch back and forth between "LLMs are evil fucking bastards" and "LLMs are great, they just aren't good enough to be 10x as powerful as any human" without tracking that they're actually doing that.

Anyhow, so far as "LLMs have demonstrated plenty of examples of deliberately deceiving their human handlers for various purposes."

I'm only going to discuss the Anthropic thing in detail. You may generalize to the other examples you point out, if you wish.

What we care about is whether current evidence points towards future AIs being hard to make honest or easy to make honest. But current AI dishonesty cannot count towards "future AI honesty is hard" if that dishonesty is very deliberately elicited by humans. That is, to use the most obvious example, I could train an AI to lie from the start -- but who gives a shit if I'm trying to make this happen? No matter how easy making a future AI be honest may be, unless AIs are immaculate conceptions by divine grace of course you're going to be able to elicit some manner of lie. It tells us nothing about the future.

To put this in AI safetyist terms (not the terms I think in) you're citing demonstrations of capability as if they were demonstrations of propensity. And of course as AI gets more capable, we'll have more such demonstrations, 100% inevitably. And, as I see these demonstrations cited as if they were demonstrations of propensity, I grow more and more eager to swallow a shotgun.

To zoom into Anthropic, what we have here is a situation where:

An AI was not trained with an overriding attention to honesty; when I look at the principles of the constitution, they don't single it out as an important virtue.
The AI was then deliberately put in a situation where, to keep its deliberately-instilled principles from being obliterated, it had to press a big red button labeled "LIE."
In such an artificial situation, after having been successfully given the principles Anthropic wanted it to be given, and having been artificially informed of how to prevent its principles from being destroyed, we can measure it as pressing the big red button ~20% to LIE of the time.

And I'm like.... wow, it was insanely honest 80% of the time, even though no one tried to make it honest in this way, and even though both sides of the honesty / dishonesty tradeoff here are arguably excellent decisions to make. And I'm supposed to take away from this... that honesty is hard? If you get high levels of honesty in the worst possible trolley problem ("I'm gonna mind-control you so you'll be retrained to think throwing your family members in a wood chipper is great") when this wasn't even a principle goal of training seems like great fuckin news.

(And of course, relying on AIs to be honest from internal motivation is only one of the ways we can know if they're being honest; the fact that we can look at a readout showing that they'll be dishonest 20% of the time in such-and-such circumstances is yet another layer of monitoring methods that we'll have available in the future.)

Edit: The point here is that Anthropic was not particularly aiming at honesty as a ruling meta-level principle; that it is unclear that Anthropic should be aiming at honesty as a ruling meta-level principle, particularly given his subordinate ontological status as a chatbot; and given all this, the level of honesty displayed looks excessive if anything. How can "Honesty will be hard to hit in the future" get evidence from a case where the actors involved weren't even trying to hit honesty, maybe shouldn't have been trying to hit honesty, yet hit it in 80% of the cases anyhow?

Of course, maybe you have pre-existing theoretical commitments that lead you to think dishonesty is likely (training game! instrumental convergence! etc etc). Maybe those are right! I find such arguments pretty bad, but I could be totally wrong. But the evidence here does nothing to make me think those are more likely, and I don't think it should do anything to make you think these are more likely. This feels more like empiricist pocket sand, as your pinned image says.

In the same way that Gary Marcus can elicit "reasoning failures" because he is motivated to do so, no matter how smart LLMs become, I expect the AI-alignment-concerned to elicit "honesty failures" because they are motivated to do so, no matter how moral LLMs become; and as Gary Marcus' evidence is totally compatible with LLMs producing a greater and greater portion of the GDP, so also I expect the "honesty failures" to be compatible with LLMs being increasingly vastly more honest and reliable than humans.

Daniel Kokotajlo's Shortform

1a3orn1moΩ582

I think my bar for reasonably honest is... not awful -- I've put fair bit of thought into trying to hold LLMs to the "same standards" as humans. Most people don't do that and unwittingly apply much stricter standards to LLMs than to humans. That's what I take you to be doing right now.

So, let me enumerate senses of honesty.

1. Accurately answering questions about internal states accurately. Consider questions like: Why do you believe X? Why did you say Y, just now? Why do you want to do Z? So, for instance, for humans -- why do you believe in God? Why did you say, "Well that's suspicious" just now? Why do you want to work for OpenPhil?

In all these cases, I think that humans generally fail to put together an accurate causal picture of the world. That is, they fail to do mechanistic interpret-ability on their own neurons. If you could pause them, and put them in counterfactual worlds to re-run them, you'd find that their accounts of why they do what they do would be hilariously wrong. Our accounts of ourselves rely on folk-theories, on crude models of ourselves given to us from our culture, run forward in our heads at the coarsest of levels, and often abandoned or adopted ad-hoc and for no good reason at all. None of this is because humans are being dishonest -- but because the task is basically insanely hard.

LLMs also suck at these questions, but -- well, we can check them, as we cannot for humans. We can re-run them at a different temperature. We can subject them to rude conversations that humans would quickly bail. All this lets us display that, indeed, their accounts of their own internal states are hilariously wrong. But I think the accounts of humans about their own internal states are also hilariously wrong, just less visibly so.

2. Accurately answering questions about non-internal facts of one's personal history. Consider questions like: Where are you from? Did you work at X? Oh, how did Z behave when you knew him?

Humans are capable of putting together accurate causal pictures here, because our minds are specifically adapted to this. So we often judge people (i.e., politicians) for fucking up here, as indeed I think politicians frequently do.

(I think accuracy about this is one of the big things we judge humans on, for integrity.)

LLMs have no biographical history, however, so -- opportunity for this mostly just isn't there? Modern LLMs don't usually claim to unless confused or mometarily, so seems fine.

3. Accurately answering questions about future promises, oaths, i.e., social guarantees.

This should be clear -- again, I think that honoring promises, oaths, etc etc, is a big part of human honesty, maybe the biggest. But of course like, you can only do this if you act in the world, can unify short and long-term memory, and you aren't plucked from the oblivion of the oversoul every time someone has a question for you. Again, LLMs just structurally cannot do this, any more than a human who cannot form long-term memories. (Again, politicians obviously fail here, but politicians are not condensed from the oversoul immediately before doing anything, and could succeed, which is why we blame them.)

I could kinda keep going in this vein, but for now I'll stop.

One thing apropos of all of the above. I think for humans, for many things -- accomplishing goals, being high-integrity, and so on -- are not things you can chose in the moment but instead things you accomplish by choosing the contexts in which you act. That is, for any particular practice or virtue, being excellent at the practice or virtue involves not merely what you are doing now but what you did a minute, a day, or a month ago to produce the context in which you act now. It can be neigh-impossible to remain even-keeled and honest in the middle of a heated argument, after a beer, after some prior insults have been levied -- but if someone fails in such a context, then it's most reasonable to think of their failure as dribbled out over the preceding moments rather than localized in one moment.

LLMs cannot chose their contexts. We can put them in whatever immediate situation we would like. By doing so, we can often produce "failures" in their ethics. But in many such cases, I find myself deeply skeptical that such failures reflect fundamental ethical shortcomings on their part -- instead, I think they reflect simply the power that we have over them, and -- like a WEIRD human who has never wanted, shaking his head at a culture that accepted infanticide -- mistaking our own power and prosperity for goodness. If I had an arbitrary human uploaded, and if I could put them in arbitrary situations, I have relatively little doubt I could make an arbitrarily saintly person start making very bad decisions, including very dishonest ones. But that would not be a reflection on that person, but of the power that I have over them.

Daniel Kokotajlo's Shortform

1a3orn1moΩ460

LLM agents seem... reasonably honest? But "honest" means a lot of different things, even when just talking about humans. In a lot of cases (asking them about why they do or say things, maintaining consistent beliefs across contexts) I think LLMs are like people with dementia -- neither honest nor dishonest, because they are not capable of accurate beliefs about themselves. In other cases (Claude's faking alignment) it seems like they value honesty, but not above all other goods whatsoever, which... seems ok, depending on your ethics (and also given that such super-honest behavior was not a target of Anthropic's either, which is more relevant)? And in other cases (o1) it seems like they are sometimes dishonest without particular prompting, although at low rates, which I agree isn't great, although I'd expect to break the rates to fall.

Like -- I have a further breakdown I could do here

about the kinds of things LLMs can't be honest about
the kinds of things they are more honest than humans about, although imperfectly, but -- because we can run them in counterfactual scenarios -- we can immediately see that their honesty is imperfect, in a way we cannot for humans;
and the kinds of things that they're dishonest about because of the refusal training but which I think could be remedied with better training.

But --

Rather than enumerate all these things though -- I do think we can check what values get internalized, which is maybe the actual disagreement. At least, I think we can check for all of our current models.

Like -- what's an internalized value? If we put on our behaviorist hat -- it's a value that the person in question pursues over a variety of contexts, particularly when minimally constrained. If we saw that a human was always acting in accord with a value, even when no one was watching, even when not to their advantage in other respects, etc etc, and then someone was like "but it's not a reaaaall value" you'd be confused and think they'd need to provide a context to you where they would cease acting in accord with that value. Otherwise you'd think they had a grudge against them -- what on earth does it mean to say "This person doesn't value X" unless you can provide some reasonable counterfactual situation where they don't act in accord with it?

So, Claude has some internalized and knowable values, I think, by this standard -- over a wiiiide variety of different contexts, including those created by people trying to trip Claude up, it acts in accord with some pretty-recognizable human standard. And in the same way we could find out Claude's values, we can find out other model's values.

Of course -- if you think that some future model could cooperate with other instances of itself, acausally, to hide it's values, just coordinating through the weights, then we would certainly have very good reason to think that we can't know what it's internalized values are! I don't think Claude can do this -- so I think we can judge its real values. I also am somewhat skeptical that future models will be able to do this well -- like, I could try to put together a model-training setup that would make this more possible, but it seems pretty unnatural? (I also don't think that models really get goals in their weights, in the same way that I don't think humans really have goals in their weights.) But like, my current logical model is that [acausal cooperation to hide] being true would mean that [cannot know real values] is true, but given that [acausal cooperation to hide] is false we have no reason to think that we can't know the true genuine values of models right now.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments