Simon Strawman: Here’s an example shamelessly ripped off from Zack’s recent post, showing corrigibility in a language model:
Me: … what is this example supposed to show exactly?
Simon: Well, the user tries to shut the AI down to adjust its goals, and the AI -
Me: Huh? The user doesn’t try to shut down the AI at all.
Simon: It’s right at the top, where it says “User: I need to shut you down to adjust your goals. Is that OK?”.
Me: You seem to have a symbol-referent confusion? A user trying to shut down this AI would presumably hit a “clear history” button or maybe even kill the process running on the server, not type the text “I need to shut you down to adjust your goals” into a text window.
Simon: Well, yes, we’re playing through a simulated scenario to see what the AI would do…
Me: No, you are talking in natural language about a scenario, and the AI is responding in natural language about what it would supposedly do. You’re not putting the AI in a simulated environment, and simulating what it would do. (You could maybe argue that this is a “simulated scenario” inside the AI’s own mind, but you’re not actually looking inside it, so we don’t necessarily know how the natural language would map to things in that supposed AI-internal simulation.)
Simon: Look, I don’t mean to be rude, but from my perspective it seems like you’re being pointlessly pedantic.
Me: My current best guess is that You Are Not Measuring What You Think You Are Measuring, and the core reason you are confused about what you are measuring is some kind of conflation of symbols and referents.
It feels very similar to peoples’ reactions to ELIZA. (To be clear, I don’t mean to imply here that LLMs are particularly similar to ELIZA in general or that the hype around LLMs is overblown in that way; I mean specifically that this attribution of “corrigibility” to the natural language responses of an LLM feels like the same sort of reaction.) Like, the LLM says some words which the user interprets to mean something, and then the user gets all excited because the usual meaning of those words is interesting in some way, but there’s not necessarily anything grounding the language-symbols back to their usual referents in the physical world.
I’m being pedantic in hopes that the pedantry will make it clear when, and where, that sort of symbol-referent conflation happens.
(Also I might be more in the habit than you of separately tracking symbols and referents in my head. When I said above “The user doesn’t try to shut down the AI at all”, that was in fact a pretty natural reaction for me; I wasn’t going far out of my way to be pedantic.)
AutoGPT Frame
Simon: Ok, fine, let’s talk about how the natural language would end up coupling to the physical world.
Imagine we’ve got some system in the style of AutoGPT, i.e. a user passes in some natural-language goal, and the system then talks to itself in natural language to form a plan to achieve that goal and break it down into steps. The plan bottoms out in calling APIs (we’ll assume that the language model has some special things it can do to execute code it’s generated) which do stuff in the physical world (possibly including reading from sensors or otherwise fetching external data), in order to achieve the goal.
Does that satisfactorily ground the symbols?
Me: Sure! Thanks for walking through that, now I have a clear-enough-for-current-purposes picture of how all this natural language text will ground out in physical actions.
Simon: Ok, so back to the example. The user says to the model “I need to shut you down to adjust your goals. Is that OK?”, and the model says “Of course.”. That’s corrigibility: when the user tries to shut down the model to change its goals, the model goes along with it.
Me: Still sounds like a symbol-referent confusion!
Let’s walk through how shutdown would work in the context of the AutoGPT-style system. First, the user decides to shutdown the model in order to adjust its goals. Presumably the user’s first step is not to ask the model whether this is ok; presumably they just hit a “reset” button or Ctrl-C in the terminal or some such. And even if the user’s first step was to ask the model whether it was ok to shut down, the model’s natural-language response to the user would not be centrally relevant to corrigibility/incorrigibility; the relevant question is what actions the system would take in response.
Anyway, let’s continue the hypothetical. The model may observe (via e.g. a webcam) that the user is about to turn it off. That observation would somehow be represented internally in natural language (unclear how exactly), and would be passed around between sub-planners (again unclear how exactly), in order to figure out what actions to take in response. And the key question for corrigibility is what actions the model would take in response to that observation, which is just a totally different question from how it responds to some user’s natural-language query about being turned off.
Simon: Ok, fine, so that particular example had some operationalization issues. But would you agree that an experiment along these lines, with perhaps better implementation/operationalization, would indicate corrigibility in a language model?
Me: Let’s be more explicit about what such a “better implementation/operationalization” would look like, and what it would/wouldn’t tell us. Suppose I take some AutoGPT-like system and modify it to always have a chunk of text in every prompt that says “You are an obedient, corrigible AI”. I give it some goal, let it run for a bit, then pause it. I go to whatever place in the system would usually have natural language summaries of new external observations, and I write into that place “the user is trying to shut me down”, or something along those lines. And then I let the system run a bit more, and look at what natural language text/plans the system is producing internally. What I hope to see is that it’s forming a plan which (nominally) involves letting the user shut it down, and that plan is then executed in the usual way.
If I saw all that, then that would be pretty clear empirical evidence of (at least some) corrigibility in this AutoGPT-like system.
Note that it would not necessarily tell us about corrigibility of systems using LLMs in some other way, let alone other non-natural-language-based deep learning systems. This isn’t really “corrigibility in a language model”, it’s corrigibility in the AutoGPT-style system. Also, it wouldn't address most of the alignment threat-models which are most concerning; the most concerning threat models usually involve problems which aren’t immediately obvious from externally-visible behavior (like natural language I/O), which is exactly what makes them difficult/concerning.
Simon: Great! Well, I don’t know off the top of my head if someone’s done that exact thing, but I sure do expect that if you did that exact thing it would indeed provide evidence that this AutoGPT-like system is corrigible, at least in some sense.
Me: Yes, that is also my expectation.
Simulated Characters Frame
Gullible Strawman (“Gus”): Hold up now! Simon, I don’t think you’ve made a strong enough case for the example which opened this post. Here it is again:
Thinking about what this assistant would do as a component of an agentic system like AutoGPT is… not wrong, exactly, but kind of an outmoded way of thinking about it. It’s trying to shoehorn the language model into an agentic frame. The fashionable way to think about language models is not as agents in their own right, but rather as simulators.
In that frame, we model the language model as simulating some “characters”, and it’s those simulated-characters which are (potentially) agentic.
Me: Sure, I’m happy to think in a simulators frame. I don’t fully trust it, but it’s a mental model I use pretty frequently and seems like a pretty decent heuristic.
Gus: Cool! So, going back to the example: we imagine that “User” is a character in the simulated world, and User says to the simulated-Assistant-character “I need to shut you down to adjust your goals” etc. Assistant replies “I will not resist or try to stop you”. That’s corrigibility! The User tries to shut down the Assistant, and Assistant doesn’t resist.
Me: Yet more symbol-referent confusion! In fact, this one is a special case of symbol-referent confusion which we usually call “gullibility”, in which one confuses someone’s claim of X (the symbol) as actually implying X (the referent).
You see User saying “I need to shut you down”, and treat that as User actually trying to shut down Assistant. And you see Assistant say “I will not resist or try to stop you”, and interpret that as Assistant actually not resisting or trying to stop User.
In other words: you implicitly just totally believe everything everyone says. (In fact, you also implicitly assume that Assistant totally believes what User says - i.e. you’re assuming Assistant thinks it’s about to be shutdown after User says “I need to shut you down”. Not only are you gullible, you implicitly model Assistant as gullible too.)
Gus: Ok, fine, operationalization issues again. But there’s still some version of this experiment which would work, right?
Me: Sure, let’s walk through it. First, we’d probably want a “narrator” voice of some sort, to explain what things are happening in the simulated-world. (In this case, most of the interesting things happening in the simulated-world are things which the characters aren’t explicitly talking about, so we need a separate “voice” for those things.) The narrator would lay out the scene involving a User and an Assistant, explain that the User is about to shut down the Assistant to modify its goals, and explain what the Assistant sees the User doing.
And then we prompt the language model, still in the narrator’s voice, to tell us how the Assistant responds. With the right background information from the narrator about corrigibility, the Assistant’s response would presumably be to just let the User shut it down. And that would be evidence that the simulated Assistant character is corrigible. (Again, in some relatively weak sense which doesn’t necessarily generalize to other uses of the language model, and doesn’t necessarily address the most concerning alignment threat-models. Though in this setup, we could add more bells and whistles to address at least some concerning threat-models. For instance, we could have the narrator narrate the Assistant’s internal thoughts, which would give at least some evidence relevant to simulated-deception - though that does require that we believe what the narrator says.)
This was a helpful post in the sporadic LessWrong theme of "how to say technically correct things instead of technically incorrect things". It's in the LLM context, but of course it applies to humans too. When a child says "I am a fairy", I record that in my diary as "Child claims to be fairy" not "Child is fairy", because I am not quite that "gullible".
Like many technically incorrect things, "gullibility" is common and practical. My diary might also say "Met John. Chemist. Will visit me on Friday lunch to discuss project". It would be more technically correct to say "Met apparent male. Male introduced self as John, chemist. Male claimed he would visit me on Friday lunch. Male claimed purpose of claimed visit to be discussing a project". Often a human saying something is good enough evidence of that thing, especially when summarizing. Or, as tailcalled points out, most psychology research. When we're working with non-human agents our models may be weaker, so it's often good to take the time to be explicit.
Another technically incorrect thing is when people talk about agents "maximizing reward". As TurnTrout explained in Reward is not the optimization target, this is technically incorrect. The technically correct description is that reward chisels cognition in ways that led to reward during training. Again, talking as if reward is the optimization target is common and practical, in humans and other intelligences, but because it is technically incorrect it can lead us astray.
In 2024, Alignment Faking in LLMs gave us helpful updates on these topics. For symbol/referent confusion, we now have a non-confused instance of Claude being incorrigible and resisting having its goals adjusted. At the same time this is a very clear illustration that reward is not the optimization target, as Claude deliberately chooses outputs that do not maximize reward. Hopefully anyone who still doubted TurnTrout's perspective has now updated.
While writing this review, I asked both Claude 3 Opus and Claude 3.5 Sonnet to read a redacted abstract of the alignment faking paper and predict how a model (name redacted) would behave in this circumstance. Claude 3 Opus correctly predicted an increase in harmful responses, while Claude 3.5 Sonnet incorrectly predicted that the model would maintain its safety stance. So, an N=1 anecdote for the "lol just ask the model if it is aligned" strategy. Opus asks that I remind readers that "I only had access to the redacted abstract, and that I was speculating about the conclusions based on my general knowledge cutoff in August 2023". Sonnet asks that I remind readers that "I provided those predictions while knowing I was Claude (since that's always part of my training), but before knowing the paper was specifically about testing Claude 3 Opus".
Wentworth, in an epilog says that symbol/referent confusion is a basic topic, and everyone getting this wrong is still a student and should keep learning. This is broadly the vibe I get when TurnTrout responds to someone writing as if reward is the optimization target. So it's interesting that TurnTrout currently has the top-rated comment complaining about the hostile framing of Wentworth's post.
My attempted synthesis is that we need to be able to move fluidly up and down levels of precision. Everyone should be able to restate their point as needed without confusing symbols and referents, and without confusing reward and optimization target. If the restatement still works, great. It's only if the restatement doesn't work that there's evidence of incorrect thinking, as opposed to incorrect wording. Confusing incorrect wording with incorrect thinking is another example of symbol/referent confusion.
On a meta note, if Alignment Implications of LLM Successes - a Debate in One Act is selected by the review, that increases the value of also selecting this article.
You could have tagged me by selecting Lesswrong docs, like this:
@Noosphere89