An exercise that helped me see the "argmax is a trap" point better was to concretely imagine what the cognitive stacktrace for an agent running argmax search over plans might look like:
# Traceback (most recent call last)
At line 42 in main:
decision = agent.deliberate()
At line 3 in deliberate:
all_plans = list(plan_generator)
return argmax(all_plans, grader=self.grader)
At line 8 in argmax:
for plan in plans: # includes adversarial inputs (!!)
evaluation = apply(grader, plan)
At line 264 in apply:
predicted_object_level_consequences = ...
KeyboardInterrupt
A major problem with this design is that the agent considers "What do I think would happen if I ran plan X?", but NOT "What do I think would happen if I generated plans using method Y?". If the agent were considering the second question as well, then it would (rightly) conclude that the method "generate all plans & run argmax search on them" would spit out a grader-fooling adversarial input, which will cause it to implement some arbitrary plan with low expected value. (Heck, with future LLMs maybe you'll be able to just show them an article about the Optimizer's Curse and it will grok this.) Knowing this, the agent tosses aside this forseeably-harmful-to-its-interests search method.
The natural next question is "Ok, but what other option(s) do we have?". I'm guessing TurnTrout's next post might look into that, and he likely has more worked out thoughts on it, so I'll leave it to him. But I'll just say that I don't think coherent real-world agents will/should/need to be running cognitive stacktraces with footguns like this.
My perspective is:
My sense is that you want to decline to play this game and instead say: just don't build AI systems that search for high-scoring plans.
That might be OK if it turns out that planning isn't an effective algorithmic ingredient, or if you can convince people not to build such systems because it is dangerous (and similar difficulties don't arise if agents learn planning internally). But failing that, we are going to have to figure out how to build AI systems that capture the benefits of planning without being dangerous.
(It's possible you instead have a novel proposal for a way to capture the benefits of search without the risks, in case I'd withdraw this comment once part 2 came out though I wish you'd led with the juicy part.)
As a secondary point (that I've said a bunch of times), I also found the arguments in this post uncompelling. Probably the first thing to clarify is that I feel like you equivocate between the grader being something that is embedded in the real world and hence subject to manipulation by real-world consequences of the actor's actions, and the grader being something that operates on plans in the agent's head in order to select the best one. In the latter case the grader is still subject to manipulation, but the prospects for manipulation seems unrelated to the open-endedness of the domain and unrelated to taking dangerous actions.
Probably the first thing to clarify is that I feel like you equivocate between the grader being something that is embedded in the real world and hence subject to manipulation by real-world consequences of the actor's actions, and the grader being something that operates on plans in the agent's head in order to select the best one. In the latter case the grader is still subject to manipulation, but the prospects for manipulation seems unrelated to the open-endedness of the domain and unrelated to taking dangerous actions.
This seems like a misunderstanding. While I've previously communicated to you arguments about problems with manipulating embedded grading functions, that is not at all what this post is intended to be about. I'll edit the post to make the intended reading more obvious. None of this post's arguments rely on the grader being embedded and therefore physically manipulable. As I wrote in footnote 1:
I'm not assuming the actor wants to maximize the literal physical output of the grader, but rather just the "spirit" of the grader. More formally, the actor is trying to , where Grader can be defined over the agent's internal plan ontology.
Anyways, replying in particular to:
the prospects for manipulation seems unrelated to the open-endedness of the domain and unrelated to taking dangerous actions.
Open-ended domains are harder to grade robustly on all inputs because more stuff can happen, and the plan space gets exponentially larger since the branching factor is the number of actions. EG it's probably far harder to produce an emotionally manipulative-to-the-grader DOTA II game state (e.g. I look at it and feel compelled to output a ridiculously high number), than a manipulative state in the real world (which plays to e.g. their particular insecurities and desires, perhaps reminding them of triggering events from their past in order to make their judgments higher-variance).
My sense is that you want to decline to play this game and instead say: just don't build AI systems that search for high-scoring plans.
That might be OK if it turns out that planning isn't an effective algorithmic ingredient, or if you can convince people not to build such systems because it is dangerous (and similar difficulties don't arise if agents learn planning internally). But failing that, we are going to have to figure out how to build AI systems that capture the benefits of planning without being dangerous.
(It's possible you instead have a novel proposal for a way to capture the benefits of search without the risks, in case I'd withdraw this comment once part 2 came out though I wish you'd led with the juicy part.)
I don't think we can or need to avoid planning per se. My position is more that certain design choices -- e.g. optimizing the output of a grader with a diamond-value, instead of actually having the diamond-value yourself -- force you to solving ridiculously hard subproblems, like robustness against adversarial inputs in the exponential-in-planning-horizon plan space.
Just to set expectations, I don't have a proposal for capturing "the benefits of search without the risks"; if you give value-child bad values, he will kill you. But I have a proposal for how several apparent challenges (e.g. robustness to adversarial inputs proposed by the actor) are artifacts of e.g. the design patterns I outlined in this post. I'll outline why I think that realistic (e.g. not argmax) cognition/motivational structures automatically avoid these extreme difficulties.
(I hesitated to post these comments in case they're not relevant to the main point you're trying to make or will be addressed in the next post. Feel free to ignore if that's the case.)
Value-child: The mother makes her kid care about working hard and behaving well.
How does one do this? (Not entirely rhetorical.)
Amplified humans spend 5,000 years thinking about how many diamonds the plan produces in the next 100 years, and write down their conclusions as the expected utility of the plan.
Due to the exponentially large plan space and the fact that humans are not cognitively secure systems, there exists a long sequence of action commands which cognitively impairs all of the humans and makes them prematurely stop the search and return a huge number.
If I was doing the evaluation, I wouldn't look at the plan directly but spend the first 4999 years slowly and carefully upgrading myself and my AI helpers, and then if I'm still not sure I can safely evaluate a plan, I would just throw an exception or return an error code instead of looking at the plan.
This lets us abstract away e.g. seemingly annoying complications with reflective agents which think about their future planning process. This seemingly[4] relaxes the problem.
Another reason to think about argmax in relation to AI safety/alignment is if you design an AI that doesn't argmax (or do its best to approximate argmax), and someone else builds one that does, your AI will lose a fair fight (e.g., economic competition starting from equal capabilities and resource endowments), so it would be nice if alignment doesn't mean giving up argmax.
if you design an AI that doesn't argmax (or do its best to approximate argmax), and someone else builds one that does, your AI will lose a fair fight (e.g., economic competition starting from equal capabilities and resource endowments), so it would be nice if alignment doesn't mean giving up argmax.
This seems exactly backwards to me. Argmax violates the non-adversarial principle and wastes computation. Argmax requires you to spend effort hardening your own utility function against the effort you're also expending searching across all possible inputs to your utility function (including the adversarial inputs!). For example, if I argmaxed over my own plan-evaluations, I'd have to consider the most terrifying-to-me basilisks possible, and rate none of them unusually highly. I'd have to spend effort hardening my own ability to evaluate plans, in order to safely consider those possibilities.
It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren't argmaxing. You're using resources effectively.
For example, some infohazardous thoughts exist (like hyper-optimized-against-you basilisks) which are dangerous to think about (although most thoughts are probably safe). But an agent which plans its next increment of planning using a reflective self-model is IMO not going to be like "hey it would be predicted-great if I spent the next increment of time thinking about an entity which is trying to manipulate me." So e.g. a reflective agent trying to actually win with the available resources, wouldn't do something dumb like "run argmax" or "find the plan which some part of me evaluates most highly."
(See Charles Foster's comment for another perspective here.)
If I was doing the evaluation, I wouldn't look at the plan directly but spend the first 4999 years slowly and carefully upgrading myself and my AI helpers, and then if I'm still not sure I can safely evaluate a plan, I would just throw an exception or return an error code instead of looking at the plan.
Unless this grader procedure implements a perfectly robust mathematical (plan input)->(grade output) function, you get hacked.
It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.
But aren't you still argmaxing within the space of plans that you haven't closed off (or are actively considering), and still taking a risk of finding some adversarial plan within that space? (Humans get scammed and invent or fall into cults and crazy ideologies not infrequently, despite doing what you're describing here already.) How do you just "not argmax" or "not design agents which exploit adversarial inputs"?
Maybe there's no substantive disagreement here, merely an issue of presentation/communication? I.e., when you say "you aren't argmaxing" perhaps you don't mean "don't ever use argmax anywhere" but instead "don't argmax over the whole plan space" and by "don't design agents which exploit adversarial inputs" you mean something like "we should try to find ways to avoid or reduce the risk adversarial inputs"?
I.e., when you say "you aren't argmaxing" perhaps you don't mean "don't ever use argmax anywhere" but instead "don't argmax over the whole plan space"
I was primarily critiquing "argmax over the whole plan space." I do caution that I think it's extremely important to not round off "iterative, reflective planning and reasoning" as "restricted argmax", because that obscures the dynamics and results of real-world cognition. Argmax is also a bad model of what people are doing when they think, and how I expect realistic embedded agents to think.
"don't design agents which exploit adversarial inputs" you mean something like "we should try to find ways to avoid or reduce the risk adversarial inputs"
No, I mean: don't design agents which are motivated to find and exploit adversarial inputs. Don't align an agent to evaluations which are only nominally about diamonds, and then expect the agent to care about diamonds! You wouldn't align an agent to care about cows and then be surprised that it didn't care about diamonds. Why be surprised here?
I wrote a bunch more before realizing that we maybe don't disagree fully on the "don't argmax" point. Here:
But aren't you still argmaxing within the space of plans that you haven't closed off (or are actively considering),
Not really? I think it is inappropriately suggestive to describe this as "argmaxing." I, for one, usually feel like I consider at most three plans during most planning sessions. Most of the work is going to be in my generative models, in my learned habits of thought, in my snap reflective assessments of what I should think about next.
How many different plans do you consider for going to the store? For writing a LessWrong post? Even if you did consider more plans, you'd convergently want to explore parts of the plan-space which you think won't contain secret adversarial examples to your own evaluations. EG at first pass, just don't think about entities trying to acausally blackmail you.
Argmax is an abstraction which may or may not actually describe a given cognitive process. I think that if we label reflective incremental planning and reasoning as "argmax", we're missing a serious opportunity for original thought, for considering in detail what the algorithm does.
and still taking a risk of finding some adversarial plan within that space?
There is indeed a risk you'll find an adversarial plan. But what is the risk, quantitatively? A reflective agent will convergently wish to avoid thinking about plans which exploit its own evaluation procedures and reasoning (eg tricking the diamond-shard into bidding for plans). In stark contrast, grader-optimizers and argmaxers convergently want to exploit those procedures, so as to achieve higher diamond-evaluations.
How do you just "not argmax" or "not design agents which exploit adversarial inputs"?
First of all, alignment researchers should stop trying to terminally motivate agents to optimize evaluations of their plans or outcomes. That's doomed and doesn't make sense.
Second, A shot at the diamond alignment problem describes an agent which isn't trying to exploit some diamond-grader. I didn't do anything in particular in order to avoid training an agent which exploits adversarial inputs to a diamond-grader function. I think that you just don't get that problem at all, unless you're assuming cognition must decompose via the (IMO) strange frame of "outer/inner alignment."
(Humans get scammed and invent or fall into cults and crazy ideologies not infrequently, despite doing what you're describing here already.)
Note the presence of adversarial optimizers in most of these situations. The adversarial optimization comes from other people who are optimizing ideas to get spurious buy-in from victims.
I expect that smart agents convergently wish to minimize the optimizer's curse, because that leads to more of what they want.
Thanks for this longer reply and the link to your diamond alignment post, which help me understand your thinking better. I'm sympathetic to a lot of what you say, but feel like you tend to state your conclusions more strongly than the underlying arguments warrant.
The adversarial optimization comes from other people who are optimizing ideas to get spurious buy-in from victims.
I think a lot of crazy religions/ideologies/philosophies come from people genuinely trying to answer hard questions for themselves, but there are also some that are deliberate attempts to optimize against others (Scientology?).
Daniel Kokotajlo described an analogy with EA, which you were going to answer but still haven't. I would add that EA funders have to consider even more (and potentially more explicitly adversarial) proposals/plans than a typical EA, and AFAIK nobody has suggested that that's doomed from the start because it amounts to argmaxing against an evaluator. Instead, everyone implicitly or explicitly recognizes the danger of adversarial plans, and tries to harden the evaluation process against them.
I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more?
However, this does not seem important for my (intended) original point. Namely, if you're trying to align e.g. a brute-force-search plan maximizer or a grader-optimizer, you will fail due to high-strength optimizer's curse forcing you to evaluate extremely scary adversarial inputs. But also this is sideways of real-world alignment, where realistic motivations may not be best specified in the form of "utility function over observation/universe histories."
there are also some that are deliberate attempts to optimize against others (Scientology?).
(Also, major religions are presumably memetically optimized. No deliberate choice required, on my model.)
Daniel Kokotajlo described an analogy with EA, which you were going to answer but still haven't.
I would add that EA funders have to consider even more (and potentially more explicitly adversarial) proposals/plans than a typical EA, and AFAIK nobody has suggested that that's doomed from the start because it amounts to argmaxing against an evaluator. Instead, everyone implicitly or explicitly recognizes the danger of adversarial plans, and tries to harden the evaluation process against them.
This seems disanalogous to the situation discussed in the OP. If we were designing, from scratch, a system which we wanted to pursue effective altruism, we would be extremely well-advised to not include grader-optimizers which are optimizing EA funder evaluations. Especially if the grader-optimizers will eventually get smart enough to write out the funders' pseudocode. At best, that wastes computation. At (probable) worst, the system blows up.
By contrast, we live in a world full of other people, some of whom are optimizing for status and power. Given that world, we should indeed harden our evaluation procedures, insofar as that helps us more faithfully evaluate grants and thereby achieve our goals.
I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more?
Maybe https://en.wikipedia.org/wiki/Extraordinary_Popular_Delusions_and_the_Madness_of_Crowds (I don't mean read this book, which I haven't either, but you could use the wiki article to familiarize yourself with the historical episodes that the book talks about.) See also https://en.wikipedia.org/wiki/Heaven's_Gate_(religious_group)
However, this does not seem important for my (intended) original point. Namely, if you’re trying to align e.g. a brute-force-search plan maximizer or a grader-optimizer, you will fail due to high-strength optimizer’s curse forcing you to evaluate extremely scary adversarial inputs. But also this is sideways of real-world alignment, where realistic motivations may not be best specified in the form of “utility function over observation/universe histories.”
My counterpoint here is, we have an example of human-aligned shard-based agents (namely humans), who are nevertheless unsafe in part because they fall prey to dangerous thoughts, which they themselves generate because they inevitably have to do some amount of search/optimization (of their thoughts/plans) as they try to reach their goals, and dangerous-thought density is high enough that even that limited amount of search/optimization is enough to frequently (on a societal level) hit upon dangerous thoughts.
Wouldn't a shard-based aligned AI have to do as much search/optimization as a human society collectively does, in order to be as competent/intelligent, in which case wouldn't it be as likely to be unsafe in this regard? And what if it has an even higher density of dangerous thoughts, especially "out of distribution", and/or does more search/optimization to try to find better-than-human thoughts/plans?
(My own proposal here is to try to solve metaphilosophy or understand "correct reasoning" so that we / our AIs are able to safely think any thought or evaluation any plan, or at least have some kind of systematic understanding of what thoughts/plans are dangerous to think about. Or work on some more indirect way of eventually achieving something like this.)
Another reason to think about argmax in relation to AI safety/alignment is if you design an AI that doesn't argmax (or do its best to approximate argmax),
Actual useful AGI will not be built from argmax, because it's not really useful for efficient approximate planning. You have exponential (in time) uncertainty from computational approximation and fundamental physics. This results in uncertainty over future state value estimates, and if you try to argmax with that uncertainty you are just selecting for noise. The correct solutions for handling uncertainty lead to something more like softmax or soft actor critic which avoids these issues (and also naturally leads to empowerment as an emergent heuristic).
So argmax is only useful in toy problem domains, mostly worthless for real world planning. To the extent much of standard alignment arguments now rests on this misunderstanding, those arguments are misfounded.
Which of the standard alignment arguments do you think no longer hold up if we replace argmax with softmax?
The first one that comes to my mind is: suppose we live in a world where intelligence explosion is possible, and someone builds an AI with flawed utility function, it would quickly become superintelligent and ignore orders to shut down because shutting down has lower expected utility than not shutting down. It seems to me that replacing the argmax in the AI's decision procedure with softmax results in the same outcome, since the AI's estimated expected utility of not shutting down would be vastly greater than shutting down, resulting in a softmax of near 1 for that option.
Am I misunderstanding something in the paragraph above, or do you have other arguments in mind?
Which of the standard alignment arguments do you think no longer hold up if we replace argmax with softmax?
The specific argument that you just referenced in your earlier comment: that argmax is important for competitiveness, but that argmax is inherently unsafe because of adversarial optimization ("argmax is a trap").
The first one that comes to my mind is: suppose we live in a world where intelligence explosion is possible, and someone builds an AI with flawed utility function,
If you assume you've already completely failed then the how/why is less interesting.
The argmax argument expounded further is that any slight imperfection in the utility function results in doom, because of adversarial optimization magnifying that slight imperfection as you extend the planning horizon into the far future and improve planning/modeling precision.
But that isn't actually how it works. Instead due to compounding planning uncertainty far future value distributions are high variance and you get convergence to empowerment as I mentioned in the linked discussion.
But that's good news because it means that small mis-specifications in the utility function model converge away rather than diverging to infinity. The planning trajectory just converges to empowerment, regardless of the utility function, so this is good news for alignment.
The specific argument that you just referenced in your earlier comment: that argmax is important for competitiveness, but that argmax is inherently unsafe because of adversarial optimization (“argmax is a trap”).
Assuming softmax is important for competitiveness instead, I don't see why this argument doesn't go through with "argmax" replaced by "softmax" throughout (including the "argmax is a trap" section of the OP). I read your linked comment and post, and still don't understand. I wonder what the authors of the OP (or anyone else) think about this.
Thanks for leaving the comments!
Value-child: The mother makes her kid care about working hard and behaving well.
How does one do this? (Not entirely rhetorical.)
I don't know how to do it perfectly, of course.[1] But I infer that it can be done, because there exist people who in fact intrinsically care about working hard and behaving well. So why can't the child also be made to make decisions in a similar manner? Take those values and transplant them into the child via some kind of "model surgery." (Unrealistic, yes. But so was "inner-align the child onto the evaluations output by his model of his mom.")
All that the parable requires is that it can be done, that we are talking about a realistic and possible mind design pattern.
I also wrote in a footnote:
Value-child is not trying to find a plan which he would evaluate as good. He is finding plans which evaluate as good. I think this is the kind of motivation which real-world intelligences tend to have. (More on how value-child works in the next essay.)
More concretely, I'm happy to make guesses like "judiciously supply M&Ms and praise to reward-shape them when they're working hard and behaving well, and emphasize why they're getting the rewards -- they're working hard and behaving well" and "show them cool media where the protagonist works hard and behaves well."
> Value-child: The mother makes her kid care about working hard and behaving well.
How does one do this? (Not entirely rhetorical.)
I think this post is not trying to answer this but just pointing out the discrepancy. The next post will probably come back to this:
In the next essay, I'll point out how this is obstacle is an artifact of these design patterns, and not any intrinsic difficulty of alignment.
This seems great!
If you are continuing work in this vein, I'd be interested in you looking at how these dynamics relate to different Goodhart failure modes, as we expanded on here. I think that much of the problem relates to specific forms of failure, and that paying attention to those dynamics could be helpful. I also think they accelerate in the presence of multiple agents - and I think the framework I pointed to here might be useful.
I'm not sure I understand what you mean by "specific forms of failure." Could you give me a more concrete example of how Goodhart relates to the ideas in this essay?
I think what you call grader-optimization is trivially about how a target diverges from the (unmeasured) true goal, which is adversarial goodhart (as defined in paper, especially how we defined Campbell’s Law, not the definition in the LW post.)
And the second paper's taxonomy, in failure mode 3, lays out how different forms of adversarial optimization in a multi-agent scenario relate to Goodhart's law, in both goal poisoning and optimization theft cases - and both of these seem relevant to the questions you discussed in terms of grader-optimization.
This is a nice frame of the problem.
In theory, at least. It's not so clear that there are any viable alternatives to argmax-style reasoning that will lead to superhuman intelligence.
I agree—I think “Optimizing for the output of a grader which evaluates plans” is more-or-less how human brains choose plans, and I don’t think it’s feasible to make an AGI that doesn’t do that.
But it sounds like this will be the topic of Alex’s next essay.
So I’m expecting to criticize Alex’s next essay by commenting on it along the lines of: “You think you just wrote an essay about something which is totally different from “Optimizing for the output of a grader which evaluates plans”, but I disagree; the thing you’re describing in this essay is in that category too.” But that’s just a guess; I will let Alex write the essay before I criticize it. :-P
IMO, what the brain does is a bit like classifier guided diffusion, where it has a generative model of plausible plans to do X, then mixes this prior with the gradients from some “does this plan actually accomplish X?” classifier.
This is not equivalent to finding a plan that maximises the score of the “does this plan actually accomplish X?” classifier. If you were to discard the generative prior and choose your plan by argmaxing the classifier’s score, you’d get some nonsensical adversarial noise (or maybe some insane, but technically coherent plan, like “plan to make a plan to make a plan to … do X”).
It sounds like some people have an intuition that the mental algorithms "sample from a conditional generative model" and "search for the argmax / epsilon-close-to-argmax input to a scoring function" are effectively the same. I don't share that intuition and struggle to communicate across that divide. Like, when I think about it through ML examples (GPT, diffusion models, etc.), those are two very different pieces of code that produce two very different kinds of outputs.
I believe sampling from a conditional distribution is basically equivalent to adding a "cost of action" (where "action" = deviating from the generative model) to argmax search.
Suppose is your prior distribution, is your utility function, and you are selecting some policy distribution so as to maximize . Here the first term represents the standard utility maximization objective whereas the second term represents a cost of action. This expands into , which is equivalent to minimizing or in other words , which happens when . (I think, I'm rusty on this math so I might have made a mistake.)
This is not 100% equivalent to letting be a Bayesian conditioned version of because Bayesian conditioning involves multiplying by an indicator function whereas this involves multiplying by a strictly positive function, but it seems related and probably shares most of its properties.
The two of us went back and forth in DMs on this for a bit. Based on that conversation, I think a mutually-agreeable translation of the above argument would be "sampling from [the conditional distribution of X-es given the Y label] is the same as sampling from [the distribution that has maximum joint [[closeness to the distribution of X-es] and [prevalence of Y-labeled X-es]]]". Even if this isn't exact, I buy that as at least morally true.
However, I don't think this establishes the claim I'd been struggling with, which was that there's some near equivalence between drawing a conditional sample and argmax searching over samples (possibly w/ some epsilon tolerance). The above argument establishes that we can view conditioning itself as the solution to a maximization problem over distributions, but not that we can view conditional sampling as the solution to any kind of maximization problem over samples.
I would also add that the key exciting things happen when you condition on an event with extremely low probability / have a utility function with an extremely wide range of available utilities. cfoster0's view is that this will mostly just cause it to fail/output nonsense, because of standard arguments along the lines of the Optimizer's Curse. I agree that this could happen, but I think it depends on the intelligence of the argmaxer/conditioner, and that another possibility (if we had more capable AI) is that this sort of optimization/conditioning could have a lot of robust effects on reality.
I can't see a clear mistake in the math here, but it seems fairly straightforwards to construct a counterexample to the conclusion of equivalence the math naively points to.
Suppose we want to use GPT-3 to generate a 600 token long essay praising some company X. Here are two ways we might do this:
I expect that the first method will mostly give you reasonable results, assuming you use text-davinci-002. However, I think the second method will tend to give you extremely degenerate solutions such as "good good good good..." for 600 tokens.
One possible reason for this divide is that GPTs aren't really a prior over language, but a prior over single token continuations of a given natural language context. When you try to make it act like a prior over an entire essay, you expose it to inputs that are very OOD relative to the distribution it's calibrated to model, including inputs that have significant upwards errors in their probability estimations.
However, I think a "perfect" model of human language might actually assign higher prior probability to a continuation like "good good good..." (or maybe something like "X is good because X is good because X is good...") than to a "natural" continuation, provided you made the continuations long enough. This is because the number of possible natural continuations is roughly exponential in the length of the continuation (assuming entropy per character remains ~constant), while there are far fewer possible degenerate continuations (their entropy decreases very quickly). While the probability of entering a degenerate continuation may be very low, you make up for it with the reduced branching factor.
The error is that the KL divergence term doesn't mean adding a cost proportional to the log probability of the continuation. In fact it's not expressible at all in terms of argmaxing over a single continuation, but instead requires you to be argmaxing over a distribution of continuations.
(Haven't double-checked the math or fully grokked the argument behind it, but strongly upvoted for making a case.)
Seems like you can always implement any function f: X -> Y as a search process. For any input from the domain X, just make the search objective assign one to f(X) and zero to everything else. Then argmax over this objective.
Yes but my point uses a different approach to the translation, and so it seems like my point allows various standard arguments about argmax to also infect conditioning, whereas your proposed equivalence doesn't really provide any way for standard argmax arguments to transfer.
This sounds like a reinvention of quantilization, and yes that's a thing you can do to improve safety, but 1. you still need your prior over plans to come from somewhere (perhaps you start out with something IRL-like, and then update it based on experience of what worked, which brings you back to square one), 2. it just gives you a safety-capabilities tradeoff dial rather than particularly solving safety.
Or hmm...
If you do basic reinforcement based on experience, then that's an unbounded adversarial search, but it's really slow and therefore might be safe. And it also raises the question of whether there are other safer approaches.
See my comment to Wei Dai. Argmax's violation of the adversarial principle strongly suggests the existence of a better and more natural frame on the problem.
I think “Optimizing for the output of a grader which evaluates plans” is more-or-less how human brains choose plans
I deeply disagree. I think you might be conflating the quotation and the referent, two different patterns:
It is possible to say sentences like "local semi-reflective search just is global search but with implicit constraints like 'select for plans which your self-model likes'." I don't think this is true. I am going to posit that, as a matter of falsifiable physical fact, the human brain does not compute a predicate which, when checked against all possible plans, rules out all adversarial plans, such that you can just argmax over everything and get out what the person would have chosen/would have wanted to choose on reflection. If you argmax over human value shards relative to the plans they might grade, you'll probably get some garbage plan where you're, like, twitching on the floor.
I don’t think it’s feasible to make an AGI that doesn’t do that.
You'll notice that A shot at the diamond alignment problem makes no claim of the AGI having an internal-argmax-hardened diamond value shard.
Hmm, if I understand it correctly, this sounds like a case for a virtue-ethics-based AGI, augmented by some basic deontology to account for bounded rationality. In this example it will be "the mother instills the virtues of "working hard and behaving well". Maybe with some basic deontology of no cheating etc. Not sure how consequentialism fits in there. Maybe in the form of "drives", e.g. improve happiness, reduce suffering, reduce odds of extinction, encourage diversity... This does not sound very revolutionary though, and probably can result in a "sharp left turn".
Updated with an important terminological clarification:
- ETA 12/26/22: When I write "grader optimization", I don't mean "optimization that includes a grader", I mean "the grader's output is the main/only quantity being optimized by the actor."
- Therefore, if I consider five plans for what to do with my brother today and choose the one which sounds the most fun, I'm not a grader-optimizer relative my internal
plan-is-fun?
grader.- However, if my only goal in life is to find and execute the plan which I would evaluate as being the most fun, then I would be a grader-optimizer relative to my fun-evaluation procedure.
He sketches out pseudocode for her evaluation procedure and finds—surprise!—that humans are flawed graders. Perhaps it turns out that by writing a strange sequence of runes and scribbles on an unused blackboard and cocking his head to the left at 63 degrees, his model of his mother returns "10 million" instead of the usual "8" or "9".
As outside humans reading the article we can say that humans are flawed graders but from the viewpoint of the learner he would not say it is flawed. We might say that value is fragile or multidimensional but we would not reject the structure for unnaturalness. Sure we might have "meta-values" in that if we are dealing with a very elaborate value system we might think it should not be in use because it is a "hackjob". But those values come /get reinforced by something other than the "object level" feedback. If drawing very specific chalk patterns produced magic effects you would found an epistemic branch to exploit it rather than be disappointed with reality.
The analog is suffering from one side of it being desribed at a very detailed level when the other side is very shallow. If I go and try to fill out the shallow side to a similar depth similar level drawbacks seem to emerge. Learning to "care about hard work" seems to involve the child actively going beyond to what is directly given to him. This seems to have the possibility that two different children might extrapolate differently which would be equally consistent with the parental guidance. For some reason in humans such process seems lead to stable-enough formation, maybe because of architectural monotony. But from this perspective aligment problem is about picking the rigth kind of generalization out of all the possible ones.
Consider two common alignment design patterns: [...] (2) Fixing a utility function and then argmaxing over all possible plans.
Wait: fixing a utility function and then argmaxing over all possible plans is not an alignment design pattern, it is the bog-standard operational definition of what an optimal-policy MDP agent should do. This is what Stuart Russell calls the 'standard model' of AI. This is an agent design pattern, not an alignment design pattern. To be an alignment design pattern in my book, you have to be adding something extra or doing something different that is not yet in the bog-standard agent design.
I think you are showing that an actor-grader is just a utility maximiser in a fancy linguistic dress. Again, not an alignment design pattern in my book.
Though your use of the word doomed sounds too absolute to me, I agree with the main technical points in your analysis. But I would feel better if you change the terminology from alignment design pattern to agent design pattern.
Is there a reason you used the term “grader” instead of the AFAICT-more-traditional term “critic”? No big deal, I’m just curious.
My critique is not of actor/critic training processes, but of actor/grader motivational designs. I worried that "critic" would make people think I don't want to use an evaluative model to provide gradients to the actor. That seems non-doomed to me.
Thank you! I’ve been using the terms “inference algorithm” versus “learning algorithm” to talk about that kind of thing. What you said seems fine too, AFAIK.
Could part of the problem be that the actor is optimizing against a single grader's evaluations? Shouldn't it somehow take uncertainty into account?
Consider having an ensemble of graders, each learning or having been trained to evaluate plans/actions from different initializations and/or using different input information. Each grader would have a different perspective, but that means that the ensemble should converge on similar evaluations for plans that look similarly good from many points of view (like a CT image crystallizing from the combination of many projections).
Rather than arg-maxing on the output of a single grader, the actor would optimize for Schelling points in plan space, selecting actions that minimize the variance among all graders. Of course, you still want it to maximize the evaluations also, so maybe it should look for actions that lie somewhere in the middle of the Pareto frontier of maximum and minimum .
My intuition suggests that the larger and more diverse the ensemble, the better this strategy would perform, assuming the evaluators are all trained properly. However, I suspect a superintelligence could still find a way to exploit this.
I think that the problem is that none of the graders are actually embodying goals. If you align the agent to some ensemble of graders, you're still building a system which runs computations at cross-purposes, where part of the system (the actor) is trying to trick and part (each individual grader) is trying to not be tricked.
In this situation, I would look for a way of looking at alignment such that this unnatural problem disappears. A different design pattern must exist, insofar as people are not optimizing for the outputs of little graders in their own heads.
This relates closely to how to "solve" Goodhart problems in general. Multiple metrics / graders make exploitation more complex, but have other drawbacks. I discussed the different approaches in my paper here, albeit in the realm of social dynamics rather than AI safety.
I'm probably missing something, but doesn't this just boil down to "misspecified goals lead to reward hacking"?
Nope! Both "misspecified goals" and "reward hacking" are orthogonal to what I'm pointing at. The design patterns I highlight are broken IMO.
In every scenario, if you have a superintelligent actor which is optimizing the grader's evaluations while searching over a large real-world plan space, the grader gets exploited.
Similar to the evaluator-child who's trying to win his mom's approval by being close to the gym teacher, how would grader exploitation be different from specification gaming / reward hacking? In theory, wouldn't a perfect grader solve the problem?
One point of this post is that specification gaming, as currently known, is an artifact of certain design patterns, which arise from motivating the agent (inner alignment) to optimize an objective over all possible plans or world states (outer alignment). These design patterns are avoidable, but AFAICT are enforced by common ways of thinking about alignment (e.g. many versions of outer alignment commit to robustly grading the agent on all plans it can consider). One hand (inner alignment) loads the shotgun, and our other hand (outer alignment) points it at our own feet and pulls the trigger.
In theory, wouldn't a perfect grader solve the problem?
Yes, in theory. In practice, I think the answer is "no", for reasons outlined in this post.
Summary. Consider two common alignment design patterns:
These design patterns incentivize the agent to find adversarial inputs to the grader (e.g. "manipulate the simulated human grader into returning a high evaluation for this plan"). I'm pretty sure we won't find adversarially robust grading rules. Therefore, I think these alignment design patterns are doomed.
In this first essay, I explore the adversarial robustness obstacle. In the next essay, I'll point out how this is obstacle is an artifact of these design patterns, and not any intrinsic difficulty of alignment. Thanks to Erik Jenner, Johannes Treutlein, Quintin Pope, Charles Foster, Andrew Critch, randomwalks, and Ulisse Mini for feedback.
1: Optimizing for the output of a grader
One motif in some AI alignment proposals is:
For simplicity, imagine we want the AI to find a plan where it makes an enormous number of diamonds. We train an actor to propose plans which the grading procedure predicts lead to lots of diamonds.
In this setting, here's one way of slicing up the problem:
Outer alignment: Find a sufficiently good grader.
Inner alignment: Train the actor to propose plans which the grader rates as highly possible (ideally argmaxing on grader output, but possibly just intent alignment with high grader output).[1]
This "grader optimization" paradigm ordains that the AI find plans which make the grader output good evaluations. An inner-aligned actor is singlemindedly motivated to find plans which are graded maximally well by the grader. Therefore, for any goal by which the grader may grade, an inner-aligned actor is positively searching for adversarial inputs which fool the grader into spitting out a high number!
In the diamond case, if the actor is inner-aligned to the grading procedure, then the actor isn't actually aligned towards diamond-production. The actor is aligned towards diamond-production as quoted via the grader's evaluations. In the end, the actor is aligned to the evaluations.
I think that there aren't clever ways around this issue. Under this motif, under this way of building an AI, you're not actually building an AI which cares about diamonds, and so you won't get a system which makes diamonds in the limit of its capability development.
Three clarifying points:
plan-is-fun?
grader.The parable of evaluation-child
First, a mechanistically relevant analogy. Imagine a mother whose child has been goofing off at school and getting in trouble. The mom just wants her kid to take education seriously and have a good life. Suppose she had two (unrealistic but illustrative) choices.
What's interesting, though, is that even if the mother succeeds at producing evaluation-child, the mother isn't actually aligning the kid so that they want to work hard and behave well. The mother is aligning the kid to maximize the mother's evaluation thereof. At first, when the mother is smarter than the child, these two child-alignments will produce similar behavior. Later, they will diverge wildly, and it will become practically impossible to keep evaluation-child aligned with "work hard and behave well." But value-child does fine.
Concretely, imagine that each day, each child chooses a plan for how to act, based on their internal alignment properties:
At first, everything goes well. In both branches of the thought experiment, the kid is finally learning and behaving. The mothers both start to relax.
But as evaluation-child gets a bit smarter and understands more about his mom, evaluation-child starts diverging from value-child. Evaluation-child starts implicitly modelling how his mom has a crush on his gym teacher. Perhaps spending more time near the gym teacher gets (subconsciously and erroneously) rated more highly by his model of his mom. So evaluation-child spends a little less effort on working hard, and more on being near the gym teacher.
Value-child just keeps working hard and behaving well.
Consider what happens as the children get way smarter. Evaluation-child starts noticing more and more regularities and exploits in his model of his mother. And, since his mom succeeded at inner-aligning him to (his model of) her evaluations, he only wants to execute plans which best optimize her evaluations. He starts explicitly reasoning about this model to which he is inner-aligned. How is she evaluating plans? He sketches out pseudocode for her evaluation procedure and finds—surprise!—that humans are flawed graders. Perhaps it turns out that by writing a strange sequence of runes and scribbles on an unused blackboard and cocking his head to the left at 63 degrees, his model of his mother returns "10 million" instead of the usual "8" or "9".
Meanwhile in the value-child branch of the thought experiment, value-child is extremely smart, well-behaved, and hard-working. And since those are his current values, he wants to stay that way as he grows up and gets smarter (since value drift would lead to less earnest hard work and less good behavior; such plans are dispreferred). Since he's smart, he starts reasoning about how these endorsed values might drift, and how to prevent that. Sometimes he accidentally eats a bit too much candy and strengthens his candy value-shard a bit more than he intended, but overall his values start to stabilize.
Both children somehow become strongly superintelligent. At this point, the evaluation branch goes to the dogs, because the optimizer's curse gets ridiculously strong. First, evaluation-child could just recite a super-persuasive argument which makes his model of his mom return
INT_MAX
, which would fully decouple his behavior from "work hard and behave at school." (Of course, things can get even worse, but I'll leave that to this footnote.[4])Meanwhile, value-child might be transforming the world in a way which is somewhat sensitive to what I meant by "he values working hard and behaving well", but there's no reason for him to search for plans like the above. He chooses plans which he thinks will lead to him actually working hard and behaving well. Does something else go wrong? Quite possibly. The values of a superintelligent agent do in fact matter! But I think that if something goes wrong, it's not due to this problem. (More on that in the next post.)
Grader optimization amplifies the optimizer's curse
Let's bring it back to diamond production. As I said earlier:
This problem is an instance of the optimizer's curse. Evaluations (eg "In this plan, how hard is evaluation-child working? Is he behaving?") are often corrupted by the influence of unendorsed factors (eg the attractiveness of the gym teacher caused an upwards error in the mother's evaluation of that plan). If you make choices by considering n options and then choosing the highest-evaluated one, then the more n increases, the harder you are selecting for upwards errors in your own evaluation procedure.
As far as I know, it's indeed not possible to avoid the curse in full generality, but it doesn't have to be that bad in practice. If I'm considering three research directions to work on next month, and I happen to be grumpy when considering direction #2, then maybe I don't pursue that direction. Even though direction #2 might have seemed the most promising under more careful reflection. I think that the distribution of plans I consider involves relatively small upwards errors in my internal evaluation metrics. Sure, maybe I occasionally make a serious mistake due to the optimizer's curse due to upwards "corruption", but I don't expect to literally die from the mistake.
Thus, there are are degrees to the optimizer's curse. (In the next essay, I'll explore why this maximum-strength curse seems straightforward to avoid.)
Grader-optimization violates the non-adversarial principle
This whole grader-optimization setup seems misguided. You have one part of the process (the actor) which wants to maximize grader evaluations (by exploiting the grader), and another part which evaluates the plan and tries to ensure it hasn't been exploited. Two parts of the system, running computations at adversarial cross-purpose.
We hope that the aggregate behavior of the process is that the grader "wins" and "constrains" the actor to, you know, actually producing diamonds. We hope that by inner-aligning an agent to a desire which is not diamond production, and by making a super clever grader which evaluates plans for diamond production, the overall behavior is aligned with diamond production.
It's one thing to try to take a system of diamond-aligned agents and then aggregate them into a diamond-aligned superagent. But here, we're not doing even that. We're aggregating a process containing an entity which does is not diamond-aligned, and hoping that we can diamond-align the overall decision-making process..? I think that grader-optimization is just not how to get cognitive work out of smart agents. It's really worth noticing the anti-naturality of trying to do so—that this setup proposes something against the grain of how values seem to usually work.
Grader-optimization seems doomed
One danger sign is that grader-alignment doesn't seem easier for simple goals/tasks (make diamonds) and harder for complex goals (human values). Sure, human values are complicated, but what about finding robust graders for:
In every scenario, if you have a superintelligent actor which is optimizing the grader's evaluations while searching over a large real-world plan space, the grader gets exploited. As best I can discern, you're always screwed. This implies that something about the grader optimization problem produces a high fixed cost to aligning on any given goal, and that the current bottleneck difficulties don't come from the goals themselves.
Here are several approaches which involve grader-alignment:
This difficulty seems fundamental. I think these grader approaches are doomed. (In the appendix, I address several possible recovery attempts for the actor/grader problem setup.)
2: Argmax is a trap
One idealization of agency is brute-force plan search AKA argmaxing with respect to a utility function. The agent considers all possible plans (i.e. action-sequences), models the effects of each plan, evaluates how many diamonds the plan leads to, and then chooses the plan with highest evaluation. AIXI is a prime example of this, a so-called "spherical cow" for modelling AGI. This lets us abstract away e.g. seemingly annoying complications with reflective agents which think about their future planning process. This seemingly[5] relaxes the problem.
Brute-force plan search nicely captures the intuition that it's better to consider more options. If you're just considering n plans and someone says "want to be able to check another plan for free?", why not accept? If the new plan isn't better than the other n, then just don't execute the new plan.
This reasoning is fine for the everyday kind of plan. But if the action space is expressive (the agent can do one of several things at each time step) and the planning horizon long enough (the agent can make interesting things happen), then brute-force plan search forces you to consider plans which trick your evaluation procedure (as in the parable of evaluation-child). For any simple evaluation procedure you can write down, there probably exists a plan which "tricks" it relative to your intentions:
Sure, maybe you can try to rule out plans which seem suspicious—to get the utility function to return
INT_MIN
for any plan which triggers the alarm (e.g. "why does this plan start off with me coding up a possible superintelligence..?"). But then this is just equivalent to specifying the utility function adequately well across all possible plans.Why is it so ridiculously hard to get an argmax agent to actually argmax by selecting a plan which makes a lot of diamonds? Because argmax invokes the optimizer's curse at maximal strength, that's why.
Conclusion
Grader optimization and brute-force plan search both ensure an extremely strong version of the optimizer's curse. No matter what grading rule you give an AI, if the AI is inner aligned on that rule, the AI will try to find adversarial inputs to that rule. Similarly, if the AI is argmaxing over plans according to a specified rule or utility function, it's selecting for huge upwards error in the rule you wrote down.
Appendix: Maybe we just...
Given a "smart" grader evaluating plans on the expected number of diamonds they produce, how do you get an actor-grader system which ends up making diamonds in reality? Maybe we just...
Simultaneously make the actor and grader more intelligent: Maybe a fixed grader will get gamed by the actor's proposals, but as long as we can maintain an invariant where, at time t, actor At can't exploit grader Gt, we should be fine.
The graders become increasingly expert in estimating how many diamonds a plan leads to, and the actors become increasingly clever in proposing highly evaluated plans. It's probably easier to evaluate plans than to generate them, so it seems reasonable at first to think that this can work, if only we found a sufficiently clever scheme for ensuring the grader outpaces the actor.
Response:
Penalize the actor for considering the vulnerabilities. Don't we have to solve actor-level interpretability so we can do that? One of the strong points of actor/grader is that evaluation is—all else equal—easier than generation. But the "thoughts" which underlie that generation need not be overseeable.
And what if the vulnerability-checker gets hit with its own adversarial input. And why consider this particular actor/grader design pattern?
Satisfice. But uniformly randomly executing a plan which passes a (high) diamond threshold might still tend to involve building malign superintelligences.[7] EDIT: However, if you bound the grader's output [0,1], it seems quite possible that some actually good plans get the max rating of 1. The question then becomes: are there lots of non-good plans which get max rating as well? I think so.
Quantilize. But then what's the base distribution, and what's the threshold? How do you set the quantiles such that you're drawing from a distribution which mostly involves lots of actual diamonds? Do there even exist such quantiles, under the uniform base distribution on plans?
Avoid having the actor argmax the grader. OK. But if we only have the actor and the black box, what do we do? We want to get an agent which actually optimizes diamond production to a powerful extent, while only taking relatively simple statistics of the seriously flawed black-box diamond evaluation function. (Why is this a thing you should be able to do? Is this the most promising way to spend the next increment of alignment research?)
Use the grader to entrain object-level diamond-caring into the actor? Sure. That's not what I was critiquing in this essay. I was critiquing design patterns in which the trained agent's cognition is intended to factor out into "actor" and "grader", and a pattern in which "argmax" is considered a reasonable idealization or limit of real-world agents.
I'm not assuming the actor wants to maximize the literal physical output of the grader, but rather just the "spirit" of the grader. More formally, the actor is trying to argmaxplan pGrader(p), where Grader can be defined over the agent's internal plan ontology.
Obligatory: Reward tampering seems relatively improbable to me because reward is not the optimization target.
Value-child is not trying to find a plan which he would evaluate as good. He is finding plans which evaluate as good. I think this is the kind of motivation which real-world intelligences tend to have. (More on how value-child works in the next essay.)
But, as usual in alignment, it can get even worse, because his mother can be persuaded by another superintelligent entity which can be built in certain plans. Via a relatively short plan, that entity can be hard-coded to persuade his mother to output a high evaluation. Perhaps evaluation-child realizes that there exist quickly codable programs which, starting from the school computers, exfiltrate a computationally efficient superintelligence onto remote supercompute clusters. Furthermore, there exist ways to hardcode initial knowledge into that AI. What if he hardcoded in initial knowledge like:
There exist superintelligences which comply and successfully super-persuade his model of his mother to output an extremely high number. Then "kill everyone via a superintelligence" returns a way higher evaluation than any course of conduct which the mother intended to incentivize. So, perhaps evaluation-child designs and deploys an unfriendly AI. Everyone dies. The end!
In general, if object-level persuading his mother to output
INT_MAX
takes more actions than coding and exfiltrating a superintelligence and informing it of the situation, then in the exponentially large plan-space, exponentially more plans will involve building the superintelligence. So even if evaluation-child uniformly randomly chooses a plan with maximal evaluation, he is far more probable to choose a meta-level "build an unaligned AI which persuades my mom" rather than "super-persuade mom-evaluator via my own actions within the plan."This insanity is an artifact of grader optimization via the optimizer's curse, and—I think—is not an intrinsic difficulty of alignment itself. More discussion of this in the next post.
I agree with Richard Ngo's comment that
"It's easier to robustly evaluate plans than to generate them" isn't true if the generator is optimizing for deceiving your fixed evaluation procedure. A real-world actor will be able to model the grading procedure / grader, and therefore efficiently find and exploit vulnerabilities. I feel confident [~95%] that we will not train a grader which is "secured" against actor-level intelligences. Even if the grader is reasonably smarter than the actor [~90%].
Even if somehow this relative difficulty argument failed, and you could maybe train a secured grader, I think it's unwise to do so. These optimizer's curse problems don't seem necessary to solve alignment.
In this comment, I described how a certain alignment obstacle ("brute-force search on ELK plans using an honest reporter") still ends up getting everyone killed, and doesn't even keep the diamond in the room. I now think this is because of grader-optimization. And I now infer that my initial unease, the unsuspension of my disbelief that alignment could really work like this—the unease was perhaps from subconsciously noticing the strangeness of grader-optimization as a paradigm.