Comment Permalink

Ivan Vendrov3y100

Agreed with John, with the caveat that I expect search processes + simple objectives to only emerge from massively multi-task training. If you're literally training an AI just on smiling, TurnTrout is right that "a spread of situationally-activated computations" is more likely since you're not getting any value from the generality of search.

The Deep Double Descent paper is a good reference for why gradient descent training in the overparametrized regime favors low complexity models, though I don't know of explicit evidence for the conjecture that "explicit search + simple objectives" is actually lower complexity (in model space) than "bundle of heuristics". Seems intuitive if model complexity is something close to Kolmogorov complexity, but would love to see an empirical investigation!

See in context

87

[ Question ]

Seriously, what goes wrong with "reward the agent when it makes you smile"?

by TurnTrout

11th Aug 2022

AI Alignment Forum

2 min read

8 43

87 Ω 40

Suppose you're training a huge neural network with some awesome future RL algorithm with clever exploration bonuses and a self-supervised pretrained multimodal initialization and a recurrent state. This NN implements an embodied agent which takes actions in reality (and also in some sim environments). You watch the agent remotely using a webcam (initially unbeknownst to the agent). When the AI's activities make you smile, you press the antecedent-computation-reinforcer button (known to some as the "reward" button). The agent is given some appropriate curriculum, like population-based self-play, so as to provide a steady skill requirement against which its intelligence is sharpened over training. Supposing the curriculum trains these agents out until they're generally intelligent—what comes next?

The standard response is "One or more of the agents gets smart, does a treacherous turn, kills you, and presses the reward button forever."
- But reward is not the optimization target. This story isn't impossible, but I think it's pretty improbable, and definitely not a slam-dunk.
Another response is "The AI paralyzes your face into smiling."
- But this is actually a highly nontrivial claim about the internal balance of value and computation which this reinforcement schedule carves into the AI. Insofar as this response implies that an AI will primarily "care about" literally making you smile, that seems like a highly speculative and unsupported claim about the AI internalizing a single powerful decision-relevant criterion / shard of value, which also happens to be related to the way that humans conceive of the situation (i.e. someone is being made to smile).

My current answer is "I don't know precisely what goes wrong, but probably something does, but also I suspect I could write down mechanistically plausible-to-me stories where things end up bad but not horrible." I think the AI will very probably have a spread of situationally-activated computations which steer its actions towards historical reward-correlates (e.g. if near a person, then tell a joke), and probably not singularly value e.g. making people smile or reward. Furthermore, I think its values won't all map on to the "usual" quantities-of-value:

80% credence: It's very hard to train an inner agent which reflectively equilibrates to an EU maximizer only over commonly-postulated motivating quantities (like # of diamonds or # of happy people or reward-signal) and not quantities like (# of times I have to look at a cube in a blue room or -1 * subjective micromorts accrued).

So, I'm pretty uncertain about what happens here, but would guess that most other researchers are less uncertain than I am. So here's an opportunity for us to talk it out!

(My mood here isn't "And this is what we do for alignment, let's relax." My mood is "Why consider super-complicated reward and feedback schemes when, as far as I can tell, we don't know what's going to happen in this relatively simple scheme? How do reinforcement schedules map into inner values?")

Reward FunctionsAI

Frontpage

87 Ω 40

Mentioned in

358My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

148Inner and outer alignment decompose one hard problem into two extremely hard problems

95A shot at the diamond-alignment problem

494. Existing Writing on Corrigibility

47Gradient descent doesn't select for inner search

Load More (5/6)

New Answer

New Comment

8 Answers sorted by
top scoring

johnswentworth

Aug 12, 2022

Ω214913

I think the main concept missing here is compression: trained systems favor more compact policies/models/heuristics/algorithms/etc. The fewer parameters needed to implement the inner agent, the more parameters are free to vary, and therefore the more parameter-space-volume the agent takes up and the more likely it is to be found. (This is also the main argument for why overparameterized ML systems are able to generalize at all.)

The outer training loop doesn't just select for high reward, it also implicitly selects for compactness. We expect it to find, not just policies which achieve high reward, but policies which are very compactly represented.

Compression is the main reason we expect inner search processes to appear. Here's the relevant argument from Risks From Learned Optimization:

In some tasks, good performance requires a very complex policy. At the same time, base optimizers are generally biased in favor of selecting learned algorithms with lower complexity. Thus, all else being equal, the base optimizer will generally be incentivized to look for a highly compressed policy.
One way to find a compressed policy is to search for one that is able to use general features of the task structure to produce good behavior, rather than simply memorizing the correct output for each input. A mesa-optimizer is an example of such a policy. From the perspective of the base optimizer, a mesa-optimizer is a highly-compressed version of whatever policy it ends up implementing: instead of explicitly encoding the details of that policy in the learned algorithm, the base optimizer simply needs to encode how to search for such a policy. Furthermore, if a mesa-optimizer can determine the important features of its environment at runtime, it does not need to be given as much prior information as to what those important features are, and can thus be much simpler.

The same argument applies to the terminal objectives/heuristics/proxies instilled in an RL-trained system: it may not terminally value the reward button being pushed or the human smiling or whatever, but its values should be generated from a relatively small, relatively simple set of things. For instance, a plausible Fermi estimate for humans is that our values are ultimately generated from ~tens of simple proxies. (And I would guess that modern ML training would probably result in even fewer, relative to human evolution.)

Furthermore, whatever terminal values are instilled in the RL-trained system, they do need to at least induce near-perfect optimization of the feedback signal on the training set; otherwise the outer training loop would select some other parameters. The outer training loop is still an optimization process, after all, so whatever policy the trained system ends up with should still be roughly-optimal. (There's some potential wiggle room here insofar as the AI which takes off will be the first one to pass the threshold, and that may happen during a training run before convergence, but I think that's probably not central to discussion here?)

Putting that all together: we don't know that the AI will necessarily end up optimizing reward-button-pushes or smiles; there may be other similarly-compact proxies which correlate near-perfectly with reward in the training process. We can probably rule out "a spread of situationally-activated computations which steer its actions towards historical reward-correlates", insofar as that spread is a much less compact policy-encoding than an explicit search process + simple objective(s).

[-]Ivan Vendrov3y100

7Ivan Vendrov3y

Thinking about this more, I think gradient descent (at least in the modern regime) probably doesn't select for inner search processes, because it's not actually biased towards low Kolmogorov complexity. More in my standalone post, and here's a John Maxwell comment making a similar point.

[-]Thane Ruthenis3y93

We can probably rule out "a spread of situationally-activated computations which steer its actions towards historical reward-correlates", insofar as that spread is a much less compact policy-encoding than an explicit search process + simple objective(s).

Not sure if I disagree with the object-level assertion, but I think some important caveats are missing here. We have to take the plausible paths through algorithm-space the SGD is likely to take as well, and that might change the form of the final compressed policy in non-intuitive ways.

Another compact poli... (read more)

[-]Quintin Pope3y51

We can probably rule out "a spread of situationally-activated computations which steer its actions towards historical reward-correlates", insofar as that spread is a much less compact policy-encoding than an explicit search process + simple objective(s).

Seems like you can have a yet-simpler policy by factoring the fixed "simple objective(s)" into implicit, modular elements that compress many different objectives that may be useful across many different environments. Then at runtime, you feed the environmental state into your factored representation of poss... (read more)

4Daniel Kokotajlo3y

Can you explain why this policy is yet-simpler? It sounds more complicated to me.

2Quintin Pope3y

I’m saying that it’s simpler to have a goal generator that can be conditioned on the current environment, rather than memorizing each goal individually.

4Thane Ruthenis3y

That seems like a semantical difference? We may just as well call these modular elements the "objectives", with them having different environment-specific local implementations. E. g., if my goal is "winning", it would unfold into different short-term objectives depending on whether I'm playing chess or football, but we can still meaningfully call it a "goal".

3Quintin Pope3y

I'm confident that this is not a semantic difference. The modular elements I was describing represent a process for determining ones objectives, depending on the environment and your current beliefs. It would be a type error to call them "objectives", just as it would be a type error to call a search process your "plans". They each represent compressions of possible objectives / plans, but are not those things themselves. Similarly, it would be incorrect to call a GPT model a "collection of sentences", even though they are essentially compressions over many possible sentences.

1Thane Ruthenis3y

Okay, suppose we feed many environment-states into some factored representation of possible objectives, and generate a lot of (environment, objectives) mappings for a given agent. In your model, is it possible to summarize these results somehow; is it possible to say something general about what the agent is trying to do in all of these environments? (E. g., like my football & chess example.)

2Quintin Pope3y

Yes, it’s possible to do summary statistics on the outputted goals, just like you can do summary statistics on the outputs of GPT-3, or in the plans produced by a given search algorithm. That doesn’t make generators of these things have the same type signature as the things themselves. My counterpoint to John is specifically about the sort of computational structures that can represent goals, while being both simple AND environment/belief-dependent. I’m saying simplicity does not push against representing goals in an environment-dependent way, because your generator of goals can be conditioned on the environment.

2Thane Ruthenis3y

How "meaningful" would that summary be? Does my "winning at chess vs football" analogy fit what you're describing, with "winning" being the compressed objective-generator and the actual win conditions of chess/football being the environment-specific objectives?

3Quintin Pope3y

My point is that you can have “goals” (things your search process steers the world towards) and “generators of goals”. These are different things, and you should not use the same name for them. More specifically, there is a difference in the computational type signature between generators and the things they generate. You can call these two things by whatever label you like, but they are not the same thing. You can look a person’s plans / behavior in many different games and conclude that it demonstrates a common thread which you might label “winning”. But you should not call the latent cognitive generators responsible for this common thread by the same name you use for the world states the person’s search process steers towards in different environments.

3Thane Ruthenis3y

Alright, then it is a semantics debate from my perspective. I don't think we're actually disagreeing, now. Your "objective-generators" cleanly map to my "goals", and your "objectives" to my "local implementations of goals" (or maybe "values" and "local interpretations of values"). That distinction definitely makes sense at the ground level. In my ontology, it's a distinction between what you want and how achieving it looks like in a given situation. I think it makes more sense to describe it my way, though, since I suspect a continuum of ever-more-specific/local objectives ("winning" as an environment-independent goal, "winning" in this type of game, "winning" against the specific opponent you have, "winning" given this game and opponent and the tactic they're using), rather than a dichotomy of "objective-generator" vs "objective", but that's a finer point.

1Thane Ruthenis3y

Although, digging into the previously-mentioned finer points, I think there is room for some meaningful disagreement. I don't think there are goal-generators as you describe them. I think there are just goals, and then some plan-making/search mechanism which does goal translation/adaptation/interpretation for any given environment the agent is in. I. e., the "goal generators" are separate pieces from the "ur-goals" they take as input. And as I'd suggested, there's a continuum of ever-more specific objectives. In this view, I think the line between "goals" and "plans" blurs, even, so that the most specific "objectives" are just "plans". In this case, the "goal generator" is just the generic plan-making process working in a particular goal-interpreting regime. (Edited-in example: "I want to be a winner" -> "I want to win at chess" -> "I want to win this game of chess" -> "I want to decisively progress towards winning in this turn" -> "I want to make this specific move". The early steps here are clear examples of goal-generation/translation (what does winning mean in chess?), the latter clear examples of problem-solving (how do I do well this turn?), but they're just extreme ends of a continuum.) The initial goal-representations from which that process starts could be many things — mathematically-precise environment-independent utility functions, or goals defined over some default environment (as I suspect is the case with humans), or even step-one objective-generators, as you're suggesting. But the initial representation being an objective-generator itself seems like a weirdly special case, not how this process works in general.

3johnswentworth3y

That sure does sound like a description of a search algorithm, right there.

5Quintin Pope3y

I'm not objecting to your assertion that some sort of search takes place. I'm objecting to your characterization of what sorts of objectives the search ends up pointed towards. Basically, I'm saying that "situationally activated heuristics that steer towards environment-dependent goals" is totally in-line with a simplicity prior over cognitive structures leading to a search-like process. The whole reason you say that we should expect search processes is because they can compress many different environment and beliefs dependent plans into a simpler generator of such plans (the search), which takes in environment info, beliefs, and the agent's simple, supposedly environment-independent, objectives, and produces a plan. So, the agent only needs to store the search process and its environment-independent objectives. I'm saying you can apply a similar "compress into an environment / beliefs conditioned generator" trick to the objectives as well, and get a generator of objectives that condition on the environment and current beliefs to produce objectives for the search process. Thus, objectives remain environment-dependent, and will probably steer towards world states that resemble those which were rewarded during training. I think this is quite similar to "a spread of situationally-activated computations which steer its actions towards historical reward-correlates", if involving rather more sophisticated cognition than phrases like "contextually activated heuristics" often imply.

[-]TurnTrout3y*Ω440

We can probably rule out "a spread of situationally-activated computations which steer its actions towards historical reward-correlates", insofar as that spread is a much less compact policy-encoding than an explicit search process + simple objective(s).

Here's what I think you mean by an explicit search process:

In every situation, the neural network runs e.g. MCTS with a fixed leaf evaluation function (the simple objective).

On this understanding of your argument, I would be surprised if it went through. Here are a few quick counterpoints.

Outside tiny

... (read more)

[-]johnswentworth3yΩ7100

Do you mean "hardcoded reward circuit"

I'm not that committed to the RL frame, but roughly speaking yes. Whatever values we have are probably generated by ~tens of hardcoded things. Anyway, on to the meat of the discussion...

It seems like a whole bunch of people are completely thrown off by use of the word "search". So let's taboo that and talk about what's actually relevant here.

We should expect compression, and we should expect general-purpose problem solving (i.e. the ability to take a fairly arbitrary problem in the training environment and solve it reasonably well). The general-purpose part comes from a combination of (a) variation in what the system needs to do to achieve good performance in training, and (b) the recursive nature of problem solving, i.e. solving one problem involves solving a wide variety of subproblems. Compactness means that it probably won't be a whole boatload of case-specific heuristics; lookup tables are not compact. A subroutine for reasonably-general planning or problem-solving (i.e. take a problem statement, figure out a plan or solution) is the key thing we're talking about here. Possibly a small number of such subroutines for a few different problem... (read more)

6Nora Belrose3y

I agree that AGI will need general purpose problem solving routines (by definition). I also agree that this requires something like recursive decomposition of problems into subproblems. I'm just very skeptical that the kinds of neural nets we're training right now can learn to do anything remotely like that— I think it's much more likely that people will hard code this type of reasoning into the compute graph with stuff like MCTS. This has already been pretty useful for e.g. MuZero. Once we're hard coding search it's less scary because it's more interpretable and we can see exactly where the mesaobjective is. I also don't really buy the compactness argument at all. I think neural nets are biased toward flat minima / broad basins but these don't generally correspond to "simple" functions in the Kolmogorov sense; they're more like equivalence classes of diverse bundles of heuristics that all get about the same train and val loss. I'm interpreting this paper as providing some evidence in that direction.

2johnswentworth3y

I hope that you're right; that would make Retargeting The Search very easy, and basically eliminates the inner alignment problem. Assuming, of course, that we can somehow confidently rule out the rest of the net doing any search in more subtle ways.

4TurnTrout3y

This seems like roughly what I had in mind by "contextually activated computations" (probably with a few differences about when/how the subroutines will be goal-agnostic). I was imagining computations like "contextually activated cached death-avoidance policy influences" and "contextually activated steering of plans towards paperclip production, in generalizations of the historical reinforcement contexts for paperclip-reward."

Thane Ruthenis

Aug 12, 2022

162

I think the AI will very probably have a spread of situationally-activated computations which steer its actions towards historical reward-correlates (e.g. if near a person, then tell a joke), and probably not singularly value e.g. making people smile or reward

I agree. My recent write-up is partly an attempt to model this dynamic in a toy causal-graph environment. Most relevantly, this section.

Imagine an environment represented as a causal graph, with some action-nodes an agent can set, observation-nodes whose values that agent can read off, and some reward-node $r$ whose value determines how much reinforcement the agent gets. The agent starts with no information about the environment structure or the environment state. If the reward-node is sufficiently distant from its action-nodes, it'll take time for the agent's world-model to become advanced enough to model it. However, the agent would start trying to develop good policies/heuristics for increasing the reward immediately. Thus, its initial policies will necessarily act on proxies: it'll be focusing on the values of some intermediate nodes between its action-nodes and the reward-node.

And these proxies can be quite good. For example:

$x_{p}$ is a good proxy for controlling the value of $r$ if the $x_{k}, x_{e}$ chain doesn't perturb it too much. So an agent that only cares about the environment up to $x_{p}$ can capture e. g. $X % > 90 %$ of the possible maximum reward.

It feels like it shouldn't matter: that once the world-model is advanced enough to include $r$ directly, the agent should just recognize $r$ as the source of reinforcement, and optimize it directly.

But suppose the heuristics the agent develops have "friction". That is: once a heuristic has historically performed well enough, the agent is reluctant to replace it with a better but more novel (and therefore untested) one. Or, at least, less willing the less counterfactual reward it promises to deliver. So a heuristic that performs 10x as well as the one it currently has will be able to win against a much older one, but a novel heuristic that only performs 1.1x as well won't be.

In this case, the marginally more effective policy will not be able to displace a more established one.

(An alternate view: suppose that the agent has two mutually-exclusive heuristic on what to do in a given situation, A and B. A has a good track record, B is a new one, but it's willing to try B out. Suppose it picks A with probability $p$ and B with $1 - p$ , with $p$ proportional to how long A's track record is. If the reinforcement B receives is much larger than the reinforcement A receives, then even a rarely-picked B will eventually outpace A. If it's not much larger, however, then A will be able to "keep up" with B by virtue of being picked more often, and eventually outrace B into irrelevancy.)

Therefore: Yes, the agent will end up optimized for good performance on some proxies of "the human presses the button". What these proxies are depends on the causal structure of the environment, the percentage of max-reward $X$ optimizing for them allows the agent to capture, and some "friction" value that depends on the agent's internal architecture.

Major caveat: This mainly only holds for less-advanced systems; for those that are optimized, but do not yet optimize at the strategic level. A hedonist wrapper-mind would have no problems with evaluating whether the new heuristic is actually better, testing it out, and implementing it, no matter how comparably novel it is.

Caveat to the caveat: Such strategic thinking will probably appear after the "values" have already been formed, and at that point the agent will do deceptive alignment to preserve them, instead of self-modifying into a reward-maximizer.

Route warning: This doesn't mean the agent's proxies will be friendly or even comprehensible to us. In particular, if the reward structure is

Something makes me smile \to I smile \to I press the button

Then it's about as likely (very not) that the agent will end up focusing on "I smile" as on "I press the button", since there's basically just a single causal step. Much more likely is that it'll value some stuff upstream of "something makes me smile"; possibly very strange stuff.

Note: Using the "antecedent-computation-reinforcer" term really makes all of this clearer, but it's so unwieldy. Any ideas for coining a better term?

Daniel Kokotajlo

Aug 12, 2022

Ω775

Quoting Rob Bensinger quoting Eliezer:

So what actually happens as near as I can figure (predicting future = hard) is that somebody is trying to teach their research AI to, god knows what, maybe just obey human orders in a safe way, and it seems to be doing that, and a mix of things goes wrong like:

The preferences not being really readable because it's a system of neural nets acting on a world-representation built up by other neural nets, parts of the system are self-modifying and the self-modifiers are being trained by gradient descent in Tensorflow, there's a bunch of people in the company trying to work on a safer version but it's way less powerful than the one that does unrestricted self-modification, they're really excited when the system seems to be substantially improving multiple components, there's a social and cognitive conflict I find hard to empathize with because I personally would be running screaming in the other direction two years earlier, there's a lot of false alarms and suggested or attempted misbehavior that the creators all patch successfully, some instrumental strategies pass this filter because they arose in places that were harder to see and less transparent, the system at some point seems to finally "get it" and lock in to good behavior which is the point at which it has a good enough human model to predict what gets the supervised rewards and what the humans don't want to hear, they scale the system further, it goes past the point of real strategic understanding and having a little agent inside plotting, the programmers shut down six visibly formulated goals to develop cognitive steganography and the seventh one slips through, somebody says "slow down" and somebody else observes that China and Russia both managed to steal a copy of the code from six months ago and while China might proceed cautiously Russia probably won't, the agent starts to conceal some capability gains, it builds an environmental subagent, the environmental agent begins self-improving more freely, undefined things happen as a sensory-supervision ML-based architecture shakes out into the convergent shape of expected utility with a utility function over the environmental model, the main result is driven by whatever the self-modifying decision systems happen to see as locally optimal in their supervised system locally acting on a different domain than the domain of data on which it was trained, the light cone is transformed to the optimum of a utility function that grew out of the stable version of a criterion that originally happened to be about a reward signal counter on a GPU or God knows what.

Perhaps the optimal configuration for utility per unit of matter, under this utility function, happens to be a tiny molecular structure shaped roughly like a paperclip.

That is what a paperclip maximizer is. It does not come from a paperclip factory AI. That would be a silly idea and is a distortion of the original example.

[-]TurnTrout3yΩ360

Perhaps the optimal configuration for utility per unit of matter, under this utility function, happens to be a tiny molecular structure shaped roughly like a paperclip.

I think this is very improbable, but thanks for the quote. Not sure if it addresses my question?

5Daniel Kokotajlo3y

Yudkowsky & I would of course agree that that is very improbable. It's just an example. The point I was making with this quote is that the question you are asking is a Big Old Unsolved Problem in the literature. If we had any idea what sort of utility function the system would end up with, that would be great and an improvement over the status quo. Yudkowsky's point in the quote is that it's a complicated multi-step process we currently don't have a clue about, it's not nearly as simple as "the system will maximize reward." A much better story would be "The system will maximize some proxy, which will gradually evolve via SGD to be closer and closer to reward, but at some point it'll get smart enough to go for reward for instrumental convergence reasons and at that point its proxy goal will crystallize." But this story is also way too simplistic. And it doesn't tell us much at all about what the proxy will actually look like, because so much depends on the exact order in which various things are learned. I should have made it just a comment, not an answer.

2TurnTrout3y

I actually doubt that claim in its stronger forms. I think there's some substantial effect, but e.g. whether a child loves their family doesn't depend strongly on the precise curriculum at grade school.

4Daniel Kokotajlo3y

Yet whether a child grows up to work on x-risk reduction vs. homeless shelters vs. voting Democrats out of office vs. voting Republicans out of office does often depend on the precise curriculum in college+high school. (I think we are in agreement here. I'd be interested to hear if you can point to any particular value AGI will probably have, or (weaker) any particular value such that if AGI has it, it doesn't depend strongly on the curriculum, order in which concepts are learned, etc.)

Nora Belrose

Aug 11, 2022

I don't know what the agent would end up valuing in this scenario either. I think a pretty good research direction for shard theory would be to closely examine the training history of the agent to find particular episodes where qualitatively new behaviors seem to form / new circuits arise in the neural net. This might allow you to identify shards, whereas it seems much harder to do that if you're only looking at the final trained weights of the agent.

DanielFilan

Jun 01, 2023

Ω551

Not having read other responses, my attempt to answer in my own words: what goes wrong is that there are tons of possible cognitive influences that could be reinforced by rewards for making people smile. E.g. "make things of XYZ type think things are going OK", "try to promote physical configurations like such-and-such", "trying to stimulate the reinforcer I observe in my environment". Most of these decision-influences, when extrapolated to coherent behaviour where those decision-influences drive the course of the behaviour, lead to resource-gathering and not respecting what the informed preferences of humans would be. Then this causes doom because you can better achieve most goals/preferences you could have by having more power and disempowering the humans.

Tao Lin

Aug 13, 2022

Pretrained models don't need any exploration to know that pressing the reward button gets more reward than doing things the humans want. If you just ask GPT3, it'll tell you that.

Then the only exploration the AI needs is to get reward after thinking about analogies between its situation and its textual knowledge of AI/reinforcement learning/AI doom scenarios.

This applies especially much to simple/often discussed tasks such as making people smile - an LM has already heard of this exact task, so if it took an action based on the "make people smile task" its heard about, this could outperform other thought processes which are only conditioned on data so far.

[-]TurnTrout3y20

OK, but that's a predictive fact in the world model, not a motivational quantity in the policy. I know about my reward center too, and my brain does RL of some kind, but I don't primarily care about reward.

5Tao Lin3y

Here's a plausible story to me: The model sees its environment + past actions, and its LM predictive modelling part puts non-neglible prob on "this the 'make humans smile' task". Then its language modelling prior predicts the next action, not based on the training setup, which it doesn't see, but based on the environment, and it outputs an action aimed at pressing the reward button. This action does well, is reinforced, and you get a reward-button-presser. Some context is that when training language models with RLHF, the language modelling prior tends to dominate over RL-learned behaviors on sub-distributions even after lots of RLHF training. Another version of this is "for many trajectories, an LM will be primarily predicting text, not executing rl-reinforced behaviors. Given this, actions that get reinforced are likely to come from the LM producing text that gets high reward in its reward model, rather than random actions"

3Nora Belrose3y

This is actually a pretty good argument, and has caused me to update more strongly to the view that we should be optimizing only the thought process of chain of thought language models, not the outcomes that they produce

1Tao Lin3y

Also, I think if you trained something to predict text, then RL trained it on inclusive genetic fitness as a human (or human motivation signals), its learning would be mostly in the space of "select specific human / subdistribution of humans to imitate" rather than learning behaviors specific to the task, and then its generalization properties would depend more on those humans than on the specific training setup used

Noosphere89

Feb 10, 2025

One plausible answer is that it does in fact reward hack/optimize the reward, because reward hacking/reward optimization has happened before empirically, so there are reasonable grounds to raise the hypothesis to plausibility:

https://x.com/moyix/status/1885069457912996128

7 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:41 PM

[-]Richard_Ngo3yΩ91511

Another response is "The AI paralyzes your face into smiling."
But this is actually a highly nontrivial claim about the internal balance of value and computation which this reinforcement schedule carves into the AI. Insofar as this response implies that an AI will primarily "care about" literally making you smile, that seems like a highly speculative and unsupported claim about the AI internalizing a single powerful decision-relevant criterion / shard of value, which also happens to be related to the way that humans conceive of the situation (i.e. someone is being made to smile).

Who do you think would make the claim that the AI in this scenario would care about "literally making you smile", as opposed to some complex, non-human-comprehensible goal somewhat related to humans smiling? E.g. Yudkowsky gives the example of an AI in that situation learning to optimize for "tiny molecular smiley faces", which is a much weirder generalization than "making you smile", although I think still less weird than the goal he'd actually expect such a system to learn (which wouldn't be describable in a single four-word phrase).

I think the AI will very probably have a spread of situationally-activated computations which steer its actions towards historical reward-correlates (e.g. if near a person, then tell a joke), and probably not singularly value e.g. making people smile or reward.

I think this happens when you have less intelligent systems, and then as you have more intelligent systems those correlates end up unified into higher-level abstractions which correspond to large-scale goals. I outline some of the arguments for that position in phase 3 here.

[-]TurnTrout3yΩ220

Who do you think would make the claim that the AI in this scenario would care about "literally making you smile", as opposed to some complex, non-human-comprehensible goal somewhat related to humans smiling?

I don't know? Seems like a representative kind of "potential risk" I've read about before, but I'm not going to go dig it up right now. (My post also isn't primarily about who said what, so I'm confused by your motivation for posting this question?)

[-]abramdemski3yΩ220

I've often repeated scenarios like this, or like the paperclip scenario.

My intention was never to state that the specific scenario was plausible or default or expected, but rather, that we do not know how to rule it out, and because of that, something similarly bad (but unexpected and hard to predict) might happen.

The structure of the argument we eventually want is one which could (probabilistically, and of course under some assumptions) rule out this outcome. So to me, pointing it out as a possible outcome is a way of pointing to the inadequacy of our current ability to analyze the situation, not as part of a proto-model in which we are conjecturing that we will be able to predict "the AI will make paperclips" or "the AI will literally try to make you smile".

[-]Joel Burget3y1113

Meta-comment: I'm happy to see this -- someone knowledgeable, who knows and seriously engages with the standard arguments, willing to question the orthodox answer (which some might fear would make them look silly). I think this is a healthy dynamic and I hope to see more of it.

[-]Raemon3y62

I also found this a good exercise in deliberate questioning/boggling.

From 2010-2014, when I was first forming my opinions on AI, it was really frustrating that anyone who objected to the basic AI arguments just... clearly hadn't been paying attention and at all and didn't understand the basic arguments.

[-]Shmi3y20

Somewhat unrelated and probably silly... Why reward the agent directly instead of letting it watch humans act in their natural environment and leaving it to build a predictive model of humans?

[-]green_leaf3y10

To predict if a human ends up happy with something or not?

Moderation Log

87

[ Question ]

Seriously, what goes wrong with "reward the agent when it makes you smile"?

87

Ω 40

87

Ω 40

8 Answers sorted by top scoring

Aug 12, 2022

Aug 12, 2022

Aug 12, 2022

Aug 11, 2022

Jun 01, 2023

Aug 13, 2022

Feb 10, 2025

8 Answers sorted by
top scoring