Executing a policy trained through current reinforcement learning methods does not necessarily result in a system which takes actions to maximize the reward function.
I am not convinced that this is actually a narrowing of what TurnTrout said.
Consider the following possible claims:
combined with one of:
combined with one of:
I interpret TurnTrout's as mainly saying 2(B)(ii) (e.g. the AI will not, in general, rewrite its reward module to output MAX_INT regardless of how smart it becomes - I agree with this point).
I also think TurnTrout is probably additionally saying 1(A)(ii) (i.e., the AI won't explicitly value or agentically seek to maximize its original reward function no matter how much it is scaled up - this is plausible to me, but I am not as sure I agree with this as compared to 2(B)(ii)).
I interpret you, in the quote above, as maybe saying 1(A)(i), (i.e. current AIs don't explicitly value or agentically seek to maximize the reward function on which they are trained on). While I agree, and this is weaker than 1(A)(ii) which seems to me a secondary point of TurnTrout's post, I don't think it is strictly speaking narrower than 2(B)(ii) which I think was TurnTrout's main point.
Also, regarding your thought experiment - of course, if in training the AI finds some way to cheat, that will be reinforced! But that has limited relevance for when cheating in training isn't possible. I also think that the fact that a human has pre-existing values, while the AI doesn't, makes the thought experiment not that useful an analogy.
Hmm, I'm not sure anyone is "making an assertion that we expect to hold no matter how much the AI is scaled up.", unless scaling up means something pretty narrow like applying current RL algorithms to larger and larger networks and more and more data.
But you're probably right that my claim is not strictly a narrowing of the original. FWIW, I think both your (1) and (2) above are pretty likely when talking about current and near-future systems, as they scale to human levels of capability and agency, but not necessarily beyond.
I read the original post as talking mainly about current methods for RL, applied to future systems, though TurnTrout and I probably disagree on when it makes sense to start calling a system an "RL agent".
Also, regarding your thought experiment - of course, if in training the AI finds some way to cheat, that will be reinforced! But that has limited relevance for when cheating in training isn't possible.
As someone who has worked in computer security, and also written and read a lot of Python code, my guess is that cheating at current RL training processes as actually implemented is very, very possible for roughly human-level agents. (That was the other point of my post on gradient hacking.)
Hmm, I'm not sure anyone is "making an assertion that we expect to hold no matter how much the AI is scaled up.", unless scaling up means something pretty narrow like applying current RL algorithms to larger and larger networks and more and more data.
While I did intend (ii) to mean something relatively narrow like that, I will make the assertion that I expect 2(B)(ii) (which I think was TurnTrout's main point) to hold for a large class of algorithms, not just current ones, and that it would require a major screw-up for someone to implement an algorithm for which it didn't hold.
As someone who has worked in computer security, and also written and read a lot of Python code, my guess is that cheating at current RL training processes as actually implemented is very, very possible for roughly human-level agents. (That was the other point of my post on gradient hacking.)
I wouldn't be surprised.
But, I would be surprised if it actually did cheat, unless the hacking were not merely possible with planning but pretty much laid out on its path.
The thing is, it's not trying to maximize the reward! (Back to TurnTrout's point again). It's gradient descent-ing in some attractor basin towards cognitive strategies that get good rewards in practice, and the hacking probably isn't in the same gradient descent attractor basin.
Even if it does develop goals and values, they will be shaped by the attractor basin that it's actually in, and not by other attractor basins.
A human with pre-existing goals is a different matter - that's why I questioned the relevance of the thought experiment.
In this post, I want to highlight a fact which I did not see mentioned in the original post or the comments: in the field of reinforcement learning, there are agents which are pursuing the goal of building a system which maximizes a reward function, subject to some additional constraints. These agents are the capabilities researchers designing and implementing SotA reinforcement learning algorithms and other methods to build and test the most capable, general systems across a variety of domains.
I agree that this fact is worth pointing out. I myself have the feeling that I've mentioned this somewhere, but couldn't instantly find/cite where I'd elaborated this over the last year. IIRC, I ended up thinking that the human-applied selection pressure was probably insufficient to actually produce policies which care about reward.
I'm not opposed to using standard shorthand when it's clear to experienced practitioners what the author means, but I think in posts which discuss both policies and agents, it is important to keep these distinctions in mind and sometimes make them explicit.
I agree very much with your point here, and think this is a way in which I have been imprecise. I mean to write a post soon which advocates against using "agents" to refer to "policy networks".
The observation that the Dreamer authors exerted strong optimization power to design an effective RL method is what led me to make the prediction here.
(As an aside, I mean to come back and read more about Dreamer and possibly take up a bet with you. I've been quite busy.)
In Reward is not the optimization target, @TurnTrout writes:
I think these claims are true and important, though in my own terms, I would rephrase and narrow as:
Executing a policy trained through current reinforcement learning methods does not necessarily result in a system which takes actions to maximize the reward function.
I find the claim stated this way more intuitively obvious, but also somewhat less interesting, because the restatement makes the narrowness more explicit.
In this post, I want to highlight a fact which I did not see mentioned in the original post or the comments: in the field of reinforcement learning, there are agents which are pursuing the goal of building a system which maximizes a reward function, subject to some additional constraints. These agents are the capabilities researchers designing and implementing SotA reinforcement learning algorithms and other methods to build and test the most capable, general systems across a variety of domains.
Capabilities researchers are not solely concerned with maximizing a particular reward function, since it is not very difficult or interesting to program a bot the old-fashioned way to beat a particular Atari game. For other games (e.g. Go) it is harder to beat human or machine performance by using traditional programming techniques, and trying doesn't make for a compelling AI research agenda.
Aside from the headline metric of how well a new RL method does in terms of training policies which result in a high reward when executed, RL researchers place importance on:
Why is this observation important?
In my view, current RL methods have not yet advanced to the point of creating systems which can be indisputably described as agents which has any kind of values at all.[2]
I view most attempts to draw parallels between high-level processes that happen in current-day AI systems and human brains as looking for patterns which do not yet exist.
Speculating on what kind of agents and values current DL-paradigm RL methods might produce in the future can be valuable and important research, but I think that it is important to remain grounded about what current systems are actually doing, and to be precise with terms.
As an example of where I think a lack of grounding about current systems and methods leads to things going wrong, in Evolution provides no evidence for the sharp left turn, Quintin Pope writes:
But the actual sharp left turn problem is about systems which are indisputably agentic and reflective already.
Similarly, on Inner and outer alignment decompose one hard problem into two extremely hard problems, I am remain skeptical that there is any process whatsoever within current-day systems for which it is meaningful to talk about inner alignment or as having values in any sense. These points are debatable, and I am not providing much evidence or explaining my own views in detail here. I am merely claiming that these are points which are up for debate.
A note on terminology
Throughout this post, I have used somewhat verbose and perhaps nonstandard phrasings like "executing a trained policy", to make the type of the object or concept I am talking about precise. I think it is sometimes worth being very precise and even pedantic about types when talking about these things, because it can make implicit assumptions more explicit in the text. This has benefits for clarity even when there is no disagreement.
I'm not claiming that this terminology or verbosity should be standard, but my own preferred way of thinking of things in the field of RL is explained by the following paragraph:
Reinforcement learning methods are human-readable instructions, algorithms, and theories for designing and building RL-based AI systems. These methods usually involve training a policy, which is then deployed in a system which feeds input and state into the policy and repeatedly executes the output of the policy (using some simple selection rule, if the policy is probabilistic) in an appropriate environment or domain. It's often useful to model such a system as an agent within a particular domain, but I dispute that any current system has properties which are similar in type to the kind of agency and values attributed to humans.
I am not an expert in the field of RL, but I don't think any usage of the italicized terms in the paragraph above is particularly controversial or nonstandard. Feel free to correct me or propose better terms in the comments if not.
I'm not opposed to using standard shorthand when it's clear to experienced practitioners what the author means, but I think in posts which discuss both policies and agents, it is important to keep these distinctions in mind and sometimes make them explicit.
A closing thought experiment
In a recent post on gradient hacking, I described a thought experiment:
Aside from making a point about gradient hacking, I think this thought experiment is useful for building an intuition for why reward is not the optimization target of the system being trained.
The human subject in the thought experiment would indeed be very unlikely to intrinsically value scoring highly on the metric which the aliens use to evaluate the human's performance during training. But the human might seek to maximize (or at least manipulate) the metric during training anyway, in order to deceive the aliens into ending the training process.
I think this helps build an intuition for why inner alignment may be a problem in future, more capable AI systems which has not yet shown up in any real systems.
The observation that the Dreamer authors exerted strong optimization power to design an effective RL method is what led me to make the prediction here.
My guess is that an RL policy trained using Dreamer will look more like a reward-function maximizer for cheese-finding in a maze, because the developers of the Dreamer algorithm were more focused on developing a maximizer-building algorithm than Langosco et al., who merely wanted an RL algorithm that was good enough to produce policies which they could study for other purposes. (This is not at all meant as a knock on Langosco or Turner et al.'s work! I take issue with some of their conclusions, but personally, I think their work is valuable and net-positive, and the publication of Dreamer is negative.)
I do think modeling RL-based systems as agents in particular domains is a useful tool for understanding the behavior of these systems, but I haven't yet seen an AI system which I would consider to actually unambiguously have any real agency or values whatsoever, which in my view is a fact about the underlying processes within a system which generate its behavior.