All of benedelman's Comments + Replies

Ah, I think I understand what you meant now. The reward for this agent is not determined by the actual long-term consequences of its action, but by the predicted long-term consequences. In that case, yes, this seems like it might be an interesting middle ground between what we are calling short-term and long-term AIs. Though it still feels closer to a long-term agent to me—I'm confused about why you think it would both (a) not plan ahead of time to disempower humans, and (b) disempower humans when it has the chance. If the predictive model is accurate enough such that it is predictable that disempowering humans would be instrumentally useful, then wouldn't the model incorporate that into its earlier plans?

1Sam Marks
Yep, sorry, I've probably been pretty unclear with describing this example. For the sake of simplicity, let's ignore the "execute on arbitrary short-term goals" half of the system I described, and just consider a system which was trained to generate long-term plans that maximize profit by imitating human-generated plans; if we label our training data with information about plan quality, such an agent could plausibly output plans as good as a human expert planner. So the training goal of this system is the short-term goal of minimizing next-token prediction loss during the current episode. Let's suppose that this agent is subtly inner misaligned with the (still short-term) inner goal of minimizing the value in the memory address which, during training, represented the current episode's prediction loss. So if this agent could wirehead by directly setting this number to be very small before the end of its current episode, it would do so. But if it's not able to do that, it would likely exhibit the behavior we want (outputting plans which mimic its training data), since that's the best way to get the value in the memory address it cares about to be small. In particular, this system wouldn't be interested in outputting plans which would manipulate us into setting this value to be small in the future, since it myopically only cares about this memory address's value during the current episode. But if at some point the system accumulates lots of resources and power by normal means, it may find that it has the option of e.g. suddenly taking control of the computers that it's running on and manually setting the value in the memory address it cares about to be small. (I'm assuming here that the system can do things other than just output tokens, e.g. search the internet, contact human experts on the side, etc., so that it could plausibly have a way of taking over its computing cluster without ending the current episode.) So this is a bad action that the system wouldn't have

In those terms, what we're suggesting is that, in the vision of the future we sketch, the same sorts of solutions might be useful for preventing both AI takeover and human takeover. Even if an AI has misaligned goals, coordination and mutually assured destruction and other "human alignment" solutions could be effective in stymying it, so long as the AI isn't significantly more capable than its human-run adversaries.

Re your second critique: why do you think an AI system (without superhuman long-term planning ability) would be more likely to take over the world this way than an actor controlled by humans (augmented with short-term AI systems) who have long-term goals that would be instrumentally served by world domination?

1Sam Marks
I think that a competent human actor assisted by short-term AI systems plausibly could take over the world this way; I'm just inclined to call that a misuse problem rather than an alignment problem. (Or in other words, fixing that requires solving the human alignment problem, which feels like it requires different solutions, e.g. coordination and governmental oversight, than the AI alignment problem.)

I'm confused about your first critique. You say the agent has a goal of generating a long-term plan which leads to as much long-term profit as possible; why do you call this a short-term goal, rather than a long-term goal? Do you mean that the agent only takes actions over a short period of time? That's true in some sense in your example, but I would still characterize this as a long-term goal because success (maximizing profit) is determined by long-term results (which depend on the long-term dynamics of a complex system, etc.).

1Sam Marks
I see two distinctions between a system like the one I described and a system with long-term goals in the usual sense. First, the goal "write down a plan which, if followed, would lead to long-term profit" is itself a short-term goal which could plausibly be trained up to human-level with a short-term objective function (by training on human-generated predictions). So I think this mechanism avoids the arguments made in claims 4 and 5 of the post for the implausibility of long-term goals (which is my motivation for mentioning it). (I can't tell if claim 6 was supposed to be addressing long-term goal formation stories like this one.) Second, the intrinsic goals of the system I described are all short-term (output the text of a plan for a long-term goal; pursue various short-term goals),so the possible alignment failures for such a system might need to be analyzed differently than those of a system with long-term intrinsic goals. For example, such a system might not plan ahead of time to disempower humans (since such disempowerment would come in the long-term, which it doesn't intrinsically care about). But once it finds that it has enough resources and power to disempower humans, it might then decide to take a catastrophic action, despite not having planned it in advance.
benedelmanΩ120

Thanks for laying out the case for this scenario, and for making a concrete analogy to a current world problem! I think our differing intuitions on how likely this scenario is might boil down to different intuitions about the following question:

To what extent will the costs of misalignment be borne by the direct users/employers of AI?

Addressing climate change is hard specifically because the costs of fossil fuel emissions are pretty much entirely borne by agents other than the emitters. If this weren't the case, then it wouldn't be a problem, for the reaso... (read more)

8leogao
I expect that the key externalities will be borne by society. The main reason for this is I expect deceptive alignment to be a big deal. It will at some point be very easy to make AI appear safe, by making it pretend to be aligned, and very hard to make it actually aligned. Then, I expect something like the following to play out (this is already an optimistic rollout intended to isolate the externality aspect, not a representative one): We start observing alignment failures in models. Maybe a bunch of AIs do things analogous to shoddy accounting practices. Everyone says "yes, AI safety is Very Important". Someone notices that when you punish the AI for exhibiting bad behaviour with RLHF or something the AI stops exhibiting bad behaviour (because it's pretending to be aligned). Some people are complaining that this doesn't actually make it aligned, but they're ignored or given a token mention. A bunch of regulations are passed to enforce that everyone uses RLHF to align their models. People notice that alignment failures decrease across the board. The models don't have to somehow magically all coordinate to not accidentally reveal deception, because even in cases where models fail in dangerous ways people chalk this up to the techniques not being perfect, but they're being iterated on, etc. Heck, humans commit fraud all the time and yet it doesn't cause people to suddenly stop trusting everyone they know when a high profile fraud case is exposed. And locally there's always the incentive to just make the accounting fraud go away by applying Well Known Technique rather than really dig deep and figuring out why it's happening. Also, a lot of people will have vested interest in not having the general public think that AI might be deceptive, and so will try to discredit the idea as being fringe. Over time, AI systems control more and more of the economy. At some point they will control enough of the economy to cause catastrophic damage, and a treacherous turn happens. A
benedelmanΩ491

Thank you for the insightful comments!! I've added thoughts on Mechanisms 1 and 2 below. Some reactions to your scattered disagreements (my personal opinions; not Boaz's):

  1. I agree that extracting short-term modules from long-term systems is more likely than not to be extremely hard. (Also that we will have a better sense of the difficulty in the nearish future as more researchers work on this sort of task for current systems.)
  2. I agree that the CEO point might be the weakest in the article. It seems very difficult to find high-quality evidence about the impac
... (read more)
4paulfchristiano
I usually think of the effects of R&D as multiplicative savings in compute, which sounds consistent with what you are saying. For example, I think a conservative estimate might be that doubling R&D effort allows you to cut compute by a factor of 4. (The analogous estimate for semiconductor R&D is something like 30x cost reduction per 2x R&D increase.) These numbers are high enough to easily allow explosive growth until the returns start diminishing much faster. Yes. I mean that if we have alignment problems such that all the most effective AI systems have long-term goals, and if all of those systems can get what they want together (e.g. because they care about reward), then to predict the outcome we should care about what would happen in a conflict between (those AIs) vs (everyone else). So I expect in practice we need to resolve alignment problems well enough that there are approximately competitive systems without malign long-term goals.
benedelman*Ω6112

I agree that this sort of deceptive misalignment story is speculative but a priori plausible. I think it's very difficult to reason about these sorts of nuanced inductive biases without having sufficiently tight analogies to current systems or theoretical models; how this will play out (as with other questions of inductive bias) probably depends to a large extent on what the high-level structure of the AI system looks like. Because of this, I think it's more likely than not that our predictions about what these inductive biases will look like are pretty of... (read more)

6paulfchristiano
I think the situation is much better if deceptive alignment is inconsistent. I also think that's more likely, particularly if we are trying. That said, I don't think the problem goes away completely if deceptive alignment is inconsistent. We may still have limited ability to distinguish deceptively aligned models from models that are trying to optimize reward, or we may find that models that are trying to optimize reward are unsuitable in practice (e.g. because of the issues raised in mechanism 1) and so selecting for things that works means you are selecting for deceptive alignment.
8Vivek Hebbar
What kind of regularization could this be?  And are you imagining an AlphaZero-style system with a hardcoded value head, or an organically learned modularity?
benedelmanΩ460

My main objection to this misalignment mechanism is that it requires people/businesses/etc. to ignore the very concern you are raising. I can imagine this happening for two reasons:

  1. A small group of researchers raise alarm that this is going on, but society at large doesn't listen to them because everything seems to be going so well. This feels unlikely unless the AIs have an extremely high level of proficiency in hiding their tampering, so that the poor performance on the intended objective only comes back to bite the AI's employers once society is permane
... (read more)
4paulfchristiano
I expect there to be broad agreement that this kind of risk is possible. I expect a lot of legitimate uncertainty and disagreement about the magnitude of the risk. I think if this kind of tampering is risky then it almost certainly has some effect on your bottom line and causes some annoyance. I don't think AI would be so good at tampering (until it was trained to be). But I don't think that requires fixing the problem---in many domains, any problem common enough to affect your bottom line can also be quickly fixed by fine-tuning for a competent model. I think that if there is a relatively easy technical solution to the problem then there is a good chance it will be adopted. If not, I expect there to be a strong pressure to take the overfitting route, a lot of adverse selection for organizations and teams that consider this acceptable, a lot of "if we don't do this someone else will," and so on. If we need a reasonable regulatory response then I think things get a lot harder. In general I'm very sympathetic to "there is a good chance that this will work out," but it also seems like the kind of problem that is not hard to mess up, and there's enough variance in our civilization's response to challenging technical problems that there's a real chance we'd mess it up even if it was objectively a softball. ETA: The two big places I expect disagreement are about (i) the feasibility of irreversible robot uprising---how sure are we that the optimal strategy for a reward-maximizing model is to do their task well? (ii) is our training process producing models that actually refrain from tampering, or are we overfitting to our evaluations and producing models that would take an opportunity for a decisive uprising if it came up? I think that if we have our act together we can most likely measure (ii) experimentally; you could also imagine a conservative outlook or various forms of penetration testing to have a sense of (i). But I think it's just quite easy to imagine us fail
leogao*Ω93128

A small group of researchers raise alarm that this is going on, but society at large doesn't listen to them because everything seems to be going so well.

Arguably this is already the situation with alignment. We have already observed empirical examples of many early alignment problems like reward hacking. One could make an argument that looks something like "well yes but this is just in a toy environment, and it's a big leap to it taking over the world", but it seems unclear when society will start listening. In analogy to the AI goalpost moving problem ... (read more)