Superintelligence Can't Solve the Problem of Deciding What You'll Do

Vladimir_Nesov

LESSWRONG
LW

Superintelligence Can't Solve the Problem of Deciding What You'll Do — LessWrong

29 Superintelligence Can't Solve the Problem of Deciding What You'll Do

by Vladimir_Nesov

15th Sep 2024

2 min read

29

Superintelligence can solve all problems, even as it won't necessarily do so. But not quite. In the modern world, you exist within physics, and all you do obeys its laws. Yet it's still you who decides what you do. If there is a simulation of your behavior, that doesn't change the attribution of the reason for it. When the fabric of reality is woven from superintelligent will rather than only physical law, the reason for your own decisions is still you, and it's still not possible for something that is not you to make decisions that are yours.

Potential for manipulation or physical destruction doesn't distinguish the role of superintelligence from that of the physical world. To make decisions in the world, you first need to exist, in an environment where you are able to function. Being overwritten or changed into something else, either with brute force or subtle manipulation or social influence, is a form of not letting this premise obtain.

A solved world is post-instrumental, all things that are done for a purpose could be done by AI to more effectively reach that purpose. It could even be more effective than you at figuring out what your decisions are going to be! This is similar to what an optimizing compiler does, the behavior of machine code is still determined by the meaning of the source code, even if that source code is erased and only exists conceptually. With humans, the physical implementation is similarly not straightforward, all the proteins and synaptic vesicles are more akin to machine code than a conceptually reasonable rendering of a person. So it's fair to say that we are already not physically present in the world, the things that are physically present are better described as kludgy and imperfect simulators.

In this framing, superintelligence is capable of being a better simulator of you, but it gets no headway in being capable of deciding your behavior. A puzzle associated with determinism is to point out that superintelligence can show you a verifiable formal proof of what you're going to do. But that doesn't really work, since deciding to ignore the proof and doing whatever makes the situation you observed go away as counterfactual, while in actuality you are still determining your own actions. Only if you do have the property of deferring to external proofs of what you'll do, will it become the case that an external proof claiming something about your behavior is the reason for that behavior. But that's no way to live. Instead, it's you who controls the course of proofs about your behavior, not those who are writing the proofs.

Map and TerritoryMetaethicsAI

Frontpage

29

Superintelligence Can't Solve the Problem of Deciding What You'll Do

New Comment

11 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:46 PM

[-]JBlack1y40

What is the difference between "deciding your behaviour" and "deciding upon interventions to you that will result in behaviour of its choosing"?

If showing you a formal proof that you will do a particular action doesn't result in you doing that action, then the supposed "proof" was simply incorrect. At any rate, it is unlikely in most cases that there exists a proof that merely presenting it to a person is sufficient to ensure that the person carries out some action.

In more formal terms: even in the trivial case where a person could be modelled as a function f(a,b,c,...) that produces actions from inputs, and there do in fact exist values of (a,b,c,...) such that f produces a chosen action A, there is no guarantee that f(a,b,c,...) = A whenever a = "a proof that f(a,b,c,...) = A" for all values of b,c,... .

It may be true that f(a,b,c,...) = A for some values of b,c,... and if the superintelligence can arrange for those to hold then it may indeed look like merely presenting the proof is enough to guarantee action A, but would actually be a property of both the presentation of the proof and all the other interventions together (even if the other interventions are apparently irrelevant).

There are many things that people believe they will be able to simply ignore, but where that belief turns out to be incorrect. Simply asserting that deciding to ignore the proof will work is not enough to make it true.

As you broaden the set of possible interventions and time spans, guarantees of future actions will hold for more people. My expectation is that at some level of intervention far short of direct brain modification or other intuitively identity-changing actions, it holds for essentially all people.

[-]Vladimir_Nesov1y*60

If showing you a formal proof that you will do a particular action doesn't result in you doing that action, then the supposed "proof" was simply incorrect.

Yes, that's the point, you can make it necessarily incorrect, your decision to act differently determines the incorrectness of the proof, regardless of its provenance. When the proof was formally correct, your decision turns the whole possible world where this takes place counterfactual. (This is called playing chicken with the universe or the chicken rule, a technique that's occasionally useful for getting an agent to have nicer properties, by not letting the formal system that generates the proofs know too much early on about what the agent is going to decide.)

[-]Vladimir_Nesov1y*4-2

Manipulation that warps an agent loses fidelity in simulating them. Certainly superintelligence has the power to forget you in various ways, similarly to how a supernova explosion is difficult to survive, so this not happening needs to be part of the premise. A simulator that only considers what you do for a particular contrived input fails to observe your behavior as a whole.

So we need some concepts that say what it means for an agent to not be warped (while retaining interaction with sources of external influence rather than getting completely isolated), for something to remain the intended agent rather than some other phenomenon that is now in its place. This turns out to be a lot like the toolset useful for defining values of an agent. Some relevant concepts are membranes, updatelessness, and coherence of volition. Membranes gesture at the inputs that are allowed to come in contact with you, or information about you that can be used in determining the inputs that are allowed, that can be part of an environment that doesn't warp you and enables further observation. This shouldn't be restricted only to inputs, since the whole deal with AI risk is that AI is not some external invasion that arrives to Earth, it's something we are building ourselves, right here. So the concept of a membrane should also target acausal influences, patterns developing internally from within the agent.

Updatelessness is about a point of view on behavior where you consider its dependence on all possible inputs, not just actions given the inputs that you did apparently observe so far. Decisions should be informed by looking at the map of all possible situations and behaviors that take place there, even if the map has to be imprecise. (And not just situations that are clearly possible, the example in the post with ignoring the proof of what you'll do is about what you do in potentially counterfactual situations.)

Coherence of volition is about the problem of path dependence in how an agent develops. There are many possible observations, many possible thoughts and possible decisions, that lead to different places. The updateless point of view says that the local decisions should be informed by the overall map of this tree of possibilities for reflection, so it would be nice if path dependence doesn't lead to chaotic divergence, if there are coherent values to be found, even if only within smaller clusters of possible paths of reflection that settle into being sufficiently in agreement.

[-]Anon User1y32

It seems you are overlooking the notion of superintelligence being able to compute through your decisionmaking process backwards. Yes, it's you who would be making the decision, but SI can tell you exactly what you need to hear in order for your decision to result in what it wants. It is not going to try to explain how it is manipulating you, it will not try to prove to you it is manipulating you correctly - it will just manipulate you. Internally, it may have a proof, but what reason would it have to show it to you? And if placed into some very constrained setup where it is forced to show you the proof, it will solve a recursive equation, of "What is the proof P, such that P proves that `'when shown P, you will act according to P's prediction '' ?", solve it correctly, and then show you such P that it would be compelling enough for you to follow it to its conclusion.

[-]Vladimir_Nesov1y3-1

Ability to resist a proof of what your behavior will be even to the point of refuting its formal correctness (by determining its incorrectness with your own decisions and turning the situation counterfactual) seems like a central example of a superintelligence being unable to decide/determine (as opposed to predict) what your decisions are. It's also an innocuous enough input that doesn't obviously have to be filtered by weak agent's membrane.

In any case, to even discuss how a weak agent behaves in a superintelligent world, it's necessary to have some notion of keeping it whole. Extreme manipulation can both warp the weak agent and fail to elicit their behavior for other possible inputs. So this response to another comment seems relevant.

Another way of stating this, drawing on the point about physical bodies thought of as simulations of some abstract formulation of a person, is to say that an agent by itself is defined by its own isolated abstract computation, which includes all membrane-permissible possible observations and resulting behaviors. Any physical implementation is then a simulation of this abstract computation, which can observe it to some extent, or fail to observe it (when the simulation gets sufficiently distorted). When an agent starts following dictates of external inputs, that corresponds to the abstract computation of the agent running other things within itself, which can be damaging to its future on that path of reflection depending on what those things are. In this framing, normal physical interaction with the external world becomes some kind of acausal interaction between the abstract agent-world (on inputs where the physical world is observed) and the physical world (for its parts that simulate the abstract agent-world).

[-]Anon User1y20

Ability to predict how outcome depends on inputs + ability to compute the inverse of the prediction formula + ability to select certain inputs => ability to determine the output (within limits of what the influencing the inputs can accomplish). The rest is just an ontological difference on what language to use to describe this mechanism. I know that if I place a kettle on a gas stove and turn on the flame, I will get the boiling water, and we colloquially describe this as bowling the water. I do not know all the intricacies of the processes inside the water, and I am not directly controlling individual heat exchange subprocesses inside the kettle, but if would be silly to argue that I am not controlling the outcome of the water getting boiled.

[-]AnthonyC1y20

It sounds like your central point is that even if a superintelligence can model you well enough to know what you'll do in any given context, it's still you deciding.

I agree that's formally true as constructed, but also... not very useful? The superintelligence can make decisions that control the context in which I decide, such that I end up deciding whatever it wants me to decide. This does not quite get me what I want from my free will, not in the way that the current unoptimized world does. This context-control can be subtle or overt, but if it never fails, then there is a sense in which the proximate cause of my actions, which is still located within me, is no longer the ultimate cause of my actions. I still make decisions and act to try to achieve my own goals, but it is now determined, by one or more entities outside myself, that my decisions and actions will in practice be optimized to achieve those entities' goals.

[-]Vladimir_Nesov1y40

The central point is more that superintelligence won't be able to help with your decisions, won't be able to decide for you at some fundamental level, no matter how capable it is. It can help instrumentally, but not replace your ability to decide in order to find out what the decisions are, so in some sense it's not able to help at all. I'm trying to capture the sense in which this holds regardless of its ability to precisely predict and determine outcomes in the physical world (if we only look at the state of the future, rather than full trajectories that get the world there).

When a program is given strange input, or if the computer it would be running on is destroyed, that event in the physical world usually doesn't affect the semantics of the program that describes its behavior for all possible inputs. If you are weak and brittle, talking about what you'll decide requires defining what we even mean in principle by a decision that is yours, only then can we ask if it does remain yours in actuality. Or if it does remain yours, even if you are yourself no longer present in actuality. Which is not very useful for keeping it (or yourself) present in actuality, but can be conceptually useful for formulating desiderata towards it being present in actuality.

So there are two claims. First, it's in some sense natural for your decisions to remain yours, if you don't start mindlessly parroting external inputs that dictate your actions, even in the face of superintelligence (in its aspect of capability, but not necessarily aimed in a way that disrupts you). Second, if you are strongly manipulated or otherwise overridden, this should in some sense mean that you are no longer present, that the resulting outcome doesn't capture or simulate what we should define as being you (in order to talk about the decisions that are in principle yours). Thus presence of overpowering manipulation doesn't contradict decisions usually remaining yours, it just requires that you are consequently no longer present to manifest them in actuality when that happens.

This seems like the third comment on the same concern, I've also answered it here and here, going into more detail on other related things. So there is a missing prerequisite post.

[-]AnthonyC1y20

In case this helps you write that post, here are some things I am still confused about.

What, specifically, is the "you" that is no longer present in the scenario of strong manipulation?

In the "mindless parroting" scenario, what happens if the ASI magically disappeared? Does/can the "you" reappear? Under what circumstances?

Why is this not a fully general argument against other humans helping you make decisions? For example, if someone decides to mindlessly parrot everything a cult leader tells them, I agree there's a sense in which they are no longer present. But the choice to obey is still theirs, and can be changed, and they can reappear if they change that one choice.

OTOH, if this is a fully general argument about anyone helping anyone else make decisions, that seems like a major (and underspecified) redefinition of both "help" and "decision." It then seems like it's premature to jump to focusing on ASI as a special case, and also I'm not sure why I'm supposed to care about these definitional changes?

"So it's fair to say that we are already not physically present in the world, the things that are physically present are better described as kludgy and imperfect simulators" - I mean, yes, I see your point, but also, I am present implicitly in the structure of the physically-existing things. There is some set of arrangements-of-matter that I'd consider me, and others I would not. I don't know if the set's boundaries are quantitative or binary or what, but each member either encodes me, or not.

I think, under more conventional definitions of "help" and "decision," that telling me what to do, or showing me what I'm going to do, is kinda beside the point. A superintelligence that wanted to help me choose the best spouse might very well do something completely different, like hack someone's Waymo to bump my car right when we're both looking for someone new and in the right headspace to have a meet cute. I think that's mostly like a fancier version of a friend trying to set people up by just sitting them next to each other at a dinner party, which I would definitely classify as helping (if done skillfully). Real superintelligences help with butterflies.

[-]Said Achmiz1y20

If I understand you correctly, you are describing the same sort of thing as I mentioned in the footnote to this comment, yes?

[-]Vladimir_Nesov1y*60

More like operationalization of a binding rule is not something even superintelligence can do for you, even when it gives the correct operationalization. Because if you follow operationalizations merely because they are formally correct and given by a superintelligence, then you follow arbitrary operationalizations, not ones you've decided to follow yourself. How would the superintelligence even know what you decide, if you shirk that responsibility and wait for the superintelligence to tell you?

This post is mostly a reaction to Bostrom's podcasts about his new book. I think making your own decisions is a central example of something that can't be solved by others, no matter how capable, and also this activity is very close to what it means to define values, so plans in the vicinity of external imposition of CEV might be missing the point.

One thought about the rule/exception discussion you've linked (in the next paragraph; this one sets up the framing). Rules/norms are primitives of acausal coordination, especially interesting when they are agents in their own right. They mostly live in other minds, and are occasionally directly incarnated in the world, outside other minds. When a norm lives in many minds, it exerts influence on the world through its decisions made inside its hosts. It has no influence where it has no hosts, and also where the hosts break the norm. Where it does have influence, it speaks in synchrony in many voices, through all of its instances, thus it can have surprising power even when it only weakly compels the minds that host its individual instances.

So there is a distinction between an exception that isn't part of a rule, and an exception that the rule doesn't plan for. Rules are often not very intelligent, so they can fail to plan for most things, and thus suggest stupid decisions that don't take those things into account. This can be patched by adding those things into the rule, hardcoding the knowledge. But not every thing needs to be hardcoded in order for the rule to be able to adequately anticipate it in its decision making. This includes hosts (big agents) making an exception to the rule (not following it in a particular situation): some rules are able to anticipate when specifically that happens, and so don't need those conditions to become part of their formulation.

Moderation Log