There is a common intuition and feeling that our most fundamental goals may be uncertain in some sense. What causes this intuition? For this topic I need to be able to pick out one’s top level goals, roughly one’s context insensitive utility function, and not some task specific utility function, and I do not want to imply that the top level goals can be interpreted in the form of a utility function. Following from Eliezer’s CFAI paper I thus choose the word “supergoal” (sorry Eliezer, but I am fond of that old document and its tendency to coin new vocabulary). In what follows, I will naturalistically explore the intuition of supergoal uncertainty.
To posit a model, what goal uncertainty (including supergoal uncertainty as an instance) means is that you have a weighted distribution over a set of possible goals and a mechanism by which that weight may be redistributed. If we take away the distribution of weights how can we choose actions coherently, how can we compare? If we take away the weight redistribution mechanism we end up with a single goal whose state utilities may be defined as the weighted sum of the constituent goals’ utilities, and thus the weight redistribution mechanism is necessary for goal uncertainty to be a distinct concept.
- Part of the intuition of supergoal uncertainty naturally follows from goal and planning uncertainty. What plan of action is best? How should I construct a local utility function for this context? The weight redistribution mechanism is then a result of gathering more evidence, calculating further, and seeing how this goal links up to one’s supergoals in the context of other plans.
- It could be we are mistaken. The rules society sets up to coordinate behavior give the default assumption that there is an absolute standard by which to judge a behavior good or bad. Also, religions many times dictate that there is a moral absolute, and even if we aren’t religious, the cultural milieu make us consider the possibility that the concept “should” (and so the existence of a weight redistribution mechanism) can be validly applied to supergoals.
- It could be we are confused. Our professed supergoal does not necessarily equal our actual supergoal but neither are they completely separate. So when we review our past behavior and introspect to determine what our supergoal is, we get conflicting evidence that is hard to reconcile with the belief that we have one simple supergoal. Nor can we necessarily endorse the observed supergoal for social reasons. The difficulty in describing the supergoal is then represented as a weighting over possible supergoals and the weight redistribution mechanism corresponds to updating our self model given additional observations and introspections along with varying the social context.
- It could be I’m confused :) People, including some part of me, may wish to present the illusion that we do not know what our supergoals truly are and also pretend that the supergoals are malleable to change after argument for game theoretic reasons.
- It could be we are incoherent. As Allais’s paradox, hyperbolic discounting, and circular preferences, show that no utility function may be defined for people (at least in any simple way). How then may we approximate a person’s behavior with a utility function/supergoal? Using a weight distribution and updating (along with some additional interpretation machinery) is a plausible possibility (though an admittedly ugly one). Perhaps supergoal uncertainty is a kludge to describe this incoherent behavior. Our environments, social and physical, enforce consistency constraints upon us, approaching making us, in isolated contexts, expectation maximizers. Could something like weighting based on probability of encountering each of those contexts define our individual supergoals? Ugly, ugly, ugly.
- It could be we predict our supergoals will change with time. Who said people have stable goals? Just look at children versus adults or the changes people undergo when they get status or have children. Perhaps the uncertainty has to do with what they predict their future supergoals will be in face of future circumstances and arguments.
- It could be we discover our supergoals and have uncertainty over what we will discover and what we would eventually get at the limit of our exploration. At one point I had rather limited exposure to the various types of foods but now find I like exploring taste space. At one point I didn’t know computer science but now I enjoy its beauty. At one point I hadn’t yet pursued women but now find it quite enjoyable. Some things we apparently just have to try (or at the very least think about) to discover if we like them.
- It could be we cannot completely separate our anticipations from our goals. If our anticipations are slowly updating, systematically off, and coherent in their effect on reality then it is easy to mistake the interaction of flawed anticipations plus a supergoal with having an entirely different supergoal.
- It could be we have uncertainty over how to define our very selves. If your self definition doesn’t include irrational behavior or selfishness or system 1 or includes the Google overmind, then “your” goals are going to look quite different depending on what you include and exclude in your self definition. It is also possible your utility function doesn’t depend upon self definition or you are “by definition” your utility function and this question is moot.
- It could be that environmental constraints cause some supergoals to express themselves equivalently to other supergoals. Perhaps your supergoal could be forever deferred in order to gain capability to achieve it ever better (likely a rare situation). Perhaps big world anthropic negotiation arguments mean you must always distort the achievement of your supergoal. Perhaps the metagolden rule is in effect and “social” conditions force you to constrain your behavior.
- It could be that there really is a way to decide between supergoals (unlikely but still conceivable) and they don’t know yet where that decision process will take them. There could even actually be a meaning of life (i.e. universally convincing supergoal given some intelligence preconditions) after all.
- It could be caused by evidential and logical uncertainty. A mind is made of many parts and there are constraints about how much each part can know about the others or the whole about the components. To show how this implies a form of supergoal uncertainty, partition a mind along functional components. Each of these components has its own function. It may not be able to achieve it without the rest of the components but nevertheless it is there. Now if the optimization power embedded in that component is large enough and the system as a whole has evidential or logical uncertainty about how that component will work you get the possibility that this functional subcomponent will “optimize” its way towards getting greater weight in the decision process and hierarchically this proceeds for all subcomponent optimizers. So, in essence, whenever there is evidential or logical uncertainty about the operations of an optimizing subcomponent we get a supergoal term corresponding to that part and the weight redistribution mechanism corresponds to that subcomponent co-opting some weight. Perhaps this concept can even be extended to define supergoals (with uncertainty) for everything from pure expectation maximizers to rocks.
- It could be uncertainty over how to ground out in reality the definition of the supergoal. If I want to maximize paper clips and just now learn quantum mechanics do I count a paperclip in a superposition of states once or many times? If I have one infinite bunch of paperclips I could produce versus another how do I choose? If my utility function is unbounded in both positive and negative directions and I do Solomonoff induction how can I make decisions at all given that actions may have values that are undefined?
- It could be some sort of mysterious underlying factor makes the formal concept of supergoal inappropriate and it is this mismatched fit that causes uncertainty and weight redistribution. Unification of priors and values in update-less-decision theory? Something else? The universe is still unknown enough that we could be mistaken on this level.
- It could be something else entirely.
(ps I may soon post and explore the effects of supergoal uncertainty in its various reifications on making decisions. For instance, what implications, if any, does it have on bounded utility functions (and actions that depend on those bounds) and negative utilitarianism (or symmetrically positive utilitarianism)? Also, if anyone knows of related literature I would be happy to check it out.)
(pps Dang, the concept of supergoal uncertainty is surprisingly beautiful and fun to explore, and I now have a vague wisp of an idea of how to integrate a subset of these with TDT/UDT)
The material I have in mind is Chapter 18 of PT:LOS. You can see the section headings on page 8 (numbered vii because the title page is unnumbered) here. One of the section titles is "Outer and Inner Robots"; when rhollerith says 72%, he's giving the outer robot answer. To give an account of how unstable your probability estimates are, you need to give the inner robot answer.
When we receive new evidence, we assign a likelihood function for the probability. (We take the perspective of the inner robot reasoning about what the outer robot will say.) The width of the interval for the probability tells us how narrow the likelihood function has to be to shift the center of that interval by a non-neglible amount.
No.
That is a strange little chapter, but I should note that if you talk about the probability that you will make some future probability estimate, then the distribution of a future probability estimate does make a good way of talking about the instability of a state of knowledge. As opposed to the notion of talking about the probability of a current probability estimate, which sounds much more like you're doing something wrong.