Less exploitable value-updating agent
My indifferent value learning agent design is in some ways too good. The agent transfer perfectly from u maximisers to v maximisers - but this makes them exploitable, as Benja has pointed out.
For instance, if u values paperclips and v values staples, and everyone knows that the agent will soon transfer from a u-maximiser to a v-maximiser, then an enterprising trader can sell the agent paperclips in exchange for staples, then wait for the utility change, and sell the agent back staples for paperclips, pocketing a profit each time. More prosaically, they could "borrow" £1,000,000 from the agent, promising to pay back £2,000,000 tomorrow if the agent is still a u-maximiser. And the currently u-maximising agent will accept, even though everyone knows it will change to a v-maximiser before tomorrow.
One could argue that exploitability is inevitable, given the change in utility functions. And I haven't yet found any principled way of avoiding exploitability which preserves the indifference. But here is a tantalising quasi-example.
As before, u values paperclips and v values staples. Both are defined in terms of extra paperclips/staples over those existing in the world (and negatively in terms of destruction of existing/staples), with their zero being at the current situation. Let's put some diminishing returns on both utilities: for each paperclips/stables created/destroyed up to the first five, u/v will gain/lose one utilon. For each subsequent paperclip/staple destroyed above five, they will gain/lose one half utilon.
We now construct our world and our agent. The world lasts two days, and has a machine that can create or destroy paperclips and staples for the cost of £1 apiece. Assume there is a tiny ε chance that the machine stops working at any given time. This ε will be ignored in all calculations; it's there only to make the agent act sooner rather than later when the choices are equivalent (a discount rate could serve the same purpose).
The agent owns £10 and has utility function u+Xv. The value of X is unknown to the agent: it is either +1 or -1, with 50% probability, and this will be revealed at the end of the first day (you can imagine X is the output of some slow computation, or is written on the underside of a rock that will be lifted).
So what will the agent do? It's easy to see that it can never get more than 10 utilons, as each £1 generates at most 1 utilon (we really need a unit symbol for the utilon!). And it can achieve this: it will spend £5 immediately, creating 5 paperclips, wait until X is revealed, and spend another £5 creating or destroying staples (depending on the value of X).
This looks a lot like a resource-conserving value-learning agent. I doesn't seem to be "exploitable" in the sense Benja demonstrated. It will still accept some odd deals - one extra paperclip on the first day in exchange for all the staples in the world being destroyed, for instance. But it won't give away resources for no advantage. And it's not a perfect value-learning agent. But it still seems to have interesting features of non-exploitable and value-learning that are worth exploring.
Note that this property does not depend on v being symmetric around staple creation and destruction. Assume v hits diminishing returns after creating 5 staples, but after destroying only 4 of them. Then the agent will have the same behaviour as above (in that specific situation; in general, this will cause a slight change, in that the agent will slightly overvalue having money on the first day compared to the original v), and will expect to get 9.75 utilons (50% chance of 10 for X=+1, 50% chance of 9.5 for X=-1). Other changes to u and v will shift how much money is spent on different days, but the symmetry of v is not what is powering this example.
Is the potential astronomical waste in our universe too small to care about?
In the not too distant past, people thought that our universe might be capable of supporting an unlimited amount of computation. Today our best guess at the cosmology of our universe is that it stops being able to support any kind of life or deliberate computation after a finite amount of time, during which only a finite amount of computation can be done (on the order of something like 10^120 operations).
Consider two hypothetical people, Tom, a total utilitarian with a near zero discount rate, and Eve, an egoist with a relatively high discount rate, a few years ago when they thought there was .5 probability the universe could support doing at least 3^^^3 ops and .5 probability the universe could only support 10^120 ops. (These numbers are obviously made up for convenience and illustration.) It would have been mutually beneficial for these two people to make a deal: if it turns out that the universe can only support 10^120 ops, then Tom will give everything he owns to Eve, which happens to be $1 million, but if it turns out the universe can support 3^^^3 ops, then Eve will give $100,000 to Tom. (This may seem like a lopsided deal, but Tom is happy to take it since the potential utility of a universe that can do 3^^^3 ops is so great for him that he really wants any additional resources he can get in order to help increase the probability of a positive Singularity in that universe.)
You and I are not total utilitarians or egoists, but instead are people with moral uncertainty. Nick Bostrom and Toby Ord proposed the Parliamentary Model for dealing with moral uncertainty, which works as follows:
Suppose that you have a set of mutually exclusive moral theories, and that you assign each of these some probability. Now imagine that each of these theories gets to send some number of delegates to The Parliament. The number of delegates each theory gets to send is proportional to the probability of the theory. Then the delegates bargain with one another for support on various issues; and the Parliament reaches a decision by the delegates voting. What you should do is act according to the decisions of this imaginary Parliament.
It occurred to me recently that in such a Parliament, the delegates would makes deals similar to the one between Tom and Eve above, where they would trade their votes/support in one kind of universe for votes/support in another kind of universe. If I had a Moral Parliament active back when I thought there was a good chance the universe could support unlimited computation, all the delegates that really care about astronomical waste would have traded away their votes in the kind of universe where we actually seem to live for votes in universes with a lot more potential astronomical waste. So today my Moral Parliament would be effectively controlled by delegates that care little about astronomical waste.
Value learning: ultra-sophisticated Cake or Death
Many mooted AI designs rely on "value loading", the update of the AI’s preference function according to evidence it receives. This allows the AI to learn "moral facts" by, for instance, interacting with people in conversation ("this human also thinks that death is bad and cakes are good – I'm starting to notice a pattern here"). The AI has an interim morality system, which it will seek to act on while updating its morality in whatever way it has been programmed to do.
But there is a problem with this system: the AI already has preferences. It is therefore motivated to update its morality system in a way compatible with its current preferences. If the AI is powerful (or potentially powerful) there are many ways it can do this. It could ask selective questions to get the results it wants (see this example). It could ask or refrain from asking about key issues. In extreme cases, it could break out to seize control of the system, threatening or imitating humans so it could give itself the answers it desired.
Avoiding this problem turned out to be tricky. The Cake or Death post demonstrated some of the requirements. If p(C(u)) denotes the probability that utility function u is correct, then the system would update properly if:
Expectation(p(C(u)) | a) = p(C(u)).
Put simply, this means that the AI cannot take any action that could predictably change its expectation of the correctness of u. This is an analogue of the conservation of expected evidence in classical Bayesian updating. If the AI was 50% convinced about u, then it could certainly ask a question that would resolve its doubts, and put p(C(u)) at 100% or 0%. But only as long as it didn't know which moral outcome was more likely.
That formulation gives too much weight to the default action, though. Inaction is also an action, so a more correct formulation would be that for all actions a and b,
Expectation(p(C(u)) | a) = Expectation(p(C(u)) | b).
How would this work in practice? Well, suppose an AI was uncertain between whether cake or death was the proper thing, but it knew that if it took action a:"Ask a human", the human would answer "cake", and it would then update its values to reflect that cake was valuable but death wasn't. However, the above condition means that if the AI instead chose the action b:"don't ask", exactly the same thing would happen.
In practice, this means that as soon as the AI knows that a human would answer "cake", it already knows it should value cake, without having to ask. So it will not be tempted to manipulate humans in any way.
Mutual Worth without default point (but with potential threats)
Though I planned to avoid posting anything more until well after baby, I found this refinement to MWBS yesterday, so I'm posting it while Miriam sleeps during a pause in contractions.
The mutual worth bargaining solution was built from the idea that the true value of a trade is having your utility function access the decision points of the other player. This gave the idea of utopia points: what happens when you are granted complete control over the other person's decisions. This gave a natural 1 to normalise your utility function. But the 0 point is chosen according to a default point. This is arbitrary, and breaks the symmetry between the top and bottom point of the normalisation.
We'd also want normalisations that function well when players have no idea what their opponents will be. This includes not knowing what their utility functions will be. Can we model what a 'generic' opposing utility function would be?
It's tricky, in general, to know what 'value' to put on an opponent's utility function. It's unclear what kind of utilities would you like to see them have? That's because game theory comes into play, with Nash equilibriums, multiple solution concepts, bargaining and threats: there is no universal default to the result of a game between two agents. There are two situations, however, that are respectively better and worse than all others: the situation where your opponent shares your exact utility function, and the situations where they have the negative of that (they're essentially your 'anti-agent').
If your opponent shares your utility function, then there is a clear ideal outcome: act as if you and the opponent were the same person, acting to maximise your joint utility. This is the utopia point for MWBS, which can be standardised to take value 1.
If your opponent has the negative of your utility, then the game is zero-sum: any gain to you is a loss to your opponent, and there is no possibility for mutually pleasing compromise. But zero-sum games also have a single canonical outcome! For zero-sum games, the concepts of Nash equilibrium, minimax, and maximin are all equivalent (and are generally mixed outcomes). The game has a single defined value: each player can guarantee they get as much utility as that value, and the other player can guarantee that they get no more.
It seems natural to normalise that point to -1 (0 would be equivalent, but -1 feels more appropriate). Given this normalisation for each utility, the two utilities can then be summed and joint maximised in the usual way.
This bargaining solution has a lot of attractive features - it's symmetric in minimal and maximal utilities, does not require a default point, reflects the relative power, and captures the spread of opponents utilities that could be encountered without needing to go into game theory. It is vulnerable to (implicit) threats, however! If I can (potentially) cause a lot of damage to you and your cause, then when you normalise your utility, you get penalised because of what your anti-agent could do if they controlled my decision nodes. So just by having the power do do bad stuff to you, I come out better than I would otherwise (and vice-versa, of course).
I feel it's worth exploring further (especially what happens with multiple agents) - but for me, after the baby.
An Attempt at Preference Uncertainty Using VNM
(This is a (possibly perpetual) draft of some work that we (I) did at the Vancouver meetup. Thanks to my meetup buddies for letting me use their brains as supplementary computational substrate. Sorry about how ugly the LaTeX is; is there a way to make this all look a bit nicer?)
(Large swaths of this are obsolete. Thanks for the input, LW!)
The Problem of Decision Under Preference Uncertainty
Suppose you are uncertain whether it is good to eat meat or not. It could be OK, or it could be very bad, but having not done the thinking, you are uncertain. And yet you have to decide what to eat now; is it going to be the tasty hamburger or the morally pure vegetarian salad?
You have multiple theories about your preferences that contradict in their assessment, and you want to make the best decision. How would you decide, even in principle, when you have such uncertainty? This is the problem of Preference Uncertainty.
Preference Uncertainty is a daily fact of life for humans; we simply don't have introspective access to our raw preferences in many cases, but we still want to make the best decisions we can. Just going with our intuitions about what seems most awesome is usually sufficient, but on higher-stakes decisions and theoretical reasoning, we want formal methods with more transparent reasoning processes. We especially like transparent formal methods if we want to create a Friendly AI.
There is unfortunately very little formal analysis of the preference uncertainty problem, and what has been done is incomplete and more philosophical than formal. Nonetheless, there has been some good work in the last few years. I'll refer you to Crouch's thesis if you're interested in that.
Using VNM
I'm going to assume VNM. That is, that rational preferences imply a utility function, and we decide between lotteries, choosing the one with highest expected utility.
The implications here are that the possible moral theories () each have an associated utility function (
) that represents their preferences. Also by VNM, our solution to preference uncertainty is a utility function
.
We are uncertain between moral theories, so we have a probability distribution over moral theories .
To make decisions, we need a way to compute the expected value of some lottery . Each lottery is essentially a probability distribution over the set of possible outcomes
.
Since we have uncertainty over multiple things (), the domain of the final preference structure is both moral theories and outcomes:
.
Now for some conditions. In the degenerate case of full confidence in one moral theory , the overall preferences should agree with that theory:
For some and
representing the degrees of freedom in utility function equality. That condition actually already contains most of the specification of
.
.
So we have a utility function, except for those unknown scaling and offset constants, which undo the arbitrariness in the basis and scale used to define each individual utility function.
Thus overall expectation looks like this:
.
This is still incomplete, though. If we want to get actual decisions, we need to pin down each and
.
Offsets and Scales
You'll see above that the probability distribution over is not dependent on the particular lottery, while
is a function of lottery. This is because I assumed that actions can't change what is right.
With this assumption, the contribution of the 's can be entirely factored out:
.
This makes it obvious that the effect of the 's is an additive constant that affects all lotteries the same way and thus never affects preferences. Thus we can set them to any value that is convenient; for this article, all
.
A similar process allows us to arbitrarily set exactly one of the .
The remaining values of actually affect decisions, so setting them arbitrarily has real consequences. To illustrate, consider the opening example of choosing lunch between a
and
when unsure about the moral status of meat.
Making up some details, we might have and
and
. Importing this into the framework described thus far, we might have the following payoff table:
| Moral Theory | U'(Burger) | U'(Salad) | (P(m)) |
|---|---|---|---|
| Meat OK (meat) | 1 | 0 | (0.7) |
| Meat Bad (veg) | 0 | k_veg | (0.3) |
| (expectation) | 0.7 | 0.3*k_veg | (1) |
We can see that with those probabilities, the expected value of exceeds that of
when
(when
), so the decision hinges on the value of that parameter.
The value of can be interpreted as the "intertheoretic weight" of a utility function candidate for the purposes of intertheoretic value comparisons.
In general, if then you have exactly
missing intertheoretic weights that determine how you respond to situations with preference uncertainty. These could be pinned down if you had
independent equations representing indifference scenarios.
For example, if we had when
, then we would have
, and the above decision would be determined in favor of the
.
Expressing Arbitrary Preferences
Preferences are arbitrary, in the sense that we should be able to want whatever we want to want, so our mathematical constructions should not dictate or limit our preferences. If they do, we should just decide to disagree.
What that means here is that because the values of drive important preferences (like at what probability you feel it is safe to eat meat), the math must leave them unconstrained, to be selected by whatever moral reasoning process it is that selected the candidate utility functions and gave them probabilities in the first place.
We could ignore this idea and attempt to use a "normalization" scheme to pin down the intertheoretic weights from the object level preferences without having to use additional moral reasoning. For example, we could dictate that the "variance" of each candidate utility function equals 1 (with some measure assignment over outcomes), which would divide out the arbitrary scales used to define the candidate utility functions, preventing dominance by arbitrary factors that shouldn't matter.
Consider that any given assignment of intertheoretic weights is equivalent to some set of indifference scenarios (like the one we used above for vegetarianism). For example, the above normalization scheme gives us the indifference scenario when
.
If I find that I am actually indifferent at like above, then I'm out of luck, unable to express this very reasonable preference. On the other hand, I can simply reject the normalization scheme and keep my preferences intact, which I much prefer.
(Notice that the normalization scheme was an unjustifiably privileged hypothesis from the beginning; we didn't argue that it was necessary, we simply pulled it out of thin air for no reason, so its failure was predictable.)
Thus I reassert that the 's are free parameters to be set accordance with our actual intertheoretic preferences, on pain of stupidity. Consider an analogy to the move from ordinal to cardinal utilities; when you add risk, you need more degrees of freedom in your preferences to express how you might respond to that risk, and you need to actually think about what you want those values to be.
Uncertainty Over Intertheoretic Weights
(This section is less solid than the others. Watch your step.)
A weakness in the constructions described so far is that they assume that we have access to perfect knowledge of intertheoretic preferences, even though the whole problem is that we are unable to find perfect knowledge of our preferences.
It seems intuitively that we could have a probability distribution over each . If we do this, making decisions is not much complicated, I think; a simple expectation should still work.
If expectation is the way, the expectation over can be factored out (by linearity or something). Thus in any given decision with fixed preference uncertainties, we can pretend to have perfect knowledge of
.
Despite the seeming triviality of the above idea for dealing with uncertainty over , I haven't formalized it much. We'll see if I figure it out soon, but for now, it would be foolish to make too many assumptions about this. Thus the rest of this article still assumes perfect knowledge of
, on the expectation that we can extend it later.
Learning Values, Among Other Things
Strictly speaking, inference across the is-ought gap is not valid, but we do it every time we act on our moral intuitions, which are just physical facts about our minds. Strictly speaking, inferring future events from past observations (induction) is not valid either, but it doesn't bother us much:
We deal with induction by defining an arbitrary (but good-seeming, on reflection) prior joint probability distribution over observations and events. We can handle the is-ought gap the same way: instead of separate probability distributions over events and moral facts
, we define a joint prior over
. Then learning value is just Bayesian updates on partial observations of
. Note that this prior subsumes induction.
Making decisions is still just maximizing expected utility with our constructions from above, though we will have to be careful to make sure that remains independent of the particular lottery.
The problem of how to define such a prior is beyond the scope of this article. I will note that this "moral prior" idea is the solid foundation on which to base Indirect Normativity schemes like Yudkowsky's CEV and Christiano's boxed philosopher. I will hopefully discuss this further in the future.
Recap
The problem was how to make decisions when you are uncertain about what your object-level preferences should be. To solve it, I assumed VNM, in particular that we have a set of possible utility functions, and we want to construct an overall utility function that does the right thing by those utility functions and their probabilities. The simple condition that the overall utility function should make the common sense choices in cases of moral certainty was sufficient to construct a utility function with a precise set of remaining degrees of freedom. The degrees of freedom being the intertheoretic weight and offset of each utility function candidate.
I showed that the offsets and an overall scale factor are superfluous, in the sense that they never affect the decision if we assume that actions don't affect what is desirable. The remaining intertheoretic weights do affect the decision, and I argued that they are critical to expressing whatever intertheoretic preferences we might want to have.
Uncertainty over intertheoretic weight seems tractable, but the details are still open.
I also mentioned that we can construct a joint distribution that allows us to embed value learning in normal Bayesian learning and induction. This "moral prior" would subsume induction and define how facts about the desirability of things could be inferred from physical observations like the opinions of moral philosophers. In particular, it would provide a solid foundation for Indirect Normativity schemes like CEV. The nature of this distribution is still open.
Open Questions
What are the details of how to deal with uncertainty over the intertheoretic weights? I am looking in particular for construction from an explicit set of reasonable assumptions like the above work, rather than simply pulling a method out of thin air unsupported.
What are the details of the Moral Prior? What is its nature? What implications does it have? What assumptions do we have to make to make it behave reasonably? How do we construct one that could be safely given to a superintelligence. This is going to be a lot of work.
I assumed that it is meaningful to assign probabilities over moral theories. Probability is closely tied up with utility, and probability over epiphenomena like preferences is especially difficult. It remains to be seen how much the framing here actually helps us, or if it effectively just disguises pulling a utility function out of a hat.
Is this at all correct? I should build it and see if it type-checks and does what it's supposed to do.
Three kinds of moral uncertainty
Related to: Moral uncertainty (wiki), Moral uncertainty - towards a solution?, Ontological Crisis in Humans.
Moral uncertainty (or normative uncertainty) is uncertainty about how to act given the diversity of moral doctrines. For example, suppose that we knew for certain that a new technology would enable more humans to live on another planet with slightly less well-being than on Earth[1]. An average utilitarian would consider these consequences bad, while a total utilitarian would endorse such technology. If we are uncertain about which of these two theories are right, what should we do? (LW wiki)
I have long been slightly frustrated by the existing discussions about moral uncertainty that I've seen. I suspect that the reason has been that they've been unclear on what exactly they mean when they say that we are "uncertain about which theory is right" - what is uncertainty about moral theories? Furthermore, especially when discussing things in an FAI context, it feels like several different senses of moral uncertainty get mixed together. Here is my suggested breakdown, with some elaboration:
Descriptive moral uncertainty. What is the most accurate way of describing my values? The classical FAI-relevant question, this is in a sense the most straightforward one. We have some set values, and although we can describe parts of them verbally, we do not have conscious access to the deep-level cognitive machinery that generates them. We might feel relatively sure that our moral intuitions are produced by a system that's mostly consequentialist, but suspect that parts of us might be better described as deontologist. A solution to descriptive moral uncertainty would involve a system capable of somehow extracting the mental machinery that produced our values, or creating a moral reasoning system which managed to produce the same values by some other process.
Epistemic moral uncertainty. Would I reconsider any of my values if I knew more? Perhaps we hate the practice of eating five-sided fruit and think that everyone who eats five-sided fruit should be thrown to jail, but if we found out that five-sided fruit made people happier and had no averse effects, we would change our minds. This roughly corresponds to the "our wish if we knew more, thought faster" part of Eliezer's original CEV description. A solution to epistemic moral uncertainty would involve finding out more about the world.
Intrinsic moral uncertainty. Which axioms should I endorse? We might be intrinsically conflicted between different value systems. Perhaps we are trying to choose whether to be loyal to a friend or whether to act for the common good (a conflict between two forms of deontology, or between deontology and consequentialism), or we could be conflicted between positive and negative utilitarianism. In its purest form, this sense of moral uncertainty closely resembles what would otherwise be called a wrong question, one where
you cannot even imagine any concrete, specific state of how-the-world-is that would answer the question. When it doesn't even seem possible to answer the question.
But unlike wrong questions, questions of intrinsic moral uncertainty are real ones that you need to actually answer in order to make a choice. They are generated when different modules within your brain generate different moral intuitions, and are essentially power struggles between various parts of your mind. A solution to intrinsic moral uncertainty would involve somehow tipping the balance of power in favor of one of the "mind factions". This could involve developing an argument sufficiently persuasive to convince most parts of yourself, or self-modifying in such a way that one of the factions loses its sway over your decision-making. (Of course, if you already knew for certain which faction you wanted to expunge, you wouldn't need to do it in the first place.) I would roughly interpret the "our wish ... if we had grown up farther together" part of CEV to be an attempt to model some of the social influences on our moral intuitions and thereby help resolve cases of intrinsic moral uncertainty.
This is a very preliminary categorization, and I'm sure that it could be improved upon. There also seem to exist cases of moral uncertainty which are hybrids of several categories - for example, ontological crises seem to be mostly about intrinsic moral uncertainty, but to also incorporate some elements of epistemic moral uncertainty. I also have a general suspicion that these categories still don't cut reality that well at the joints, so any suggestions for improvement would be much appreciated.
Cake, or death!
Here we'll look at the famous cake or death problem teasered in the Value loading/learning post.
Imagine you have an agent that is uncertain about its values and designed to "learn" proper values. A formula for this process is that the agent must pick an action a equal to:
- argmaxa∈A Σw∈W p(w|e,a) Σu∈U u(w)p(C(u)|w)
Let's decompose this a little, shall we? A is the set of actions, so argmax of a in A simply means that we are looking for an action a that maximises the rest of the expression. W is the set of all possible worlds, and e is the evidence that the agent has seen before. Hence p(w|e,a) is the probability of existing in a particular world, given that the agent has seen evidence e and will do action a. This is summed over each possible world in W.
And what value do we sum over in each world? Σu∈U u(w)p(C(u)|w). Here U is the set of (normalised) utility functions the agent is considering. In value loading, we don't program the agent with the correct utility function from the beginning; instead we imbue it with some sort of learning algorithm (generally with feedback) so that it can deduce for itself the correct utility function. The expression p(C(u)|w) expresses the probability that the utility u is correct in the world w. For instance, it might cover statements "it's 99% certain that 'murder is bad' is the correct morality, given that I live in a world where every programmer I ask tells me that murder is bad".
The C term is the correctness of the utility function, given whatever system of value learning we're using (note that some moral realists would insist that we don't need a C, that p(u|w) makes sense directly, that we can deduce ought from is). All the subtlety of the value learning is encoded in the various p(C(u)|w): this determines how the agent learns moral values.
So the whole formula can be described as:
- For each possible world and each possible utility function, figure out the utility of that world. Weigh that by the probability that that utility is correct is that world, and by the probability of that world. Then choose the action that maximises the weighted sum of this across all utility functions and worlds.
Naive cake or death

= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)