Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.
A putative new idea for AI control; index here.
Many designs for creating AGIs (such as Open-Cog) rely on the AGI deducing moral values as it develops. This is a form of value loading (or value learning), in which the AGI updates its values through various methods, generally including feedback from trusted human sources. This is very analogous to how human infants (approximately) integrate the values of their society.
The great challenge of this approach is that it relies upon an AGI which already has an interim system of values, being able and willing to correctly update this system. Generally speaking, humans are unwilling to easily update their values, and we would want our AGIs to be similar: values that are too unstable aren't values at all.
So the aim is to clearly separate the conditions under which values should be kept stable by the AGI, and conditions when they should be allowed to vary. This will generally be done by specifying criteria for the variation ("only when talking with Mr and Mrs Programmer"). But, as always with AGIs, unless we program those criteria perfectly (hint: we won't) the AGI will be motivated to interpret them differently from how we would expect. It will, as a natural consequence of its program, attempt to manipulate the value updating rules according to its current values.
How could it do that? A very powerful AGI could do the time honoured "take control of your reward channel", by either threatening humans to give it the moral answer it wants, or replacing humans with "humans" (constructs that pass the programmed requirements of being human, according to the AGI's programming, but aren't actually human in practice) willing to give it these answers. A weaker AGI could instead use social manipulation and leading questioning to achieve the morality it desires. Even more subtly, it could tweak its internal architecture and updating process so that it updates values in its preferred direction (even something as simple as choosing the order in which to process evidence). This will be hard to detect, as a smart AGI might have a much clearer impression of how its updating process will play out in practice than it programmers would.
The problems with value loading have been cast into the various "Cake or Death" problems. We have some idea what criteria we need for safe value loading, but as yet we have no candidates for such a system. This post will attempt to construct one.
Puzzle 1: George mortgages his house to invest in lottery tickets. He wins and becomes a millionaire. Did he make a good choice?
Puzzle 2: The U.S. president questions if he should bluff a nuclear war or concede to the USSR. He bluffs and it just barely works. Although there were several close calls for nuclear catastrophe, everything works out ok. Was this ethical?
One interpretation of consequentialism is that decisions that produce good outcomes are good decisions, rather than decisions that produce good expected outcomes.12 One would be ethical if their actions end up with positive outcomes, disregarding the intentions of those actions. For instance, a terrorist who accidentally foils an otherwise catastrophic terrorist plan would have done a very ‘morally good’ action.3 This general view seems to be surprisingly common.4
This seems intuitively strange to many, it definitely is to me. Instead, ‘expected value’ seems to be a better way of both making decisions and judging the decisions made by others. However, while ‘expected value’ can be useful for individual decision making, I make the case that it is very difficult to use to judge other people’s decisions in a meaningful way.5 This is because ‘expected value’ is typically defined in reference to a specific set of information and intelligence rather than an objective truth about the world.
Two questions to help guide this:
- Should we judge previous actions based on ‘expected’ or ‘actual’ value?
- Should we make future decisions to optimize ‘expected’ or ‘actual’ value?
I believe these are in a sense quite simple, but require some consideration to definitions.6
Optimizing Future Decisions: Actual vs. Expected Value
The second question is the easiest of the two, so I’ll begin with that one. The simple answer is that this is a question of defining ‘expected value’. Once we do so the question kind of goes away.
There is nothing fundamentally different between expected value and actual value. A more fair comparison may be ‘expected value from the perspective of the decision maker’ with ‘expected value from a later, more accurate prospective’.
Expected value converges on actual value with lots of information. Said differently, actual value is expected value with complete information.
In the case of an individual purchasing lottery tickets successfully (Puzzle 1), the ‘actual value’ is still not exact from our point of view. While we may know how much money was won, or what profit was made. We also don’t know what the counterfactual would have been. It is still theoretically possible that in the worlds where George wouldn’t have purchased the lottery tickets, he would have been substantially better off. While the fact that we have imperfect information doesn’t matter too much, I think it demonstrates that presenting a description of the outcome as ‘actual value’ is incomplete. ‘Actual value’ exists only theoretically, even after the fact.7
So this question becomes, then ‘should one make a decision to optimize value using the information and knowledge available to them, or using perfect knowledge and information?’ Obviously, in this case, ‘perfect knowledge’ is inaccessible to them (or the ‘expected value’ and ‘actual value’ would be the same exact thing). I believe it should be quite apparent that in this case, the best one can do (and should do) is make the best decision using their available information.
This question is similar to asking ‘should you drive your car as quickly as your car can drive, or much faster than your car can drive?’ Obviously you may like to drive faster, but that’s by definition not an option. Another question: ‘should you do well in life or should you become an all-powerful dragon king?’
Judging Previous Decisions: Actual vs. Expected Value
Judging previous decisions can get tricky.
Let’s study the lottery example again. A person purchases a lottery ticket and wins. For simplicity, let’s say the decision to purchase the ticket was done only to optimize money.
The question is, what is the expected value of purchasing the lottery ticket? How does this change depending on information and knowledge?
In general purchasing a lottery ticket can be expected to be a net loss in earnings, and thus a bad decision. However, if one was sure they would win, it would be a pretty good idea. Given the knowledge that the player won, the player made a good decision. Winning the lottery clearly is better than not playing once.
More interesting is considering the limitation not in information about the outcome but about knowledge of probability. Say the player thought that they were likely win the lottery, that it was a good purchase. This may seem insane to someone familiar with probability and the lottery system, but not everyone is familiar with these things.
From the point of view of the player, the lottery ticket purchase had net-positive utility. From the point of view of a person with knowledge of the lottery and/or statistics, the purchase had net-negative utility. From the point of view of any of these two groups, after they know that the lottery will be a success, it was a net positive decision.
|No Knowledge of Outcome||Knowledge of Outcome|
|‘Intelligent’ Person with Knowledge of Probability||Negative||Positive|
Expected Value of purchasing a Lottery Ticket from different Reference Points
To make things a bit more interesting, imagine that there’s a genius out there with a computer simulation of our exact universe. This person can tell which lottery ticket will win in advance because they can run the simulations. To this ‘genius’ it’s obvious that the purchase is a net-positive outcome.
|No Knowledge of Outcome||Knowledge of Outcome|
|‘Intelligent’ Person with Knowledge of Probability||Negative||Positive|
Expected Value of purchasing a Lottery Ticket from different Reference Points
So what is the expected value of purchasing the lottery ticket? The answer is that the ‘expected value’ is completely dependent on the ‘reference frame’, or a specific set of information and intelligence. From the reference frame of the ‘intelligent person’ this was low in expected value, so was a bad decision. From that of the genius, it was a good decision. And from the player, a good decision.
So how do we judge this poor (well, soon rich) lottery player? They made a good decision respective to the results, respective to the genius, and compared to their own knowledge. Should we say ‘oh, this person should have had slightly more knowledge, but not too much knowledge, and thus they made a bad choice’? What does that even mean?
Perhaps we could judge the player for not reading into lottery facts before playing. Wasn’t it irresponsible for falling for such a simple fallacy? Or perhaps the person was ‘lazy’ to not learn probability in the first place.
Well, things like these seem like intuitions to me. We may have the intuitions to us that the lottery is a poor choice. We may find facts to prove these intuitions accurate. But the gambler my not have these intuitions. It seems unfair to consider any intuitions ‘obvious’ to those who do not share them.
One might also say that the gambler probably knew it was a bad idea, but let his or her ‘inner irrationalities’ control the decision process. Perhaps they were trying to take an ‘easy way out’ of some sort. However, these seem quite judgmental as well. If a person experiences strong emotional responses; fear, anger, laziness; those inner struggles would change their expected value calculation. It might be a really bad, heuristically-driven ‘calculation’, but it would be the best they would have at that time.
Free Will Bounded Expected Value
We are getting to the question of free will and determinism. After all, if there is any sort of free will, perhaps we have the ability to make decisions that are sub-optimal by our expected value functions. Perhaps we commonly do so (else it wouldn’t be much in the sense of ‘free’ will.)
This would be interesting because it would imply an ‘expected result’ that the person should have calculated, even if they didn’t actually do so. We need to understand the person’s actions and understanding, not in terms of what we know, or what they knew, but what they should have figured out given their knowledge.
This would require a very well specified Free Will Boundary of some sort. A line around a few thought processes, parts of the brain, and resource constraints, which could produce a thereby optimal expected result calculation. Anything less than this ‘optimal given Free Will Boundary’ expected value calculation would be fair game for judging.
Conclusion: Should we Even Judge People or Decisions Anyway?
So, deciding to make future decisions based on expected value seems reasonable. The main question in this essay, the harder question, is if we can judge previous decisions based on their respective expected values, and how to possibly come up with the relevant expected values to do so.
I think that we naturally judge people. We have old and modern heroes and villains. Judging people is simply something that humans do. However, I believe that on close inspection this is very challenging if not impossible to do reasonably and precisely.
Perhaps we should attempt to stop placing so much emphasis on individualism and just try to do the best we can while not judging others nor other decisions much. Considerations of judging may be interesting, but the main take away may be the complexity itself, indicated that judgements are very subjective and incredibly messy.
That said, it can still be useful to analyze previous decisions or individuals. That seems like one of the best ways to update our priors of the world. We just need to remember not to treat it personally.
This is assuming the terrorists are trying to produce ‘disutility’ or a value separate from ‘utility’. I feel like from their perspective, maximizing an intrinsic value dissimilar from our notion of utility would be maximizing ‘expected value’. But analyzing the morality of people with alternative value systems is a very different matter. ↩
These people tend not to like consequentialism much. ↩
I don’t want to impose what I deem to be a false individualistic appeal, so consider this to mean that one would have a difficult time judging anyone at any time except for their spontaneous consciousness. ↩
I bring them up because they are what I considered and have talked to others about before understanding what makes them frustrating to answer. Basically, they are nice starting points for getting towards answering the questions that were meant to be asked instead. ↩
This is true for essentially all physical activities. Thought experiments or very simple simulations may be exempt. ↩
Related: Pinpointing Utility
If I ever say "my utility function", you could reasonably accuse me of cargo-cult rationality; trying to become more rational by superficially immitating the abstract rationalists we study makes about as much sense as building an air traffic control station out of grass to summon cargo planes.
There are two ways an agent could be said to have a utility function:
It could behave in accordance with the VNM axioms; always choosing in a sane and consistent manner, such that "there exists a U". The agent need not have an explicit representation of U.
It could have an explicit utility function that it tries to expected-maximize. The agent need not perfectly follow the VNM axioms all the time. (Real bounded decision systems will take shortcuts for efficiency and may not achieve perfect rationality, like how real floating point arithmetic isn't associative).
Neither of these is true of humans. Our behaviour and preferences are not consistent and sane enough to be VNM, and we are generally quite confused about what we even want, never mind having reduced it to a utility function. Nevertheless, you still see the occasional reference to "my utility function".
Sometimes "my" refers to "abstract me who has solved moral philosophy and or become perfectly rational", which at least doesn't run afoul of the math, but is probably still wrong about the particulars of what such an abstract idealized self would actually want. But other times it's a more glaring error like using "utility function" as shorthand for "entire self-reflective moral system", which may not even be VNMish.
But this post isn't really about all the ways people misuse terminology, it's about where we're actually at on the whole problem for which a utility function might be the solution.
As above, I don't think any of us have a utility function in either sense; we are not VNM, and we haven't worked out what we want enough to make a convincing attempt at trying. Maybe someone out there has a utility function in the second sense, but I doubt that it actually represents what they would want.
Perhaps then we should speak of what we want in terms of "terminal values"? For example, I might say that it is a terminal value of mine that I should not murder, or that freedom from authority is good.
But what does "terminal value" mean? Usually, it means that the value of something is not contingent on or derived from other facts or situations, like for example, I may value beautiful things in a way that is not derived from what they get me. The recursive chain of valuableness terminates at some set of values.
There's another connotation, though, which is that your terminal values are akin to axioms; not subject to argument or evidence or derivation, and simply given, that there's no point in trying to reconcile them with people who don't share them. This is the meaning people are sometimes getting at when they explain failure to agree with someone as "terminal value differences" or "different set of moral axioms". This is completely reasonable, if and only if that is in fact the nature of the beliefs in question.
About two years ago, it very much felt like freedom from authority was a terminal value for me. Those hated authoritarians and fascists were simply wrong, probably due to some fundamental neurological fault that could not be reasoned with. The very prototype of "terminal value differences".
And yet here I am today, having been reasoned out of that "terminal value", such that I even appreciate a certain aesthetic in bowing to a strong leader.
If that was a terminal value, I'm afraid the term has lost much of its meaning to me. If it was not, if even the most fundamental-seeming moral feelings are subject to argument, I wonder if there is any coherent sense in which I could be said to have terminal values at all.
The situation here with "terminal values" is a lot like the situation with "beliefs" in other circles. Ask someone what they believe in most confidently, and they will take the opportunity to differentiate themselves from the opposing tribe on uncertain controversial issues; god exists, god does not exist, racial traits are genetic, race is a social construct. The pedant answer of course is that the sky is probably blue, and that that box over there is about a meter long.
Likewise, ask someone for their terminal values, and they will take the opportunity to declare that those hated greens are utterly wrong on morality, and blueness is wired into their very core, rather than the obvious things like beauty and friendship being valuable, and paperclips not.
So besides not having a utility function, those aren't your terminal values. I'd be suprised if even the most pedantic answer weren't subject to argument; I don't seem to have anything like a stable and non-negotiable value system at all, and I don't think that I am even especially confused relative to the rest of you.
Instead of a nice consistent value system, we have a mess of intuitions and hueristics and beliefs that often contradict, fail to give an answer, and change with time and mood and memes. And that's all we have. One of the intuitions is that we want to fix this mess.
People have tried to do this "Moral Philosophy" thing before, myself included, but it hasn't generally turned out well. We've made all kinds of overconfident leaps to what turn out to be unjustified conclusions (utilitarianism, egoism, hedonism, etc), or just ended up wallowing in confused despair.
The zeroth step in solving a problem is to notice that we have a problem.
The problem here, in my humble opinion, is that we have no idea what we are doing when we try to do Moral Philosophy. We need to go up a meta-level and get a handle on Moral MetaPhilosophy. What's the problem? What are the relevent knowns? What are the unknowns? What's the solution process?
Ideally, we could do for Moral Philosphy approximately what Bayesian probability theory has done for Epistemology. My moral intuitions are a horrible mess, but so are my epistemic intuitions, and yet we more-or-less know what we are doing in epistemology. A problem like this has been solved before, and this one seems solvable too, if a bit harder.
It might be that when we figure this problem out to the point where we can be said to have a consistent moral system with real terminal values, we will end up with a utility function, but on the other hand, we might not. Either way, let's keep in mind that we are still on rather shaky ground, and at least refrain from believing the confident declarations of moral wisdom that we so like to make.
Moral Philosophy is an important problem, but the way is not clear yet.
Nick_Beckstead asked me to link to posts I referred to in this comment. I should put up or shut up, so here's an attempt to give an organized overview of them.
Since I wrote these, LukeProg has begun tackling some related issues. He has accomplished the seemingly-impossible task of writing many long, substantive posts none of which I recall disagreeing with. And I have, irrationally, not read most of his posts. So he may have dealt with more of these same issues.
I think that I only raised Holden's "objection 2" in comments, which I couldn't easily dig up; and in a critique of a book chapter, which I emailed to LukeProg and did not post to LessWrong. So I'm only going to talk about "Objection 1: It seems to me that any AGI that was set to maximize a "Friendly" utility function would be extraordinarily dangerous." I've arranged my previous posts and comments on this point into categories. (Much of what I've said on the topic has been in comments on LessWrong and Overcoming Bias, and in email lists including SL4, and isn't here.)
The concept of "human values" cannot be defined in the way that FAI presupposes
Human errors, human values: Suppose all humans shared an identical set of values, preferences, and biases. We cannot retain human values without retaining human errors, because there is no principled distinction between them.
A comment on this post: There are at least three distinct levels of human values: The values an evolutionary agent holds that maximize their reproductive fitness, the values a society holds that maximizes its fitness, and the values a rational optimizer holds who has chosen to maximize social utility. They often conflict. Which of them are the real human values?
Values vs. parameters: Eliezer has suggested using human values, but without time discounting (= changing the time-discounting parameter). CEV presupposes that we can abstract human values and apply them in a different situation that has different parameters. But the parameters are values. There is no distinction between parameters and values.
A comment on "Incremental progress and the valley": The "values" that our brains try to maximize in the short run are designed to maximize different values for our bodies in the long run. Which are human values: The motivations we feel, or the effects they have in the long term? LukeProg's post Do Humans Want Things? makes a related point.
Group selection update: The reason I harp on group selection, besides my outrage at the way it's been treated for the past 50 years, is that group selection implies that some human values evolved at the group level, not at the level of the individual. This means that increasing the rationality of individuals may enable people to act more effectively in their own interests, rather than in the group's interest, and thus diminish the degree to which humans embody human values. Identifying the values embodied in individual humans - supposing we could do so - would still not arrive at human values. Transferring human values to a post-human world, which might contain groups at many different levels of a hierarchy, would be problematic.
I wanted to write about my opinion that human values can't be divided into final values and instrumental values, the way discussion of FAI presumes they can. This is an idea that comes from mathematics, symbolic logic, and classical AI. A symbolic approach would probably make proving safety easier. But human brains don't work that way. You can and do change your values over time, because you don't really have terminal values.
Strictly speaking, it is impossible for an agent whose goals are all indexical goals describing states involving itself to have preferences about a situation in which it does not exist. Those of you who are operating under the assumption that we are maximizing a utility function with evolved terminal goals, should I think admit these terminal goals all involve either ourselves, or our genes. If they involve ourselves, then utility functions based on these goals cannot even be computed once we die. If they involve our genes, they they are goals that our bodies are pursuing, that we call errors, not goals, when we the conscious agent inside our bodies evaluate them. In either case, there is no logical reason for us to wish to maximize some utility function based on these after our own deaths. Any action I wish to take regarding the distant future necessarily presupposes that the entire SIAI approach to goals is wrong.
My view, under which it does make sense for me to say I have preferences about the distant future, is that my mind has learned "values" that are not symbols, but analog numbers distributed among neurons. As described in "Only humans can have human values", these values do not exist in a hierarchy with some at the bottom and some on the top, but in a recurrent network which does not have a top or a bottom, because the different parts of the network developed simultaneously. These values therefore can't be categorized into instrumental or terminal. They can include very abstract values that don't need to refer specifically to me, because other values elsewhere in the network do refer to me, and this will ensure that actions I finally execute incorporating those values are also influenced by my other values that do talk about me.
Even if human values existed, it would be pointless to preserve them
- The only preferences that can be unambiguously determined are the preferences a person (mind+body) implements, which are not always the preferences expressed by their beliefs.
- If you extract a set of consciously-believed propositions from an existing agent, then build a new agent to use those propositions in a different environment, with an "improved" logic, you can't claim that it has the same values, since it will behave differently.
- Values exist in a network of other values. A key ethical question is to what degree values are referential (meaning they can be tested against something outside that network); or non-referential (and hence relative).
- Supposing that values are referential helps only by telling you to ignore human values.
- You cannot resolve the problem by combining information from different behaviors, because the needed information is missing.
- Today's ethical disagreements are largely the result of attempting to extrapolate ancestral human values into a changing world.
- The future will thus be ethically contentious even if we accurately characterize and agree on present human values, because these values will fail to address the new important problems.
Human values differ as much as values can differ: There are two fundamentally different categories of values:
- Non-positional, mutually-satisfiable values (physical luxury, for instance)
- Positional, zero-sum social values, such as wanting to be the alpha male or the homecoming queen
All mutually-satisfiable values have more in common with each other than they do with any non-mutually-satisfiable values, because mutually-satisfiable values are compatible with social harmony and non-problematic utility maximization, while non- mutually-satisfiable values require eternal conflict. If you find an alien life form from a distant galaxy with non-positional values, it would be easier to integrate those values into a human culture with only human non-positional values, than to integrate already-existing positional human values into that culture.
It appears that some humans have mainly the one type, while other humans have mainly the other type. So talking about trying to preserve human values is pointless - the values held by different humans have already passed the most-important point of divergence.
Enforcing human values would be harmful
The human problem: This argues that the qualia and values we have now are only the beginning of those that could evolve in the universe, and that ensuring that we maximize human values - or any existing value set - from now on, will stop this process in its tracks, and prevent anything better from ever evolving. This is the most-important objection of all.
Re-reading this, I see that the critical paragraph is painfully obscure, as if written by Kant; but it summarizes the argument: "Once the initial symbol set has been chosen, the semantics must be set in stone for the judging function to be "safe" for preserving value; this means that any new symbols must be defined completely in terms of already-existing symbols. Because fine-grained sensory information has been lost, new developments in consciousness might not be detectable in the symbolic representation after the abstraction process. If they are detectable via statistical correlations between existing concepts, they will be difficult to reify parsimoniously as a composite of existing symbols. Not using a theory of phenomenology means that no effort is being made to look for such new developments, making their detection and reification even more unlikely. And an evaluation based on already-developed values and qualia means that even if they could be found, new ones would not improve the score. Competition for high scores on the existing function, plus lack of selection for components orthogonal to that function, will ensure that no such new developments last."
Averaging value systems is worse than choosing one: This describes a neural-network that encodes preferences, and takes some input pattern and computes a new pattern that optimizes these preferences. Such a system is taken as analogous for a value system and an ethical system to attain those values. I then define a measure for the internal conflict produced by a set of values, and show that a system built by averaging together the parameters from many different systems will have higher internal conflict than any of the systems that were averaged together to produce it. The point is that the CEV plan of "averaging together" human values will result in a set of values that is worse (more self-contradictory) than any of the value systems it was derived from.
A point I may not have made in these posts, but made in comments, is that the majority of humans today think that women should not have full rights, homosexuals should be killed or at least severely persecuted, and nerds should be given wedgies. These are not incompletely-extrapolated values that will change with more information; they are values. Opponents of gay marriage make it clear that they do not object to gay marriage based on a long-range utilitarian calculation; they directly value not allowing gays to marry. Many human values horrify most people on this list, so they shouldn't be trying to preserve them.
In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.
— Eliezer Yudkowsky, May 2004, Coherent Extrapolated Volition
Foragers versus industry era folks
Consider the difference between a hunter-gatherer, who cares about his hunting success and to become the new tribal chief, and a modern computer scientist who wants to determine if a “sufficiently large randomized Conway board could turn out to converge to a barren ‘all off’ state.”
The utility of the success in hunting down animals and proving abstract conjectures about cellular automata is largely determined by factors such as your education, culture and environmental circumstances. The same forager who cared to kill a lot of animals, to get the best ladies in its clan, might have under different circumstances turned out to be a vegetarian mathematician solely caring about his understanding of the nature of reality. Both sets of values are to some extent mutually exclusive or at least disjoint. Yet both sets of values are what the person wants, given the circumstances. Change the circumstances dramatically and you change the persons values.
What do you really want?
You might conclude that what the hunter-gatherer really wants is to solve abstract mathematical problems, he just doesn’t know it. But there is no set of values that a person “really” wants. Humans are largely defined by the circumstances they reside in.
- If you already knew a movie, you wouldn’t watch it.
- To be able to get your meat from the supermarket changes the value of hunting.
If “we knew more, thought faster, were more the people we wished we were, and had grown up closer together” then we would stop to desire what we learnt, wish to think even faster, become even different people and get bored of and rise up from the people similar to us.
A singleton is an attractor
Much of our values and goals, what we want, are culturally induced or the result of our ignorance. Reduce our ignorance and you change our values. One trivial example is our intellectual curiosity. If we don’t need to figure out what we want on our own, our curiosity is impaired.
A singleton won’t extrapolate human volition but implement an artificial set values as a result of abstract high-order contemplations about rational conduct.
With knowledge comes responsibility, with wisdom comes sorrow
Knowledge changes and introduces terminal goals. The toolkit that is called ‘rationality’, the rules and heuristics developed to help us to achieve our terminal goals are also altering and deleting them. A stone age hunter-gatherer seems to possess very different values than we do. Learning about rationality and various ethical theories such as Utilitarianism would alter those values considerably.
Rationality was meant to help us achieve our goals, e.g. become a better hunter. Rationality was designed to tell us what we ought to do (instrumental goals) to achieve what we want to do (terminal goals). Yet what actually happens is that we are told, that we will learn, what we ought to want.
If an agent becomes more knowledgeable and smarter then this does not leave its goal-reward-system intact if it is not especially designed to be stable. An agent who originally wanted to become a better hunter and feed his tribe would end up wanting to eliminate poverty in Obscureistan. The question is, how much of this new “wanting” is the result of using rationality to achieve terminal goals and how much is a side-effect of using rationality, how much is left of the original values versus the values induced by a feedback loop between the toolkit and its user?
Take for example an agent that is facing the Prisoner’s dilemma. Such an agent might originally tend to cooperate and only after learning about game theory decide to defect and gain a greater payoff. Was it rational for the agent to learn about game theory, in the sense that it helped the agent to achieve its goal or in the sense that it deleted one of its goals in exchange for a allegedly more “valuable” goal?
Beware rationality as a purpose in and of itself
It seems to me that becoming more knowledgeable and smarter is gradually altering our utility functions. But what is it that we are approaching if the extrapolation of our volition becomes a purpose in and of itself? Extrapolating our coherent volition will distort or alter what we really value by installing a new cognitive toolkit designed to achieve an equilibrium between us and other agents with the same toolkit.
Would a singleton be a tool that we can use to get what we want or would the tool use us to do what it does, would we be modeled or would it create models, would we be extrapolating our volition or rather follow our extrapolations?
(This post is a write-up of a previous comment designated to receive feedback from a larger audience.)
Frequently, we decide on a goal, and then we are ineffective in working towards this goal, due to factors wholly within our control. Failure modes include giving up, losing interest, procrastination, akrasia, and failure to evaluate return on time. In all these cases it seems that if our motivation were higher, the problem would not exist. Call the problem of finding the motivation to effectively pursue one's goals, the problem of motivation. This is a common failure of instrumental rationality which has been discussed from numerous different angles on LessWrong.
I wish to introduce another approach to the problem of motivation, which to my knowledge has not yet been discussed on LessWrong. This approach is summarized in the following paragraph:
We do not know what we value. Therefore, we choose goals that are not in harmony with our values. The problem of motivation is often caused by our goals not being in harmony with our values. Therefore, many cases of the problem of motivation can be solved by discovering what you value, and carrying out goals that conform to your values.
In Are Wireheads Happy? I discussed the difference between wanting something and liking something. More recently, Luke went deeper into some of the science in his post Not for the Sake of Pleasure Alone.
In the comments of the original post, cousin_it asked a good question: why implement a mind with two forms of motivation? What, exactly, are "wanting" and "liking" in mind design terms?
Tim Tyler and Furcas both gave interesting responses, but I think the problem has a clear answer in a reinforcement learning perspective (warning: formal research on the subject does not take this view and sticks to the "two different systems of different evolutionary design" theory). "Liking" is how positive reinforcement feels from the inside; "wanting" is how the motivation to do something feels from the inside. Things that are positively reinforced generally motivate you to do more of them, so liking and wanting often co-occur. With more knowledge of reinforcement, we can begin to explore why they might differ.
CONTEXT OF REINFORCEMENT
Reinforcement learning doesn't just connect single stimuli to responses. It connects stimuli in a context to responses. Munching popcorn at a movie might be pleasant; munching popcorn at a funeral will get you stern looks at best.
In fact, lots of people eat popcorn at a movie theater and almost nowhere else. Imagine them, walking into that movie theater and thinking "You know, I should have some popcorn now", maybe even having a strong desire for popcorn that overrides the diet they're on - and yet these same people could walk into, I don't know, a used car dealership and that urge would be completely gone.
These people have probably eaten popcorn at a movie theater before and liked it. Instead of generalizing to "eat popcorn", their brain learned the lesson "eat popcorn at movie theaters". Part of this no doubt has to do with the easy availability of popcorn there, but another part probably has to do with context-dependent reinforcement.
I like pizza. When I eat pizza, and get rewarded for eating pizza, it's usually after smelling the pizza first. The smell of pizza becomes a powerful stimulus for the behavior of eating pizza, and I want pizza much more after smelling it, even though how much I like pizza remains constant. I've never had pizza at breakfast, and in fact the context of breakfast is directly competing with my normal stimuli for eating pizza; therefore, no matter how much I like pizza, I have no desire to eat pizza for breakfast. If I did have pizza for breakfast, though, I'd probably like it.
If an activity is intermittently reinforced; occasional rewards spread among more common neutral stimuli or even small punishments, it may be motivating but unpleasant.
Imagine a beginning golfer. He gets bogeys or double bogeys on each hole, and is constantly kicking himself, thinking that if only he'd used one club instead of the other, he might have gotten that one. After each game, he can't believe that after all his practice, he's still this bad. But every so often, he does get a par or a birdie, and thinks he's finally got the hang of things, right until he fails to repeat it on the next hole, or the hole after that.
This is a variable response schedule, Skinner's most addictive form of delivering reinforcement. The golfer may keep playing, maybe because he constantly thinks he's on the verge of figuring out how to improve his game, but he might not like it. The same is true for gamblers, who think the next pull of the slot machine might be the jackpot (and who falsely believe they can discover a secret in the game that will change their luck; they don't like sitting around losing money, but they may stick with it so that they don't leave right before they reach the point where their luck changes.
SMALL-SCALE DISCOUNT RATES
Even if we like something, we may not want to do it because it involves pain at the second or sub-second level.
Eliezer discusses the choice between reading a mediocre book and a good book:
You may read a mediocre book for an hour, instead of a good book, because if you first spent a few minutes to search your library to obtain a better book, that would be an immediate cost - not that searching your library is all that unpleasant, but you'd have to pay an immediate activation cost to do that instead of taking the path of least resistance and grabbing the first thing in front of you. It's a hyperbolically discounted tradeoff that you make without realizing it, because the cost you're refusing to pay isn't commensurate enough with the payoff you're forgoing to be salient as an explicit tradeoff.
In this case, you like the good book, but you want to keep reading the mediocre book. If it's cheating to start our hypothetical subject off reading the mediocre book, consider the difference between a book of one-liner jokes and a really great novel. The book of one-liners you can open to a random page and start being immediately amused (reinforced). The great novel you've got to pick up, get into, develop sympathies for the characters, figure out what the heck lomillialor or a Tiste Andii is, and then a few pages in you're thinking "This is a pretty good book". The fear of those few pages could make you realize you'll like the novel, but still want to read the joke book. And since hyperbolic discounting overcounts reward or punishment in the next few seconds, it may seem like a net punishment to make the change.
This deals yet another blow to the concept of me having "preferences". How much do I want popcorn? That depends very much on whether I'm at a movie theater or a used car dealership. If I browse Reddit for half an hour because it would be too much work to spend ten seconds traveling to the living room to pick up the book I'm really enjoying, do I "prefer" browsing to reading? Which has higher utility? If I hate every second I'm at the slot machines, but I keep at them anyway so I don't miss the jackpot, am I a gambling addict, or just a person who enjoys winning jackpots and is willing to do what it takes?
In cases like these, the language of preference and utility is not very useful. My anticipation of reward is constraining my behavior, and different factors are promoting different behaviors in an unstable way, but trying to extract "preferences" from the situation is trying to oversimplify a complex situation.
ETA: As stated below, criticizing beliefs is trivial in principle, either they were arrived at with an approximation to Bayes' rule starting with a reasonable prior and then updated with actual observations, or they weren't. Subsequent conversation made it clear that criticizing behavior is also trivial in principle, since someone is either taking the action that they believe will best suit their preferences, or not. Finally, criticizing preferences became trivial too -- the relevant question is "Does/will agent X behave as though they have preferences Y", and that's a belief, so go back to Bayes' rule and a reasonable prior. So the entire issue that this post was meant to solve has evaporated, in my opinion. Here's the original article, in case anyone is still interested:
Pancritical rationalism is a fundamental value in Extropianism that has only been mentioned in passing on LessWrong. I think it deserves more attention here. It's an approach to epistemology, that is, the question of "How do we know what we know?", that avoids the contradictions inherent in some of the alternative approaches.
The fundamental source document for it is William Bartley's Retreat to Commitment. He describes three approaches to epistemology, along with the dissatisfying aspects of the other two:
- Nihilism. Nothing matters, so it doesn't matter what you believe. This path is self-consistent, but it gives no guidance.
- Justificationlism. Your belief is justified because it is a consequence of other beliefs. This path is self-contradictory. Eventually you'll go in circles trying to justify the other beliefs, or you'll find beliefs you can't jutify. Justificationalism itself cannot be justified.
- Pancritical rationalism. You have taken the available criticisms for the belief into account and still feel comfortable with the belief. This path gives guidance about what to believe, although it does not uniquely determine one's beliefs. Pancritical rationalism can be criticized, so it is self-consistent in that sense.
Read on for a discussion about emotional consequences and extending this to include preferences and behaviors as well as beliefs.
[I made significant edits when moving this to the main page - so if you read it in Discussion, it's different now. It's clearer about the distinction between two different meanings of "free", and why linking one meaning of "free" with morality implies a focus on an otherworldly soul.]
It was funny to me that many people thought Crime and Punishment was advocating outcome-based justice. If you read the post carefully, nothing in it advocates outcome-based justice. I only wanted to show how people think, so I could write this post.
Talking about morality causes much confusion, because most philosophers - and most people - do not have a distinct concept of morality. At best, they have just one word that composes two different concepts. At worst, their "morality" doesn't contain any new primitive concepts at all; it's just a macro: a shorthand for a combination of other ideas.
I think - and have, for as long as I can remember - that morality is about doing the right thing. But this is not what most people think morality is about!
View more: Next