Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.
Convergent instrumental goals (also basic AI drives) are goals that are useful for pursuing almost any other goal, and are thus likely to be pursued by any agent that is intelligent enough to understand why they’re useful. They are interesting because they may allow us to roughly predict the behavior of even AI systems that are much more intelligent than we are.
Instrumental goals are also a strong argument for why sufficiently advanced AI systems that were indifferent towards human values could be dangerous towards humans, even if they weren’t actively malicious: because the AI having instrumental goals such as self-preservation or resource acquisition could come to conflict with human well-being. “The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.”
I’ve thought of a candidate for a new convergent instrumental drive: simplifying the environment to make it more predictable in a way that aligns with your goals.
Arguments for risks from general AI are sometimes criticized on the grounds that they rely on a series of linear events, each of which has to occur for the proposed scenario to go through. For example, that a sufficiently intelligent AI could escape from containment, that it could then go on to become powerful enough to take over the world, that it could do this quickly enough without being detected, etc.
The intent of my following series of posts is to briefly demonstrate that AI risk scenarios are in fact disjunctive: composed of multiple possible pathways, each of which could be sufficient by itself. To successfully control the AI systems, it is not enough to simply block one of the pathways: they all need to be dealt with.
I've got two posts in this series up so far:
AIs gaining a decisive advantage discusses four different ways by which AIs could achieve a decisive advantage over humanity. The one-picture version is:
AIs gaining the power to act autonomously discusses ways by which AIs might come to act as active agents in the world, despite possible confinement efforts or technology. The one-picture version (which you may wish to click to enlarge) is:
These posts draw heavily on my old paper, Responses to Catastrophic AGI Risk, as well as some recent conversations here on LW. Upcoming posts will try to cover more new ground.
Hypothetical “value learning” AIs learn human values and then try to act according to those values. The design of such AIs, however, is hampered by the fact that there exists no satisfactory definition of what exactly human values are. After arguing that the standard concept of preference is insufficient as a definition, I draw on reinforcement learning theory, emotion research, and moral psychology to offer an alternative definition. In this definition, human values are conceptualized as mental representations that encode the brain’s value function (in the reinforcement learning sense) by being imbued with a context-sensitive affective gloss. I finish with a discussion of the implications that this hypothesis has on the design of value learners.
Economic treatments of agency standardly assume that preferences encode some consistent ordering over world-states revealed in agents’ choices. Real-world preferences, however, have structure that is not always captured in economic models. A person can have conflicting preferences about whether to study for an exam, for example, and the choice they end up making may depend on complex, context-sensitive psychological dynamics, rather than on a simple comparison of two numbers representing how much one wants to study or not study.
Sotala argues that our preferences are better understood in terms of evolutionary theory and reinforcement learning. Humans evolved to pursue activities that are likely to lead to certain outcomes — outcomes that tended to improve our ancestors’ fitness. We prefer those outcomes, even if they no longer actually maximize fitness; and we also prefer events that we have learned tend to produce such outcomes.
Affect and emotion, on Sotala’s account, psychologically mediate our preferences. We enjoy and desire states that are highly rewarding in our evolved reward function. Over time, we also learn to enjoy and desire states that seem likely to lead to high-reward states. On this view, our preferences function to group together events that lead on expectation to similarly rewarding outcomes for similar reasons; and over our lifetimes we come to inherently value states that lead to high reward, instead of just valuing such states instrumentally. Rather than directly mapping onto our rewards, our preferences map onto our expectation of rewards.
Sotala proposes that value learning systems informed by this model of human psychology could more reliably reconstruct human values. On this model, for example, we can expect human preferences to change as we find new ways to move toward high-reward states. New experiences can change which states my emotions categorize as “likely to lead to reward,” and they can thereby modify which states I enjoy and desire. Value learning systems that take these facts about humans’ psychological dynamics into account may be better equipped to take our likely future preferences into account, rather than optimizing for our current preferences alone.
Would be curious to hear whether anyone here has any thoughts. This is basically a "putting rough ideas together and seeing if they make any sense" kind of paper, aimed at clarifying the hypothesis and seeing whether others kind find any obvious holes in it, rather than being at the stage of a serious scientific theory yet.
Long. Mostly quite positive, though does spend a little while rolling its eyes at the Eliezer/MIRI connection and the craziness of taking things like cryonics and polyamory seriously.
A lot of people have said that they never look at Main, only Discussion. And indeed, LW's Google Analytics stats say that Main only gets one-third of the views that Discussion does.
Because of this, I thought that I'd point out that December has been an unusually lively month for Main, with several high-quality posts that you may be interested in reading out if you haven't already:
- LessWrong 2.0 (Vaniver): discussion about what to do with LW in order to stop its decline. Different from previous discussions in that this time, MIRI and TrikeApps have agreed to make the changes that result from the discussion.
- Why startup founders have mood swings (and why they may have uses) (AnnaSalamon and Duncan_Sabien): what the title says
- Results of a One-Year Longitudinal Study of CFAR Alumni (Unnamed): CFAR has studied the impact of their workshops on people a year after taking the workshops, and have promising results.
- The art of grieving well (Valentine): a beautiful and important post on the function of grief, and how to make the best out of it. A post intended for a sequence on "the sub-art of subconsciously seeking out and eliminating ugh fields and also eliminating the inclination to form them in the first place".
- European Community Weekend 2016 (nino): ECW2016 is confirmed to happen!
- Why CFAR? The view from 2015 (PeteMichaud): a report on what CFAR has achieved in 2015, how it has changed, and what it will do in the future.
New essay summarizing some of my latest thoughts on AI safety, ~3500 words. I explain why I think that some of the thought experiments that have previously been used to illustrate the dangers of AI are flawed and should be used very cautiously, why I'm less worried about the dangers of AI than I used to be, and what are some of the remaining reasons for why I do continue to be somewhat worried.
Backcover celebrity endorsement: "Thanks, Kaj, for a very nice write-up. It feels good to be discussing actually meaningful issues regarding AI safety. This is a big contrast to discussions I've had in the past with MIRI folks on AI safety, wherein they have generally tried to direct the conversation toward bizarre, pointless irrelevancies like "the values that would be held by a randomly selected mind", or "AIs with superhuman intelligence making retarded judgments" (like tiling the universe with paperclips to make humans happy), and so forth.... Now OTOH, we are actually discussing things of some potential practical meaning ;p ..." -- Ben Goertzel
Summary: the problem with Pascal's Mugging arguments is that, intuitively, some probabilities are just too small to care about. There might be a principled reason for ignoring some probabilities, namely that they violate an implicit assumption behind expected utility theory. This suggests a possible approach for formally defining a "probability small enough to ignore", though there's still a bit of arbitrariness in it.
You may recognize several familiar names there, such as Paul Christiano, Benja Fallenstein, Katja Grace, Nick Bostrom, Anna Salamon, Jacob Steinhardt, Stuart Russell... and me. (the $20,000 for my project was the smallest grant that they gave out, but hey, I'm definitely not complaining. ^^)
Summary: Utilitarianism is often ill-defined by supporters and critics alike, preference utilitarianism even more so. I briefly examine some of the axes of utilitarianism common to all popular forms, then look at some axes unique but essential to preference utilitarianism, which seem to have received little to no discussion – at least not this side of a paywall. This way I hope to clarify future discussions between hedonistic and preference utilitarians and perhaps to clarify things for their critics too, though I’m aiming the discussion primarily at utilitarians and utilitarian-sympathisers.
I like this essay particularly for the way it breaks down different forms of utilitarianism to various axes, which have rarely been discussed on LW much.
For utilitarianism in general:
Many of these axes are well discussed, pertinent to almost any form of utilitarianism, and at least reasonably well understood, and I don’t propose to discuss them here beyond highlighting their salience. These include but probably aren’t restricted to the following:
- What is utility? (for the sake of easy reference, I’ll give each axis a simple title – for this, the utility axis); eg happiness, fulfilled preferences, beauty, information(PDF)
- How drastically are we trying to adjust it?, aka what if any is the criterion for ‘right’ness? (sufficiency axis); eg satisficing, maximising, scalar
- How do we balance tradeoffs between positive and negative utility? (weighting axis); eg, negative, negative-leaning, positive (as in fully discounting negative utility – I don’t think anyone actually holds this), ‘middling’ ie ‘normal’ (often called positive, but it would benefit from a distinct adjective)
- What’s our primary mentality toward it? (mentality axis); eg act, rule, two-level, global
- How do we deal with changing populations? (population axis); eg average, total
- To what extent do we discount future utility? (discounting axis); eg zero discount, >0 discount
- How do we pinpoint the net zero utility point? (balancing axis); eg Tännsjö’s test, experience tradeoffs
- What is a utilon? (utilon axis)  – I don’t know of any examples of serious discussion on this (other than generic dismissals of the question), but it’s ultimately a question utilitarians will need to answer if they wish to formalise their system.
For preference utilitarianism in particular:
Here then, are the six most salient dependent axes of preference utilitarianism, ie those that describe what could count as utility for PUs. I’ll refer to the poles on each axis as (axis)0 and (axis)1, where any intermediate view will be (axis)X. We can then formally refer to subtypes, and also exclude them, eg ~(F0)R1PU, or ~(F0 v R1)PU etc, or represent a range, eg C0..XPU.
How do we process misinformed preferences? (information axis F)
(F0 no adjustment / F1 adjust to what it would have been had the person been fully informed / FX somewhere in between)
How do we process irrational preferences? (rationality axis R)
(R0 no adjustment / R1 adjust to what it would have been had the person been fully rational / RX somewhere in between)
How do we process malformed preferences? (malformation axes M)
(M0 Ignore them / MF1 adjust to fully informed / MFR1 adjust to fully informed and rational (shorthand for MF1R1) / MFxRx adjust to somewhere in between)
How long is a preference relevant? (duration axis D)
(D0 During its expression only / DF1 During and future / DPF1 During, future and past (shorthand for DP1F1) / DPxFx Somewhere in between)
What constitutes a preference? (constitution axis C)
(C0 Phenomenal experience only / C1 Behaviour only / CX A combination of the two)
What resolves a preference? (resolution axis S)
(S0 Phenomenal experience only / S1 External circumstances only / SX A combination of the two)
What distinguishes these categorisations is that each category, as far as I can perceive, has no analogous axis within hedonistic utilitarianism. In other words to a hedonistic utilitarian, such axes would either be meaningless, or have only one logical answer. But any well-defined and consistent form of preference utilitarianism must sit at some point on every one of these axes.
See the article for more detailed discussion about each of the axes of preference utilitarianism, and more.
View more: Next