Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

FAI and the Information Theory of Pleasure

8 johnsonmx 08 September 2015 09:16PM

Previously, I talked about the mystery of pain and pleasure, and how little we know about what sorts of arrangements of particles intrinsically produce them.

 

Up now: should FAI researchers care about this topic? Is research into the information theory of pain and pleasure relevant for FAI? I believe so! Here are the top reasons I came up with while thinking about this topic.

 

An important caveat: much depends on whether pain and pleasure (collectively, 'valence') are simple or complex properties of conscious systems. If they’re on the complex end of the spectrum, many points on this list may not be terribly relevant for the foreseeable future. On the other hand, if they have a relatively small “kolmogorov complexity” (e.g., if a ‘hashing function’ to derive valence could fit on a t-shirt), crisp knowledge of valence may be possible sooner rather than later, and could have some immediate relevance to current FAI research directions.

Additional caveats: it’s important to note that none of these ideas are grand, sweeping panaceas, or are intended to address deep metaphysical questions, or aim to reinvent the wheel- instead, they’re intended to help resolve empirical ambiguities and modestly enlarge the current FAI toolbox.

 

 

1. Valence research could simplify the Value Problem and the Value Loading Problem. If pleasure/happiness is an important core part of what humanity values, or should value, having the exact information-theoretic definition of it on-hand could directly and drastically simplify the problems of what to maximize, and how to load this value into an AGI.

 

2. Valence research could form the basis for a well-defined ‘sanity check’ on AGI behavior. Even if pleasure isn’t a core terminal value for humans, it could still be used as a useful indirect heuristic for detecting value destruction. I.e., if we’re considering having an AGI carry out some intervention, we could ask it what the expected effect is on whatever pattern precisely corresponds to pleasure/happiness. If there’s be a lot less of that pattern, the intervention is probably a bad idea.

 

3. Valence research could help us be humane to AGIs and WBEs. There’s going to be a lot of experimentation involving intelligent systems, and although many of these systems won’t be “sentient” in the way humans are, some system types will approach or even surpass human capacity for suffering. Unfortunately, many of these early systems won’t work well— i.e., they’ll be insane. It would be great if we had a good way to detect profound suffering in such cases and halt the system.

 

4. Valence research could help us prevent Mind Crimes. Nick Bostrom suggests in Superintelligence that AGIs might simulate virtual humans to reverse-engineer human preferences, but that these virtual humans might be sufficiently high-fidelity that they themselves could meaningfully suffer. We can tell AGIs not to do this- but knowing the exact information-theoretic pattern of suffering would make it easier to specify what not to do.

 

5. Valence research could enable radical forms of cognitive enhancement. Nick Bostrom has argued that there are hard limits on traditional pharmaceutical cognitive enhancement, since if the presence of some simple chemical would help us think better, our brains would probably already be producing it. On the other hand, there seem to be fewer a priori limits on motivational or emotional enhancement. And sure enough, the most effective “cognitive enhancers” such as adderall, modafinil, and so on seem to work by making cognitive tasks seem less unpleasant or more interesting. If we had a crisp theory of valence, this might enable particularly powerful versions of these sorts of drugs.

 

6. Valence research could help align an AGI’s nominal utility function with visceral happiness. There seems to be a lot of confusion with regard to happiness and utility functions. In short: they are different things! Utility functions are goal abstractions, generally realized either explicitly through high-level state variables or implicitly through dynamic principles. Happiness, on the other hand, seems like an emergent, systemic property of conscious states, and like other qualia but unlike utility functions, it’s probably highly dependent upon low-level architectural and implementational details and dynamics. In practice, most people most of the time can be said to have rough utility functions which are often consistent with increasing happiness, but this is an awfully leaky abstraction.

 

My point is that constructing an AGI whose utility function is to make paperclips, and constructing a sentient AGI who is viscerally happy when it makes paperclips, are very different tasks. Moreover, I think there could be value in being able to align these two factors— to make an AGI which is viscerally happy to the exact extent it’s maximizing its nominal utility function.

(Why would we want to do this in the first place? There is the obvious semi-facetious-but-not-completely-trivial answer— that if an AGI turns me into paperclips, I at least want it to be happy while doing so—but I think there’s real potential for safety research here also.)

 

7. Valence research could help us construct makeshift utility functions for WBEs and Neuromorphic AGIs. How do we make WBEs or Neuromorphic AGIs do what we want? One approach would be to piggyback off of what they already partially and imperfectly optimize for already, and build a makeshift utility function out of pleasure. Trying to shoehorn a utility function onto any evolved, emergent system is going to involve terrible imperfections, uncertainties, and dangers, but if research trends make neuromorphic AGI likely to occur before other options, it may be a case of “something is probably better than nothing.”

 

One particular application: constructing a “cryptographic reward token” control scheme for WBEs/neuromorphic AGIs. Carl Shulman has suggested we could incentivize an AGI to do what we want by giving it a steady trickle of cryptographic reward tokens that fulfill its utility function- it knows if it misbehaves (e.g., if it kills all humans), it’ll stop getting these tokens. But if we want to construct reward tokens for types of AGIs that don’t intrinsically have crisp utility functions (such as WBEs or neuromorphic AGIs), we’ll have to understand, on a deep mathematical level, what they do optimize for, which will at least partially involve pleasure.

 

8. Valence research could help us better understand, and perhaps prevent, AGI wireheading. How can AGI researchers prevent their AGIs from wireheading (direct manipulation of their utility functions)? I don’t have a clear answer, and it seems like a complex problem which will require complex, architecture-dependent solutions, but understanding the universe’s algorithm for pleasure might help clarify what kind of problem it is, and how evolution has addressed it in humans.

 

9. Valence research could help reduce general metaphysical confusion. We’re going to be facing some very weird questions about philosophy of mind and metaphysics when building AGIs, and everybody seems to have their own pet assumptions on how things work. The better we can clear up the fog which surrounds some of these topics, the lower our coordinational friction will be when we have to directly address them.


Successfully reverse-engineering a subset of qualia (valence- perhaps the easiest type to reverse-engineer?) would be a great step in this direction.

 

10. Valence research could change the social and political landscape AGI research occurs in. This could take many forms: at best, a breakthrough could lead to a happier society where many previously nihilistic individuals suddenly have “skin in the game” with respect to existential risk. At worst, it could be a profound information hazard, and irresponsible disclosure or misuse of such research could lead to mass wireheading, mass emotional manipulation, and totalitarianism. Either way, it would be an important topic to keep abreast of.

 

 

These are not all independent issues, and not all are of equal importance. But, taken together, they do seem to imply that reverse-engineering valence will be decently relevant to FAI research, particularly with regard to the Value Problem, reducing metaphysical confusion, and perhaps making the hardest safety cases (e.g., neuromorphic AGIs) a little bit more tractable.

An overall schema for the friendly AI problems: self-referential convergence criteria

17 Stuart_Armstrong 13 July 2015 03:34PM

A putative new idea for AI control; index here.

After working for some time on the Friendly AI problem, it's occurred to me that a lot of the issues seem related. Specifically, all the following seem to have commonalities:

Speaking very broadly, there are two features all them share:

  1. The convergence criteria are self-referential.
  2. Errors in the setup are likely to cause false convergence.

What do I mean by that? Well, imagine you're trying to reach reflective equilibrium in your morality. You do this by using good meta-ethical rules, zooming up and down at various moral levels, making decisions on how to resolve inconsistencies, etc... But how do you know when to stop? Well, you stop when your morality is perfectly self-consistent, when you no longer have any urge to change your moral or meta-moral setup. In other words, the stopping point (and the the convergence to the stopping point) is entirely self-referentially defined: the morality judges itself. It does not include any other moral considerations. You input your initial moral intuitions and values, and you hope this will cause the end result to be "nice", but the definition of the end result does not include your initial moral intuitions (note that some moral realists could see this process dependence as a positive - except for the fact that these processes have many convergent states, not just one or a small grouping).

So when the process goes nasty, you're pretty sure to have achieved something self-referentially stable, but not nice. Similarly, a nasty CEV will be coherent and have no desire to further extrapolate... but that's all we know about it.

The second feature is that any process has errors - computing errors, conceptual errors, errors due to the weakness of human brains, etc... If you visualise this as noise, you can see that noise in a convergent process is more likely to cause premature convergence, because if the process ever reaches a stable self-referential state, it will stay there (and if the process is a long one, then early noise will cause great divergence at the end). For instance, imagine you have to reconcile your belief in preserving human cultures with your beliefs in human individual freedom. A complex balancing act. But if, at any point along the way, you simply jettison one of the two values completely, things become much easier - and once jettisoned, the missing value is unlikely to ever come back.

Or, more simply, the system could get hacked. When exploring a potential future world, you could become so enamoured of it, that you overwrite any objections you had. It seems very easy for humans to fall into these traps - and again, once you lose something of value in your system, you don't tend to get if back.

 

Solutions

And again, very broadly speaking, there are several classes of solutions to deal with these problems:

  1. Reduce or prevent errors in the extrapolation (eg solving the agent tiling problem).
  2. Solve all or most of the problem ahead of time (eg traditional FAI approach by specifying the correct values).
  3. Make sure you don't get too far from the starting point (eg reduced impact AI, tool AI, models as definitions).
  4. Figure out the properties of a nasty convergence, and try to avoid them (eg some of the ideas I mentioned in "crude measures", general precautions that are done when defining the convergence process).

 

FAI Research Constraints and AGI Side Effects

14 JustinShovelain 03 June 2015 07:25PM

Ozzie Gooen and Justin Shovelain

Summary

Friendly artificial intelligence (FAI) researchers have at least two significant challenges. First, they must produce a significant amount of FAI research in a short amount of time. Second, they must do so without producing enough general artificial intelligence (AGI) research to result in the creation of an unfriendly artificial intelligence (UFAI). We estimate the requirements of both of these challenges using two simple models.

Our first model describes a friendliness ratio and a leakage ratio for FAI research projects. These provide limits on the allowable amount of artificial general intelligence (AGI) knowledge produced per unit of FAI knowledge in order for a project to be net beneficial.

Our second model studies a hypothetical FAI venture, which is responsible for ensuring FAI creation. We estimate necessary total FAI research per year from the venture and leakage ratio of that research. This model demonstrates a trade off between the speed of FAI research and the proportion of AGI research that can be revealed as part of it. If FAI research takes too long, then the acceptable leakage ratio may become so low that it would become nearly impossible to safely produce any new research.

continue reading »

New forum for MIRI research: Intelligent Agent Foundations Forum

36 orthonormal 20 March 2015 12:35AM

Today, the Machine Intelligence Research Institute is launching a new forum for research discussion: the Intelligent Agent Foundations Forum! It's already been seeded with a bunch of new work on MIRI topics from the last few months.

We've covered most of the (what, why, how) subjects on the forum's new welcome post and the How to Contribute page, but this post is an easy place to comment if you have further questions (or if, maths forbid, there are technical issues with the forum instead of on it).

But before that, go ahead and check it out!

(Major thanks to Benja Fallenstein, Alice Monday, and Elliott Jin for their work on the forum code, and to all the contributors so far!)

EDIT 3/22: Jessica Taylor, Benja Fallenstein, and I wrote forum digest posts summarizing and linking to recent work (on the IAFF and elsewhere) on reflective oracle machines, on corrigibility, utility indifference, and related control ideas, and on updateless decision theory and the logic of provability, respectively! These are pretty excellent resources for reading up on those topics, in my biased opinion.

New(ish) AI control ideas

24 Stuart_Armstrong 05 March 2015 05:03PM

EDIT: this post is no longer being maintained, it has been replaced by this new one.

 

I recently went on a two day intense solitary "AI control retreat", with the aim of generating new ideas for making safe AI. The "retreat" format wasn't really a success ("focused uninterrupted thought" was the main gain, not "two days of solitude" - it would have been more effective in three hour sessions), but I did manage to generate a lot of new ideas. These ideas will now go before the baying bloodthirsty audience (that's you, folks) to test them for viability.

A central thread running through could be: if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach.

To provide inspiration and direction to my thought process, I first listed all the easy responses that we generally give to most proposals for AI control. If someone comes up with a new/old brilliant idea for AI control, it can normally be dismissed by appealing to one of these responses:

  1. The AI is much smarter than us.
  2. It’s not well defined.
  3. The setup can be hacked.
    • By the agent.
    • By outsiders, including other AI.
    • Adding restrictions encourages the AI to hack them, not obey them.
  4. The agent will resist changes.
  5. Humans can be manipulated, hacked, or seduced.
  6. The design is not stable.
    • Under self-modification.
    • Under subagent creation.
  7. Unrestricted search is dangerous.
  8. The agent has, or will develop, dangerous goals.
continue reading »

The mystery of pain and pleasure

8 johnsonmx 01 March 2015 07:47PM

 

Some arrangements of particles feel better than others. Why?

We have no general theories, only descriptive observations within the context of the vertebrate brain, about what produces pain and pleasure. It seems like there's a mystery here, a general principle to uncover.

Let's try to chart the mystery. I think we should, in theory, be able to answer the following questions:


(1) What are the necessary and sufficient properties for a thought to be pleasurable?

(2) What are the characteristic mathematics of a painful thought?

(3) If we wanted to create an artificial neural network-based mind (i.e., using neurons, but not slavishly patterned after a mammalian brain) that could experience bliss, what would the important design parameters be?

(4) If we wanted to create an AGI whose nominal reward signal coincided with visceral happiness -- how would we do that?

(5) If we wanted to ensure an uploaded mind could feel visceral pleasure of the same kind a non-uploaded mind can, how could we check that? 

(6) If we wanted to fill the universe with computronium and maximize hedons, what algorithm would we run on it?

(7) If we met an alien life-form, how could we tell if it was suffering?


It seems to me these are all empirical questions that should have empirical answers. But we don't seem to have much for hand-holds which can give us a starting point.

Where would *you* start on answering these questions? Which ones are good questions, and which ones are aren't? And if you think certain questions aren't good, could you offer some you think are?

 

As suggested by shminux, here's some research I believe is indicative of the state of the literature (though this falls quite short of a full literature review):

Tononi's IIT seems relevant, though it only addresses consciousness and explicitly avoids valence. Max Tegmark has a formal generalization of IIT which he claims should apply to non-neural substrates. And although Tegmark doesn't address valence either, he posted a recent paper on arxiv noting that there *is* a mystery here, and that it seems topical for FAI research.

Current models of emotion based on brain architecture and neurochemicals (e.g., EMOCON) are somewhat relevant, though ultimately correlative or merely descriptive, and seem to have little universalization potential.

There's also a great deal of quality literature about specific correlates of pain and happiness- e.g., Building a neuroscience of pleasure and well-being and An fMRI-Based Neurologic Signature of Physical Pain. Luke covers Berridge's research in his post, The Neuroscience of Pleasure. Short version: 'liking', 'wanting', and 'learning' are all handled by different systems in the brain. Opioids within very small regions of the brain seem to induce the 'liking' response; elsewhere in the brain, opioids only produce 'wanting'. We don't know how or why yet. This sort of research constrains a general principle, but doesn't really hint toward one.

 

In short, there's plenty of research around the topic, but it's focused exclusively on humans/mammals/vertebrates: our evolved adaptations, our emotional systems, and our architectural quirks. Nothing on general or universal principles that would address any of (1)-(7). There is interesting information-theoretic / patternist work being done, but it's highly concentrated around consciousness research.

 

---

 

Bottom line: there seems to be a critically important general principle as to what makes certain arrangements of particles innately preferable to others, and we don't know what it is. Exciting!

"Smarter than us" is out!

24 Stuart_Armstrong 25 February 2014 03:50PM

We're pleased to announce the release of "Smarter Than Us: The Rise of Machine Intelligence", commissioned by MIRI and written by Oxford University’s Stuart Armstrong, and available in EPUB, MOBI, PDF, and from the Amazon and Apple ebook stores.

What happens when machines become smarter than humans? Forget lumbering Terminators. The power of an artificial intelligence (AI) comes from its intelligence, not physical strength and laser guns. Humans steer the future not because we’re the strongest or the fastest but because we’re the smartest. When machines become smarter than humans, we’ll be handing them the steering wheel. What promises—and perils—will these powerful machines present? This new book navigates these questions with clarity and wit.

Can we instruct AIs to steer the future as we desire? What goals should we program into them? It turns out this question is difficult to answer! Philosophers have tried for thousands of years to define an ideal world, but there remains no consensus. The prospect of goal-driven, smarter-than-human AI gives moral philosophy a new urgency. The future could be filled with joy, art, compassion, and beings living worthwhile and wonderful lives—but only if we’re able to precisely define what a “good” world is, and skilled enough to describe it perfectly to a computer program.

AIs, like computers, will do what we say—which is not necessarily what we mean. Such precision requires encoding the entire system of human values for an AI: explaining them to a mind that is alien to us, defining every ambiguous term, clarifying every edge case. Moreover, our values are fragile: in some cases, if we mis-define a single piece of the puzzle—say, consciousness—we end up with roughly 0% of the value we intended to reap, instead of 99% of the value.

Though an understanding of the problem is only beginning to spread, researchers from fields ranging from philosophy to computer science to economics are working together to conceive and test solutions. Are we up to the challenge?

Special thanks to all those at the FHI, MIRI and Less Wrong who helped with this work, and those who voted on the name!

New report: Intelligence Explosion Microeconomics

45 Eliezer_Yudkowsky 29 April 2013 11:14PM

SummaryIntelligence Explosion Microeconomics (pdf) is 40,000 words taking some initial steps toward tackling the key quantitative issue in the intelligence explosion, "reinvestable returns on cognitive investments": what kind of returns can you get from an investment in cognition, can you reinvest it to make yourself even smarter, and does this process die out or blow up? This can be thought of as the compact and hopefully more coherent successor to the AI Foom Debate of a few years back.

(Sample idea you haven't heard before:  The increase in hominid brain size over evolutionary time should be interpreted as evidence about increasing marginal fitness returns on brain size, presumably due to improved brain wiring algorithms; not as direct evidence about an intelligence scaling factor from brain size.)

I hope that the open problems posed therein inspire further work by economists or economically literate modelers, interested specifically in the intelligence explosion qua cognitive intelligence rather than non-cognitive 'technological acceleration'.  MIRI has an intended-to-be-small-and-technical mailing list for such discussion.  In case it's not clear from context, I (Yudkowsky) am the author of the paper.

Abstract:

I. J. Good's thesis of the 'intelligence explosion' is that a sufficiently advanced machine intelligence could build a smarter version of itself, which could in turn build an even smarter version of itself, and that this process could continue enough to vastly exceed human intelligence.  As Sandberg (2010) correctly notes, there are several attempts to lay down return-on-investment formulas intended to represent sharp speedups in economic or technological growth, but very little attempt has been made to deal formally with I. J. Good's intelligence explosion thesis as such.

I identify the key issue as returns on cognitive reinvestment - the ability to invest more computing power, faster computers, or improved cognitive algorithms to yield cognitive labor which produces larger brains, faster brains, or better mind designs.  There are many phenomena in the world which have been argued as evidentially relevant to this question, from the observed course of hominid evolution, to Moore's Law, to the competence over time of machine chess-playing systems, and many more.  I go into some depth on the sort of debates which then arise on how to interpret such evidence.  I propose that the next step forward in analyzing positions on the intelligence explosion would be to formalize return-on-investment curves, so that each stance can say formally which possible microfoundations they hold to be falsified by historical observations already made.  More generally, I pose multiple open questions of 'returns on cognitive reinvestment' or 'intelligence explosion microeconomics'.  Although such questions have received little attention thus far, they seem highly relevant to policy choices affecting the outcomes for Earth-originating intelligent life.

The dedicated mailing list will be small and restricted to technical discussants.

continue reading »

Reflection in Probabilistic Logic

63 Eliezer_Yudkowsky 24 March 2013 04:37PM

Paul Christiano has devised a new fundamental approach to the "Löb Problem" wherein Löb's Theorem seems to pose an obstacle to AIs building successor AIs, or adopting successor versions of their own code, that trust the same amount of mathematics as the original.  (I am currently writing up a more thorough description of the question this preliminary technical report is working on answering.  For now the main online description is in a quick Summit talk I gave.  See also Benja Fallenstein's description of the problem in the course of presenting a different angle of attack.  Roughly the problem is that mathematical systems can only prove the soundness of, aka 'trust', weaker mathematical systems.  If you try to write out an exact description of how AIs would build their successors or successor versions of their code in the most obvious way, it looks like the mathematical strength of the proof system would tend to be stepped down each time, which is undesirable.)

Paul Christiano's approach is inspired by the idea that whereof one cannot prove or disprove, thereof one must assign probabilities: and that although no mathematical system can contain its own truth predicate, a mathematical system might be able to contain a reflectively consistent probability predicate.  In particular, it looks like we can have:

∀a, b: (a < P(φ) < b)          ⇒  P(a < P('φ') < b) = 1
∀a, b: P(a ≤ P('φ') ≤ b) > 0  ⇒  a ≤ P(φ) ≤ b

Suppose I present you with the human and probabilistic version of a Gödel sentence, the Whitely sentence "You assign this statement a probability less than 30%."  If you disbelieve this statement, it is true.  If you believe it, it is false.  If you assign 30% probability to it, it is false.  If you assign 29% probability to it, it is true.

Paul's approach resolves this problem by restricting your belief about your own probability assignment to within epsilon of 30% for any epsilon.  So Paul's approach replies, "Well, I assign almost exactly 30% probability to that statement - maybe a little more, maybe a little less - in fact I think there's about a 30% chance that I'm a tiny bit under 0.3 probability and a 70% chance that I'm a tiny bit over 0.3 probability."  A standard fixed-point theorem then implies that a consistent assignment like this should exist.  If asked if the probability is over 0.2999 or under 0.30001 you will reply with a definite yes.

continue reading »

View more: Next