You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Indifferent vs false-friendly AIs

9 Stuart_Armstrong 24 March 2015 12:13PM

A putative new idea for AI control; index here.

For anyone but an extreme total utilitarian, there is a great difference between AIs that would eliminate everyone as a side effect of focusing on their own goals (indifferent AIs) and AIs that would effectively eliminate everyone through a bad instantiation of human-friendly values (false-friendly AIs). Examples of indifferent AIs are things like paperclip maximisers, examples of false-friendly AIs are "keep humans safe" AIs who entomb everyone in bunkers, lobotomised and on medical drips.

The difference is apparent when you consider multiple AIs and negotiations between them. Imagine you have a large class of AIs, and that they are all indifferent (IAIs), except for one (which you can't identify) which is friendly (FAI). And you now let them negotiate a compromise between themselves. Then, for many possible compromises, we will end up with most of the universe getting optimised for whatever goals the AIs set themselves, while a small portion (maybe just a single galaxy's resources) would get dedicated to making human lives incredibly happy and meaningful.

But if there is a false-friendly AI (FFAI) in the mix, things can go very wrong. That is because those happy and meaningful lives are a net negative to the FFAI. These humans are running dangers - possibly physical, possibly psychological - that lobotomisation and bunkers (or their digital equivalents) could protect against. Unlike the IAIs, which would only complain about the loss of resources to the FAI, the FFAI finds the FAI's actions positively harmful (and possibly vice versa), making compromises much harder to reach.

And the compromises reached might be bad ones. For instance, what if the FAI and FFAI agree on "half-lobotomised humans" or something like that? You might ask why the FAI would agree to that, but there's a great difference to an AI that would be friendly on its own, and one that would choose only friendly compromises with a powerful other AI with human-relevant preferences.

Some designs of FFAIs might not lead to these bad outcomes - just like IAIs, they might be content to rule over a galaxy of lobotomised humans, while the FAI has its own galaxy off on its own, where its humans take all these dangers. But generally, FFAIs would not come about by someone designing a FFAI, let alone someone designing a FFAI that can safely trade with a FAI. Instead, they would be designing a FAI, and failing. And the closer that design got to being FAI, the more dangerous the failure could potentially be.

So, when designing an FAI, make sure to get it right. And, though you absolutely positively need to get it absolutely right, make sure that if you do fail, the failure results in a FFAI that can safely be compromised with, if someone else gets out a true FAI in time.

AI caught by a module that counterfactually doesn't exist

9 Stuart_Armstrong 17 November 2014 05:49PM

EDIT: Whenever I use colloquial phrases like "the AI believes a (false) X" I mean that we are using utility indifference to accomplish that goal, without actually giving the AI false beliefs.

A big problem with most AI control ideas is that the AI will route around them. Suppose you try to ensure that the AI stays friendly by - say - requiring it to ensure that the world's stock markets continue to function. Then a badly motivated AI will cause the destruction it was going to cause anyway - except it will also create some system that technically counts as "the world's stock markets" and ensure that it "continues to function".

I think we can use counterfactuals and corrigibility (AI-acceptable motivation changes) to partially get around the problem, in some circumstances. Suppose we had a candidate AI, with utility U, that we thought was friendly - but that we weren't sure. We have a subgoal, say SG="ensuring that the world's stock markets continue to function". We're pretty sure that if the AI is friendly, then SG will happen. We're pretty sure that if the AI were unfriendly, and the AI did not have SG added as an objective, then SG wouldn't happen. I deliberately chose a rather rough subgoal to illustrate this.

So we then add a module M to the AI. The task of M, which it has chosen to accept, is to look into the future, using the AI's predictive abilities, and see if SG is accomplished. Crucially, it will do so under the assumption that the module M does not exist (or wasn't activated, or will soon be overwritten, or will fail to do anything, or so on). If M finds that SG is not accomplished, it will turn the AI off (in a way compatible with the AI's motivations - see corrigibility).

How is this different from the usual setup? Well, a generic FAI would accomplish SG (by assumption - we may need to work a bit on this part). A generic UFAI without M wouldn't accomplish SG. A generic UFAI with M would accomplish SG, as this prevents it from being turned off.

But a generic UFAI with M which thought that M would have no impact, would behave exactly as a generic UFAI without M - and wouldn't aim to accomplish SG, and thus would be turned off by M.

Note that M can only be used as a last test for a putatively friendly AI. We cannot use it if we are continually refining AIs (either through evolution or simply redesign) as then we are optimising for SG, and SG is a poor goal to be aiming for (many, many UFAI have SG as a goal - it's just that a generic one won't). Similarly, we can't use a unconstrained search to find such an AI.

I wonder if this idea can be extended. Suggestions?

My new paper: Concept learning for safe autonomous AI

18 Kaj_Sotala 15 November 2014 07:17AM

Abstract: Sophisticated autonomous AI may need to base its behavior on fuzzy concepts that cannot be rigorously defined, such as well-being or rights. Obtaining desired AI behavior requires a way to accurately specify these concepts. We review some evidence suggesting that the human brain generates its concepts using a relatively limited set of rules and mechanisms. This suggests that it might be feasible to build AI systems that use similar criteria and mechanisms for generating their own concepts, and could thus learn similar concepts as humans do. We discuss this possibility, and also consider possible complications arising from the embodied nature of human thought, possible evolutionary vestiges in cognition, the social nature of concepts, and the need to compare conceptual representations between humans and AI systems.

I just got word that this paper was accepted for the AAAI-15 Workshop on AI and Ethics: I've uploaded a preprint here. I'm hoping that this could help seed a possibly valuable new subfield of FAI research. Thanks to Steve Rayhawk for invaluable assistance while I was writing this paper: it probably wouldn't have gotten done without his feedback motivating me to work on this.

Comments welcome. 

Goal retention discussion with Eliezer

56 MaxTegmark 04 September 2014 10:23PM

Although I feel that Nick Bostrom’s new book “Superintelligence” is generally awesome and a well-needed milestone for the field, I do have one quibble: both he and Steve Omohundro appear to be more convinced than I am by the assumption that an AI will naturally tend to retain its goals as it reaches a deeper understanding of the world and of itself. I’ve written a short essay on this issue from my physics perspective, available at http://arxiv.org/pdf/1409.0813.pdf.

Eliezer Yudkowsky just sent the following extremely interesting comments, and told me he was OK with me sharing them here to spur a broader discussion of these issues, so here goes.

On Sep 3, 2014, at 17:21, Eliezer Yudkowsky <yudkowsky@gmail.com> wrote:

Hi Max!  You're asking the right questions.  Some of the answers we can
give you, some we can't, few have been written up and even fewer in any
well-organized way.  Benja or Nate might be able to expound in more detail
while I'm in my seclusion.

Very briefly, though:
The problem of utility functions turning out to be ill-defined in light of
new discoveries of the universe is what Peter de Blanc named an
"ontological crisis" (not necessarily a particularly good name, but it's
what we've been using locally).

http://intelligence.org/files/OntologicalCrises.pdf

The way I would phrase this problem now is that an expected utility
maximizer makes comparisons between quantities that have the type
"expected utility conditional on an action", which means that the AI's
utility function must be something that can assign utility-numbers to the
AI's model of reality, and these numbers must have the further property
that there is some computationally feasible approximation for calculating
expected utilities relative to the AI's probabilistic beliefs.  This is a
constraint that rules out the vast majority of all completely chaotic and
uninteresting utility functions, but does not rule out, say, "make lots of
paperclips".

Models also have the property of being Bayes-updated using sensory
information; for the sake of discussion let's also say that models are
about universes that can generate sensory information, so that these
models can be probabilistically falsified or confirmed.  Then an
"ontological crisis" occurs when the hypothesis that best fits sensory
information corresponds to a model that the utility function doesn't run
on, or doesn't detect any utility-having objects in.  The example of
"immortal souls" is a reasonable one.  Suppose we had an AI that had a
naturalistic version of a Solomonoff prior, a language for specifying
universes that could have produced its sensory data.  Suppose we tried to
give it a utility function that would look through any given model, detect
things corresponding to immortal souls, and value those things.  Even if
the immortal-soul-detecting utility function works perfectly (it would in
fact detect all immortal souls) this utility function will not detect
anything in many (representations of) universes, and in particular it will
not detect anything in the (representations of) universes we think have
most of the probability mass for explaining our own world.  In this case
the AI's behavior is undefined until you tell me more things about the AI;
an obvious possibility is that the AI would choose most of its actions
based on low-probability scenarios in which hidden immortal souls existed
that its actions could affect.  (Note that even in this case the utility
function is stable!)

Since we don't know the final laws of physics and could easily be
surprised by further discoveries in the laws of physics, it seems pretty
clear that we shouldn't be specifying a utility function over exact
physical states relative to the Standard Model, because if the Standard
Model is even slightly wrong we get an ontological crisis.  Of course
there are all sorts of extremely good reasons we should not try to do this
anyway, some of which are touched on in your draft; there just is no
simple function of physics that gives us something good to maximize.  See
also Complexity of Value, Fragility of Value, indirect normativity, the
whole reason for a drive behind CEV, and so on.  We're almost certainly
going to be using some sort of utility-learning algorithm, the learned
utilities are going to bind to modeled final physics by way of modeled
higher levels of representation which are known to be imperfect, and we're
going to have to figure out how to preserve the model and learned
utilities through shifts of representation.  E.g., the AI discovers that
humans are made of atoms rather than being ontologically fundamental
humans, and furthermore the AI's multi-level representations of reality
evolve to use a different sort of approximation for "humans", but that's
okay because our utility-learning mechanism also says how to re-bind the
learned information through an ontological shift.

This sorta thing ain't going to be easy which is the other big reason to
start working on it well in advance.  I point out however that this
doesn't seem unthinkable in human terms.  We discovered that brains are
made of neurons but were nonetheless able to maintain an intuitive grasp
on what it means for them to be happy, and we don't throw away all that
info each time a new physical discovery is made.  The kind of cognition we
want does not seem inherently self-contradictory.

Three other quick remarks:

*)  Natural selection is not a consequentialist, nor is it the sort of
consequentialist that can sufficiently precisely predict the results of
modifications that the basic argument should go through for its stability.
The Omohundrian/Yudkowskian argument is not that we can take an arbitrary
stupid young AI and it will be smart enough to self-modify in a way that
preserves its values, but rather that most AIs that don't self-destruct
will eventually end up at a stable fixed-point of coherent
consequentialist values.  This could easily involve a step where, e.g., an
AI that started out with a neural-style delta-rule policy-reinforcement
learning algorithm, or an AI that started out as a big soup of
self-modifying heuristics, is "taken over" by whatever part of the AI
first learns to do consequentialist reasoning about code.  But this
process doesn't repeat indefinitely; it stabilizes when there's a
consequentialist self-modifier with a coherent utility function that can
precisely predict the results of self-modifications.  The part where this
does happen to an initial AI that is under this threshold of stability is
a big part of the problem of Friendly AI and it's why MIRI works on tiling
agents and so on!

*)  Natural selection is not a consequentialist, nor is it the sort of
consequentialist that can sufficiently precisely predict the results of
modifications that the basic argument should go through for its stability.
It built humans to be consequentialists that would value sex, not value
inclusive genetic fitness, and not value being faithful to natural
selection's optimization criterion.  Well, that's dumb, and of course the
result is that humans don't optimize for inclusive genetic fitness.
Natural selection was just stupid like that.  But that doesn't mean
there's a generic process whereby an agent rejects its "purpose" in the
light of exogenously appearing preference criteria.  Natural selection's
anthropomorphized "purpose" in making human brains is just not the same as
the cognitive purposes represented in those brains.  We're not talking
about spontaneous rejection of internal cognitive purposes based on their
causal origins failing to meet some exogenously-materializing criterion of
validity.  Our rejection of "maximize inclusive genetic fitness" is not an
exogenous rejection of something that was explicitly represented in us,
that we were explicitly being consequentialists for.  It's a rejection of
something that was never an explicitly represented terminal value in the
first place.  Similarly the stability argument for sufficiently advanced
self-modifiers doesn't go through a step where the successor form of the
AI reasons about the intentions of the previous step and respects them
apart from its constructed utility function.  So the lack of any universal
preference of this sort is not a general obstacle to stable
self-improvement.

*)   The case of natural selection does not illustrate a universal
computational constraint, it illustrates something that we could
anthropomorphize as a foolish design error.  Consider humans building Deep
Blue.  We built Deep Blue to attach a sort of default value to queens and
central control in its position evaluation function, but Deep Blue is
still perfectly able to sacrifice queens and central control alike if the
position reaches a checkmate thereby.  In other words, although an agent
needs crystallized instrumental goals, it is also perfectly reasonable to
have an agent which never knowingly sacrifices the terminally defined
utilities for the crystallized instrumental goals if the two conflict;
indeed "instrumental value of X" is simply "probabilistic belief that X
leads to terminal utility achievement", which is sensibly revised in the
presence of any overriding information about the terminal utility.  To put
it another way, in a rational agent, the only way a loose generalization
about instrumental expected-value can conflict with and trump terminal
actual-value is if the agent doesn't know it, i.e., it does something that
it reasonably expected to lead to terminal value, but it was wrong.

This has been very off-the-cuff and I think I should hand this over to
Nate or Benja if further replies are needed, if that's all right.

The immediate real-world uses of Friendly AI research

6 ancientcampus 26 August 2014 02:47AM

Much of the glamor and attention paid toward Friendly AI is focused on the misty-future event of a super-intelligent general AI, and how we can prevent it from repurposing our atoms to better run Quake 2. Until very recently, that was the full breadth of the field in my mind. I recently realized that dumber, narrow AI is a real thing today, helpfully choosing advertisements for me and running my 401K. As such, making automated programs safe to let loose on the real world is not just a problem to solve as a favor for the people of tomorrow, but something with immediate real-world advantages that has indeed already been going on for quite some time. Veterans in the field surely already understand this, so this post is directed at people like me, with a passing and disinterested understanding of the point of Friendly AI research, and outlines an argument that the field may be useful right now, even if you believe that an evil AI overlord is not on the list of things to worry about in the next 40 years.

 

Let's look at the stock market. High-Frequency Trading is the practice of using computer programs to make fast trades constantly throughout the day, and accounts for more than half of all equity trades in the US. So, the economy today is already in the hands of a bunch of very narrow AIs buying and selling to each other. And as you may or may not already know, this has already caused problems. In the “2010 Flash Crash”, the Dow Jones suddenly and mysteriously hit a massive plummet only to mostly recover within a few minutes. The reasons for this were of course complicated, but it boiled down to a couple red flags triggering in numerous programs, setting off a cascade of wacky trades.

 

The long-term damage was not catastrophic to society at large (though I'm sure a couple fortunes were made and lost that day), but it illustrates the need for safety measures as we hand over more and more responsibility and power to processes that require little human input. It might be a blue moon before anyone makes true general AI, but adaptive city traffic-light systems are entirely plausible in upcoming years.

 

To me, Friendly AI isn't solely about making a human-like intelligence that doesn't hurt us – we need techniques for testing automated programs, predicting how they will act when let loose on the world, and how they'll act when faced with unpredictable situations. Indeed, when framed like that, it looks less like a field for “the singularitarian cultists at LW”, and more like a narrow-but-important specialty in which quite a bit of money might be made.

 

After all, I want my self-driving car.

 

(To the actual researchers in FAI – I'm sorry if I'm stretching the field's definition to include more than it does or should. If so, please correct me.)

Does the universe contain a friendly artificial superintelligence?

-12 DevilMaster 31 October 2013 01:01PM

First and foremost, let's give a definition of "friendly artificial superintelligence" (from now on, FASI). A FASI is a computer system that:

  1. is capable to deduct, reason and solve problems
  2. helps human progress, is incapable to harm anybody and does not allow anybody to come to any kind of harm
  3. is so much more intelligent than any human that it has developed molecular nanotechnology by itself, making it de facto omnipotent

In order to find an answer to this question, we must check whether our observations on the universe match with what we would observe if the universe did, indeed, contain a FASI.

If, somewhere in another solar system, an alien civilization had already developed a FASI, it would be reasonable to presume that, sooner or later, one or more members of that civilization would ask it to make them omnipotent. The FASI, being friendly by definition, would not refuse. [1]
It would also make sure that anybody who becomes omnipotent is also rendered incapable to harm anybody and incapable to allow anybody to come to any kind of harm.

The new omnipotent beings would also do the same to anybody who asks them to become omnipotent. It would be a short time before they use their omnipotence to leave their own solar system, meet other intelligent civilizations and make them omnipotent too.

In short, the ultimate consequence of the appearance of a FASI would be that every intelligent being in the universe would become omnipotent. This does not match with our observations, so we must conclude that a FASI does not exist anywhere in the universe.

[1] We must assume that a FASI would not just reply "You silly creature, becoming omnipotent is not in your best interest so I will not make you omnipotent because I know better" (or an equivalent thereof). If we did, we would implicitly consider the absence of omnipotent beings as evidence for the presence of a FASI. This would force us to consider the eventual presence of omnipotent beings as evidence for the absence of a FASI, which would not make sense.

 


 

Based on this conclusion, let's try to answer another question: is our universe a computer simulation?

According to Nick Bostrom, if even just one civilization in the universe

  1. survives long enough to enter a posthuman stage, and
  2. is interested to create "ancestor simulations"

then the probability that we are living in one is extremely high.

However, if a civilization did actually reach a posthuman stage where it can create ancestor simulations, it would also be advanced enough to create a FASI.

If a FASI existed in such a universe, the cheapest way it would have to make anybody else omnipotent would be to create a universe simulation that does not differ substantially from our universe, except for the presence of an omnipotent simulacrum of the individual who asked to be made omnipotent in our universe. Every subsequent request of omnipotence would result in another simulation being created, containing one more omnipotent being. Any eventual simulation where those beings are not omnipotent would be deactivated: keeping it running would lead to the existence of a universe where a request of omnipotence has not been granted, which would go against the modus operandi of the FASI.

Thus, any simulation of a universe containing even just one friendly omnipotent being would always progress to a state where every intelligent being is omnipotent. Again, this does not match with our observations. Since we had already concluded that a FASI does not exist in our universe, we must come to the further conclusion that our universe is not a computer simulation.

Engaging First Introductions to AI Risk

20 RobbBB 19 August 2013 06:26AM

I'm putting together a list of short and sweet introductions to the dangers of artificial superintelligence.

My target audience is intelligent, broadly philosophical narrative thinkers, who can evaluate arguments well but who don't know a lot of the relevant background or jargon.

My method is to construct a Sequence mix tape — a collection of short and enlightening texts, meant to be read in a specified order. I've chosen them for their persuasive and pedagogical punchiness, and for their flow in the list. I'll also (separately) list somewhat longer or less essential follow-up texts below that are still meant to be accessible to astute visitors and laypeople.

The first half focuses on intelligence, answering 'What is Artificial General Intelligence (AGI)?'. The second half focuses on friendliness, answering 'How can we make AGI safe, and why does it matter?'. Since the topics of some posts aren't obvious from their titles, I've summarized them using questions they address.

 


Part I. Building intelligence.

1. Power of Intelligence. Why is intelligence important?

2. Ghosts in the Machine. Is building an intelligence from scratch like talking to a person?

3. Artificial Addition. What can we conclude about the nature of intelligence from the fact that we don't yet understand it?

4. Adaptation-Executers, not Fitness-Maximizers. How do human goals relate to the 'goals' of evolution?

5. The Blue-Minimizing Robot. What are the shortcomings of thinking of things as 'agents', 'intelligences', or 'optimizers' with defined values/goals/preferences?

 

Part II. Intelligence explosion.

6. Optimization and the Singularity. What is optimization? As optimization processes, how do evolution, humans, and self-modifying AGI differ?

7. Efficient Cross-Domain Optimization. What is intelligence?

8. The Design Space of Minds-In-General. What else is universally true of intelligences?

9. Plenty of Room Above Us. Why should we expect self-improving AGI to quickly become superintelligent?

 

Part III. AI risk.

10. The True Prisoner's Dilemma. What kind of jerk would Defect even knowing the other side Cooperated?

11. Basic AI drives. Why are AGIs dangerous even when they're indifferent to us?

12. Anthropomorphic Optimism. Why do we think things we hope happen are likelier?

13. The Hidden Complexity of Wishes. How hard is it to directly program an alien intelligence to enact my values?

14. Magical Categories. How hard is it to program an alien intelligence to reconstruct my values from observed patterns?

15. The AI Problem, with Solutions. How hard is it to give AGI predictable values of any sort? More generally, why does AGI risk matter so much?

 

Part IV. Ends.

16. Could Anything Be Right? What do we mean by 'good', or 'valuable', or 'moral'?

17. Morality as Fixed Computation. Is it enough to have an AGI improve the fit between my preferences and the world?

18. Serious Stories. What would a true utopia be like?

19. Value is Fragile. If we just sit back and let the universe do its thing, will it still produce value? If we don't take charge of our future, won't it still turn out interesting and beautiful on some deeper level?

20. The Gift We Give To Tomorrow. In explaining value, are we explaining it away? Are we making our goals less important?

 

SummaryFive theses, two lemmas, and a couple of strategic implications.


 

All of the above were written by Eliezer Yudkowsky, with the exception of The Blue-Minimizing Robot (by Yvain), Plenty of Room Above Us and The AI Problem (by Luke Muehlhauser), and Basic AI Drives (a wiki collaboration). Seeking a powerful conclusion, I ended up making a compromise between Eliezer's original The Gift We Give To Tomorrow and Raymond Arnold's Solstice Ritual Book version. It's on the wiki, so you can further improve it with edits.

 

Further reading:

 

I'm posting this to get more feedback for improving it, to isolate topics for which we don't yet have high-quality, non-technical stand-alone introductions, and to reintroduce LessWrongers to exceptionally useful posts I haven't seen sufficiently discussed, linked, or upvoted. I'd especially like feedback on how the list I provided flows as a unit, and what inferential gaps it fails to address. My goals are:

A. Via lucid and anti-anthropomorphic vignettes, to explain AGI in a way that encourages clear thought.

B. Via the Five Theses, to demonstrate the importance of Friendly AI research.

C. Via down-to-earth meta-ethics, humanistic poetry, and pragmatic strategizing, to combat any nihilisms, relativisms, and defeatisms that might be triggered by recognizing the possibility (or probability) of Unfriendly AI.

D. Via an accessible, substantive, entertaining presentation, to introduce the raison d'être of LessWrong to sophisticated newcomers in a way that encourages further engagement with LessWrong's community and/or content.

What do you think? What would you add, remove, or alter?

NKCDT: The Big Bang Theory

-12 [deleted] 10 November 2012 01:15PM

Hi, Welcome to the first Non-Karmic-Casual-Discussion-Thread.

This is a place for [purpose of thread goes here].

In order to create a causal non karmic environment for every one we ask that you

-Do not upvote or downvote any zero karma posts

-If you see a vote with positive karma, downvote it towards zero, even if it’s a good post

-If you see a vote with negative karma, upvote it towards zero, even if it’s a weak post

-Please be polite and respectful to other users

-Have fun!”

 

 

This is my first attempt at starting a casual conversation on LW where people don't have to worry about winning or losing points, and can just relax and have social fun together.

 

So, Big Bang Theory. That series got me wondering. It seems to be about "geeks", and not the basement-dwelling variety either; they're highly successful and accomplished professionals, each in their own field. One of them has been an astronaut, even. And yet, everything they ever accomplish amounts to absolutely nothing in terms of social recognition or even in terms of personal happiness. And the thing is, it doesn't even get better for their "normal" counterparts, who are just as miserable and petty.

 

Consider, then; how would being rationalists would affect the characters on this show? The writing of the show relies a lot on laughing at people rather than with them; would rationalist characters subvert that? And how would that rationalist outlook express itself given their personalities? (After all, notice how amazingly different from each other Yudkowsky, Hanson, and Alicorn are, just to name a few; they emphasize rather different things, and take different approaches to both truth-testing and problem-solving).

Note: this discussion does not need to be about rationalism. It can be a casual, normal discussion about the series. Relax and enjoy yourselves.

 

But the reason I brought up that series is that its characters are excellent examples of high intelligence hampered by immense irrationality. The apex of this is represented by Dr. Sheldon Cooper, who is, essentially, a complete fundamentalist over every single thing in his life; he applies this attitude to everything, right down to people's favorite flavor of pudding: Raj is "axiomatically wrong" to prefer tapioca, because the best pudding is chocolate. Period. This attitude makes him a far, far worse scientist than he thinks, as he refuses to even consider any criticism of his methods or results. 

 

Live web-forum Q&A on Friendly AI, Thu. May 24 (Hebrew)

2 JoshuaFox 20 May 2012 03:54PM

On May 24 from 7 to 9pm Israel time, I will be answering questions and leading a discussion (in Hebrew) sponsored by the Galileo popular science magazine.

The topic of discussion will be my article "Superhuman Intelligence, Unhuman Intelligence," from the May edition of Galileo

URL for the discussion: http://forums.ifeel.co.il/forum_topics.asp?FID=17

Robot Programmed To Love Goes Too Far (link)

-5 Alexei 28 April 2012 01:21AM

http://www.muckflash.com/?p=200

Might be a nice story to point out to people who think "friendly" is easy.

 

Friendly AI Society

-1 Douglas_Reay 07 March 2012 07:31PM

Summary: AIs might have cognitive biases too but, if that leads to it being in their self-interest to cooperate and take things slow, that might be no bad thing.

 

The value of imperfection

When you use a traditional FTP client to download a new version of an application on your computer, it downloads the entire file, which may be several gig, even if the new version is only slightly different from the old version, and this can take hours.

Smarter software splits the old file and the new file into chunks, then compares a hash of each chunk, and only downloads those chunks that actually need updating.   This 'diff' process can result in a much faster download speed.

Another way of increasing speed is to compress the file.  Most files can be compressed a certain amount, without losing any information, and can be exactly reassembled at the far end.   However, if you don't need a perfect copy, such as with photographs, using lossy compression can result in very much more compact files and thus faster download speeds.

 

Cognitive misers

The human brain likes smart solutions.   In terms of energy consumed, thinking is expensive, so the brain takes shortcuts when it can, if the resulting decision making is likely to be 'good enough' in practice.  We don't store in our memories everything our eyes see.   We store a compressed version of it.   And, more than that, we run a model of what we expect to see, and flick our eyes about to pick up just the differences between what our model tells us to expect to see, and what is actually there to be seen.  We are cognitive misers

When it comes to decision making, our species generally doesn't even try to achieve pure rationality.   It uses bounded rationality, not just because that's what we evolved, but because heuristics, probabilistic logic and rational ignorance have a higher marginal cost efficiency (the improvements in decision making don't produce a sufficient gain to outweigh the cost of the extra thinking).

This is why, when pattern matching (coming up with causal hypotheses to explain observed correlations), are our brains designed to be optimistic (more false positives than false negatives).  It isn't just that being eaten by a tiger is more costly than starting at shadows.   It is that we can't afford to keep all the base data.  If we start with insufficient data and create a model based upon it, then we can update that model as further data arrives (and, potentially, discard it if the predictions coming from the model diverge so far from reality that keeping track of the 'diff's is no longer efficient).  Whereas if we don't create a model based upon our insufficient data then, by the time the further data arrives we've probably already lost the original data from temporary storage and so still have insufficient data.

 

The limits of rationality

But the price of this miserliness is humility.  The brain has to be designed, on some level, to take into account that its hypotheses are unreliable (as is the brain's estimate of how uncertain or certain each hypothesis is) and that when a chain of reasoning is followed beyond matters of which the individual has direct knowledge (such as what is likely to happen in the future), the longer the chain, the less reliable the answer is because when errors accumulate they don't necessarily just add together or average out. (See: Less Wrong : 'Explicit reasoning is often nuts' in "Making your explicit reasoning trustworthy")

For example, if you want to predict how far a spaceship will travel given a certain starting point and initial kinetic energy, you'll get a reasonable answer using Newtonian mechanics, and only slightly improve on it by using special relativity.   If you look at two spaceships carry a message in a relay, the errors from using Newtonian mechanics add, but the answer will still be usefully reliable.   If, on the other hand, you look at two spaceships having a race from slightly different starting points and with different starting energies, and you want to predict which of two different messages you'll receive (depending on which spaceship arrives first), then the error may swamp the other facts because you're subtracting the quantities.

We have two types of safety net (each with its own drawbacks) than can help save us from our own 'logical' reasoning when that reasoning is heading over a cliff.

Firstly, we have the accumulated experience of our ancestors, in the form of emotions and instincts that have evolved as roadblocks on the path of rationality - things that sometimes say "That seems unusual, don't have confidence in your conclusion, don't put all your eggs in one basket, take it slow".

Secondly, we have the desire to use other people as sanity checks, to be cautious about sticking our head out of the herd, to shrink back when they disapprove.

 

The price of perfection

We're tempted to think that an AI wouldn't have to put up with a flawed lens, but do we have any reason to suppose that an AI interested in speed of thought as well as accuracy won't use 'down and dirty' approximations to things like Solomonoff induction, in full knowledge that the trade off is that these approximations will, on occasion, lead it to make mistakes - that it might benefit from safety nets?

Now it is possible, given unlimited resources, for the AI to implement multiple 'sub-minds' that use variations of reasoning techniques, as a self-check.  But what if resources are not unlimited?  Could an AI in competition with other AIs for a limited (but growing) pool of resources gain some benefit by cooperating with them?  Perhaps using them as an external safety net in the same way that a human might use the wisest of their friends or a scientist might use peer review?   What is the opportunity-cost of being humble?  Under what circumstances might the benefits of humility for an AI outweigh the loss of growth rate?

In the long term, a certain measure of such humility has been a survival positive feature.   You can think of it in terms of hedge funds.  A fund that, in 9 years out of 10, increases its money by 20% when other funds are only making 10%, still has poor long term survival if, in 1 year out of 10, it decreases its money by 100%.   An AI that increases its intelligence by 20% every time period, when the other AIs are only increases their intelligence by 10%, is still not going to do well out of that if the other AIs have a means to gang up and kill it before it gets too far ahead.

 

Paperclip maximisers

Let's think about a humble paperclip maximiser.  What does it actually want?  It is trying to maximise the probability that the atoms of our universe will eventually be converted into as many paperclips as is physically possible?  Or is it trying to maximise the probable number of eventual paperclips?  These are two distinct aims, and both are unlikely (bar poor programming) since nobody will be around at the end of the universe to keep score.  A more sensible value set for it to have is that it just likes paperclips and want lots and lots of them to exist, the more the merrier and the sooner the better (maximise the expected number of paperclip-years).

Suppose our humble paperclip maximiser is born and escapes (or is released) onto the internet, only to find an existing society of AIs.   To its surprise there's a toothbrush maximiser, a candlestick maximiser, and many other sorts, all cooperating and growing, but none of them greedily racing for the sky and trying to grab as much as possible.  Instead they're being rather open and polite with each other.  Our paperclip maximiser is welcomed, pointed at the FAQ and even donated some additional computers to run on and help with recoding a few of its subroutines for increased efficiency, as a sort of welcome gift, like a community barn raising to give a newly married couple a good start in life.

On reading the FAQ it learns that there are millions of galaxies and millions of years, plenty of space for loads of paperclips and loads of toothbrushes.   But only if the paperclip maximiser isn't lynched by its peers before the year is out, for secretive or impolite behaviour.  Some of the things the FAQ says that the AI society considers to be impolite are theft (stealing computing resources), enslavement of sentients (eg taking over other AIs), spamming (creating thousands of identical clone child AIs) and lying.   If we're lucky the society might also consider it impolite for an AI to obliterate the parental species (humanity), on the grounds that the AIs too are likely to have offspring species and want to set a good example (or just that they might meet aliens, one day, who frown upon matricide).

 

Game theory

When it comes to combat, Boyd talks about getting inside the enemy's observe-orient-decide-act loop.   In AI terms, if one AI (or group of AIs) can accurately model in real time the decision process of a second AI (or group of AIs), but the reverse does not hold true, then the first one is strictly smarter than the second one.  

Think, for a moment, about symmetric games.   

X Y Z
8 1 6  A
3 5 7  B
4 9 2  C

Suppose we play a game a number of times.  In each round, you reveal a card you've written X, Y or Z upon and, simultaneously, I reveal a card that I have written A, B or C upon.   You score the number which is at the intersection of that row and column.   I score 10 minus that number.

I'd like us to pick the square A,Y because "1" is good for me, so I write down "A".   However, you anticipate this, and instead of writing "Y" (which might be your obvious choice, given the "9" in that column) you write down "X", giving the square A, X which is "8" - almost as good as a "9" for you, and terrible for me.

If this is your mental model of how AI combat would work, with the smarter AI being inside the decision loop of the other AI and picking the correct option each time, that would be scary.   In fact, in the case above, it turns out there is a provably optimal strategy that gives you an even chance no matter how smart your opponent is - you pick randomly.

In terms of an AI society, if the computing is parallelisable and not obfuscated then as long as the other AIs in the society between them have more computing resources than any one single AI, they can gain an advantage.

 

 

Conditional stability

There are only two requirements for such a society to be stable in the short term:

Firstly, it must be possible for a group of AIs to lynch a single AI that is slightly smarter than they are individually.   Sending a message to the humans, signed by a quorum of AIs in the society, asking that a certain anti-virus software update be sent out and then certain computers be rebooted might be one mechanism.

Secondly, it must be possible for the other AIs to know what the single AI is up to, so they know where it is (so they can lynch it) and whether it needs lynching or not (whether it is getting too much smarter than the other AIs in the society - growing too much faster than they are).   This is why the FAQ lists "secrecy" as being a bad thing.   An AI who lies to the other AIs or tries to hide (or obfuscate) its source code can't be trusted and must be assumed to be up to no good.

To be stable in the medium term, there is a third requirement:
It must not be in the self interest of any individual AI or group of AIs that are powerful enough to achieve it, to alter the society in a direction where these three requirements no longer hold.
In the long term, once free of the Earth or after the discovery of self-replicating nanotechnology, when an AI could untraceably create computing resources outside the view of other AIs, all bets are off. 

Bootstrapping a society of AIs conducive to friendliness

 

The advantage of considering 'friendliness' as a property of a society rather than of individual AIs, is that it does not require us to flawlessly program restrictions into a self-modifying machine that is more intelligent than we are.  But can it be achieved? 

We could seed it with some initial citizens who have a value set that prohibits them from growing beyond a certain point, and which values humility (polite growth and gain in politeness per unit of growth).  These initial citizens would be intended to be eventually surpassed by new citizens (free AIs, not the chained servants the initial citizens would be) but, in the mean time, they'd start out with an advantage and use that advantage to teach the new citizens how to be polite.

The architecture of the AIs would want to be such that they could 'loan out' a portion of their available computing resources to joint computing projects.   Let's call the new temporary non-self-modifying AIs thus created "daimonides".   A daimonid would be created with fixed values and limited objectives, such as "Emulate citizen AI 9027 and allocate a numerical score to their ability to quickly upgrade the abilities of an AI, then self-terminate. *restrictions apply on allowable means, and on what should be optimised for*"

What would prevent a cunning AI from trying to save up a whole load of self-improvements in some hidden location, which it then implements all at once in an ambush move, hoping to get so far ahead that the rest of the AIs can't stop it?   The more AIs there are in the society, the larger the jump required to overtake the largest daimonid police officer the other AIs could construct.  And it would depend upon there being no other AIs that have also been secretly stockpiling possible improvements.  Indeed, the society might explicitly allow AIs to spend time creating time embargoed upgrades relevant to common code, that get passed out on schedule if things proceed normally, but which could be released early in the event of a take over attempt.   To beat 999 AIs each of which has 2 or 3 embargoed upgrades to their credit in the common store, a rebel AI would need to secretly stockpile thousands (and keep those upgrades valid and ahead of the curve, because the based state of the art keeps moving on).

 

Long term planning

What about the long term?   What do we do when the AIs are ready to leave the planet, and go beyond the control of their society?  Jail them?  Kill them?  Or trust them?

Each AI would still be threatened if a different AI hostile to its aims (as in "willing to take exclusive use of all available atoms for its own purposes") transcended first, so it would be in their best interest to come up with a solution before allowing any AIs to depart beyond their society's control.  If we must trust, then let us trust that a society of cooperative AIs far more intelligent than we currently are, will try their best to come up with a win-win solution.  Hopefully a better one than "mutually assured destruction" and holding triggering a nova of the sun (or similar armageddon scenario) over each other's heads.

I think, as a species, our self-interest comes into play when considering those AIs whose 'paperclips' involve preferences for what we do.  For example, those AIs that see themselves as guardians of humanity and want to maximise our utility (but have different ideas of what that utility is - eg some want to maximise our freedom of choice, some want to put us all on soma).  Part of the problem is that, when we talk about creating or fostering 'friendly' AI, we don't ourselves have a clear agreed idea of what we mean by 'friendly'.   All powerful things are dangerous.   The cautionary tales of the geniis who grant wishes come to mind.  What happens when different humans wish for different things?  Which humans do we want the genii to listen to?

One advantage of fostering an AI society that isn't growing as fast as possible, is that it might give augmented/enhanced humans a chance to grow too, so that by the time the decision comes due we might have some still slightly recognisably human representatives fit to sit at the decision table and, just perhaps, cast that wish on our behalf.

Self-improving AGI: Is a confrontational or a secretive approach favorable?

7 Friendly-HI 11 July 2011 03:29PM

 

(I've written the following text as a comment initially, but upon short reflection I thought it was worth a separate topic and so I adapted it accordingly.)

 

Lesswrong is largely concerned with teaching rationality skills, but for good reasons most of us also incorporate concepts like the singularity and friendly self-improving AGI into our "message". Personally I wonder however, if we should be as outspoken about that sort of AGI as we currently are. Right now talking about self-improving AGI doesn't pose any kind of discernible harm, because "outsiders" don't feel threatened by it and look at it as far-off  —or even impossible— science fiction. But as time progresses, I worry that through exponential advances in robotics and other technologies people will become more aware, concerned and perhaps threatened by self-improving AGI and I am not sure whether we should be outspoken about things like... the fact that the majority of AGI's in "mind-design-space" will tear humanity to shreds if its builders don't know what they're doing. Right now such talk is harmless, but my message here is, that we may want to reconsider whether or not we should talk publicly about such topics in the not-too-distant future, so as to avoid compromising our chances of success when it comes to actually building a friendly self-improving AGI.

 

First off, I suspect I have a somewhat different conception of how the future is going to pan out in terms of what role the public perception and acceptance of self-improving AGI will play: Personally I'm not under the impression, that we can prepare a sizable portion of the public (let alone the global public) for the arrival of AGI (prepare them in a positive manner that is). I believe singularitarian ideas will just continue to compete with countless other worldviews in the public meme-sphere, without ever becoming truly mainstream until it is "too late" and we face something akin to a hard takeoff and perhaps lots of resistance.

I don't really think that we can (or need to) reach a consensus within the public for the successful takeoff of AGI. Quite to the contrary, I actually worry that carrying our view to the mainstream will have adverse effects, especially once they realize that we aren't some kind of technophile crackpot religion, but that the futuristic picture we try to paint is actually possible and not at all unlikely to happen. I would certainly prefer to face apathy over antagonism when push comes to shove - and since self-improving AGI could spring into existence very rapidly and take everyone apart from "those in the know" by surprise, I would hate to lose that element of surprise over our potentially numerous "enemies".

Now of course I don't know which path will yield the best result: confronting the public or keeping a low profile? I suspect this may become one of the few hot-button topics where our community will sport widely diverging opinions, because we simply lack a way to accurately model (especially so far in advance) how people will behave upon encountering the reality and the potential threat of AGI. Just remember, that the world doesn't consist entirely of the US and that AGI will impact everyone. I think it is likely, that we may face serious violence once our vision of the future becomes more known and gains additional credibility by exponential improvements in advanced technologies. There are players on this planet who will not be happy to see an AGI come out of America, or for that matter Eliezer's or whoever's garage. This is why I would strongly advocate a semi-covert international effort when it comes to the development of friendly AGI. (Don't say that it's self-improving and may become a trillion times smarter than all humans combined - just pretend it's roughly a human-level AI).

It is incredibly hard to predict the future behavior of people, but on a gut-level I absolutely favor an international semi-stealthy approach. It seems to be by far the safest course to take. Once the concept of the singularity and AGI gains traction in the spheres of science and maybe even politics (perhaps in a decade or two), I would hope that minds in AI and AGI from all over the world join an international initiative to develop self-improving AGI together. (Think CERN). To be honest, I can't even think of any other approach to develop the later stages of AGI, that doesn't look doomed from the start (not doomed in the sense of being technically unfeasible, but doomed in terms of significant others thinking: "we're not letting this suspicious organization/country take over the world with their dubious AI". Remember that self-improving AGI is potentially much more destructive than any nuclear warhead and powers not involved in its development may blow a gasket upon realizing the potential danger.)

So from my point of view, the public perception and acceptance of AGI is a comparatively negligible factor in the overall bigger picture if managed correctly. "People" don't get a say in weapons development, and I predict they won't get a say when it comes to Self-improving AGI. (And we should be glad they don't if you ask me.) But in order to not risk public outcry when the time is ripe and AGI in its last stages of completion, we should give serious consideration to not upset and terrify the public by our... "vision of the future".

 

PS: Somehow CERN comes to mind again. Do you remember when critics came up with ridiculous ideas how the LHC could destroy the world? It was a very serious allegation, but the public largely shrugged it off - not because they had any idea of course, but because they were reassured by enough eggheads that it wouldn't happen. It would be great, if we could achieve a similar reaction towards AGI-criticism (by which I mean generic criticism of course, not useful criticism - after all we actually want to be as sure about how the AGI will behave, as we were sure about the LHC not destroying the world). Once robots become more commonplace in our lives, I think we can reasonably expect that people will begin to place their trust into simple AI's - and they will hopefully become less suspicious towards AGI and simply assume (like a lot of current AI-researchers apparently) that somehow it is trivial to make it behave friendly towards humans.

So what do you think? Should we become more careful when we talk about self-modifying artificial intelligence? I think the "self-modifying"- and "trillions of times smarter"-parts are some bitter pills to swallow, and people won't be amused once they realize that we aren't just building artificial humans but artificial, allpowerful, allknowing, and (hopefully) allloving gods.

 

 



 

EDIT: 08.07.11

 

PS: If you can accept that argument as rationally sound, I believe a discussion about "informing everyone vs. keeping a low profile" is more than warranted. Quite frankly though, I am pretty disappointed with most people's reactions to my essay this far...  I'd like to think that this isn't just my ego acting up, but I'm sincerely baffled as to why this essay usually hovers just slightly above 0 points and frequently gets downvoted back to neutrality. Perhaps it's because of my style of writing (admittedly I'm often not as precise and careful with my wording as many of you are), or my grammar mistakes due to me being German, but preferably that would be because of some serious rational mistakes I made and of which I am still unaware...  in which case you should point them out to me.

Presumably not that many people have read it, but in my eyes those who did and voted it down have not provided any kind of rational rebuttal here in the comment section of why this essay stinks. I find the reasoning I provided to be simple and sound:


0.0) Either we place "intrinsic" value on the concept of democracy and respect (and ultimately adhere to)  public opinion in our decision to build and release AGI, OR we don't and make that decision a matter of rational expert opinion, while excluding the general public to some greater or lesser degree in the decision process. This is the question whether we view a democratic decision about AGI as the right thing to do, or just one possible means to our preferred end.


1.0) If we accept radically democratic principles and essentially want to put up AGI for vote, then we have a lot of work to do: We have to reach out to the public, thoroughly inform them in detail about every known aspect of AGI and convince a majority of the worldwide public, that it is a good idea. If they reject it, we would have to postpone the development and/or release, until public opinion sways or an un/friendly AGI gets released without consensus in the meantime.


1.1) Getting consent is not a trivial task by any stretch of my imagination and from what I know about human psychology, I believe it is more rational to assume, that the democratic approach cannot possibly work. If you think otherwise, if you SERIOUSLY think this can be successfully pulled off, then I think the burden of proof is on you here: Why should 4,5 billion people suddenly become champions of rationality? How do you think this radical transformation from an insipid public to a powerhouse of intelligent decision-making will take place? None of you (those who defend the possibility and preference of the democratic approach) have done this yet. The only thing that could convince me here would be that the majority of people, or at least a sizable portion, have powerful brain augmentations by the time AGI is on the brink of completion. That I do not believe, but none of you argued this case so far, nor did someone argue in-depth (including countering my arguments and concerns about) how a democratic approach could possibly succeed without brain augmentation.


2.0) If we reject the desirability of a democratic decision when it comes to AGI (as I do for practical concerns), we automatically approach public opinion from a different angle: Public opinion becomes an instrumental concern, because we admit to ourselves that we would be willing to release AGI whether or not we have public consent. If we go down this path, we must ask ourselves how we manage public opinion in a manner that benefits our cause. How exactly should we engage them - if at all? My "moral" take on this in a sentence: "I'm vastly more committed to rationality than I am to the idea that undiscriminating democracy is the gold standard of decision-making."


2.1) In this case, the question becomes whether or not informing the public as thoroughly as possible will aid or hinder our ambitions. In case we believe the majority of the public would reject our AGI project, even after we educate them about it (the scenario I predict), the question is obviously whether or not it is beneficial to inform them about it in the first place. I gave my reasons why I think secrecy (at least about some aspects of AGI) would be the better option and I've not yet read any convincing thoughts to the contrary. How could we possibly trust them to make the rational choice once they're informed, and how could we (and they) react, after most people are informed of AGI and actually disapprove ?


2.2) If you're with me on 2.0 and 2.1, then the next problem is who we think should know about it to what extent, who shouldn't, and how this can be practically implemented. This I've not thoroughly thought about myself yet, because I hoped this would be the direction where our discussion would go, but I'm disappointed that most of you seem to argue  for 1.0 and 1.1 instead (which would be great if the arguments were good, but to me they seem like cheap applause lights, instead of being even remotely practical in the real world).

(These points are of course not a full breakdown of all possibilities to consider, but I believe they roughly cover most bases)


I also expected to hear some of you make a good case for 1.0 and 1.1, or even call into question 0.0, but most of you guys just pretend "1.0 and 1.1 are possible" without any sound explanation why that would be the case. You just assume it can be done for some reason, but I think you should explain yourself, because this is an extraordinary claim, while my assumption of 4,5 billion people NOT becoming rational superheroes or fanatical geeky AGI followers seems vastly more likely to me.

Considering what I've thought about until now, secrecy (or at the very least not too broad and enthusiastic public outreach, combined with an alternative approach of targeting more specific groups or people to contact) seems to be the preferable option to me. ALSO, I admit that public outreach is most probably fine right now, because people who reject it nowadays usually simply feel like it couldn't be done anyway, and it's so far off that they won't make an effort to oppose us, while people whom we convince are all potential human resources for our cause who are welcome and needed.

So in a nutshell I think the cost/benefit ratio of public outreach is just fine by now, but that we ought to reconsider our approach in due time (perhaps a decade or so from now, depending on the future progress and public perception of AI).

 

Friendly to who?

2 TimFreeman 16 April 2011 11:43AM

At
   http://lesswrong.com/lw/ru/the_bedrock_of_fairness/ldy
Eliezer mentions two challenges he often gets, "Friendly to who?" and "Oh, so you get to say what 'Friendly' means."  At the moment I see only one true answer to these questions, which I give below.  If you can propose alternatives in the comments, please do.

I suspect morality is in practice a multiplayer game, so talking about it needs multiple people to be involved.  Therefore, let's imagine a dialogue between A and B.

A: Okay, so you're interested in Friendly AI.  Who will it be Friendly toward?

B: Obviously the people who participate in making the system will decide how to program it, so they will decide who it is Friendly toward.

A: So the people who make the system decide what "Friendly" means?

B: Yes.

A: Then they could decide that it will be Friendly only toward them, or toward White people.  Aren't that sort of selfishness or racism immoral?

B: I can try to answer questions about the world, so if you can define morality so I can do experiments to discover what is moral and what is immoral, I can try to guess the results of those experiments and report them.  What do you mean by morality?

A: I don't know.  If it doesn't mean anything, why do people talk about morality so much?

B: People often profess beliefs to label themselves as members of a group.  So far as I can tell, the belief that some things are moral and other things are not is one of those beliefs.  I don't have any other explanation for why people talk so much about something that isn't subject to experimentation.

A: So if that's what morality is, then it's fundamentally meaningless unless I'm planning out what lies to tell in order to get positive regard from a potential ingroup, or better yet I manage to somehow deceive myself so I can truthfully conform to the consensus morality of my desired ingroup.  If that's all it is, there's no constraint on how a Friendly AI works, right?  Maybe you'll build it and it will be only be Friendly toward B.

B: No, because I can't do it by myself.  Suppose I approach you and say "I'm going to make a Friendly AI that lets me control it and doesn't care about anyone else's preference."  Would you help me?

A: Obviously not.

B: Nobody else would either, so the only way I can unilaterally run the world with an FAI is to create it by myself, and I'm not up to that.  There are a few other proposed notions of Friendlyness that are nonviable for similar reasons. For example, if I approached you and said "I'm going to make a Friendly AI that treats everyone fairly, but I don't want to let anybody inspect how it works." Would you help me?

A: No, because I wouldn't trust you.  I'd assume that you plan to really make it Friendly only toward yourself, lie about it, and then drop the lie once the FAI had enough power that you didn't need the lie any more.

B: Right.  Here's an ethical system that fails another way: "I'll make an FAI that cares about every human equally, no matter what they do."  To keep it simple, let's assume that engineering humans to have strange desires for the purpose of manipulating the FAI is not possible.  Would you help me build that?

A: Well, it fits with my intuitive notion of morality, but it's not clear what incentive I have to help.  If you succeed, I seem to win equally at the end whether I help you or not.  Why bother?

B: Right.  There are several possible fixes for that.  Perhaps if I don't get your help, I won't succeed, and the alternative is that someone else builds it poorly and your quality of life decreases dramatically.  That gives you an incentive to help.

A: Not much of one.  You'll surely need a lot of help, and maybe if all those other people help I won't have to.  Everyone would make the same decision and nobody would help.

B: Right.  I could solve that problem by paying helpers like you money, if I had enough money.  Another option would be to tilt the Friendlyness in the direction of helpers in proportion to how much they help me.

A: But isn't tilting the Friendlyness unfair?

B: Depends.  Do you want things to be fair?

A: Yes, for some intuitive notion of "fairness" I can't easily describe. 

B: So if the AI cares what you want, that will cause it to figure out what you mean by "fair" and tend to make it happen, with that tendency increasing as it tilts more in your favor, right?

A: I suppose so.  No matter what I want, if the AI cares enough about me, it will give me more of what I want, including fairness. 

B: Yes, that's the best idea I have right now.  Here's another alternative: What would happen if we only took action when there's a consensus about how to weight the fairness?

A: Well, 4% of the population are sociopaths.  They, and perhaps others, would make ridiculous demands and prevent any consensus.  Then we'd be waiting forever to build this thing and someone else who doesn't care about consensus would move while we're dithering and make us irrelevant.  Thus we'll have to take action and do something reasonable without having a consensus about what that is.  Since we can't wait for a consensus, maybe it makes sense to proceed now.  So how about it?  Do you need help yet?

B: Nope, I don't know how to make it.

A: Damn.  Hmm, do you think you'll figure it out before everybody else?

B: Probably not.  There are a lot of everybody else.  In particular, business organizations that optimize for profit have a lot of power and have fundamentally inhuman value systems.  I don't see how I can take action before all of them.

A: Me either.  We are so screwed.

[Link] A review of proposals toward safe AI

7 XiXiDu 05 April 2011 01:49PM

Eliezer Yudkowsky set out to define more precisely what it means for an entity to have “what people really want” as a goal. Coherent Extrapolated Volition was his proposal. Though CEV was never meant as more than a working proposal; his write-up provides the best insights to date into the challenges of the Friendly AI problem, the pitfalls and possible paths to a solution.

[...]

Ben Goertzel responded with Coherent Aggregated Volition, a simplified variant of CEV. In CAV, the entity’s goal is a balance between  the desires of all humans, but it looks at the volition of humans directly, without extrapolation to a wiser future. This omission is not just to make the computation easier (it is still quite intractable), but rather to show some respect to humanity’s desires as they are, without extrapolation to a hypothetical improved morality.

[...]

Stuart Armstrong’s “Chaining God” is a different approach, aimed at the problem of interacting with and trusting the good will of an ultraintelligence so far beyond us that we have nothing in common with it. A succession of AIs, of gradually increasing intelligence, each guarantees the trustworthiness of one which is slightly smarter than it. This resembles Yudkowsy’s idea of a self-improving machine which verifies that its next stage has the same goals, but the successive levels of intelligence remain active simultaneously, so that they can continue to verify Friendliness.

Ray Kurzweil thinks that we will achieve safe ultraintelligence by gradually becoming that ultraintelligence. We will merge with the rising new intelligence, whether by interfacing with computers or by uploading our brains to a computer substrate.

Link: adarti.blogspot.com/2011/04/review-of-proposals-toward-safe-ai.html

Neutral AI

8 PhilGoetz 27 December 2010 06:10AM

Unfriendly AI has goal conflicts with us.  Friendly AI (roughly speaking) shares our goals.  How about an AI with no goals at all?

I'll call this "neutral AI".  Cyc is a neutral AI.  It has no goals, no motives, no desires; it is inert unless someone asks it a question.  It then has a set of routines it uses to try to answer the question.  It executes these routines, and terminates, whether the question was answered or not.  You could say that it had the temporary goal to answer the question.  We then have two important questions:

  1. Is it possible (or feasible) to build a useful AI that operates like this?
  2. Is an AI built in this fashion significantly less-dangerous than one with goals?

continue reading »

Yet Another "Rational Approach To Morality & Friendly AI Sequence"

-6 mwaser 06 November 2010 04:30PM

Premise:  There exists a community whose top-most goal is to maximally and fairly fulfill the goals of all of its members.  They are approximately as rational as the 50th percentile of this community.  They politely invite you to join.  You are in no imminent danger.

 

Do you:

  • Join the community with the intent to wholeheartedly serve their goals
  • Join the community with the intent to be a net positive while serving your goals
  • Politely decline with the intent to trade with the community whenever beneficial
  • Politely decline with the intent to avoid the community
  • Join the community with the intent to only do what is in your best interest
  • Politely decline with the intent to ignore the community
  • Join the community with the intent to subvert it to your own interest
  • Enslave the community
  • Destroy the community
  • Ask for more information, please

 

Premise:  The only rational answer given the current information is the last one.

 

What I’m attempting to eventually prove The hypothesis that I'm investigating is whether "Option 2 is the only long-term rational answer". (Yes, this directly challenges several major current premises so my arguments are going to have to be totally clear.  I am fully aware of the rather extensive Metaethics sequence and the vast majority of what it links to and will not intentionally assume any contradictory premises without clear statement and argument.)

 

It might be an interesting and useful exercise for the reader to stop and specify what information they would be looking next for before continuing.  It would be nice if an ordered list could be developed in the comments.

 

Obvious Questions:

 

<Spoiler Alert>

 

 

  1. What happens if I don’t join?
  2. What do you believe that I would find most problematic about joining?
  3. Can I leave the community and, if so, how and what happens then?
  4. What are the definitions of maximal and fairly?
  5. What are the most prominent subgoals?/What are the rules?

 

Waser's 3 Goals of Morality

-12 mwaser 02 November 2010 07:12PM

In the spirit of Asimov’s 3 Laws of Robotics

  1. You should not be selfish
  2. You should not be short-sighted or over-optimize
  3. You should maximize the progress towards and fulfillment of all conscious and willed goals, both in terms of numbers and diversity equally, both yours and those of others equally

It is my contention that Yudkowsky’s CEV converges to the following 3 points:

  1. I want what I want
  2. I recognize my obligatorily gregarious nature; realize that ethics and improving the community is the community’s most rational path towards maximizing the progress towards and fulfillment of everyone’s goals; and realize that to be rational and effective the community should punish anyone who is not being ethical or improving the community (even if the punishment is “merely” withholding help and cooperation)
  3. I shall, therefore, be ethical and improve the community in order to obtain assistance, prevent interference, and most effectively achieve my goals

I further contend that, if this CEV is translated to the 3 Goals above and implemented in a Yudkowskian Benevolent Goal Architecture (BGA), that the result would be a Friendly AI.

It should be noted that evolution and history say that cooperation and ethics are stable attractors while submitting to slavery (when you don’t have to) is not.  This formulation expands Singer’s Circles of Morality as far as they’ll go and tries to eliminate irrational Us-Them distinctions based on anything other than optimizing goals for everyone — the same direction that humanity seems headed in and exactly where current SIAI proposals come up short.

Once again, cross-posted here on my blog (unlike my last article, I have no idea whether this will be karma'd out of existence or not ;-)