The challenges of bringing up AIs

Stuart_Armstrong

The challenges of bringing up AIs

1 min read10th Dec 201227 comments

16

At the current AGI-12 conference, some designers have been proponents of keeping AGI's safe by bringing them up in human environments, providing them with interactions and feedback in a similar way to how we bring up human children. Obviously that approach would fail for a fully smart AGI with its own values - it would pretend to follow our values for as long as it needed, and then defect. However, some people have confidence if we started with a limited, dumb AGI, then we could successfully inculcate our values in this way (a more sophisticated position would be that though this method would likely fail, it's more likely to succeed than a top-down friendliness project!).

The major criticism of this approach is that it anthropomorphises the AGI - we have a theory of children's minds, constructed by evolution, culture, and our own child-rearing experience. And then we project this on the alien mind of the AGI, assuming that if the AGI presents behaviours similar to a well-behaved child, then it will become a moral AGI. The problem is that we don't know how alien the AGI's mind will be, and if our reinforcement is actually reinforcing the right thing. Specifically, we need to be able to find some way of distinguishing between:

An AGI being trained to be friendly.
An AGI being trained to lie and conceal.
An AGI that will behave completely differently once out of the training/testing/trust-building environment.
An AGI that forms the wrong categories and generalisations (what counts as "human" or "suffering", for instance), because it lacks human-shared implicit knowledge that was "too obvious" for us to even think of training it on.

New to LessWrong?

Getting Started

FAQ

Library

New Comment

27 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:51 AM

[-]TimS11y240

What, if anything, do proponents of the raise-AI-as-kids proponents say when someone asserts that we don't have a particularly reliable process for producing Friendly children?

[-]J_Taylor11y20

In the defense of the raise-AI-as-kids proponents, ethics committees tend to limit the search for reliable processes for producing Friendly children.

[-]TimS11y70

Anyone with an idea and a computer can write an advice book on how to raise children. And Science really doesn't know what techniques have what effects in particular circumstances.

If we really knew how to raise Friendly children, public schools wouldn't be the mess that they are.

None of that has anything to do with IRB or other ethics reviews.

[-]J_Taylor11y10

If we really knew how to raise Friendly children, public schools wouldn't be the mess that they are.

I am not talking about taking N children and getting N children, maximizing average Friendliness of the children. I am talking about, given N children, finding some regimen X, such that a child which has finished regimen X will have the highest expected Friendliness.

Regimen X may well involve frequent metaphorical culling of children who have low expected Friendliness.

[-]TimS11y10

It isn't clear that Science knows even a culling regime that would create Friendly child. If there were a reliable culling regime, we'd have Friendly politicians.

Also, what's up with this degree-of-Friendliness language. AGI that has a small degree of Friendly is called uFAI. We're already quite confident that we could make uFAI if we could make AGI at all.

[-]timtyler11y-30

We are bringing up machines in society today. Interactions with society do apparently help them to pick up human values - so cars have air bags, blenders have safety lids, and social networks have privacy awareness.

Attempting to avoid interactions with society means that your machines will be untested, and they will be insulated from economic growth that might fuel their development. So: other things being equal, such approaches are less likely to be safe, and more likely to come last - in which case their safety is irrelevant.

[-]James_Miller11y130

It might be helpful to greatly raise the intelligence of a biological creature that had a very different evolutionary path than we did, such as the octopus, to give us a bit more insight into the general relationship between intelligence and friendliness.

[-]Zaine11y00

Any suggestions on how to do this?

[-]Luke_A_Somers11y50

Aggressive breeding program?

If we went transgenic, that would A) be unlikely to work, and B) interfere with the very effect we are trying to test.

[-]James_Miller11y00

For (B) not if we took genes from non-mammals.

[-]Luke_A_Somers11y00

Octopi are already the pinnacle of non-mammalian intelligence, so far as I know. What else would you bring in?

[-]evand11y20

Tool-using birds are smart enough that there might be useful stuff there. How you do anything useful with that and an octopus is beyond me, but no more so than doing something useful with mammal genes and an octopus.

[-]Armok_GoB11y20

Physically stretch and merge brains using steam cells.

[-]Bgoertzel11y130

Stuart: The majority of people proposing the "bringing up baby AGI" approach to encouraging AGI ethics, are NOT making the kind of naive cognitive error you describe here. This approach to AGI ethics is not founded on naive anthropomorphism. Rather, it is based on the feeling of having a mix of intuitive and rigorous understanding of the AGI architectures in question, the ones that will be taught ethics.

For instance, my intuition is that if we taught an OpenCog system to be loving and ethical, then it would very likely be so, according to broad human standards. This intuition is NOT based on naively anthropomorphizing OpenCog systems, but rather based on my understanding of the actual OpenCog architecture (which has many significant differences from the human cognitive architecture).

No one, so far as I know, claims to have an airtight PROOF that this kind of approach to AGI ethics will work. However, the intuition that it will work is based largely on understanding of the specifics of the AGI architectures in question, not just on anthropomorphism.

If you want to counter-argue against this approach, you should argue about it in the context of the specific AGI architectures in question. Or else you should present some kind of principled counter-argument. Just claiming "anthropomorphism" isn't very convincing.

[-]Stuart_Armstrong11y140

Thanks for your answer, Ben!

First of all, all of these methods involve integrating the AGI in human society. So the AGI is forming its values, at least in part, through doing something (possibly talking) and getting a response from some human. That human will be interpreting the AGI's answers, and selecting the right response, using their own theory of the AGI's mind - nearly certainly an anthopomorphisation! Even if that human develops experience dealing with the AGI, their understanding will be limited (as our understanding of other humans is limited, except worse than that).

So the AGI programmer is taking a problem that they can't solve through direct coding, and putting the AGI through interactions so that it will acquire the values that the programmer can't specify directly, in settings where the other interactors will be prone to anthropomorphisation.

ie: "I can't solve this problem formally, but I do understand it's structure enough to be reasonably sure that anthropomorphic interactions will solve it".

If that's the claim, I would expect the programmer to be very schooled in the properties and perils of anthropomorphisation, and to cast their arguments, as much as possible, in formal logic or code form. For instance, if we want the AGI to "love" us: what kind of behaviour would we expect that this entailed, and why would this code acquire that behaviour from these interactions? If you couldn't use the word love, or any close synonyms, could you still describe the process and show that it will perform well? If you can't describe love without saying "love", then you are counting on a shared non-formalised human understanding of what love is, and hoping that the AGI will stumble upon the same understanding - you don't know the contours of the definition, and the potential pitfalls, but you're counting on the AGI to avoid them.

Those four types of behaviours that I mentioned there, and that we need to separate - don't just decry the use of anthropomorphisation in the description, but say which parts of the open cog system will be used to distinguish between them, and select the friendly behaviour rather than the others. You know how your system works - reassure me! :-)

[-]Bgoertzel11y30

Stuart -- Yeah, the line of theoretical research you suggest is worthwhile....

However, it's worth noting that I and the other OpenCog team members are pressed for time, and have a lot of concrete OpenCog work to do. It would seem none of us really feels like taking a lot of time, at this stage, to carefully formalize arguments about what the system is likely to do in various situations once it's finished. We're too consumed with trying to finish the system, which is a long and difficult task in itself...

I will try to find some time in the near term to sketch a couple example arguments of the type you request... but it won't be today...

As a very rough indication for the moment... note that OpenCog has explicit Goal Node objects in its AtomSpace knowledge store, and then one can look at the explicit probabilistic ImplicationLinks pointing to these GoalNodes from various combinations of contexts and actions. So one can actually look, in principle, at the probabilistic relations between (context, action) pairs and goals that OpenCog is using to choose actions.

Now, for a quite complex OpenCog system, it may be hard to understand what all these probabilistic relations mean. But for a young OpenCog doing simple things, it will be easier. So one would want to validate for a young OpenCog doing simple things, that the information in the system's AtomSpace is compatible with 1 rather than 2-4.... One would then want to validate that, as the system gets more mature and does more complex things, there is not a trend toward more of 2-4 and less of 1 ....

Interesting line of thinking indeed! ...

[-]Dr_Manhattan11y90

Ben, your response is logical (if not correct), but the fact that many AI researchers advocate the "upbringing approach" (for other architectures) makes me very suspicious that they're anthropomorphising after all.

[-]nigerweiss11y20

An AGI that is not either deeply neuromorphic or possessing a well-defined and formally stable utility function sounds like... frankly one of the worst ideas I've ever heard. I'm having difficulty imagining a way you could demonstrate the safety of such a system, or trust it enough at any point to give it enough resources to learn. Considering that the fate of intelligent life in our future light cone may hang in the balance, standards of safety must obviously be very high! Intuition is, I'm sorry, simply not an acceptable criteria on which to wager at least billions, and perhaps trillions of lives. The expected utility math does not wash if you actually expect OpenCog to work.

On a more technical level, human values are broadly defined as some function over a typical human brain. There may be some (or many) optimizations possible, but not such that we can rely on them. So, for a really good model of human values, we should not expect to need less than the entropy of a human brain. In other words, nobody, whether they're Eliezer Yudkowsky with his formalist approach or you, is getting away with less than about ten petabytes of good training samples. Those working on uploads can skip this step entirely, but neuromorphic AI is likely to be fundamentally less useful.

And this assumes that every bit of evidence can be mapped directly to a bit in a typical human brain map. In reality, for a non-FOOMed AI, the mapping it likely to be many orders of magnitude less efficient. I suspect, but cannot demonstrate right now, that a formalist approach starting with a clean framework along the lines of AIXI is going to be more efficient. Quite aside from that, even assuming you can acquire enough data to train your machine reliably, then you still need it to do... something. Human values include a lot of unpleasant qualities. Simply giving it human values and then allowing it to grow to superhuman intellect is grossly unsafe. Ted Bundy had human values. If your plan is to train it on examples of only nice people, then you've got a really serious practical problem of how to track down >10 petabytes of really good data on the lives of saints. A formalist approach like CEV, for all the things that bug me about it, simply does not have that issue, because its utility function is defined as functions of the observed values of real humans.

In other words, for a system that's as alien as the architecture of OpenCog, even if we assume that the software is powerful and general enough to work (which I'm in no way convinced of), attempting to inculate it with human values is extremely difficult, dangerous, and just plan unethical.

[-]Eliezer Yudkowsky11y110

3 and 4 seem like the most fatal.

W/r/t 2, young, unsophisticated AIs with mostly human-readable source code require only small amounts of concern to detect "being trained to lie". Albeit this is only a small amount of concern by serious-FAI-work standards; outside the local cluster, anyone who tries to build this sort of AI in the first place might very well wave their hands and say, "Oh, but there's no difference between trying to lie with your actions to us and really being friendly, that's just some anthropomorphic interpretation of this code here" when the utility function has nothing about being nice and being nice is just being done as an instrumental act to get the humans to go along with you while you increase your reward counter. But in terms of serious FAI proposals, that's just being stupid. I'm willing to believe Paul Christiano when he tells me that his smart International Mathematical Olympiad friends are smarter than this, regardless of my past bad experiences with would-be AGI makers. In any case, it shouldn't take a large amount of "actual concern and actual willingness to admit problems", to detect this class of problem in a young AGI; so this alone would not rule out "raise the FAI like a kid" as a serious FAI proposal. Being able to tell the difference between a 'lying' young AGI and a young AGI that actually has some particular utility function - albeit not so much / just-only by inspection of actions, as by inspection of code which not only has that utility function but was human-designed to transparently explicitly encode it - is an explicit part of serious FAI proposals.

3 and 4 are the actually difficult parts because they don't follow from mere inspection of readable source code.

On 3: Knowing that the current execution path of the code seems to be working okay today is very different from strongly constraining future execution paths across hugely different contexts to have desirable properties; this requires abstract thinking on a much higher level than staring at what your AGI is doing right now. The tank-detector works so long as it's seeing pictures from the training sets in which all tanks are present on cloudy days, but fails when it wanders out into the real world, etc. "Reflective decision theory"-style FAI proposals try to address this by being able to state the desirable properties of the AI in an abstraction which can be checked against abstractions over code execution pathways and even over permitted future self-modifications, although the 'abstract desirable properties' are very hard (require very difficult and serious FAI efforts) to specify for reasons related to 4.

On 4: Since humans don't have introspective access to their own categories and generalizations, figuring out the degree of overlap by staring at their direct representations will fail (you would not know your brain's spiking pattern for blueness ify ou saw it), and trying to check examples is subject to a 3-related problem wherein you only check a narrow slice of samples (you never checked any cryonics patients or Terry Schiavo when you were checking that the AI knew what a 'sentient being' was). I.e., your training set turns out to unfortunately have been a dimensionally impoverished subset of the test set. "Indirect normativity" (CEV-style) proposals try to get at this by teaching the AI to idealize values as being stored in humans, such that observation about human judgments or observation of human brain states will 'correctly' (from our standpoint) refine its moral theory; as opposed to trying to get the utility function correct outright.

The anthropomorphic appeal of "raising AIs as kids" doesn't address 3 or 4, so it falls into the class of proposals that will appear to work while the AI is young, then kill you after it becomes smarter than you. Similarly, due to the problems with 3 and 4, any AGI project claiming to rely solely on 2 is probably unserious about FAI and probably will treat "learning how to get humans to press your reward button" as "our niceness training is working" a la the original AIXI paper, since you can make a plausible-sounding argument for it (or, heck, just a raw appeal to "I know my architecture!") and it avoids a lot of inconvenient work you'd have to do if you publicly admitted otherwise. Ahem.

It should also be noted that Reversed Stupidity Is Not Intelligence; there's a lot of stupid F-proposals for raising AIs like children, but that doesn't mean a serious FAI project tries to build and run an FAI in one-shot without anything analogous to gradient developmental stages. Indirect normativity is complex enough to require learning (requiring an inductive DWIM architecture below, with that architecture simpler and more transparent). It's >50% probable in my estimate that there's a stage where you're slowly teaching running code about things like vision, analogous to a baby stage of a human. It's just that the presence of such a stage does not solve, and in fact does not even constitute significant progress toward, problems 3 and 4, the burden of which need be addressed by other proposals.

[-]CarlShulman11y30

"learning how to get humans to press your reward button" as "our niceness training is working" a la the original AIXI paper,

Quote needed, wasn't this contested by the author?

On 3: Knowing that the current execution path of the code seems to be working okay today is very different from strongly constraining future execution paths across hugely different contexts to have desirable properties; this requires abstract thinking on a much higher level than staring at what your AGI is doing right now. The tank-detector works so long as it's seeing pictures from the training sets in which all tanks are present on cloudy days, but fails when it wanders out into the real world, etc. "Reflective decision theory"-style FAI proposals try to address this by being able to state the desirable properties of the AI in an abstraction which can be checked against abstractions over code execution pathways and even over permitted future self-modifications, although the 'abstract desirable properties' are very hard (require very difficult and serious FAI efforts) to specify for reasons related to 4.

Humans are able to learn basic human moral concepts with reasonable quantities of data. What is the relevant context change?

[-]hairyfigment11y10

Humans are able to learn basic human moral concepts with reasonable quantities of data. What is the relevant context change?

Eh? Do you want a more detailed answer than the question might suggest? I thought nigerweiss et al had good responses.

I also don't see any human culture getting Friendliness-through-AI-training right without doing something horrible elsewhere.

[-]nigerweiss11y50

I would only trust this strategy with hyper-neuromorphic artificial intelligence. And that's unlikely to FOOM uncontrollably anyway. In general, the applicability of such a strategy depends on the structure of the AI, but the line at which it might be applicable is tiny hyperbubble in mind space centered around humans. Anything more alien than that, and it's a profoundly naive idea.

[-]Stuart_Armstrong11y40

Yes. That's pretty much my point.

[-]Bgoertzel11y30

Thanks for sharing your personal feeling on this matter. However, I'd be more interested if you had some sort of rational argument in favor of your position!

The key issue is the tininess of the hyperbubble you describe, right? Do you have some sort of argument regarding some specific estimate of the measure of this hyperbubble? (And do you have some specific measure on mindspace in mind?)

To put it differently: What are the properties you think a mind needs to have, in order for the "raise a nice baby AGI" approach to have a reasonable chance of effectiveness? Which are the properties of the human mind that you think are necessary for this to be the case?

[-]nigerweiss11y190

Well, consider this: it takes only a very small functional change to the human brain to make 'raising it as a human child' a questionable strategy at best. Crippling a few features of the brain produces sociopaths who, notably, cannot be reliably inculcated with our values, despite sharing 99.99etc% of our own neurological architecture.

Mind space is a tricky thing to pin down in a useful way, so let's just say the bubble is really tiny. If the changes your making are larger than the changes between a sociopath and a neurotypical human, then you shouldn't employ this strategy. Trying to use it on any kind of denovo AI without anything analagous to our neurons is foolhardy beyond belief. So much of our behavior is predated on things that aren't and can't be learned, and trying to program all of those qualities and intuitions by hand so that the AI can be properly taught our value scheme looks broadly isomorphic to the FAI problem.

[-]ewbrownv11y160

Human children respond to normal child-rearing practices the way they do because of specific functional adaptations of the human mind. This general principle applies to everything from language acquisition to parent-child bonding to acculturation. Expose a monkey, dog, fish or alien to the same environment, and you'll get a different outcome.

Unfortunately, while the cog sci community has produced reams of evidence on this point they've also discovered that said adaptations are very complex, and mapping out in detail what they all are and how they work is turning out to be a long research project. Partial results exist for a lot of intriguing examples, along with data on what goes wrong when different pieces are broken, but it's going to be awhile before we have a complete picture.

An AI researcher who claims his program will respond like a human child is implicitly claiming either that this whole body of research is wrong (in which case I want to see evidence), or that he's somehow implemented all the necessary adaptations in code despite the fact that no one knows how they all work (yea, right). Either way, this isn't especially credible.

[-]Stuart_Armstrong11y70

I think some cross cultural human studies might be a way of starting to answer this question. Looking at autists, or other non-neurotypical minds, would also be helpful. Studying sociopaths or psychopaths would also be important (they pass our society's behaviour filters, and yet misbehave). The errors of early AGIs (as long as they're left unpatched!!!) will also be very revealing, and let us try and trace the countours of non-human minds, and get insights into human minds as well. Formal philosophical measures (what kind of consistent long term behaviours can exist in theory?) may also help.

More ideas will no doubt spring to mind - if you want, we can design a research program!

Moderation Log