Goal retention discussion with Eliezer

Max Tegmark

Although I feel that Nick Bostrom’s new book “Superintelligence” is generally awesome and a well-needed milestone for the field, I do have one quibble: both he and Steve Omohundro appear to be more convinced than I am by the assumption that an AI will naturally tend to retain its goals as it reaches a deeper understanding of the world and of itself. I’ve written a short essay on this issue from my physics perspective, available at http://arxiv.org/pdf/1409.0813.pdf.

Eliezer Yudkowsky just sent the following extremely interesting comments, and told me he was OK with me sharing them here to spur a broader discussion of these issues, so here goes.

On Sep 3, 2014, at 17:21, Eliezer Yudkowsky <yudkowsky@gmail.com> wrote:

Hi Max! You're asking the right questions. Some of the answers we can
give you, some we can't, few have been written up and even fewer in any
well-organized way. Benja or Nate might be able to expound in more detail
while I'm in my seclusion.

Very briefly, though:
The problem of utility functions turning out to be ill-defined in light of
new discoveries of the universe is what Peter de Blanc named an
"ontological crisis" (not necessarily a particularly good name, but it's
what we've been using locally).

http://intelligence.org/files/OntologicalCrises.pdf

The way I would phrase this problem now is that an expected utility
maximizer makes comparisons between quantities that have the type
"expected utility conditional on an action", which means that the AI's
utility function must be something that can assign utility-numbers to the
AI's model of reality, and these numbers must have the further property
that there is some computationally feasible approximation for calculating
expected utilities relative to the AI's probabilistic beliefs. This is a
constraint that rules out the vast majority of all completely chaotic and
uninteresting utility functions, but does not rule out, say, "make lots of
paperclips".

Models also have the property of being Bayes-updated using sensory
information; for the sake of discussion let's also say that models are
about universes that can generate sensory information, so that these
models can be probabilistically falsified or confirmed. Then an
"ontological crisis" occurs when the hypothesis that best fits sensory
information corresponds to a model that the utility function doesn't run
on, or doesn't detect any utility-having objects in. The example of
"immortal souls" is a reasonable one. Suppose we had an AI that had a
naturalistic version of a Solomonoff prior, a language for specifying
universes that could have produced its sensory data. Suppose we tried to
give it a utility function that would look through any given model, detect
things corresponding to immortal souls, and value those things. Even if
the immortal-soul-detecting utility function works perfectly (it would in
fact detect all immortal souls) this utility function will not detect
anything in many (representations of) universes, and in particular it will
not detect anything in the (representations of) universes we think have
most of the probability mass for explaining our own world. In this case
the AI's behavior is undefined until you tell me more things about the AI;
an obvious possibility is that the AI would choose most of its actions
based on low-probability scenarios in which hidden immortal souls existed
that its actions could affect. (Note that even in this case the utility
function is stable!)

Since we don't know the final laws of physics and could easily be
surprised by further discoveries in the laws of physics, it seems pretty
clear that we shouldn't be specifying a utility function over exact
physical states relative to the Standard Model, because if the Standard
Model is even slightly wrong we get an ontological crisis. Of course
there are all sorts of extremely good reasons we should not try to do this
anyway, some of which are touched on in your draft; there just is no
simple function of physics that gives us something good to maximize. See
also Complexity of Value, Fragility of Value, indirect normativity, the
whole reason for a drive behind CEV, and so on. We're almost certainly
going to be using some sort of utility-learning algorithm, the learned
utilities are going to bind to modeled final physics by way of modeled
higher levels of representation which are known to be imperfect, and we're
going to have to figure out how to preserve the model and learned
utilities through shifts of representation. E.g., the AI discovers that
humans are made of atoms rather than being ontologically fundamental
humans, and furthermore the AI's multi-level representations of reality
evolve to use a different sort of approximation for "humans", but that's
okay because our utility-learning mechanism also says how to re-bind the
learned information through an ontological shift.

This sorta thing ain't going to be easy which is the other big reason to
start working on it well in advance. I point out however that this
doesn't seem unthinkable in human terms. We discovered that brains are
made of neurons but were nonetheless able to maintain an intuitive grasp
on what it means for them to be happy, and we don't throw away all that
info each time a new physical discovery is made. The kind of cognition we
want does not seem inherently self-contradictory.

Three other quick remarks:

*) Natural selection is not a consequentialist, nor is it the sort of
consequentialist that can sufficiently precisely predict the results of
modifications that the basic argument should go through for its stability.
The Omohundrian/Yudkowskian argument is not that we can take an arbitrary
stupid young AI and it will be smart enough to self-modify in a way that
preserves its values, but rather that most AIs that don't self-destruct
will eventually end up at a stable fixed-point of coherent
consequentialist values. This could easily involve a step where, e.g., an
AI that started out with a neural-style delta-rule policy-reinforcement
learning algorithm, or an AI that started out as a big soup of
self-modifying heuristics, is "taken over" by whatever part of the AI
first learns to do consequentialist reasoning about code. But this
process doesn't repeat indefinitely; it stabilizes when there's a
consequentialist self-modifier with a coherent utility function that can
precisely predict the results of self-modifications. The part where this
does happen to an initial AI that is under this threshold of stability is
a big part of the problem of Friendly AI and it's why MIRI works on tiling
agents and so on!

*) Natural selection is not a consequentialist, nor is it the sort of
consequentialist that can sufficiently precisely predict the results of
modifications that the basic argument should go through for its stability.
It built humans to be consequentialists that would value sex, not value
inclusive genetic fitness, and not value being faithful to natural
selection's optimization criterion. Well, that's dumb, and of course the
result is that humans don't optimize for inclusive genetic fitness.
Natural selection was just stupid like that. But that doesn't mean
there's a generic process whereby an agent rejects its "purpose" in the
light of exogenously appearing preference criteria. Natural selection's
anthropomorphized "purpose" in making human brains is just not the same as
the cognitive purposes represented in those brains. We're not talking
about spontaneous rejection of internal cognitive purposes based on their
causal origins failing to meet some exogenously-materializing criterion of
validity. Our rejection of "maximize inclusive genetic fitness" is not an
exogenous rejection of something that was explicitly represented in us,
that we were explicitly being consequentialists for. It's a rejection of
something that was never an explicitly represented terminal value in the
first place. Similarly the stability argument for sufficiently advanced
self-modifiers doesn't go through a step where the successor form of the
AI reasons about the intentions of the previous step and respects them
apart from its constructed utility function. So the lack of any universal
preference of this sort is not a general obstacle to stable
self-improvement.

*) The case of natural selection does not illustrate a universal
computational constraint, it illustrates something that we could
anthropomorphize as a foolish design error. Consider humans building Deep
Blue. We built Deep Blue to attach a sort of default value to queens and
central control in its position evaluation function, but Deep Blue is
still perfectly able to sacrifice queens and central control alike if the
position reaches a checkmate thereby. In other words, although an agent
needs crystallized instrumental goals, it is also perfectly reasonable to
have an agent which never knowingly sacrifices the terminally defined
utilities for the crystallized instrumental goals if the two conflict;
indeed "instrumental value of X" is simply "probabilistic belief that X
leads to terminal utility achievement", which is sensibly revised in the
presence of any overriding information about the terminal utility. To put
it another way, in a rational agent, the only way a loose generalization
about instrumental expected-value can conflict with and trump terminal
actual-value is if the agent doesn't know it, i.e., it does something that
it reasonably expected to lead to terminal value, but it was wrong.

This has been very off-the-cuff and I think I should hand this over to
Nate or Benja if further replies are needed, if that's all right.

Max, as you can see from Eliezer's reply, MIRI people (and other FAI proponents) are largely already aware of the problems you brought up in your paper. (Personally I think they are still underestimating the difficulty of solving those problems. For example, Peter de Blanc and Eliezer both suggest that humans can already solve ontological crises, implying that the problem is merely one of understanding how we do so. However I think humans actually do not already have such an ability, at least not in a general form that would be suitable for implementing in a Friendly AI, so this is really a hard philosophical problem rather than just one of reverse engineering.)

Also, you may have misunderstood why Nick Bostrom talks about "goal retention" in his book. I think it's not meant to be an argument in favor of building FAI (as you suggest in the paper), but rather an argument for AIs being dangerous in general, since they will resist attempts to change their goals by humans if we realize that we built AIs with the wrong final goals.

Thanks Wei for these interesting comments. Whether humans can "solve" ontological crises clearly depends one's definition of "solve". Although there's arguably a clear best solution for de Blanc's corridor example, it's far from clear that there is any behavior that deserves being called a "solution" if the ontological update causes the entire worldview of the rational agent to crumble, revealing the goal to have been fundamentally confused and undefined beyond repair. That's what I was getting at with my souls example.

As to what Nick's views are, I plan to ask him about this when I see him tomorrow.

In the link you suggest that ontological crises might lead to nihilism, but I think a much more likely prospect is that they lead to relativism, with respect to the original utility function. That is, there are solutions to the re-interpretation problem, which, for example, allow us to talk of "myself" and "others" despite the underlying particle physics. But there are more than one of those solutions, none of which are forced. Thus the original "utility function" fails to be such, strictly speaking. It does not really specify which actions are preferred. It only does so modulo a choice of interpretation.

So, all we need to do is figure out each of the possible ways physics might develop, and map out utility functions in terms of that possible physics! Or, we could admit that talk of utility functions needs to be recognized as neither descriptive nor truly normative, but rather as a crude, mathematically simplified, approximation to human values. (Which may be congruent to your conclusions - I just arrive by a different route.)

It's very nice to see you on LW! I think both your essay and Eliezer's comments are very on point.

There are non-obvious ways to define a utility function for an AI. For example, you could "pass the buck" by giving the AI a mathematical description of a human upload, and telling it to maximize the value of the function that the upload would define, given enough time and resources to think. That's Paul Christiano's indirect normativity proposal. I think it fails for subtle reasons, but there might be other ways of defining what humans want by looking at the computational content of human brains and extrapolating it somehow (CEV), while keeping a guarantee that the extrapolation will talk about whatever world we actually live in. Basically it's a huge research problem.

Thanks Eliezer for your encouraging words and for all these interesting comments! I agree with your points, and we clearly agree on the bottom line as well: 1) Building FAI is hard and we’re far from there yet. Sorting out “final goal” issues is part of the challenge. 2) It’s therefore important to further research these questions now, before it’s too late. :-)

Hey Max, this is Nate. Thanks for posting this publicly! Are there any particular points that still seem confusing or wrong, and/or concerns that you don't feel have been addressed?

Okay, wow, I don't know if I quite understand any of this, but this part caught my attention:

The Omohundrian/Yudkowskian argument is not that we can take an arbitrary stupid young AI and it will be smart enough to self-modify in a way that preserves its values, but rather that most AIs that don't self-destruct will eventually end up at a stable fixed-point of coherent consequentialist values. This could easily involve a step where, e.g., an AI that started out with a neural-style delta-rule policy-reinforcement learning algorithm, or an AI that started out as a big soup of self-modifying heuristics, is "taken over" by whatever part of the AI first learns to do consequentialist reasoning about code.

I have sometimes wondered whether the best way to teach an AI a human's utility function would not be to program it into the AI directly (because that will require that we figure out what we really want in a really precisely-defined way, which seems like a gargantuan task), but rather, perhaps the best way would be to "raise" the AI like a kid at a stage where the AI would have minimal and restricted ways of interacting with human society (to minimize harm...much like a toddler thankfully does not have the muscles of Arnold Schwarzenegger to use during its temper tantrums), and where we would then "reward" or "punish" the AI for seeming to demonstrate better or worse understanding of our utility function.

It always seemed to me that this strategy had the fatal flaw that we would not be able to tell if the AI was really already superintelligent and was just playing dumb and telling us what we wanted to hear so that we would let it loose, or if the AI really was just learning.

In addition to that fatal flaw, it seems to me that the above quote suggests another fatal flaw to the "raising an AI" strategy—that there would be a limited time window in which the AI's utility function would still be malleable. It would appear that, as soon as part of the AI figures out how to do consequentialist reasoning about code, then its "critical period" in which we could still mould its utility function would be over. Is this the right way of thinking about this, or is this line of thought waaaay too amateurish?

It always seemed to me that this strategy had the fatal flaw that we would not be able to tell if the AI was really already superintelligent and was just playing dumb and telling us what we wanted to hear so that we would let it loose, or if the AI really was just learning.

In addition to that fatal flaw, it seems to me that the above quote suggests another fatal flaw to the "raising an AI" strategy—that there would be a limited time window in which the AI's utility function would still be malleable. It would appear that, as soon as part of the AI figures out how to do consequentialist reasoning about code, then its "critical period" in which we could still mould its utility function would be over. Is this the right way of thinking about this, or is this line of thought waaaay too amateurish?

This problem is essentially what MIRI has been calling corrigibility. A corrigible AI is one that understands and accepts that it or its utility function is not yet complete.

Very relevant article from the sequences: Detached Lever Fallacy.

Not saying you're committing this fallacy, but it does explain some of the bigger problems with "raising an AI like a child" that you might not have thought of.

I completely made this mistake right up until the point I read that article.

Hardly dispositive. A utility function that says "learn and care what your parents care about" looks relatively simple on paper. And we know the minumum intelligence required is that of a human toddler,

A utility function that says "learn and care what your parents care about" looks relatively simple on paper.

Citation needed. That sounds extremely complex to specify.

I don't think "learn and care about what your parents care about" is noticeably simpler than abstractly trying to determine what an arbitrary person cares about or CEV.

It always seemed to me that this strategy had the fatal flaw that we would not be able to tell if the AI was really already superintelligent and was just playing dumb and telling us what we wanted to hear so that we would let it loose, or if the AI really was just learning.

You could, you know, look inside the machine and see what makes it tick. It's not a black box.

That seems desirable and perhaps possible, but extremely difficult, especially when you have a superintelligent mind anticipating that you'll do it and trying to work out how to ensure you come away with the wrong impression.

You could, you know, look inside the machine and see what makes it tick.

After a certain level of complexity you can look but you wouldn't understand what you are seeing.

Why? That's not a necessary logical consequence. These aren't (or don't have to be) chaotical systems, so there is no reason that scaling up the size of the computation results in an unauditable mess. The techniques used depend very much on the AGI architecture, but there are designs which allow for tracing thought patterns and answering questions about its operation in ways which are computationally tractable.

Why?

Because there are limits on complexity that a human brain can process. There are tools and techniques which help, sure, but at some point you will hit the limit anyway.

there is no reason that scaling up the size of the computation results in an unauditable mess.

I am pretty sure there is. Imagine a few million threads running different code asynchronously on distributed hardware, all interacting. And if the scenario involves an AI which bootstrapped itself through self-modification, your chances of understanding what's happening by looking at pieces of code are... very low.

Sorry some clarification is necessary. I didn't mean to imply that a human being manually inspect the AI and understand its workings, but rather that humans assisted by non-recursively self-improving technology (e.g. theorem provers, probabalistic code checkers, sandbox environments etc.) checks during execution or afterwards that the code behaves in respect of whatever properties the human cares about.

For example, the input/output channels which handle a combination of long-term memory, self-modifications, and actions may require that complete computational traces be provided. Now obviously the traces must be pruned in order to keep the auditing workload manageable, but using cryptographic techniques this can be done such that the AI cannot know ahead of time which branches will be checked, and therefore cannot cheat by creating fraudulent traces while escaping probabalistic or energetic detection.

The AI itself is executing on a framework that includes Bayesian probability graphs as its fundamental unit of computation, like a list in Lisp. The auditor's job then it to make sure that the updates or actions make sense in terms of the data provided, and that choices wree made in alignment with its claimed utility function (remember the AI can't make up data without that fraud being energetically detectable).

The above basically acts as a lie detector to the question "what's your highest-level utility function?" and followup questions like "give examples of why this clause of the utility function is necessary".

Actually, maybe that's a simpler way of framing the issue: assume that you have access to a lie detector. I posit that an accurate lie detector is sufficient to create boxable AI, or maybe even enable some procedural processes for the creation of true friendly AI (e.g. some extension of indirect normativity). The above is basically a technical outline of how one would create such a lie detector for certain AGI architectures.

humans assisted by non-recursively self-improving technology (e.g. theorem provers, probabalistic code checkers, sandbox environments etc.) checks during execution or afterwards that the code behaves in respect of whatever properties the human cares about.

I reiterate my point: regardless of tools used, you still have a human brain at the end of the line and that brain's complexity limit is not high.

complete computational traces be provided

Going back to my example, what does a "complete computational trace" mean in the context of a few million threads running different code asynchronously on distributed hardware, all interacting?

Bayesian probability graphs as its fundamental unit of computation, like a list in Lisp

A list in Lisp is just a data structure. What is a Bayesian probability graph as a unit of computation? I don't even understand what it means.

assume that you have access to a lie detector

Sure. Give that lie detector to someone with the IQ of 60 and see how well they together will figure out something that a guy with the IQ of 150 wants to keep hidden from them, say, his true intentions and capabilities.

I reiterate my point: regardless of tools used, you still have a human brain at the end of the line and that brain's complexity limit is not high.

A human brain is at the end of all the alternative strategies as well.

Just as an example of something a human couldn't understand that a sufficiently smart computer might - writing code directly to binary, without the intermediate step of a programming language.

That would be read as decompiled assembler which humans can understand, though not in large quantities.

Interesting. Consider me corrected.

For anything nontrivial, we need software support to do that—and it still won't work very well. You might not be absolutely correct, but you're close.

IDA is a wonderful piece of software, though. A heck of a lot better than working manually.

As to what Nick's views are, I plan to ask him about this when I see him tomorrow.

It's very nice to see you on LW! I think both your essay and Eliezer's comments are very on point.

Hey Max, this is Nate. Thanks for posting this publicly! Are there any particular points that still seem confusing or wrong, and/or concerns that you don't feel have been addressed?

Okay, wow, I don't know if I quite understand any of this, but this part caught my attention:

The Omohundrian/Yudkowskian argument is not that we can take an arbitrary stupid young AI and it will be smart enough to self-modify in a way that preserves its values, but rather that most AIs that don't self-destruct will eventually end up at a stable fixed-point of coherent consequentialist values. This could easily involve a step where, e.g., an AI that started out with a neural-style delta-rule policy-reinforcement learning algorithm, or an AI that started out as a big soup of self-modifying heuristics, is "taken over" by whatever part of the AI first learns to do consequentialist reasoning about code.

It always seemed to me that this strategy had the fatal flaw that we would not be able to tell if the AI was really already superintelligent and was just playing dumb and telling us what we wanted to hear so that we would let it loose, or if the AI really was just learning.

In addition to that fatal flaw, it seems to me that the above quote suggests another fatal flaw to the "raising an AI" strategy—that there would be a limited time window in which the AI's utility function would still be malleable. It would appear that, as soon as part of the AI figures out how to do consequentialist reasoning about code, then its "critical period" in which we could still mould its utility function would be over. Is this the right way of thinking about this, or is this line of thought waaaay too amateurish?

This problem is essentially what MIRI has been calling corrigibility. A corrigible AI is one that understands and accepts that it or its utility function is not yet complete.

Very relevant article from the sequences: Detached Lever Fallacy.

Not saying you're committing this fallacy, but it does explain some of the bigger problems with "raising an AI like a child" that you might not have thought of.

I completely made this mistake right up until the point I read that article.

A utility function that says "learn and care what your parents care about" looks relatively simple on paper.

Citation needed. That sounds extremely complex to specify.

I don't think "learn and care about what your parents care about" is noticeably simpler than abstractly trying to determine what an arbitrary person cares about or CEV.

It always seemed to me that this strategy had the fatal flaw that we would not be able to tell if the AI was really already superintelligent and was just playing dumb and telling us what we wanted to hear so that we would let it loose, or if the AI really was just learning.

You could, you know, look inside the machine and see what makes it tick. It's not a black box.

You could, you know, look inside the machine and see what makes it tick.

After a certain level of complexity you can look but you wouldn't understand what you are seeing.

Why?

Because there are limits on complexity that a human brain can process. There are tools and techniques which help, sure, but at some point you will hit the limit anyway.

there is no reason that scaling up the size of the computation results in an unauditable mess.

humans assisted by non-recursively self-improving technology (e.g. theorem provers, probabalistic code checkers, sandbox environments etc.) checks during execution or afterwards that the code behaves in respect of whatever properties the human cares about.

I reiterate my point: regardless of tools used, you still have a human brain at the end of the line and that brain's complexity limit is not high.

complete computational traces be provided

Going back to my example, what does a "complete computational trace" mean in the context of a few million threads running different code asynchronously on distributed hardware, all interacting?

Bayesian probability graphs as its fundamental unit of computation, like a list in Lisp

A list in Lisp is just a data structure. What is a Bayesian probability graph as a unit of computation? I don't even understand what it means.

assume that you have access to a lie detector

I reiterate my point: regardless of tools used, you still have a human brain at the end of the line and that brain's complexity limit is not high.

A human brain is at the end of all the alternative strategies as well.

Just as an example of something a human couldn't understand that a sufficiently smart computer might - writing code directly to binary, without the intermediate step of a programming language.

That would be read as decompiled assembler which humans can understand, though not in large quantities.

Interesting. Consider me corrected.

For anything nontrivial, we need software support to do that—and it still won't work very well. You might not be absolutely correct, but you're close.

IDA is a wonderful piece of software, though. A heck of a lot better than working manually.

98

Goal retention discussion with Eliezer

98

98

98