Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Comment author: Manfred 27 March 2015 09:55:30PM *  1 point [-]

What I mostly referred to in my comment was the ontology problem for agents with high-reductive-level motivations. Example: a robot built to make people happy has to be able to find happiness somewhere in their world model, but a robot built to make itself smarter has no such need. So if you want a robot to make people happy, using world-models built to make a robot smarter, the happiness maximizer is going to need to be able to find happiness inside an unfamiliar ontology.

More exposition about why world models will end up different:

Recently I've been trying to think about why building inherently lossy predictive models of the future is a good idea. My current thesis statement is that since computations of models are much more valuable finished than unfinished, it's okay to have a lossy model as long as it finishes. The trouble is quantifying this.

For the current purpose, though, the details are not so important. Supposing one understands the uncertainty characteristics of various models, one chooses a model by maximizing an effective expected value, because inaccurate predictive models have some associated cost that depends on the agent's preferences. Agents with different preferences will pick different methods of predicting the future, even if they're locked into the same ontology, and so anything not locked in is fair game to vary widely.

In response to comment by Manfred on What I mean...
Comment author: Stuart_Armstrong 30 March 2015 03:14:39PM 0 points [-]

the happiness maximizer is going to need to be able to find happiness inside an unfamiliar ontology.

But the module for predicting human behaviour/preferences should surely be the same in a different ontology? The module is a model, and the model is likely not grounded in the fine detail of the ontology.

Example: the law of comparative advantage in economics is a high level model, which won't collapse because the fundamental ontology is relativity rather than newtonian mechanics. Even in a different ontology, humans should remain (by far) the best things in the world that approximate the "human model".

Comment author: tailcalled 27 March 2015 04:35:16PM 1 point [-]

I know, but my point is that such a model might be very perverse, such as "Humans do not expect to find out that you presented misleading information." rather than "Humans do not expect that you present misleading information."

Comment author: Stuart_Armstrong 30 March 2015 02:13:02PM 0 points [-]

You're right. This thing can come up in terms of "predicting human behaviour", if the AI is sneaky enough. It wouldn't come up in "compare human models of the world to reality". So there are subtle nuances there to dig into...

Comment author: Slider 30 March 2015 02:05:59PM 1 point [-]

If one were to believe there is only one thing that agents ought to maximise could this be used as a way to translate agents that actually maximise another thing as maximising "the correct thing" but with false beliefs? If rationalism is the deep rejection of false beliefs could this be a deep error mode where agents are seen as having false beliefs instead of recognised to have different values? Then demanding "rectification" of the factual erros would actually be a form of value imperialism.

This could also be seen as divergence of epistemological and instrumental rationality in that instrumental rationality would accept falsehoods if they are useful enough. That is if you care about probabilities in order to maximise expected utility whether the uncertainty would be in the details of the specific way the goal is reached or in the desirability of the out of the process are largely interchangeable. In the extreme of low probability accuracy and high utility accuracy you would know to select the action which gets you what you want but be unsure how it makes it come about. The other extreme of high probability accuracy but low utility accuracy would be the technically capable AI which we don't know whether it is allied with or against us.

Comment author: Stuart_Armstrong 30 March 2015 02:12:08PM 1 point [-]

If one were to believe there is only one thing that agents ought to maximise could this be used as a way to translate agents that actually maximise another thing as maximising "the correct thing" but with false beliefs?

Not easily. It's hard to translate a u-maximiser for complex u, into, say, a u-minimiser, without redefining the entire universe.

Comment author: Skeptityke 28 March 2015 05:31:58PM 1 point [-]

"But the general result is that one can start with an AI with utility/probability estimate pair (u,P) and map it to an AI with pair (u',P) which behaves similarly to (u,P')"

Is this at all related to the Loudness metric mentioned in this paper? https://intelligence.org/files/LoudnessPriors.pdf It seems like the two are related... (in terms of probability and utility blending together into a generalized "importance" or "loudness" parameter)

Comment author: Stuart_Armstrong 30 March 2015 12:48:47PM 1 point [-]

Is this at all related to the Loudness metric mentioned in this paper?

Not really. They're only connected in that they both involve scaling of utilities (but in one case, scaling of whole utilities, in this case, scaling of portions of the utility).

Comment author: royf 27 March 2015 05:59:01PM *  2 points [-]

This is not unlike Neyman-Pearson theory. Surely this will run into the same trouble with more than 2 possible actions.

Comment author: Stuart_Armstrong 30 March 2015 12:46:19PM 1 point [-]

No, no real connection Neyman-Pearson. And its fine with more that 2 actions - notice that each action only uses itself in the definition. And u' doesn't event use any actions in its definition.

Comment author: buybuydandavis 27 March 2015 08:35:43PM 3 points [-]

I thought one of the takeaways from Bernardo + Smith in Bayesian Theory was that from a decision theory perspective, your cost function and your probability function is basically an integrated whole, any division of which is arbitrary.

Comment author: Stuart_Armstrong 30 March 2015 12:42:13PM 1 point [-]

That makes sense. Do you have a reference that isn't a book?

Crude measures

9 Stuart_Armstrong 27 March 2015 03:44PM

A putative new idea for AI control; index here.

Partially inspired by as conversation with Daniel Dewey.

People often come up with a single great idea for AI, like "complexity" or "respect", that will supposedly solve the whole control problem in one swoop. Once you've done it a few times, it's generally trivially easy to start taking these ideas apart (first step: find a bad situation with high complexity/respect and a good situation with lower complexity/respect, make the bad very bad, and challenge on that). The general responses to these kinds of idea are listed here.

However, it seems to me that rather than constructing counterexamples each time, we should have a general category and slot these ideas into them. And not only have a general category with "why this can't work" attached to it, but "these are methods that can make it work better". Seeing the things needed to make their idea better can make people understand the problems, where simple counter-arguments cannot. And, possibly, if we improve the methods, one of these simple ideas may end up being implementable.


Crude measures

The category I'm proposing to define is that of "crude measures". Crude measures are methods that attempt to rely on non-fully-specified features of the world to ensure that an underdefined or underpowered solution does manage to solve the problem.

To illustrate, consider the problem of building an atomic bomb. The scientists that did it had a very detailed model of how nuclear physics worked, the properties of the various elements, and what would happen under certain circumstances. They ended up producing an atomic bomb.

The politicians who started the project knew none of that. They shovelled resources, money and administrators at scientists, and got the result they wanted - the Bomb - without ever understanding what really happened. Note that the politicians were successful, but it was a success that could only have been achieved at one particular point in history. Had they done exactly the same thing twenty years before, they would not have succeeded. Similarly, Nazi Germany tried a roughly similar approach to what the US did (on a smaller scale) and it went nowhere.

So I would define "shovel resources at atomic scientists to get a nuclear weapon" as a crude measure. It works, but it only works because there are other features of the environment that are making it work. In this case, the scientists themselves. However, certain social and human features about those scientists (which politicians are good at estimating) made it likely to work - or at least more likely to work than shovelling resources at peanut-farmers to build moon rockets.

In the case of AI, advocating for complexity is similarly a crude measure. If it works, it will work because of very contingent features about the environment, the AI design, the setup of the world etc..., not because "complexity" is intrinsically a solution to the FAI problem. And though we are confident that human politicians have some good enough idea about human motivations and culture that the Manhattan project had at least some chance of working... we don't have confidence that those suggesting crude measures for AI control have a good enough idea to make their idea works.

It should be evident that "crudeness" is on a sliding scale; I'd like to reserve the term for proposed solutions to the full FAI problem that do not in any way solve the deep questions about FAI.


More or less crude

The next question is, if we have a crude measure, how can we judge its chance of success? Or, if we can't even do that, can we at least improve the chances of it working?

The main problem is, of course, that of optimising. Either optimising in the sense of maximising the measure (maximum complexity!) or of choosing the measure that is most extreme fit to the definition (maximally narrow definition of complexity!). It seems we might be able to do something about this.

Let's start by having AI create sample a large class of utility functions. Require them to be around the same expected complexity as human values. Then we use our crude measure μ - for argument's sake, let's make it something like "approval by simulated (or hypothetical) humans, on a numerical scale". This is certainly a crude measure.

We can then rank all the utility functions u, using μ to measure the value of "create M(u), a u-maximising AI, with this utility function". Then, to avoid the problems with optimisation, we could select a certain threshold value and pick any u such that E(μ|M(u)) is just above the threshold.

How to pick this threshold? Well, we might have some principled arguments ("this is about as good a future as we'd expect, and this is about as good as we expect that these simulated humans would judge it, honestly, without being hacked").

One thing we might want to do is have multiple μ, and select things that score reasonably (but not excessively) on all of them. This is related to my idea that the best Turing test is one that the computer has not been trained or optimised on. Ideally, you'd want there to be some category of utilities "be genuinely friendly" that score higher than you'd expect on many diverse human-related μ (it may be better to randomly sample rather than fitting to precise criteria).

You could see this as saying that "programming an AI to preserve human happiness is insanely dangerous, but if you find an AI programmed to satisfice human preferences, and that other AI also happens to preserve human happiness (without knowing it would be tested on this preservation), then... it might be safer".

There are a few other thoughts we might have for trying to pick a safer u:

  • Properties of utilities under trade (are human-friendly functions more or less likely to be tradable with each other and with other utilities)?
  • If we change the definition of "human", this should have effects that seem reasonable for the change. Or some sort of "free will" approach: if we change human preferences, we want the outcome of u to change in ways comparable with that change.
  • Maybe also check whether there is a wide enough variety of future outcomes, that don't depend on the AI's choices (but on human choices - ideas from "detecting agents" may be relevant here).
  • Changing the observers from hypothetical to real (or making the creation of the AI contingent, or not, on the approval), should not change the expected outcome of u much.
  • Making sure that the utility u can be used to successfully model humans (therefore properly reflects the information inside humans).
  • Make sure that u is stable to general noise (hence not over-optimised). Stability can be measured as changes in E(μ|M(u)), E(u|M(u)), E(v|M(u)) for generic v, and other means.
  • Make sure that u is unstable to "nasty" noise (eg reversing human pain and pleasure).
  • All utilities in a certain class - the human-friendly class, hopefully - should score highly under each other (E(u|M(u)) not too far off from E(u|M(v))), while the over-optimised solutions - those scoring highly under some μ - must not score high under the class of human-friendly utilities.

This is just a first stab at it. It does seem to me that we should be able to abstractly characterise the properties we want from a friendly utility function, which, combined with crude measures, might actually allow us to select one without fully defining it. Any thoughts?

And with that, the various results of my AI retreat are available to all.

Comment author: tailcalled 27 March 2015 02:15:57PM 0 points [-]

The problem is that the 'human interpretation module' might give the wrong results. For instance, if it convinces people that X is morally obligatory, it might interpret that as X being morally obligatory. It is not entirely obvious to me that it would be useful to have a better model. It probably depends on what the original AI wants to do.

Comment author: Stuart_Armstrong 27 March 2015 03:01:09PM 2 points [-]

The module is supposed to be a predictive model of what humans mean or expect, rather than something that "convinces" or does anything like that.

Comment author: tailcalled 27 March 2015 01:37:21PM 2 points [-]
  1. Which leads to the obvious question of whether figuring out the rules about the questions is much simpler than figuring out the rules for morality. Do you have a specific, simple class of questions/orders in mind?

  2. Yes, but it seems to me that your approach is dependent on an 'immoral' system: simulating humans in too high detail. In other cases, one might attempt to make a nonperson predicate and eliminate all models that fail, or something. However, your idea seems to depend on simulated humans.

  3. Well, it depends on how the model of the human works and how it is asked questions. That would probably depend a lot on how the original AI structured the model of the human, and we don't currently have any AIs to test that with. The point is, though, that in certain cases, the AI might compromise the human, for instance by wireheading it or convincing it of a religion or something, and then the compromised human might command destructive things. There's a huge, hidden amount of trickiness, such as determining how to give the human correct information to decide etc etc.

Comment author: Stuart_Armstrong 27 March 2015 01:57:24PM 2 points [-]

3 is the general problem of AI's behaving badly. The way that this approach is supposed to avoid that is by having constructing a "human interpretation module" that is maximally accurate, and then using that module+human instructions to be the motivation of the AI.

Basically I'm using a lot of the module approach (and the "false miracle" stuff to get counterfactuals): the AI that builds the human interpretation module will build it for the purpose of making it accurate, and the one that uses it will have it as part of its motivation. The old problems may rear their heads again if the process is ongoing, but "module X" + "human instructions" + "module X's interpretation of human instructions" seems rather solid as a one-off initial motivation.

Utility vs Probability: idea synthesis

3 Stuart_Armstrong 27 March 2015 12:30PM

A putative new idea for AI control; index here.

This post is a synthesis of some of the ideas from utility indifference and false miracles, in an easier-to-follow format that illustrates better what's going on.


Utility scaling

Suppose you have an AI with a utility u and a probability estimate P. There is a certain event X which the AI cannot affect. You wish to change the AI's estimate of the probability of X, by, say, doubling the odds ratio P(X):P(¬X). However, since it is dangerous to give an AI false beliefs (they may not be stable, for one), you instead want to make the AI behave as if it were a u-maximiser with doubled odds ratio.

Assume that the AI is currently deciding between two actions, α and ω. The expected utility of action α decomposes as:

u(α) = P(X)u(α|X) + P(¬X)u(α|¬X).

The utility of action ω is defined similarly, and the expected gain (or loss) of utility by choosing α over ω is:

u(α)-u(ω) = P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)).

If we were to double the odds ratio, the expected utility gain becomes:

u(α)-u(ω) = (2P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)))/Ω,    (1)

for some normalisation constant Ω = 2P(X)+P(¬X), independent of α and ω.

We can reproduce exactly the same effect by instead replacing u with u', such that

  • u'( |X)=2u( |X)
  • u'( |¬X)=u( |¬X)


u'(α)-u'(ω) = P(X)(u'(α|X)-u'(ω|X)) + P(¬X)(u'(α|¬X)-u'(ω|¬X)),

2P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)).    (2)

This, up to an unimportant constant, is the same equation as (1). Thus we can accomplish, via utility manipulation, exactly the same effect on the AI's behaviour as a by changing its probability estimates.

Notice that we could also have defined

  • u'( |X)=u( |X)
  • u'( |¬X)=(1/2)u( |¬X)

This is just the same u', scaled.

The utility indifference and false miracles approaches were just special cases of this, where the odds ratio was sent to infinity/zero by multiplying by zero. But the general result is that one can start with an AI with utility/probability estimate pair (u,P) and map it to an AI with pair (u',P) which behaves similarly to (u,P'). Changes in probability can be replicated as changes in utility.


Utility translating

In the previous, we multiplied certain utilities by two. But by doing so, we implicitly used the zero point of u. But utility is invariant under translation, so this zero point is not actually anything significant.

It turns out that we don't need to care about this - any zero will do, what matters simply is that the spread between options is doubled in the X world but not in the ¬X one.

But that relies on the AI being unable to affect the probability of X and ¬X itself. If the AI has an action that will increase (or decrease) P(X), then it becomes very important where we set the zero before multiplying. Setting the zero in a different place is isomorphic with adding a constant to the X world and not the ¬X world (or vice versa). Obviously this will greatly affect the AI's preferences between X and ¬X.

One way of avoiding the AI affecting X is to set this constant so that u'(X)=u'(¬X), in expectation. Then the AI has no preferences between the two situations, and will not seek to boost one over the other. However, note that u(X) is an expected utility calculation. Therefore:

  1. Choosing the constant so that u'(X)=u'(¬X) requires accessing the AI's probability estimate P for various worlds; it cannot be done from outside, by multiplying the utility, as the previous approach could.
  2. Even if u'(X)=u'(¬X), this does not mean that u'(X|Y)=u'(¬X|Y) for every event Y that could happen before X does. Simple example: X is a coin flip, and Y is the bet of someone on that coin flip, someone the AI doesn't like.

This explains all the complexity of the utility indifference approach, which is essentially trying to decompose possible universes (and adding constants to particular subsets of universes) to ensure that u'(X|Y)=u'(¬X|Y) for any Y that could happen before X does.

View more: Next