A putative new idea for AI control; index here.

The insight this post comes from is a simple one: defining concepts such as “human” and “happy” is hard. A superintelligent AI will probably create good definitions of these, while attempting to achieve its goals: a good definition of “human” because it needs to control them, and of “happy” because it needs to converse convincingly with us. It is annoying that these definitions exist, but that we won’t have access to them.

 

Modelling and defining

Imagine a game of football (or, as you Americans should call it, football). And now imagine a computer game version of it. How would you say that the computer game version (which is nothing more than an algorithm) is also a game of football?

Well, you can start listing features that they have in common. They both involve two “teams” fielding eleven “players” each, that “kick” a “ball” that obeys certain equations, aiming to stay within the “field”, which has different “zones” with different properties, etc...

As you list more and more properties, you refine your model of football. There are some properties that distinguish real from simulated football (fine details about the human body, for instance), but most of the properties that people care about are the same in both games.

My idea is that once you have a sufficiently complex model of football that applies to both the real game and a (good) simulated version, you can use that as the definition of football. And compare it with other putative examples of football: maybe in some places people play on the street rather than on fields, or maybe there are more players, or maybe some other games simulate different aspects to different degrees. You could try and analyse this with information theoretic considerations (ie given two model of two different examples, how much information is needed to turn one into the other).

Now, this resembles the “suggestively labelled lisp tokens” approach to AI, or the Cyc approach of just listing lots of syntax stuff and their relationships. Certainly you can’t keep an AI safe by using such a model of football: if you try an contain the AI by saying “make sure that there is a ‘Football World Cup’ played every four years”, the AI will still optimise the universe and then play out something that technically fits the model every four years, without any humans around.

However, it seems to me that ‘technically fitting the model of football’ is essentially playing football. The model might include such things as a certain number of fouls expected; an uncertainty about the result; competitive elements among the players; etc... It seems that something that fits a good model of football would be something that we would recognise as football (possibly needing some translation software to interpret what was going on). Unlike the traditional approach which involves humans listing stuff they think is important and giving them suggestive names, this involves the AI establishing what is important to predict all the features of the game.

We might even combine such a model with the Turing test, by motivating the AI to produce a good enough model that it could a) have conversations with many aficionados about all features of the game, b) train a team to expect to win the world cup, and c) use it to program successful football computer game. Any model of football that allowed the AI to do this – or, better still, that a football-model module that, when plugged into another, ignorant AI, allowed that AI to do this – would be an excellent definition of the game.

It’s also one that could cross ontological crises, as you move from reality, to simulation, to possibly something else entirely, with a new physics: the essential features will still be there, as they are the essential features of the model. For instance, we can define football in Newtonian physics, but still expect that this would result in something recognisably ‘football’ in our world of relativity.

Notice that this approach deals with edge cases mainly by forbidding them. In our world, we might struggle on how to respond to a football player with weird artificial limbs; however, since this was never a feature in the model, the AI will simply classify that as “not football” (or “similar to, but not exactly football”), since the model’s performance starts to degrade in this novel situation. This is what helps it cross ontological crises: in a relativistic football game based on a Newtonian model, the ball would be forbidden from moving at speeds where the differences in the physics become noticeable, which is perfectly compatible with the game as its currently played.

 

Being human

Now we take the next step, and have the AI create a model of humans. All our thought processes, our emotions, our foibles, our reactions, our weaknesses, our expectations, the features of our social interactions, the statistical distribution of personality traits in our population, how we see ourselves and change ourselves. As a side effect, this model of humanity should include almost every human definition of human, simply because this is something that might come up in a human conversation that the model should be able to predict.

Then simply use this model as the definition of human for an AI’s motivation.

What could possibly go wrong?

I would recommend first having an AI motivated to define “human” in the best possible way, most useful for making accurate predictions, keeping the definition in a separate module. Then the AI is turned off safely and the module is plugged into another AI and used as part of its definition of human in its motivation. We may also use human guidance at several points in the process (either in making, testing, or using the module), especially on unusual edge cases. We might want to have humans correcting certain assumptions the AI makes in the model, up until the AI can use the model to predict what corrections humans would suggest. But that’s not the focus of this post.

There are several obvious ways this approach could fail, and several ways of making it safer. The main problem is if the predictive model fails to define human in a way that preserves value. This could happen if the model is too general (some simple statistical rules) or too specific (a detailed list of all currently existing humans, atom position specified).

This could be combated by making the first AI generate lots of different models, with many different requirements of specificity, complexity, and predictive accuracy. We might require some models make excellent local predictions (what is the human about to say?), others excellent global predictions (what is that human going to decide to do with their life?). 

Then everything defined as “human” in any of the models counts as human. This results in some wasted effort on things that are not human, but this is simply wasted resources, rather than a pathological outcome (the exception being if some of the models define humans in an actively pernicious way – negative value rather than zero – similarly to the false-friendly AIs’ preferences in this post).

The other problem is a potentially extreme conservatism. Modelling humans involves modelling all the humans in the world today, which is a very narrow space in the range of all potential humans. To prevent the AI lobotomising everyone to a simple model (after all, there does exist some lobotomised humans today), we would want the AI to maintain the range of cultures and mind-types that exist today, making things even more unchanging.

To combat that, we might try and identify certain specific features of society that the AI is allowed to change. Political beliefs, certain aspects of culture, beliefs, geographical location (including being on a planet), death rates etc... are all things we could plausibly identify (via sub-sub-modules, possibly) as things that are allowed to change. It might be safer to allow them to change in a particular range, rather than just changing altogether (removing all sadness might be a good thing, but there are many more ways this could go wrong, than if we eg just reduced the probability of sadness). 

Another option is to keep these modelled humans little changing, but allow them to define allowable changes themselves (“yes, that’s a transhuman, consider it also a moral agent.”). The risk there is that the modelled humans get hacked or seduced, and that the AI fools our limited brains with a “transhuman” that is one in appearance only.

We also have to beware of not sacrificing seldom used values. For instance, one could argue that current social and technological constraints mean that no one has today has anything approaching true freedom. We wouldn’t want the AI to allow us to improve technology and social structures, but never get more freedom than we have today, because it’s “not in the model”. Again, this is something we could look out for, if the AI has separate models of “freedom” we could assess and permit to change in certain directions.

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 6:08 PM
[-][anonymous]9y50

Even without an AI, the current trend may well have a world where there is a blurring of real Football matches and simulations.

Certainly you can’t keep an AI safe by using such a model of football

I used to think that a detailed ontological mapping could provide a solution to keeping AI's safe but have slowly realized that it probably isn't likely to work overall. It would be interesting to test this though for small specifically defined domains (like a game of football) - it could work, or at least it would interesting to make a toy experiment to see how it could fail.

I'm put off by using a complex model as a definition. I've always seen a model as an imperfect approximation, where there's always room for improvement. A good model of humans should be able to look at a candidate and decide whether it's human with some probability p of a false positive and q of a false negative. A model that uses statistical data can potentially improve by gathering more information.

A definition, on the other hand, is deterministic. Taking a model as a definition is basically declaring that your model is correct and cuts off any avenue for improvement. Definitions are usually used for simpler concepts that can be readily articulated. It's possible to brute-force a definition by making a list of all objects in the universe that satisfy it. So, I could conceivably make a list of shirts and then define "shirt" to mean any object in the list. However, I don't think that's quite what you had in mind.

A model has the advantage of staying the same across different environments (virtual vs real, or different laws of physics).

I'm thinking "we are failing to define what human is, yet the AI is likely to have an excellent model of what being human entails, that model is likely a better definition that what we've defined".

Humans can be recognized inductively: Pick a time such as the present when it is not common to manipulate genomes. Define a human to be everyone genetically human at that time, plus all descendants who resulted from the naturally occurring process, along with some constraints on the life from conception to the present to rule out various kinds of manipulation.

Or maybe just say that the humans are the genetic humans at the start time, and that's all. Caring for the initial set of humans should lead to caring for their descendants because humans care about their descendants, so if you're doing FAI you're done. If you want to recognize humans for some other purpose this may not be sufficient.

Predicting human behavior seems harder than recognizing humans, so it seems to me that you're presupposing the solution of a hard problem in order to solve an easy problem.

An entirely separate problem is that if you train to discover what humans would do in one situation and then stop training and then use the trained inference scheme in new situations, you're open to the objection that the new situations might be outside the domain covered by the original training.

Define a human to be everyone genetically human at that time, plus all descendants who resulted from the naturally occurring process, along with some constraints on the life from conception to the present to rule out various kinds of manipulation.

That seems very hard! For instance, does that not qualify molar pregnancies as people, twins as one person and chimeras as two? And it's hard to preclude manipulations that future humans (or AIs) may be capable of.

Or maybe just say that the humans are the genetic humans at the start time, and that's all.

Easier, but still a challenge. You need to identify a person with the "same" person at a later date - but not, for instance, with list skin cells or amputated limbs. What of clones, if we're using genetics?

It seems to me that identifying people imperfectly (a "crude measure", essentially http://lesswrong.com/lw/ly9/crude_measures/ ) is easier and safer than modelling people imperfectly. But doing it throughly, then the model seems better, and less vulnerable to unexpected edge cases.

But the essence of the idea is to exploit something that a superintelligent AI will be doing anyway. We could similarly try and use any "human identification" algorithm the AI would be using anyway.