Models as definitions
A putative new idea for AI control; index here.
The insight this post comes from is a simple one: defining concepts such as “human” and “happy” is hard. A superintelligent AI will probably create good definitions of these, while attempting to achieve its goals: a good definition of “human” because it needs to control them, and of “happy” because it needs to converse convincingly with us. It is annoying that these definitions exist, but that we won’t have access to them.
Modelling and defining
Imagine a game of football (or, as you Americans should call it, football). And now imagine a computer game version of it. How would you say that the computer game version (which is nothing more than an algorithm) is also a game of football?
Well, you can start listing features that they have in common. They both involve two “teams” fielding eleven “players” each, that “kick” a “ball” that obeys certain equations, aiming to stay within the “field”, which has different “zones” with different properties, etc...
As you list more and more properties, you refine your model of football. There are some properties that distinguish real from simulated football (fine details about the human body, for instance), but most of the properties that people care about are the same in both games.
My idea is that once you have a sufficiently complex model of football that applies to both the real game and a (good) simulated version, you can use that as the definition of football. And compare it with other putative examples of football: maybe in some places people play on the street rather than on fields, or maybe there are more players, or maybe some other games simulate different aspects to different degrees. You could try and analyse this with information theoretic considerations (ie given two model of two different examples, how much information is needed to turn one into the other).
Now, this resembles the “suggestively labelled lisp tokens” approach to AI, or the Cyc approach of just listing lots of syntax stuff and their relationships. Certainly you can’t keep an AI safe by using such a model of football: if you try an contain the AI by saying “make sure that there is a ‘Football World Cup’ played every four years”, the AI will still optimise the universe and then play out something that technically fits the model every four years, without any humans around.
However, it seems to me that ‘technically fitting the model of football’ is essentially playing football. The model might include such things as a certain number of fouls expected; an uncertainty about the result; competitive elements among the players; etc... It seems that something that fits a good model of football would be something that we would recognise as football (possibly needing some translation software to interpret what was going on). Unlike the traditional approach which involves humans listing stuff they think is important and giving them suggestive names, this involves the AI establishing what is important to predict all the features of the game.
We might even combine such a model with the Turing test, by motivating the AI to produce a good enough model that it could a) have conversations with many aficionados about all features of the game, b) train a team to expect to win the world cup, and c) use it to program successful football computer game. Any model of football that allowed the AI to do this – or, better still, that a football-model module that, when plugged into another, ignorant AI, allowed that AI to do this – would be an excellent definition of the game.
It’s also one that could cross ontological crises, as you move from reality, to simulation, to possibly something else entirely, with a new physics: the essential features will still be there, as they are the essential features of the model. For instance, we can define football in Newtonian physics, but still expect that this would result in something recognisably ‘football’ in our world of relativity.
Notice that this approach deals with edge cases mainly by forbidding them. In our world, we might struggle on how to respond to a football player with weird artificial limbs; however, since this was never a feature in the model, the AI will simply classify that as “not football” (or “similar to, but not exactly football”), since the model’s performance starts to degrade in this novel situation. This is what helps it cross ontological crises: in a relativistic football game based on a Newtonian model, the ball would be forbidden from moving at speeds where the differences in the physics become noticeable, which is perfectly compatible with the game as its currently played.
Being human
Now we take the next step, and have the AI create a model of humans. All our thought processes, our emotions, our foibles, our reactions, our weaknesses, our expectations, the features of our social interactions, the statistical distribution of personality traits in our population, how we see ourselves and change ourselves. As a side effect, this model of humanity should include almost every human definition of human, simply because this is something that might come up in a human conversation that the model should be able to predict.
Then simply use this model as the definition of human for an AI’s motivation.
What could possibly go wrong?
I would recommend first having an AI motivated to define “human” in the best possible way, most useful for making accurate predictions, keeping the definition in a separate module. Then the AI is turned off safely and the module is plugged into another AI and used as part of its definition of human in its motivation. We may also use human guidance at several points in the process (either in making, testing, or using the module), especially on unusual edge cases. We might want to have humans correcting certain assumptions the AI makes in the model, up until the AI can use the model to predict what corrections humans would suggest. But that’s not the focus of this post.
There are several obvious ways this approach could fail, and several ways of making it safer. The main problem is if the predictive model fails to define human in a way that preserves value. This could happen if the model is too general (some simple statistical rules) or too specific (a detailed list of all currently existing humans, atom position specified).
This could be combated by making the first AI generate lots of different models, with many different requirements of specificity, complexity, and predictive accuracy. We might require some models make excellent local predictions (what is the human about to say?), others excellent global predictions (what is that human going to decide to do with their life?).
Then everything defined as “human” in any of the models counts as human. This results in some wasted effort on things that are not human, but this is simply wasted resources, rather than a pathological outcome (the exception being if some of the models define humans in an actively pernicious way – negative value rather than zero – similarly to the false-friendly AIs’ preferences in this post).
The other problem is a potentially extreme conservatism. Modelling humans involves modelling all the humans in the world today, which is a very narrow space in the range of all potential humans. To prevent the AI lobotomising everyone to a simple model (after all, there does exist some lobotomised humans today), we would want the AI to maintain the range of cultures and mind-types that exist today, making things even more unchanging.
To combat that, we might try and identify certain specific features of society that the AI is allowed to change. Political beliefs, certain aspects of culture, beliefs, geographical location (including being on a planet), death rates etc... are all things we could plausibly identify (via sub-sub-modules, possibly) as things that are allowed to change. It might be safer to allow them to change in a particular range, rather than just changing altogether (removing all sadness might be a good thing, but there are many more ways this could go wrong, than if we eg just reduced the probability of sadness).
Another option is to keep these modelled humans little changing, but allow them to define allowable changes themselves (“yes, that’s a transhuman, consider it also a moral agent.”). The risk there is that the modelled humans get hacked or seduced, and that the AI fools our limited brains with a “transhuman” that is one in appearance only.
We also have to beware of not sacrificing seldom used values. For instance, one could argue that current social and technological constraints mean that no one has today has anything approaching true freedom. We wouldn’t want the AI to allow us to improve technology and social structures, but never get more freedom than we have today, because it’s “not in the model”. Again, this is something we could look out for, if the AI has separate models of “freedom” we could assess and permit to change in certain directions.
Creating a satisficer
A putative new idea for AI control; index here.
This is a potential design for a safe-ish satisficer, based on the ideas presented here. The two key ideas are that a satisficer S(u) with utility u:
- Would not effectively aid M(u), a u-maximiser.
- Would not effectively resist M(-u), a u-minimizer.
So satisficer make poor allies and weak enemies.
The design, the designer and the verifier
Let M(u-v) be an agent that maximises u and minimises v. Let's assume that M(u-v) is currently unaware as to what u and v are (though it knows that the upper bound of u is easy to approach). It can interact with the outside environment in one way only: by creating a satisficer S(). S() will then be fed u to become S(u).
However, the design for S(u) will first be sent to M(εu+v), for some unknown small ε. Note that M(εu+v) is a v-maximiser, so it disagrees with M(u-v) except for u. M(εu+v) is a (hypothetical) agent that will be (counterfactually) released at some unspecified later date. Currently, it will be fed the whole reasoning history of M(u+v), and will make a single decision only: where S(u) is to be built or not.
So it seems that S(u) must;
- Set u close to its maximal value (as this is "easy to approach").
- Not increase v to any significant extent (or else M(u-v) would not design it).
- Not decrease v to any significant extent (or else M(εu+v) would veto it).
- Since v is unknown to M(u-v) and and resources are finite, this should serve as a general reduced impact requirement for S(u) (we may have to use something like a soft minimum across all v, rather than an expectation across all v, to avoid certain edge casess).
- Since is u unknown to M(u-v), S() would serve as a general satisficing agent for any utility functions whose upper bounds are easy to approach (remember that we can take an arbitrary utility function and arbitrarily bound it at some number).
For the moment, this does seems like it would produce a successful satisficer...
Defining a limited satisficer
A putative new idea for AI control; index here.
EDIT: The definition of satisficer I'm using here is the informal one of "it tries to achieve a goal, without making huge changes on the universe" rather than "it's an agent that has utility u and threshold t". If you prefer the standard notation, think of this as a satisficer where t is not fixed, but dependent on some facts in the world (such as the ease of increasing u). I'm trying to automate the process of designing and running a satisficer: people generally chose t given facts about the world (how easy it is to achieve, for instance), and I want the whole process to be of low impact.
I've argued that the definition of a satisficer is underdefined, because there are many pathological behaviours all compatible with satsificer designs. This contradict the intuitive picture that many people have of a satisficer, which is an agent that does the minimum of effort to reach its goal, and doesn't mess up the outside world more than it has to. And if it can't accomplish the goals without messing up the outside world, it would be content not to.
In the spirit of "if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach", can we spell out what features we would want from such a satisficer? Preferably in a simpler format that our intuitions.
It seems to me that if you had a proper u-satisficer S(u), then for many (real or hypothetical) v-maximiser M(v) out there, M(v) would find that:
- Changing S(u) to S(v) is of low value.
- Similarly, utility function trading with S(u) is of low value.
- The existence or non-existence of S(u) is of low information content about the future.
- The existence or non-existence of S(u) has little impact on the expected value of v.
Further, S(u):
- Would not effectively aid M(u), a u-maximiser.
- Would not effectively resist M(-u), a u-minimizer.
- Would not have large impacts (if this can measured) for low utility gains.
A subsequent post will present an example of a satisficer using some of these ideas.
A few other much less-developed thoughts about satisficers:
- Maybe require that it learns what variables humans care about, and doesn’t set them to extreme values – try and keep them in the same range. Do the same for variables humans may care about or that resemble values they care about.
- Models the general procedure of detecting unaccounted-for variables set to extreme values.
- We could check whether it would kill all humans cheaply if it could (or replace certain humans cheaply). ie give it hypothetical destructive superpowers with no costs to using them, and see whether it would use them.
- Have the AI establish a measure/model of optimisation power (without reference to any other goal), then put itself low on that.
- Trade between satisficers might be sub-Pareto.
- When talking about different possible v's in the first four points above, it might be better to use something else than an expectation over different v's, as that could result in edge cases dominating - maybe a soft minimum of value across different v instead.
Countess and Baron attempt to define blackmail, fail
For a more concise version of this argument, see here.
We meet our heroes, the Countess of Rectitude and Baron Chastity, as they continue to investigate the mysteries of blackmail by sleeping together and betraying each other.
The Baron had a pile of steamy letters between him and the Countess: it would be embarrassing to both of them if these letters got out. Yet the Baron confided the letters to a trusted Acolyte, with strict instructions. The Acolyte was to publish these letters, unless the Countess agreed to give the Baron her priceless Ping Vase.
This seems a perfect example of blackmail:
- The Baron is taking a course of action that is intrinsically negative for him. This behaviour only makes sense if it forces the Countess to take a specific action which benefits him. The Countess would very much like it if the Baron couldn't do such things.
As it turns out, a servant broke the Ping Vase while chasing the Countess's griffon. The servant was swiftly executed, but the Acolyte had to publish the letters as instructed, to great embarrassment all around (sometimes precommitments aren't what they're cracked up to be). After six days of exile in the Countess's doghouse (a luxurious, twenty-room affair) and eleven days of make-up sex, the Baron was back to planning against his lover.
How I applied useful concepts from the personal growth seminar "est" and MBTI
I have encountered personally in conversations, and also observed in the media over the past couple of decades, a great deal of skepticism, scorn, and ridicule, if not merely indifference or dismissal, from many people in reaction to the est training, which I completed in 1983, and the Myers-Briggs Type Indicator tool, which I first took in 1993 or 1994. I would like to share some concrete examples from my own life where information and perspective that I gained from these two sources have improved my life, both in my own way of conceptualizing and approaching things, and also in my relationships with others. I do this with the hope and intention of showing that est and MBTI have positive value, and encouraging people to explore these and other tools for personal growth.
One important insight that I gained from the est training is an understanding and the experience that I am not my opinions, and my opinions are not me. Opinions are neutral things, and they may be something I hold, or agree with, but I can separate my self from them, and I can discuss them, and I can change or discard them, but I am still the same "me". I am not more or less "myself" in relation to what I think or believe. Before I did the est training, whenever someone would question an opinion I held, I felt personally attacked. I identified my self with my opinion or belief. My emotional response to attack, like for many other people, is to defend and/or to retreat, so when I perceived of my "self" being "attacked", I gave in to the standard fight or flight response, and therefore I did not get the opportunity to explore the opinion in question to see if the person who questioned me had some important new information or a perspective that I had not previously considered. It is not that I always remember this or that it is my first response, but once I notice myself responding in the old way, I can then take that step back and remember the separation between self and opinion. That choice is now available to me, where it wasn't before. When I find myself in conversations with another person or people who disagree with me, my response now is to draw them out, to ask them about what they believe and why they believe it. I regard myself as if I were a reporter on a fact-finding mission. I step back and I do not feel attacked. I learn sometimes from this, and other times I do not, but I no longer feel attacked, and I find that I can more easily become friends with people even if we have disagreements. That was not the case for me prior to doing est.
Another valuable tool that I got from est and still use in my life is the ability to accept responsibility without attaching blame to it, even if someone is trying to heap blame upon me. This is similar to what I said above about basically not identifying my self with what I think. I do not have to feel or think of myself as a "bad person" because I made a mistake. I have come to the belief that guilt is an emotion that I need not wallow in. If I feel guilt about doing or not doing something, saying or not saying something, I take that feeling of guilt as a sign that I either need to take some action to rectify the situation, and/or I need to apologize to someone about it, and/or I need to learn from the situation so that hopefully I will not repeat it, and then forgive myself, and move on. Hanging on to guilt is something I see many people doing, and it not only holds them up and blocks them off from taking action, they often pull that feeling in and create a scenario or self-definition that involves beating themselves up about it, or they wallow around in feeling guilty in a way that serves as a self-indulgent excuse for not improving things. "I'm so awful, I'm such a screw-up, I can't do anything right." That kind of negative self-esteem can affect a person for their entire life if they allow it to. There are many ways to come to these realizations, and I make no claim that est is some kind of "cure-all". One of the characters on the tv show "SOAP" called est "The McDonald's of Psychiatry". That's amusing, but it denigrates a very useful and powerful experience. I believe in an eclectic approach to life. I look at many things, explore many ideas and experiences, and I take what works and leave the rest. est is only one of many helpful experiences I have had in my 49 years.
I took the Myers-Briggs Personality Index at a science fiction convention in the early years of my marriage, when I was living in Alexandria, VA, in 1993 and 1994. It was given as part of a panel, and I also took it again when I read "Do What You Are", which is a book about finding employment/a profession based on your MBTI personality type. The basics, if you have not encountered MBTI before are: There are 4 "continuums" in how people tend to interact with the world. Most people use both sides of each continuum, but are most comfortable on one side. The traits are Extrovert/Introvert, Sensing/Intuiting, Thinking/Feeling, and Judging/Perceiving. (The use of these words in the MBTI context is not exactly the same as their dictionary definitions). I am a strong ENFP. My husband was an ISTP. Understanding the differences between how we approached the world was very helpful to me in learning why we were so different about socializing with other people, and about our communication style with each other. As an "I", John (as they put it in the book), "got his batteries charged" by mostly being alone. I, as an "E", got mine charged by being with other people. We went to conventions and parties, but he often wanted to leave well before I felt ready to go. Once we had two cars, we would each take our own to events. Even though I felt it wasted gas, it gave him the opportunity to "flee" once he had had enough of being with others, while I could then come home at my leisure, and neither of us had to give up on what made us happier and more comfortable. It also explained why he would not always respond immediately to a question. "I "people tend to figure out in their own mind first what they want to say before they say anything aloud. "E" people often start talking right away, and as they speak, what they think becomes clearer to them. This is also a very useful data point for teachers. If they know about it, they can realize that the "I" kids need more time to come up with their answers, while the "E" kids put their hands in the air more immediately. They can then allow the "I" kids the time they need to respond to questions without thinking they are not good students, or are not as intelligent or knowledgeable as they "E" kids are.
My boyfriend is an ENTJ. The source of some of the friction in our relationship became clear to me after I asked him to find out his Myers-Briggs type, which he had never done before. Gerry often asks me to give him a list of what I want to do in the course of my day, and how much time things will take. These are reasonable requests. However, the rub comes from the fact that as a "J", he is uncomfortable not knowing the answer to these things. I, as a "P", am uncomfortable stating these things in advance, in nailing things down. I prefer to leave things open-ended. He regarded what I said as more concrete, whereas I regarded it more as a guideline, but not a definite plan or promise. In addition, I have always had a hard time judging how long things will take, and as a person with ADD, I also get distracted easily, so it was making me upset when he would come home and ask me what I'd gotten done, and then he would get upset when I hadn't done what I had said I wanted to, or if things took longer than I said they would. Understanding the differences in our types has helped me to understand more about why this has been an area of friction. That leaves room for us to discuss it without feeling the need to blame each other for our preferred method of dealing with things. I feel clearer about stating goals for the day, but not necessarily promising to do specific things, and working on figuring out how to allocate enough time for things. He understands that just because I tell him what I would like to do, it is not necessarily what I will end up doing. It's still a work in progress.
I want to be clear that I am not talking about using the types as excuses to get out of doing things, or for taking what other people feel is "too long" to get things done. It's merely another "tool in my tool box" that helps me to process how I and my loved ones function, and to figure out how to improve.
I am curious to know how other people feel about their experiences, if they have done a personal growth seminar such as est and/or taken the MBTI, if they feel that they have also taken tools from those experiences that have had an ongoing positive impact on their lives and relationships. I look forward to hearing what people have to say in response to this article.
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)