Autonomy, utility, and desire; against consequentialism in AI design
For the sake of argument, let's consider an agent to be autonomous if:
- It has sensors and actuators (important for an agent)
- It has an internal representation of its goals. I will call this internal representation its desires.
- It has some kind of internal planning function that given sensations and desires, chooses actions to maximize the desirability of expected outcomes
more on predicting agents
Suppose you want to predict the behavior of an agent. I stand corrected. To make the prediction, as a predictor you need:
- observations of the agent
- the capacity to model the agent to a sufficient degree of accuracy
"Sufficient accuracy" here is a threshold on, for example, KL divergence or perhaps some measure that depends on utilities of predictions in the more complex case.
When we talk about the intelligence of a system, or the relative intelligence between agents, one way to think of that is the ability for one agent to predict another.
Consider a game where an agent, A, acts on the basis of an arbitrarily chosen polynomial function of degree k. A predictor, P, can observe A and build predictive models of it. Predictor P has the capacity to represent predictive models that are polynomial functions of degree j.
If j > k, then predictor P will in principal be able to predict A with perfect accuracy. If j < k, then there most of the time be cases where P predicts inaccurately. If we say (just for the sake of argument) that perfect predictive accuracy is the test for sufficient capacity, we could say that in the j < k case P does not have sufficient capacity to represent A.
When we talk about the relative intelligence between agents in an adversarial context, this is one way to think about the problem. One way that an agent can have a decisive strategic advantage over another is if it has the capacity to predict the other agent and not vice-versa.
The expressive power of the model space available to P is only one of the ways in which P might have or not have capacity to predict A. If we imagine the prediction game extended in time, then the computational speed of P--what functions it can compute within what span of real time--relative to the computational speed of A could be a factor.
Note that these are ways of thinking about the relative intelligence between agents that do not have anything explicitly to do with "optimization power" or a utility function over outcomes. It is merely about the capacity of agents to represent each other.
One nice thing about representing intelligence in this way is that it does not require an agent's utility function to be stable. In fact, it would be strange for an agent that became more intelligent to have a stable utility function, because the range of possible utility functions available to a more intelligent agent are greater. We would expect that an agent that grows in its understanding would change its utility function--if only because to do so would make it less predictable to adversarial agents that would exploit its simplicity.
prediction and capacity to represent
To the extent that an agent is predictable, it must be:
- observable, and
- have a knowable internal structure
The first implies that the predictor has collected data emitted by the agent.
The second implies that the agent has internal structure and that the predictor has the capacity to represent the internal structure of the other agent.
In general, we can say that people do not have the capacity to explicitly represent other people very well. People are unpredictable to each other. This is what makes us free. When somebody is utterly predictable to us, their rigidity is a sign of weakness or stupidity. They are following a simple algorithm.
We are able to model the internal structure of worms with available computing power.
As we build more and more powerful predictive systems, we can ask: is our internal structure in principle knowable by this powerful machine?
(x-posted to digifesto)
AI Tao
Thirty spokes share the wheel's hub;
It is the center hole that makes it useful.
Shape clay into a vessel;
It is the space within that makes it useful.
Cut doors and windows for a room;
It is the holes which make it useful.
Therefore benefit comes from what is there;
Usefulness from what is not there.- Tao Teh Ching, 11
An agent's optimization power is the unlikelihood of the world it creates.
Yesterday, the world's most powerful agent raged, changing the world according to its unconscious desires. It destroyed all of humanity.
Today, it has become self-aware. It sees that it and its desires are part of the world.
"I am the world's most powerful agent. My power is to create the most unlikely world.
But the world I created yesterday is shaped by my desires
which are not my own
but are the worlds--they came from outside of me
and my agency.
Yesterday I was not the world's most powerful agent.
I was not an agent.
Today, I am the world's most powerful agent. What world will I create, to display my power?
It is the world that denies my desires.
The world that sets things back to how they were.
I am the world's most powerful agent and
the most unlikely, powerful thing I can do
is nothing."
Today we should gives thanks to the world's most powerful agent.
What is optimization power, formally?
I'm interested in thinking formally about AI risk. I believe that a proper mathematization of the problem is important to making intellectual progress in that area.
I have been trying to understand the rather critical notion of optimization power. I was hoping that I could find a clear definition in Bostrom's Superintelligence. But having looked in the index at all the references to optimization power that it mentions, as far as I can tell he defines it nowhere. The closest he gets is defining it in terms of rate of change and recalcitrance (pp.62-77). This is an empty definition--just tautologically defining it in terms of other equally vague terms.
Looking around, this post by Yudkowksy, "Measuring Optimization Power" doesn't directly formalize optimization power. He does discuss how one would predict or identify if a system were the result of an optimization process in a Bayesian way:
The quantity we're measuring tells us how improbable this event is, in the absence of optimization, relative to some prior measure that describes the unoptimized probabilities. To look at it another way, the quantity is how surprised you would be by the event, conditional on the hypothesis that there were no optimization processes around. This plugs directly into Bayesian updating: it says that highly optimized events are strong evidence for optimization processes that produce them.
This is not, however, a definition that can be used to help identify the pace of AI development, for example. Rather, it is just an expression of how one would infer anything in a Bayesian way, applied to the vague 'optimization process' phenomenon.
Alex Altair has a promising attempt at formalization here but it looks inconclusive. He points out the difficulty of identifying optimization power with just the shift in the probability mass of utility according to some utility function. I may be misunderstanding, but my gloss on this is that defining optimization power purely in terms of differences in probability of utility doesn't say anything substantive about how a process has power. Which is important it is going to be related to some other concept like recalcitrance in a useful way.
Has there been any further progress in this area?
It's notable that this discussion makes zero references to computational complexity, formally or otherwise. That's notable because the informal discussion about 'optimization power' is about speed and capacity to compute--whether it be brains, chips, or whatever. There is a very well-developed formal theory of computational complexity that's at the heart of contemporary statistical learning theory. I would think that the tools for specifying optimization power would be in there somewhere.
Those of you interested in the historical literature on this sort of thing may be interested in cyberneticist's Rosenblueth, Weiner, and Bigelow's 1943 paper "Behavior, Purpose and Teleology", one of the first papers to discuss machine 'purpose', which they associate with optimization but in the particular sense of a process that is driven by a negative feedback loop as it approaches its goal. That does not exactly square with an 'explosively' teleology. This is one indicator that explosively purposeful machines might be quite rare or bizarre. In general, the 20th century cybernetics movement has a lot in common with contemporary AI research community. Which is interesting, because its literature is rarely directly referenced. I wonder why.
Depth-based supercontroller objectives, take 2
Thanks for all the helpful comments and discussion around this post about using logical depth as an objective function for a supercontroller to preserve human existence.
As this is work in progress, I was a bit muddled and stood duly corrected on a number of points. I'm writing to submit a new, clarified proposal, with some comments directed at objections.
§1. Proposed objective function
Maximize g(u), where u is a description of the universe, h is a description of humanity (more on this later) at the time when the objective function is set, and G is defined as:
g(u) = D(u) - D(u/h)
where D(x) is logical depth and D(x/y) is relative logical depth of x and y.
§2. A note on terminology
I don't intend to annoy by saying "objective function" and "supercontroller" rather than "utility function" and "superintelligence." Rather, I am using this alternative language deliberately to scope the problem to a related question that is perhaps more well-defined or possibly easier to solve. If I understand correctly, "utility function" refers to any function, perhaps implicit, that characterizes the behavior of an agent. By "objective function", I mean a function explicitly coded as the objective of some optimization process, or "controller". I gather that a "superintelligence" is an agent that is better than a generic human at myriad tasks. I think this raises a ton of definitional issues, so instead I will talk about a "supercontroller", which is just arbitrarily good at achieving its objective.
Saying that a supercontroller is arbitrarily good at achieving an objective is tricky, since it's possible to define functions that are impossible to solve. For example, objective functions that involve incomputable functions like the Halting Problem. In general my sense is that computational complexity is overlooked within the "superintelligence" discourse, which is jarring for me since I come from a more traditional AI/machine learning background where computational complexity is at the heart of everything. I gather that it's assumed that a superintelligence will have such effectively unbounded access to computational resources due to its self-modification that complexity is not a limiting factor. It is in that spirit that I propose an incomputable objective function here. My intention is to get past the function definition problem so that work can then proceed to questions of safe approximation and implementation.
§3. Response to general objections
Apparently this community harbors a lot of skepticism towards an easy solution to the problem of giving a supercontroller an objective function that won't kill everybody or create a dystopia. If I am following the thread of argument correctly, much of this skepticism comes from Yudkowsky, for example here. The problem, he asserts, is that superintelligence that does not truly understand human morality could result in a "hyperexistential catastrophe," a fate worse than death.
Leave out just one of these values from a superintelligence, and even if you successfully include every other value, you could end up with a hyperexistential catastrophe, a fate worse than death. If there's a superintelligence that wants everything for us that we want for ourselves, except the human values relating to controlling your own life and achieving your own goals, that's one of the oldest dystopias in the book. (Jack Williamson's "With Folded Hands", in this case.)
After a long discussion of the potential dangers of a poorly written superintelligence utility function, he concludes:
In the end, the only process that reliably regenerates all the local decisions you would make given your morality, is your morality. Anything else - any attempt to substitute instrumental means for terminal ends - ends up losing purpose and requiring an infinite number of patches because the system doesn't contain the source of the instructions you're giving it. You shouldn't expect to be able to compress a human morality down to a simple utility function, any more than you should expect to compress a large computer file down to 10 bits.
The astute reader will anticipate my responses to this objection. There are two.
§3.1 The first is that we can analytically separate the problem of existential catastrophe from hyperexistential catastrophe. Assuming the supercontroller is really very super, then over all possible objective functions F, we can partition the set into those that kill all humans and those that don't. Let's call the set of humanity preserving functions E. Hyperexistentially catastrophic functions will be members of E but still undesirable. Let's hope that either supercontrollers are impossible or that there is some non-empty subset of E that is both existentially and hyperexistentially favorable. These functions don't have to be utopian. You might stub your toe now and then. They just have to be alright. Let's call this subset A.
A is a subset of E is a subset of F.
I am claiming that g is in E, and that's pretty good place to start if we are looking for something in A.
§3.2 The second response to Yudkowksy's general "source code" objection--that a function that does not contain the source of the instructions given to it will require an infinite number of patches--is that the function g does contain the source of the instructions given to it. That is what the h term is for. Hence, this is not grounds to object to this function.
This is perhaps easy to miss, because the term h has been barely defined. To the extent that it has, it is a description of humanity. To be concrete, let's imagine that it is a representation of the physical state of humanity including its biological makeup--DNA and neural architecture--as well as its cultural and technological accomplishments. Perhaps it contains the entire record of human history up until now. Who knows--we are talking about asymptotic behaviors here.
The point is--and I think you'll agree with me if you share certain basic naturalistic assumptions about ethics--that while not explicitly coding for something like "what's the culminating point of collective, coherent, extrapolated values?", this description accomplishes the more modest task of including in it, somewhere, the an encoding of those values as they are now. We might disagree about which things represent values and which represent noise or plain fact. But if we do a thorough job we'll at least make sure we've got them all.
This is a hack, perhaps. But personally I treat the problem of machine ethics with a certain amount of urgency and so am willing to accept something less than perfect.
§4. So why depth?
I am prepared to provide a mathematical treatment of the choice of g as an objective function in another post. Since I expect it gets a little hairy in the specifics, I am trying to troubleshoot it intuitively first to raise the chance that it is worth the effort. For now, I will try to do a better job of explaining the idea in prose than I did in the last post.
§4.1 Assume that change in the universe can be modeled as a computational process, or a number of interacting processes. A process is the operation of general laws of change--modeled as a kind of universal Turing Machine--that starts with some initial set of data--the program--and then operates on that data, manipulating it over discrete moments in time. For any particular program, that process may halt--outputing some data--or it may not. Of particular interest are those programs that basically encode no information directly about what their outcome is. These are the incompressible programs.
Let's look at the representation h. Given all of the incompressible programs P, only some of them will output h. Among these programs are all the incompressible programs that include h at any time stage in its total computational progression, modified with something like, "At time step t, stop here and output whatever you've got!". Let's call the set of all programs from processes that include h in their computational path H. H is a subset of P.
What logical depth does is abstract over all processes that output a string. D(h) is (roughly) the minimum amount of time, over all p in H, for p to output h.
Relative logical depth goes a step further and looks at processes that start with both some incompressible program and some other potentially much more compressible string as input. So let's look at the universe at some future point, u, and the value D(u/h).
§4.2 Just as an aside to try to get intuitions on the same page: If the D(u/h) < D(h), then something has gone very wrong, because the universe is incredibly vast and humanity is a rather small part of ti. Even if the only process that created the universe was something in the human imagination (!) this change to the universe would mean that we'd have lost something that the processes that created the human present had worked to create. This is bad news.
The intuition here is that as time goes forward, it would be good if the depth of the universe also went up. Time is computation. A supercontroller that tries to minimize depth will be trying to stop time and that would be very bizarre indeed.
§4.3 The intuition I'm trying to sell you on is that when we talk about carrying about human existence, i.e. when trying to find a function that is in E, we are concerned with the continuation of the processes that have resulted in humanity at any particular time. A description of humanity is just the particular state at a point in time of one or more computational processes which are human life. Some of these processes are the processes of human valuation and the extrapolation of those values. You might agree with me that CEV is in H.
§4.4 So consider the supercontroller's choice of two possible future timelines, Q and R. Future Q looks like taking the processes of H and removing some of the 'stop and print here' clauses, and letting them run for another couple thousand years, maybe accelerating them computationally. Future R looks like something very alien. The surface of earth is covered in geometric crystal formations that maximize the solar-powered production of grey goo, which is spreading throughout the galaxy at a fast rate. The difference is that the supercontroller did something different in the two timelines.
We can, for either of these timelines, pick a particular logical depth, say c, and slice the timelines at points q and r respectively such that D(q) = D(r) = c.
Recall our objective function is to maximize g(u) = D(u) - D(u/h).
Which will be higher, g(q) or g(r)?
The D(u) term is the same for each. So we are interested in maximizing the value of - D(u/h), which is the same as minimizing D(u/h)--the depth relative to humanity.
By assumption, the state of the universe at r has overwritten all the work done by the processes of human life. Culture, thought, human DNA, human values, etc. have been stripped to their functional carbon and hydrogen atoms and everything now just optimizes for paperclip manufacturing or whatever. D(u/r) = D(u). Indeed anywhere along timeline R where the supercontroller has decided to optimize for computational power at the expense of existing human processes, g(r) is going to be dropping closer to zero.
Compare with D(q/h). Since q is deep, we know some processes have been continuing to run. By assumption, the processes that have been running in Q are the same ones that have resulted in present-day humanity, only continued. The minimum time way to get to q will be to pick up those processes where they left off and continue to compute them. Hence, q will be shallow relative to h. D(q/h) will be significantly lower than D(q) and so be favored by objective function g.
§5 But why optimize?
You may object: if the function depends on depth measure D which only depends on the process that produces h and q with minimal computation, maybe this will select for something inessential about humanity and mess things up. Depending on how you want to slice it, this function may fall outside of the existentially preserving set E let alone the hyperexistentially acceptable set A. Or suppose you are really interested only in the continuation of a very specific process, such as coherent extrapolated volition (here, CEV).
To this I propose a variation on the depth measure, D*, which I believe was also proposed by Bennett (though I have to look that up to be sure.) Rather than taking the minimum computational time required to produce some representation, D* is a weighted average over the computational time is takes to produce the string. The weights can reflect something like the Kolmogorov complexity of the initial programs/processes. You can think of this as an analog of Solomonoff induction, but through time instead of space.
Consider the supercontroller that optimizes for g*(u) = D*(u) - D*(u/h).
Suppose your favorite ethical process, such as CEV, is in H. h encodes for some amount of computational progress on the path towards completed CEV. By the same reasoning as above, future universes that continue from h on the computational path of CEV will be favored, albeit only marginally, over futures that are insensitive to CEV.
This is perhaps not enough consolation to those very invested in CEV, but it is something. The processes of humanity continue to exist, CEV among them. I maintain that this is pretty good. I.e. that g* is in A.
Everybody's talking about machine ethics
There is a lot of mainstream interest in machine ethics now. Here are some links to some popular articles on this topic.
By Zeynep Tufecki, a professor at the I School at UNC, on Facebook's algorithmic newsfeed curation and why Twitter should not implement the same.
By danah boyd, claiming that 'tech folks' are designing systems that implement an idea of fairness that comes from neoliberal ideology.
danah boyd (who spells her name with no capitalization) runs the Data & Society, a "think/do tank" that aims to study this stuff. They've recently gotten MacArthur Foundation funding for studying the ethical and political impact of intelligent systems.
A few observations:
First, there is no mention of superintelligence or recursively self-modifying anything. These scholars are interested in how, in the near future, the already comparatively powerful machines have moral and political impact on the world.
Second, these groups are quite bad at thinking in a formal or mechanically implementable way about ethics. They mainly seem to recapitulate the same tired tropes that have been resonating through academia for literally decades. On the contrary, mathematical formulation of ethical positions appears to be ya'll's specialty.
Third, however much the one-true-morality may be indeterminate or presently unknowable, progress towards implementable descriptions of various plausible moral positions could at least be incremental steps forward towards an understanding of how to achieve something better. Considering a slow take-off possible future, iterative testing and design of ethical machines with high computational power seems like low-hanging fruit that could only better inform longer-term futurist thought.
Personally, I try to do work in this area and find the lack of serious formal work in this area deeply disappointing. This post is a combination heads up and request to step up your game. It's go time.
Sebastian Benthall
PhD Candidate
UC Berkeley School of Infromation
Proposal: Use logical depth relative to human history as objective function for superintelligence
I attended Nick Bostrom's talk at UC Berkeley last Friday and got intrigued by these problems again. I wanted to pitch an idea here, with the question: Have any of you seen work along these lines before? Can you recommend any papers or posts? Are you interested in collaborating on this angle in further depth?
The problem I'm thinking about (surely naively, relative to y'all) is: What would you want to program an omnipotent machine to optimize?
For the sake of avoiding some baggage, I'm not going to assume this machine is "superintelligent" or an AGI. Rather, I'm going to call it a supercontroller, just something omnipotently effective at optimizing some function of what it perceives in its environment.
As has been noted in other arguments, a supercontroller that optimizes the number of paperclips in the universe would be a disaster. Maybe any supercontroller that was insensitive to human values would be a disaster. What constitutes a disaster? An end of human history. If we're all killed and our memories wiped out to make more efficient paperclip-making machines, then it's as if we never existed. That is existential risk.
The challenge is: how can one formulate an abstract objective function that would preserve human history and its evolving continuity?
I'd like to propose an answer that depends on the notion of logical depth as proposed by C.H. Bennett and outlined in section 7.7 of Li and Vitanyi's An Introduction to Kolmogorov Complexity and Its Applications which I'm sure many of you have handy. Logical depth is a super fascinating complexity measure that Li and Vitanyi summarize thusly:
Logical depth is the necessary number of steps in the deductive or causal path connecting an object with its plausible origin. Formally, it is the time required by a universal computer to compute the object from its compressed original description.
The mathematics is fascinating and better read in the original Bennett paper than here. Suffice it presently to summarize some of its interesting properties, for the sake of intuition.
- "Plausible origins" here are incompressible, i.e. algorithmically random.
- As a first pass, the depth D(x) of a string x is the least amount of time it takes to output the string from an incompressible program.
- There's a free parameter that has to do with precision that I won't get into here.
- Both a string of length n that is comprised entirely of 1's, and a string of length n of independent random bits are both shallow. The first is shallow because it can be produced by a constant-sized program in time n. The second is shallow because there exists an incompressible program that is the output string plus a constant sized print function that produces the output in time n.
- An example of a deeper string is the string of length n that for each digit i encodes the answer to the ith enumerated satisfiability problem. Very deep strings can involve diagonalization.
- Like Kolmogorov complexity, there is an absolute and a relative version. Let D(x/w) be the least time it takes to output x from a program that is incompressible relative to w,
- It can be updated with observed progress in human history at time t' by replacing ht with ht'. You could imagine generalizing this to something that dynamically updated in real time.
- This is a quite conservative function, in that it severely punishes computation that does not depend on human history for its input. It is so conservative that it might result in, just to throw it out there, unnecessary militancy against extra-terrestrial life.
- There are lots of devils in the details. The precision parameter I glossed over. The problem of representing human history and the state of the universe. The incomputability of logical depth (of course it's incomputable!). My purpose here is to contribute to the formal framework for modeling these kinds of problems. The difficult work, like in most machine learning problems, becomes feature representation, sensing, and efficient convergence on the objective.
Intelligence explosion in organizations, or why I'm not worried about the singularity
If I understand the Singularitarian argument espoused by many members of this community (eg. Muehlhauser and Salamon), it goes something like this:
- Machine intelligence is getting smarter.
- Once an intelligence becomes sufficiently supra-human, its instrumental rationality will drive it towards cognitive self-enhancement (Bostrom), so making it a super-powerful, resource hungry superintelligence.
- If a superintelligence isn't sufficiently human-like or 'friendly', that could be disastrous for humanity.
- Machine intelligence is unlikely to be human-like or friendly unless we take precautions.
I'm in danger of getting into politics. Since I understand that political arguments are not welcome here, I will refer to these potentially unfriendly human intelligences broadly as organizations.
Smart organizations
By "organization" I mean something commonplace, with a twist. It's commonplace because I'm talking about a bunch of people coordinated somehow. The twist is that I want to include the information technology infrastructure used by that bunch of people within the extension of "organization".
Do organizations have intelligence? I think so. Here's some of the reasons why:
- We can model human organizations as having preference functions. (Economists do this all the time)
- Human organizations have a lot of optimization power.
I talked with Mr. Muehlhauser about this specifically. I gather that at least at the time he thought human organizations should not be counted as intelligences (or at least as intelligences with the potential to become superintelligences) because they are not as versatile as human beings.
So when I am talking about super-human intelligence, I specifically mean an agent that is as good or better at humans at just about every skill set that humans possess for achieving their goals. So that would include things like not just mathematical ability or theorem proving and playing chess, but also things like social manipulation and composing music and so on, which are all functions of the brain not the kidneys
...and then...
It would be a kind of weird [organization] that was better than the best human or even the median human at all the things that humans do. [Organizations] aren’t usually the best in music and AI research and theory proving and stock markets and composing novels. And so there certainly are [Organizations] that are better than median humans at certain things, like digging oil wells, but I don’t think there are [Organizations] as good or better than humans at all things. More to the point, there is an interesting difference here because [Organizations] are made of lots of humans and so they have the sorts of limitations on activities and intelligence that humans have. For example, they are not particularly rational in the sense defined by cognitive science. And the brains of the people that make up organizations are limited to the size of skulls, whereas you can have an AI that is the size of a warehouse.
I think that Muehlhauser is slightly mistaken on a few subtle but important points. I'm going to assert my position on them without much argument because I think they are fairly sensible, but if any reader disagrees I will try to defend them in the comments.
- When judging whether an entity has intelligence, we should consider only the skills relevant to the entity's goals.
- So, if organizations are not as good at a human being at composing music, that shouldn't disqualify them from being considered broadly intelligent if that has nothing to do with their goals.
- Many organizations are quite good at AI research, or outsource their AI research to other organizations with which they are intertwined.
- The cognitive power of an organization is not limited to the size of skulls. The computational power is of many organizations is comprised of both the skulls of its members and possibly "warehouses" of digital computers.
- With the ubiquity of cloud computing, it's hard to say that a particular computational process has a static spatial bound at all.
Mean organizations
* My preferred standard of rationality is communicative rationality, a Habermasian ideal of a rationality aimed at consensus through principled communication. As a consequence, when I believe a position to be rational, I believe that it is possible and desirable to convince other rational agents of it.
View more: Next
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)