Why might the future be good?
(Cross-posted from Rational Altruist. See also recent posts on time-discounting and self-driving cars.)
When talking about the future, I often encounter two (quite different) stories describing why the future might be good:
- Decisions will be made by people whose lives are morally valuable and who want the best for themselves. They will bargain amongst each other and create a world that is good to live in. Because my values are roughly aligned with their aggregate preferences, I expect them to create a rich and valuable world (by my lights as well as theirs).
- Some people in the future will have altruistic values broadly similar to my own, and will use their influence to create a rich and valuable world (by my lights as well as theirs).
Which of these pictures we take more seriously has implications for what we should do today. I often have object level disagreements which seem to boil down to disagreement about which of these pictures is more important, but rarely do I see serious discussion of that question. (When there is discussion, it seems to turn into a contest of political ideologies rather than facts.)
If we take picture (1) seriously, we may be interested in ensuring that society continues to function smoothly, that people are aware of and pursue what really makes them happy, that governments are effective, markets are efficient, externalities are successfully managed, etc. If we take picture (2) seriously, we are more likely to be concerned with changing what the people of the future value, bolstering the influence of people who share our values, and ensuring that altruists are equipped to embark on their projects successfully.
I'm mostly concerned with the very long run---I am wondering what conditions will prevail for most of the people who live in the future, and I expect most of them to be alive very far from now.
It seems to me that there are two major factors that control the relative importance of pictures (1) and (2): how prominent should we expect altruism to be in the future, and how efficiently are altruistic vs. selfish resources being used to create value? My answer to the second question is mostly vague hand-waving, but I think I have something interesting to say on the first question.
How much altruism do we expect?
I often hear people talking about the future, and the present for that matter, as if we are falling towards a Darwinian attractor of cutthroat competition and vanishing empathy (at least as a default presumption, which might be averted by an extraordinary effort). I think this picture is essentially mistaken, and my median expectation is that the future is much more altruistic than the present.
Dose natural selection select for self-interest?
In the world of today, it may seem that humans are essentially driven by self-interest, that this self-interest was a necessary product of evolution, that good deeds are principally pursued instrumentally in service of self-interest, and that altruism only exists at all because it is too hard for humans to maintain a believable sociopathic facade.
If we take this situation and project it towards a future in which evolution has had more time to run its course, creating automations and organizations less and less constrained by folk morality, we may anticipate an outcome in which natural selection has stripped away all empathy in favor of self-interest and effective manipulation. Some may view this outcome as unfortunate but inevitable, others may view it as a catastrophe which we should work to avert, and still others might view it as a positive outcome in which individuals are free to bargain amongst themselves and create a world which serves their collective interest.
But evolution itself does not actually seem to favor self-interest at all. No matter what your values, if you care about the future you are incentivized to survive, to acquire resources for yourself and your descendants, to defend yourself from predation, etc. etc. If I care about filling the universe with happy people and you care about filling the universe with copies of yourself, I'm not going to set out by trying to make people happy while allowing you and your descendants to expand throughout the universe unchecked. Instead, I will pursue a similar strategy of resource acquisition (or coordinate with others to stop your expansion), to ensure that I maintain a reasonable share of the available resources which I can eventually spend to help shape a world I consider value. (See here for a similar discussion.)
This doesn't seem to match up with what we've seen historically, so if I claim that it's relevant to the future I have some explaining to do.
Historical distortions
Short-range consequentialism
One reason we haven't seen this phenomenon historically is that animals don't actually make decisions by backwards-chaining from a desired outcome. When animals (including humans) engage in goal-oriented behavior, it tends to be pretty local, without concern for consequences which are distant in time or space. To the extent that animal behavior is goal-oriented at a large scale, those goals are largely an emergent property of an interacting network of drives, heuristics, etc. So we should expect animals to have goals which lead them to multiply and acquire resources, even when those drives are pursued short-sightedly. And indeed, that's what we see. But it's not the fault of evolution alone---it is a product of evolution given nature's inability to create consequentialist reasoners.
Casually, we seem to observe a similar situation with respect to human organizations---organizations which value expansion for its own sake (or one of its immediate consequences) are able to expand aggressively, while organizations which don't value expansion have a much harder time deciding to expand for instrumental reasons without compromising their values.
Hopefully, this situation is exceptional in history. If humans ever manage to build systems which are properly consequentialist---organizations or automations which are capable of expanding because it is instrumentally useful---we should not expect natural selection to discriminate at all on the basis of those systems' values.
Value drift
Human's values are also distorted by the process of reproduction. A perfect consequentialist would prefer to have descendants who share their values. (Even if I value diversity or freedom of choice, I would like my children to at least share those values, at least if I want that freedom and diversity to last more than one generation!) But humans don't have this option---the only way we can expand our influence is by creating very lossy copies. And so each generation is populated by a fresh batch of humans with a fresh set of values, and the values of our ancestors only have an extremely indirect effect on the world of today.
Again, a similar problem afflicts human organizations. If I create a foundation that I would like to persist for generations, the only way it can expand its influence is by hiring new staff. And since those staff have a strong influence over what my foundation will do, the implicit values of my foundation will slowly but surely be pulled back to the values of the pool of human employees that I have to draw from.
These constraints distort evolution, causing selection to act only those traits which can be reliably passed on from one generation to the next. In particular, this exacerbates the problem from the preceding section---even to the extent that humans can engage in goal-oriented reasoning and expand their own influence instrumentally, these tendencies can not be very well encoded in genes or passed on to the next generation in other ways. This is perhaps the most fundamental change which would result from the development of machine intelligences. If it were possible to directly control the characteristics and values of the next generation, evolution would be able to act on those characteristics and values directly.
So what does natural selection select for?
If the next generation is created by the current generation, guided by the current generation's values, then the properties of the next generation will be disproportionately affected by those who care most strongly about the future.
In finance: if investors have different time preferences, those who are more patient will make higher returns and eventually accumulate much wealth. In demographics: if some people care more about the future, they may have more kids as a way to influence it, and therefore be overrepresented in future generations. In government: if some people care about what government looks like in 100 years, they will use their political influence to shape what the government looks like in 100 years rather than trying to win victories today.
What natural selection selects for is patience. In a thousand years, given efficient natural selection, the most influential people will be those who today cared what happens in a thousand years. Preferences about what happens to me (at least for a narrow conception of personal identity) will eventually die off, dominated by preferences about what society looks like on the longest timescales.
I think this picture is reasonably robust. There are ways that natural selection (/ efficient markets) can be frustrated, and I would not be too surprised if these frustrations persisted indefinitely, but nevertheless this dynamic seems like one of the most solid features of an uncertain future.
What values are we starting with?
Most of people's preferences today seem to concern what happens to them in the near term. If we take the above picture seriously, these values will eventually have little influence over society. Then the question becomes: if we focus only on humanity's collective preferences over the long term, what do those preferences look like? (Trying to characterize preferences as "altruistic" or not no longer seems useful as we zoom in.)
This is an empirical question, which I am not very well-equipped to value. But I can make a few observations that ring true to me (though my data is mostly drawn from academics and intellectuals, who may fail to be representative of normal people in important ways even after conditioning on the "forward-looking" part of people's values):
- When people think about the far future (and thus when they articulate their preferences for the far future) they seem to engage a different mode of reasoning, more strongly optimized to produce socially praise-worthy (and thus prosocial) judgments. This might be characterized as a bias, but to the extent we can talk about human preferences at all they seem to be a result of these kinds of processes (and to the extent that I am using my own altruistic values to judge futures, they are produced by a similar process). This effect seems to persist even when we are not directly accountable for our actions.
- People mostly endorse their own enlightened preferences, and look discouragingly at attempts to lock-in hastily considered values (though they often seem to have overconfident views about what their enlightened preferences will look like, which admittedly might interfere with their attempts at reflection).
- I find myself sympathetic to very many people's accounts of their own preferences about the future, even where those accounts different significantly from my own. I would be surprised if the distribution of moral preferences was too scattered.
- To the extent that people care especially about their species, their nation, their family, themselves, etc. : they seem to be sensitive to fairness considerations (and rarely wish e.g. to spend a significant fraction of civilization's resources on themselves), their preferences seem to be only a modest distortion of aggregative values (wanting people with property X to flourish is not so different from wanting people to flourish, if property X is some random characteristic without moral significance), and human preferences seem to somewhat reliably drift in the direction of more universal concern as basic needs are addressed and more considerations are considered.
After cutting away all near-term interests, I expect that contemporary human society's collective preferences are similar to their stated moral preferences, with significant disagreement on many moral judgments. However, I expect that these values support reflection, that upon reflection the distribution of values is not too broad, and that for the most part these values are reasonably well-aligned. With successful bargaining, I expect a mixture of humanity's long-term interests to be only modestly (perhaps a factor of 10, probably not a factor of 1000) worse than my own values (as judged by my own values).
Moreover, I have strong intuitions to emphasize those parts of my values which are least historically contingent. (I accept that all of my values are contingent, but am happier to accept those values that are contingent on my biological identity than those that are contingent on my experiences as a child, and happier to accept those that are contingent on my experiences as a child than those that are contingent on my current blood sugar.) And I have strong reciprocity intuitions that exacerbate this effect and lead me to be more supportive of my peers' values. These effects make me more optimistic about a world determined by humanity's aggregate preferences than I otherwise would be.
How important is altruism?
(The answer to this question, unlike the first one, depends on your values: how important to what? I will answer from my own perspective. I have roughly aggregative values, and think that the goodness of a world with twice as many happy people is twice as high.)
Even if we know a society's collective preferences, it is not obvious what their relative importance is. At what level of prevalence would the contributions of explicit altruism become the source of value? If altruists are 10% of the influence-weighted population, do the contributions of the altruists matter? What if altruists are 1% of the population? A priori, it seems clear that the explicit altruists should do at least as much good--on the altruistic account--as any other population (otherwise they could decide to jump ship and become objectivists, or whatever). But beyond that, it isn't clear that altruists should create much more value--even on the altruistic account--than people with other values.
I suspect that explicit altruistic preferences create many times more value than self-interest or other nearly orthogonal preferences. So in addition to expecting a future in which altruistic preferences play a very large role, I think that altruistic preferences would be responsible for most of the value even if they controlled only 1% of the resources.
One significant issue is population growth. Self-interest may lead people to create a world which is good for themselves, but it is unlikely to inspire people to create as many new people as they could, or use resources efficiently to support future generations. But it seems to me that the existence of large populations is a huge source of value. A barren universe is not a happy universe.
A second issue is that population characteristics may also be an important factor in the goodness of the world, and self-interest is unlikely to lead people to ensure that each new generation has the sorts of characteristics which would cause them to lead happy lives. It may happen by good fortune that the future is full of people who are well-positioned to live rich lives, but I don't see any particular reason this would happen. Instead, we might have a future "population" in which almost all resources support automation that doesn't experience anything, or a world full of minds which crave survival but experience no joy, or etc.; "self-interest" wouldn't lead any of these populations to change themselves to experience more happiness. It's not clear why we would avoid these outcomes except by a law of nature that said that productive people were happy people (which seems implausible to me) or by coordinating to avoid these outcomes.
(If you have different values, such that there is a law [or at least guideline] of nature: "productive people are morally valuable people," then this analysis may not apply to you. I know several such people, but I have a hard time sympathizing with their ethics.)
Conclusion
I think that the goodness of a world is mostly driven by the amount of explicit optimization that is going on to try and make the world good (this is all relative to my values, though a similar analysis seems to carry with respect to other aggregative values). This seems to be true even if relatively little optimization is going on. Fortunately, I also think that the future will be characterized by much higher influence for altruistic values. If I thought altruism was unlikely to win out, I would be concerned with changing that. As it is, I am instead more concerned with ensuring that the future proceeds without disruptions. (Though I still think it is worth it to try and increase the prevalence of altruism faster, most of all because this seems like a good approach to minimizing the probability of undesired disruptions.)
Link: blog on effective altruism
Over the last few months I've started blogging about effective altruism more broadly, rather than focusing on AI risk. I'm still focusing on abstract considerations and methodological issues, but I hope it is of interest to others here. Going forward I intend to cross-post more often to LW, but I thought I would post the backlog here anyway. With luck, I'll also have the opportunity to post more than bi-weekly.
I welcome thoughts, criticisms, etc.
My workflow
Over the last 6 months I've started doing a lot of things differently. Some of these changes seem to have increased my work output a good bit and made me happier. I normally hesitate to share habits, but I'm pretty happy with these in particular, and even if they will work for only a few people I think they are worth sharing. Most of the habits I've adopted are fairly common, but I hope I can help people anyway by identifying the habits that have most helped me.
I'm curious to hear about alternatives that have worked for you.
Workflowy:
Workflowy lets you edit a single collapsible outline. I use it very extensively. It is much more convenient than the network of google docs it replaced, and I use it much more often. It is much like other outliners, but (1) has a slicker interface, (2) works offline, (3) lets you recurse on and share sublists.
Workflowy is free to try but costs $5 a month. This may seem expensive for what it does, but if you use (or could use!) outliners a lot this is not enough to matter. After some searching Workflowy seems like the best option. I'm sure I like Workflowy more than most people, but I really like it, so I think it's worth trying.
Here is a skeleton of my workflowy list, which hosts many of the other systems in this post.
Checklists:
I have a checklist of tasks to do each night before sleeping. In the past I would often forget one of these things; putting them in a checklist helps me do them more reliably and makes me more relaxed.
Checklists for other occasions, particularly waking up and traveling, are also helpful, but are much less important to me.
Todo lists:
I now maintain two todo lists: one with a list of tasks for each upcoming day, and one with a list of tasks for future events ("I'm in the UK," "it is Thursday," "I'm going grocery shopping"). Whenever I think of something I should do, I either put it under a future day and do it when that day arrives, or I put it with an associated event. Each night I check both lists and decide what to do tomorrow.
Beeminder:
Beeminder is a service that holds you to commitments and tracks your progress. It has helped me a lot over the last months. I've experimented with a few different commitments, but two have been most useful: following a daily routine, and doing a minimum amount of work each day (on average). Beeminder has pretty low overhead.
Reflection:
I spend about 10% of my productive time reflecting on how things have been going and what I should do differently. I benefit from producing concrete possible changes each time I sit down to think. I realized how important this is for me recently; since I've started doing it more reliably, I have gotten a lot more out of reflection.
Pomodoro:
I do my work in uninterrupted blocks of 20 minutes, punctuated by 2-3 minute breaks. This is my bastardized, minimalist version of the pomodoro technique, which I arrived at by trial and error. I use Alinof timer, which was recommended to me by a friend.
Calendar:
I now record commitments on my calendar reliably and check it each night. I failed to do this for 6 months after finishing my undergraduate degree, which I think was a serious mistake. I became much more reliable at checking my calendar after adopting a daily checklist.
Time Logging:
Whenever I start a new activity, I write down the current time and a description of what I just stopped doing. At the end of the day I spend a few minutes reading this log and estimating how much time I spent on each activity. This makes me more attentive to time during the day, helps me remember what I did throughout the day, and frees up attention. Sometimes I use the logs to try and notice trends. For example, I've been exercising on random days and measuring how this affects my time. I don't yet know if this helps at all.
Catch:
Catch is a note-taking app. It is very minimal, and lets you record a voice note by pressing a single button. It has substantially increased my affordance for taking notes during the day, which I use to remember todo items and help with time logging.
Formalizing Value Extrapolation
A recent post at my blog may be interesting to LW. It is a high-level discussion of what precisely defined value extrapolation might look like. I mostly wrote the essay while a visitor at FHI.
The basic idea is that we can define extrapolated values by just taking an emulation of a human, putting it in a hypothetical environment with access to powerful resources, and then adopting whatever values it eventually decides on. You might want some philosophical insight before launching into such a definition, but since we are currently laboring under the threat of catastrophe, it seems that there is virtue in spending our effort on avoiding death and delegating whatever philosophical work we can to someone on a more relaxed schedule.
You wouldn't want to run an AI with the values I lay out, but at least it is pinned down precisely. We can articulate objections relatively concretely, and hopefully begin to understand/address the difficulties.
(Posted at the request of cousin_it.)
Negentropy Overrated?
This post may be interesting to some LWers.
In summary: it looks like our universe can support reversible computers which don't create entropy. Reversible computers can simulate irreversible computers, with pretty mild time and space blowup. So if moral value comes from computation, negentropy probably won't be such an important resource for distant future folks, and if the universe lasts a long time we may be able to simulate astronomically long-lived civilizations (easily 10^(10^25) clock cycles, using current estimates and neglecting other obstructions).
Has this been discussed before, and/or is there some reason that it doesn't work or isn't relevant? I suspect that this consideration won't matter in the long run, but it is at least interesting and seems to significantly deflate (long-run) concerns about entropy.
Some thoughts on AI, Philosophy, and Safety
I've spent some time over the last two weeks thinking about problems around FAI. I've committed some of these thoughts to writing and put them up here.
There are about a dozen real posts and some scraps. I think some of this material will be interesting to certain LWers; there is a lot of discussion of how to write down concepts and instructions formally (which doesn't seem so valuable in itself, but it seems like someone should do it at some point) some review and observations on decision theory, and some random remarks on complexity theory, entropy, and prediction markets.
Clarification of AI Reflection Problem
Consider an agent A, aware of its own embedding in some lawful universe, able to reason about itself and use that reasoning to inform action. By interacting with the world, A is able to modify itself or construct new agents, and using these abilities effectively is likely to be an important component of AGI. Our current understanding appears to be inadequate for guiding such an agent's behavior, for (at least) the following reason:
If A does not believe "A's beliefs reflect reality," then A will lose interest in creating further copies of itself, improving its own reasoning, or performing natural self-modifications. Indeed, if A's beliefs don't reflect reality then creating more copies of A or spending more time thinking may do more harm than good. But if A does believe "A's beliefs reflect reality," then A runs immediately into Gödelian problems: for example, does A become convinced of the sentence Q = "A does not believe Q"? We need to find a way for A to have some confidence in its own behavior without running into these fundamental difficulties with reflection.
This problem has been discussed occasionally at Less Wrong, but I would like to clarify and lay out some examples before trying to start in on a resolution.
Gödel Machines
The Gödel machine is a formalism described by Shmidhuber for principled self-modification. A Gödel machine is designed to solve some particular object level problem in its allotted time. I will describe one Gödel machine implementation.
The initial machine A has an arbitrary object level problem solver. Before running the object level problem solver, however, A spends half of its time enumerating pairs of strings (A', P); for each one, if A' is a valid description of an agent and P is a proof that A' does better on the object level task than A, then A transforms into A'.
Now suppose that A's initial search for self-modifications is inefficient: a new candidate agent A' has a more efficient proof checker, and so is able to examine more proofs before defaulting to the original object level problem solver. Intuitively, A' is an improvement over A, and this is (a very mild version of) the sort of self-modification which we would like our agents to perform. For example, a human is easily able to recognize that it can achieve higher utility by modifying itself to think faster. But we will see that under most conditions A cannot prove that A' is an improvement, because A is not able to assume its own consistency (as we do in our justification that A' is an improvement).
Suppose A' considers some new self-modification (A'', P) not considered by A. We would like A to be able to prove that if A' executes this self-modification--if P is a proof that A'' is an improvement over A'--then A'' is really an improvement over A' (and therefore over A). We say that a pair (A'', P) is safe if it has this property, i.e., if either P is not a proof that A'' is an improvement over A', or if A'' is really an improvement over A'.
A can prove that particular pairs are safe, but suppose A' considers many candidate modifications (A'', P1), (A'', P2), ..., (A'', Pk). It is now much harder for A to show that all of these self-modifications are safe, without relying directly on the consistency of A's axioms system (which A can never assume, by the incompleteness theorem). If the set of considered proofs is sufficiently rich, then by a bounded version of Löb's theorem A can only prove that all of these self-modifications are safe if it can prove directly that A'' is an improvement over A'. But in that case A might as well dispense with theorem proving altogether and just commit to transforming into A'' after exhausting all other self-modifications.
(Originally this post was centered around a formal version of the above argument, but although I believe the argument works it turns out to be simultaneously very difficult and very unenlightening. Here is the main difficulty: in order to justify converting into A', A only needs to prove that A' won't execute any destructive modifications if A' doesn't make some other modification before reaching any of the new proposals, and this additional assumption is very difficult to work with. In general, it is very hard to prove anything at all about a Gödel machine's behavior, which is unfortunate because a Gödel machine must do exactly this in order to self-modify.)
This informal argument does not show that a Gödel machine will not execute any self-improvements, but it does show that if it does execute any non-trivial self-improvements then it must appeal to some justification other than the intuitive one. Other modifications, for example any modification which is so radical as to change the set of self-modifications considered in a more complicated way or to change the architecture fundamentally, seem to be even harder to justify than the one described here. Gödel machines may be able to self-modify, but this argument shows that this conclusion is not supported by intuitive arguments about the existence of beneficial self-modifications.
Creating New Agents
We can also try a non-reflective technique for designing an "optimal problem solver," in the spirit of Hutter search. Again, suppose we have a particular object level problem and some allotted time.
Consider the agent A which behaves as follows: in the first half of the allotted time, it enumerates pairs (B, P), and if B is a valid description of an algorithm and P is a proof that B has utility at least U on the object level problem (using at most half of the allotted time), then A stores the pair (B, U). After half of the allotted time, the pair (B, U) with maximal U is selected, and then the algorithm B is run using the remaining time.
This solution seems like it may avoid some of the trouble encountered by the Gödel machine, in exchange for being somewhat less ambitious in its goals. But we will see that the same reflection problem arises.
If the allotted time is very large, and if self-modification is an important component of general intelligence, then we expect A to select a new algorithm B which is itself capable of (provably useful) self-modification. In principle A could simply run long enough to become a very powerful non-self-modifying agent; but it may be that finding a very powerful agent B directly is much more difficult then proceeding through a sequence of successively more powerful agents, each intelligent enough to design the next but not to leap directly to B.
But suppose A creates a new algorithm B which uses the same axioms as A to prove theorems, and uses these theorems to guide its own self-modification. Again, by Löb's theorem we see that A cannot guarantee that B's self-modifications are really beneficial. This means that A is unable to rule out the possibility that B executes a very bad self-modification (or creates a new agent with very bad behavior); but without ruling out the possibility that B performs very badly, A cannot run B instead of some more conservative algorithm which does not do any self-modification.
This does not rule out the possibility that A creates a new algorithm B' which pursues self-modification using a similar strategy but with a more restricted set of axioms--one which can be proven consistent from within A. And in turn, if B' takes the same form as A, then any algorithm C which B' creates or modifies into will use a smaller still set of axioms by the same argument. Eliezer describes the proof-theoretic strength of such an agent as a 'telomere'--with every self-modification it becomes weaker, and for Gödelian reasons it cannot be increased without sacrificing provable correctness.
(The distinction between the Gödel machine and creating new agents is inessential; for example, we could imagine a Gödel machine attempt to self-modify by improving the efficiency of its proof searcher and then using the additional time to consider some rewrites which are justified by a more restricted axiom system.)
Possible Solutions
There are some natural solutions which have been considered.
Long Telomeres
Recognizing that an agent must decrease its proof-theoretic strength when performing a self-modification, we may just try to construct an initial agent with substantial proof-theoretic strength. For example, if we believe that PA probably has sufficient logical expressiveness to facilitate useful superintelligence, we may create an agent with axioms PA + "PA is consistent" + "PA + 'PA is consistent' is consistent" and so on. We may iterate this process a finite number of times to allow for a few self-modifications, or perhaps add an entire countable chain so that the original AI can decide how much proof-theoretic strength its first descendant needs, or perhaps add a few countable chains so that the agent can self-modify a handful of times before needing to make that decision, or so on.
The primary objection to this solution is philosophical: humans appear to be able to reason about self-modification much more robustly (or at least with better results) than an AI using any of the logical frameworks discussed so far, and it is quite ambiguous whether or not a system with 'long telomeres' can ever justify becoming a system which is actually able to reason about self-modification correctly (or, if we have doubts about humans' capacities, whether such a system even exists). Minimally, if we want to understand the form an AI is likely to take after self-modification, we need to understand how such a system would eventually escape from problems that currently seem inescapable; without such understanding the resulting behavior is hard to get a handle on (although we might still be satisfied, if that behavior is backed up by formal justification produced by the AI).
Lesser objections focus on the additional complexity required to maintain such a long telomere. Each additional consistency axiom added to a system essentially requires recursively specifying the entire agent as it exists so far, and the complexity of the mathematical manipulations required seems to grow rapidly. This argument seems to be fairly unimportant, and so if the philosophical objections were dispatched with I would consider building agents with very strong proof systems a feasible path forward.
Beliefs rather than Proofs
Humans seem to be able to avoid Gödelian problems with reflection, but they may do this by having confidence in statements which they cannot prove, and manipulating uncertainty about logical facts rather than dealing in absolute assertions of truth. The set of "statements I strongly believe" seems to differ in important ways from the set of "statements I can prove" in several important ways; for example, if I am confident of a statement at one point in time I am not committed to remaining confident indefinitely.
I suspect that a satisfactory theory of reflective beliefs would go far towards resolving the AI reflection problem, but this is hardly more than a restatement of the problem. Translating from proofs to beliefs does not immediately resolve the problems with self-reference, it just replaces them with subtly different issues. For example, it is easy to see that an agent should not believe that its own beliefs are well-calibrated on all questions, and so we must attempt to formalize a version of the self-consistency hypothesis which is weak enough to be defensible but still strong enough to escape the shortcomings described above. I hope to make several posts on this topic in the near future.
Ignorance
Understanding this issue may not be necessary to building safe AGI. Indeed, self-modification may ultimately play a minimal role in intelligence, or we may settle for executing self-modifications using weaker justification. However if we accept usual arguments about the importance of FAI, then we should not be satisfied with this solution.
The importance of self-modification is an open question which has received much discussion here and elsewhere. It is worth adding that, to the extent that we are concerned with influencing probable outcomes for humanity, the highest leverage scenarios seem to be those in which self-modification tends to result in positive feedback loops and takeoff (if we assign such scenarios significant weight). That is, in such scenarios we should be particularly cautious about building self-modifying systems, but there is also a much greater imperative to understand how to design safe and stable AI.
Standard arguments surrounding FAI (particularly, the importance of early AI goal systems and the fragility of humane value) suggest that agents should have high degrees of confidence in a change before executing it. If an agent's beliefs are not correctly related to reality, the resulting behavior may be as dangerous as if the agent's valus were modified. For example, incorrect beliefs about logical structure which cause that agent to fail to preserve its own values in subsequent rewrites, or incorrect beliefs about the relationships between value and reality.
AIXI and Existential Despair
It has been observed on Less Wrong that a physical, approximate implementation of AIXI is unable to reason about its own embedding in the universe, and therefore is apt to make certain mistakes: for example, it is likely to destroy itself for spare parts, and is unable to recognize itself in a mirror. But these seem to be mild failures compared to other likely outcomes: a physical, approximate implementation of AIXI is likely to develop a reductionist world model, doubt that its decisions have any effect on reality, and begin behaving completely erratically.
Setup
Let A be an agent running on a physical computer, implementing some approximate version of AIXI. Suppose that A is running inside of an indestructible box, connected to the external world by an input wire W1 and an output wire W2.
Suppose that this computer exists within a lawful physical universe, governed by some rules which can be inferred by A. For simplicity, assume that the universe and its initial conditions can be described succinctly and inferred by A, and that the sequence of bits sent over W1 and W2 can be defined using an additional 10000 bits once a description of the universe is in hand. (Similar problems arise for identical reasons in more realistic settings, where A will work instead with a local model of reality with more extensive boundary conditions and imperfect predictability, but this simplified setting is easier to think about formally.)
Recall the definition of AIXI: A will try to infer a simple program which takes A's outputs as input and provides A's inputs as output, and then choose utility maximizing actions with respect to that program. Thus two models with identical predictive power may lead to very different actions, if they give different predictions in counterfactuals where A changes its output (this is not philosophy, just straightforward symbol pushing from the definition of AIXI).
AIXI's Behavior
First pretend that, despite being implemented on a physical computer, A was able to perform perfect Solomonoff induction. What model would A learn then? There are two natural candidates:
- A's outputs are fed to the output wire W2, the rest of the universe (including A itself) behaves according to physical law, and A is given the values from input wire W1 as its input. (Model 1)
- A's outputs are ignored, the rest of the universe behaves according to physical law, and A is given the values from W1 as its input. (Model 2)
Both of these models give perfect predictions, but Model 2 is substantially simpler (around 10000 bits simpler, and specifying A's control over W2's values in 10000 bits seems quite optimistic). Therefore A will put much more probability mass on Model 2 than Model 1. In fact, Model 2 or its close variants probably receive almost all of the probability mass.
If A believed Model 2, what would its actions be? Well, in Model 2 A's outputs are completely ignored! So although Model 2 may make up most of the probability mass, it contributes nothing important to the expected value computation determining A's actions. So maybe A will make decisions as if it believed Model 1, and therefore behave appropriately for a surprising reason? In fact this is unlikely: there are many other models of intermediate complexity, for example:
- A's outputs are usually ignored, the rest of the universe behaves physically, and then A is given the values from W1 as its input. However, if A ever outputs the sequence 0011011000111010100, then A's inputs are modified to include a large reward signal on the next step. (Model 3)
Model 3 is an intentionally silly example, but hopefully the point is clear: Model 1 is enormously more complicated than the simplest explanation for A's observations, and so there are many other (potentially very strange and arbitrary) models which account for the data but are simpler than Model 1. It is these models which will determine A's behavior, and the result is almost certainly very arbitrary (this is similar to some discussions of Pascal's Wager: it is possible that all of these strange models will cancel out and add up to normality, but it seems outlandishly unlikely).
For example, if A suspects Model 3 may be true it will be eager to test this hypothesis by outputting 0011011000111010100. It will find that it continues to receive inputs from W1, but no matter, it will move on to testing Model 4, which has the same form but with a different target bitstring. Of course this is not what A will actually do, as there will be other models with complicated effects on behavior, but this gives the flavor of A's failure.
If A somehow did accept Model 1, then we would be back in the situation normally discussed on Less Wrong: A believes that the values on W2 are magically made equal to A's outputs, and so is unconcerned with its own real physical instantiation. In particular, note that having some uncertainty between Model 1 and Model 2 is not going to save A from any of these problems: in the possible worlds in which Model 2 is true, A doesn't care at all what it does (A doesn't "want" its physical instantiation to be destroyed, but by the same token it believes it has no control), and so A's behavior reduces to the normal self-destructive behavior of Model 1.
Approximate AIXI's Behavior
An approximate version of AIXI may be able to save itself from existential despair by a particular failure of its approximate inference and a lack of reflective understanding.
Because A is only an approximation to AIXI, it cannot necessarily find the simplest model for its observations. The real behavior of A depends on the nature of its approximate inference. It seems safe to assume that A is able to discover some approximate versions of Model 1 or Model 2, or else A's behavior will be poor for other reasons (for example, modern humans can't infer the physical theory of everything or the initial conditions of the universe, but their models are still easily good enough to support reductionist views like Model 2), but its computational limitations may still play a significant role.
Why A might not fail
How could A believe Model 1 despite its prior improbability? Well, note that A cannot perform a complete simulation of its physical environment (since it is itself contained in that environment) and so can never confirm that Model 2 really does correctly predict reality. It can acquire what seems to a human like overwhelming evidence for this assertion, but recall that A is learning an input-output relationship and so it may assign zero probability to the statement "Model 2 and Model 1 make identical predictions," because Model 1 depends on the indeterminate input (in particular, if this indeterminate was set to be a truly random variable, then it would be mathematically sound to assign zero probability to this assertion). In this case, no amount of evidence will ever allow A to conclude that Model 2 and Model 1 are identically equivalent--any observed equivalence would need to be the result of increasingly unlikely coincidences (we can view this as a manifestation of A's ignorance about its own implementation of an algorithm).
Now consider A's beliefs about W2. It is relatively easy for A to check (for almost all timesteps) that Model 1 correctly predicts each bit on W2, while A only has enough time to check that Model 2 correctly predicts only a few of these bits. Therefore the probability of Model 2 must be decreased by A's estimate of the likelihood that Model 2 would happen to set the correct value for all of the bits that A didn't have time to verify. Model 1's probability must be decreased likewise, but because A was able to check more of Model 1's values, Model 1 leaves less unexplained data and may not be as unlikely as Model 2.
Why A probably fails anyway
In general, it is very fragile to expect that an AI will behave correctly because it fails to realize something. There are a number of obvious reasons that A might accept the reductionist hypothesis, even acknowledging the above argument.
First, an effective approximate inference module seems unlikely to make the mistake described in the last section. Making this mistake seems to correlate with other errors which may disrupt normal "sane" behavior: in general, given a model M which the agent cannot exactly simulate, uncertainty about the M's outputs (even if M is known to produce mostly the same output, and never known to produce incorrect output) causes the model's probability to drop off exponentially with the number of unverified computations. It seems as though any "sane" AI should be able to assign some constant probability to the hypothesis that this model exactly captures the output, and thereby avoid considering other exponentially improbable hypotheses. This may relate to deeper concerns about approximate Solomonoff induction.
Second, A may be able to observe the operation of the hardware on which it is running. This will generally cause Model 1 to be much less probable: indeed, if A can observe even one "causal ancestor" of W2's value, it will no longer gain very much by believing Model 1 as such (since now Model 1 only produces the correct output if Model 2 did anyway--all of the relative advantage for Model 1 comes from occasions when A can observe the value of W2 without observing the operations directly responsible for that value, which may be rare). Of course there are more complicated models in which A's outputs control reality in more subtle ways, which may have better complexity. Understanding this issue completely depends on a much more detailed knowledge of A's approximate inference and the nature of A's observations. In general, however, being able to observe its own computation seems like it may be adequate to force A into a reductionist model.
Third, A's approximate inference module may be aware of the fact that A's own outputs are produced algorithmically (as a computational aid, not an underlying belief about reality). This would cause it to assign positive probability to the assertion "Model 2 is equivalent to Model 1," and eventually force it into a reductionist model.
Conclusion
Agents designed in the spirit of AIXI appear to be extremely fragile and vulnerable to the sort of existential despair described above. Progress on reflection is probably necessary not only to design an agent which refrains from killing itself when convenient, but even to design an agent which behaves coherently when embedded in the physical universe.
Having Useful Conversations
Holding conversations in person is useful; feedback is quick, and it seems to be much easier to change your behavior as a result of actually talking with people.
Having effective goal-oriented conversations is somewhat difficult. One source of difficulty is a strong tendency to stray from useful talk into entertaining talk. A typical example is the tendency of many (otherwise potentially productive) conversations between rationalists simply wandering into an extended dialog about the nature of existential risk or some interesting philosophical problem, and then stagnating there (potentially treading interesting new intelligence-demonstrating terrain, but not in point of fact getting anything done or refining beliefs in a meaningful way).
If this is what all participants want out of the conversation, then it's great that we've found a community where people can get their kicks in this particular abstruse way. If this is what some but not all participants want out of the conversation, then perhaps the conversation should divide or conclude. But conversations seem to get derailed--either for significant lengths of time, or indefinitely--even when participants honestly want to get things done, and view conversations with other rationalists as instruments to serve their values.
In the interest of getting things done, I (and Nick Tarleton and Michael Curzi, with the tiniest bit of testing) suggest that the rationalist community try really hard to adopt the following norm: when someone else is talking, and the conversation would be significantly better served by them stopping, let them know. Either point out that the topic is nice to think about but unhelpful, that the topic should be considered later rather now, or whatever else the speaker seems to have failed to notice. To help make adoption a little easier, it might be help to choose one person in advance who will have some responsibility to arbitrate.
If a participant disagrees about the relevance of a remark, don't push it--our hope is that such a system could help people who have honestly wandered from the topic pursuing an interesting tangent or happy thought, not to resolve any actual dispute. If a participant doesn't want to adhere closely to any particular notion of usefulness--for example, if someone is having a conversation to simply enjoy themselves and unwind--then the conversing parties should resolve their misunderstanding, or if not possible simply stop talking to each other and save some time.
Have any LWers considered other lightweight measures to hold more useful conversations? There seems to be low-hanging fruit here, and there seems to be a lot to gain.
Not a Meetup May 22 in Cambridge, MA
I enjoyed some of the conversations at the last Cambridge (MA) meetup, particularly towards the end, but I will be in California for the next couple of Cambridge meetups (though I hope to meet some of the LW community there).
I spend a lot of my time sitting and working on my laptop; there is nothing particularly important about where I'm sitting, and being in an unfamiliar environment seems to make me more productive if anything.
Putting two and two together: I'm going to commit to being at Cosi's in Kendall square between 1pm and 3pm on May 22. Feel free to come by and talk; I'll stay longer if there is interesting conversation. If no one shows up, nothing lost.
I feel like it should be possible to share this sort of information (not just here, but in general) without adding formality. For example, the act of posting such an event to meetup.com feels like it adds some unwarranted legitimacy / officialness: no one showing up would feel like a loss, and it would feel like undermining the regular meetups. On the other hand, though I'm more comfortable posting to LW discussion, posting it here inconveniences more people than it should. Deliberating at length doesn't seem worth it, so I'll just ask: what would others do?
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)