Summary

"Unbounded analysis" refers to determining the behavior of a computer program that would, to actually run, require an unphysically large amount of computing power, or sometimes hypercomputation. If we know how to solve a problem using unlimited computing power, but not with real-world computers, then we have an "unbounded solution" but no "bounded solution".

As a central example, consider computer chess. The first paper ever written on computer chess, by Claude Shannon in 1950, gave an unbounded solution for playing perfect chess by exhaustively searching the tree of all legal chess moves. (Since a chess game draws if no pawn is moved and no piece is captured for 50 moves, this is a finite search tree.) Shannon then passed to considering more bounded ways of playing imperfect chess, such as evaluating the worth of a midgame chess position by counting the balance of pieces, and searching a smaller tree up to midgame states. It wasn't until 47 years later, in 1997, that Deep Blue beat Garry Kasparov for the world championship, and there were multiple new basic insights along the way, such as alpha-beta pruning.

In 1836, there was a sensation called the Mechanical Turk, allegedly a chess-playing automaton. Edgar Allen Poe, who was also an amateur magician, wrote an essay arguing that the Turk must contain a human operator hidden in the apparatus (which it did). Besides analyzing the Turk's outward appearance to locate the hidden compartment, Poe carefully argued as to why no arrangement of wheels and gears could ever play chess in the first place, explicitly comparing the Turk to "the calculating machine of Mr. Babbage":

Arithmetical or algebraical calculations are, from their very nature, fixed and determinate. Certain data being given, certain results necessarily and inevitably follow [...] But the case is widely different with the Chess-Player. With him there is no determinate progression. No one move in chess necessarily follows upon any one other. From no particular disposition of the men at one period of a game can we predicate their disposition at a different period [...] Now even granting that the movements of the Automaton Chess-Player were in themselves determinate, they would be necessarily interrupted and disarranged by the indeterminate will of his antagonist. There is then no analogy whatever between the operations of the Chess-Player, and those of the calculating machine of Mr. Babbage [...] It is quite certain that the operations of the Automaton are regulated by mind, and by nothing else. Indeed this matter is susceptible of a mathematical demonstration, a priori.

(In other words: In an algebraical problem, each step follows with the previous step of necessity, and therefore can be represented by the determinate motions of wheels and gears as in Charles Babbage's proposed computing engine. In chess, the player's move and opponent's move don't follow with necessity from the board position, and therefore can't be represented by deterministic gears.)

This is an amazingly sophisticated remark, considering the era. It even puts a finger on the part of chess that is computationally difficult, the combinatorial explosion of possible moves. And it is still entirely wrong.

Even if you know an unbounded solution to chess, you might still be 47 years away from a bounded solution. But if you can't state a program that solves the problem in principle, you are in some sense confused about the nature of the cognitive work needed to solve the problem. If you can't even solve a problem given infinite computing power, you definitely can't solve it using bounded computing power. (Imagine Poe trying to write a chess-playing program before he'd had the insight about search trees.)

We don't presently know how to write a Python program that would be a nice AI if we ran it on a unphysically large computer. Trying directly to cross this conceptual gap by carving off pieces of the problem and trying to devise unbounded solutions to them is "the methodology of unbounded analysis in AI alignment theory".

Since "bounded agent" has come to mean in general an agent that is realistic, the term "unbounded agent" also sometimes refers to agents that:

Perfectly know their environments.

Can fully simulate their environments (the agent is larger than its environment and/or the environment is very simple).

Operate in turn-based, discrete time (the environment waits for the agent to compute its move).

Cartesian agents that are perfectly separated from the environment except for sensory inputs and motor outputs.

These are two importantly distinct kinds of unboundedness, and we'll refer to the above list of properties as "unrealism" to distinguish them here. Sufficiently unrealistic setups are also called "toy models" or "toy problems".

Unrealistic setups have disadvantages, most notably that the results, observations, and solutions may fail to generalize to realistic applications. Correspondingly, the classic pitfall of unbounded analysis is that it's impossible to run the code, which means that certain types of conceptual and empirical errors are more likely to go uncaught (see below).

There are nonetheless several forces pushing technical work in value alignment theory toward unbounded analyses and unrealistic setups, at least for now:

Attacking confusion in the simplest settings. If we don't know how to solve a problem given unlimited computing power, this means we're confused about the nature of the work to be performed. It is sometimes worthwhile to tackle this conceptual confusion in a setup that tries to expose the confusing part of the problem as simply as possible. Trying to bound the proposed solution or make it realistic can introduce a lot of complications into this discussion, arguably unnecessary ones. (Deep Blue was far more complicated than Shannon's ideal chess program, and it wouldn't be doing Edgar Allen Poe any favors to show him Deep Blue's code and hide away Shannon's ideal outline.)

Unambiguous consequences and communication. Introducing 'realistic' complications can make it difficult to engage in cooperative discourse about which ideas have which consequences. (AIXI was a watershed moment for alignment theory because AIXI was formally specified and there was no way to say "Oh, I didn't mean that" when somebody pointed out that AIXI would try to seize control of its own reward channel.)

More advanced agents might be less idiosyncratic. Increasing the cognitive power of an agent may sometimes move its behavior closer to ideals and further from specific complications of an algorithm. (From the perspective of a human onlooker, early chess algorithms seemed to have weird idiosyncracies tied to their specific algorithms. Modern chess programs can from an intuitive human perspective be seen as just making good chess moves.)

Runnable toy models often aren't a good fit for advanced-agent scenarios. Current AI algorithms are often not good natural fits for demonstrating certain phenomena that seem like they might predictably occur in sufficiently advanced AIs. Usually it's a lot of work to carve off a piece of such a phenomenon and pose it as an unbounded problem. But that can still be significantly easier to fit into an unbounded setting than into a toy model. (All examples here are too complicated for one sentence, but see the subsection below and the discussion of Utility indifference and Tiling agents theory.)

Even to the extent that these are good reasons, standard pitfalls of unbounded analyses and unrealistic setups still apply, and some of our collective and individual precautions against them are discussed below.

For historical reasons, computer scientists are sometimes suspicious of unbounded or unrealistic analyses, and may wonder aloud if they reflect unwillingness or incapacity to do a particular kind of work associated with real-world code. For discussion of this point see Why MIRI uses unbounded analysis.

Attacking confusion in the simplest settings.

Imagine somebody claiming that an ideal chess program ought to evaluate the ideal goodness of each move, and giving their philosophical analysis in terms of a chess agent which knows the perfect goodness of each move, without ever giving an effective specification of what determines the ideal goodness, but using the term to symbolize it so that the paper looks very formal. We can imagine an early programmer sitting down to write a chess-playing program, crafting the parts that take in user-specified moves and the part that displays the chess screen, using random numbers for move goodness until they get around to writing the "move-goodness determining module", and then finally finding themselves being unable to complete the program; at which point they finally recognize that all the talk of "the ideal goodness $γ$ of a chess move" wasn't an effective specification.

Part of the standard virtue ethics of computer science includes an injunction to write code in order to force this kind of grad student to realize that they don't know how to effectively specify something, even if they symbolized it using a Greek letter. But at least this kind of ineffectiveness seems to be something that some people can learn to detect without actually running the code - consider that, in our example above, the philosopher-programmer realized that they didn't know how to compute $γ$ at the point where they couldn't complete that part of the program, not at a point where the program ran and failed. Adding on all the code to take in user moves and display a chess board on the screen only added to the amount of time required to come to this realization; once they know what being unable to write code feels like, they might be able to come to the same realization much faster by standing in front of a whiteboard and failing to write pseudocode.

At this point, it sometimes makes sense to step back and try to say exactly what you don't know how to solve - try to crisply state what it is that you want an unbounded solution for. Sometimes you can't even do that much, and then you may actually have to spend some time thinking 'philosophically' - the sort of stage where you talk to yourself about some mysterious ideal quantity of move-goodness and you try to pin down what its properties might be. It's important not to operate in this stage under the delusion that your move-goodness $γ$ is a well-specified concept; it's a symbol to represent something that you're confused about. By asking for an unbounded solution, or even an effectively-specified representation of the unbounded problem, that is, asking for pseudo-code which could be run given nothing more than an unphysically large computer (but no otherwise-unspecified Oracle of Move Goodness that outputs $γ$ ), we're trying to invoke the "Can I write code yet?" test on whatever philosophical reasoning we're doing.

Can trying to write running code help at this stage? Yes, depending on how easy it is to write small programs that naturally represent the structure of the problem you're confused about how to solve. Edgar Allen Poe might have been willing to concede that he could conceive of deterministic gears that would determine whether a proposed chess move was legal or illegal, and it's possible that if he'd actually tried to build that automaton and then try to layer gears on top of that to pick out particular legal moves by whatever means, he might have started to think that maybe chess was computable after all - maybe even have hit upon a representation of search, among other possible things that gears could do, and so realized how in principle the problem could be solved. But this adds a delay and a cost to build the automaton and try out variations of it, and complications from trying to stay within the allowed number of gears; and it's not obvious that there can't possibly be any faster way to hit upon the idea of game-tree search, say, by trying to write pseudocode or formulas on a whiteboard, thinking only about the core structure of the game. If we ask how it is that Shannon had an easier time coming up with the unbounded solution (understanding the nature of the work to be performed) than Poe, the most obvious cause would be the intervening work by Church and Turing (among others) on the nature of computation.

And then in other cases it's not obvious at all how you could well-represent a problem using current AI algorithms, but with enough work you can figure out how represent the problem in an unbounded setting.

The pitfall of simplying away the key, confusing part of the problem.

The tiling agents problem, in the rocket alignment metaphor, is the metaphorical equivalent of trying to fire a cannonball around a perfectly spherical Earth with no atmosphere - to obtain an idea how any kind of "stable orbit" can work at all. Nonmetaphorically, the given problem is to exhibit the simplest nontrivial case of stable self-modification - an agent that, given its current reasoning, wants to create an agent with similar properties as a successor, i.e., preserving its current goals.

In a perfect world, we'd be able to, with no risk, fire up running code for AI algorithms that reason freely about self-modification and have justified beliefs about how alternative versions of their own code will behave and the outer consequences of that behavior (the way you might imagine what would happen in the real world if you took a particular drug affecting your cognition). But this is way, way beyond current AI algorithms to represent in any sort of natural or realistic way.

A bad "unbounded solution" would be to suppose agents that could exactly simulate their successors, determine exactly what their successors would think, extrapolate the exact environmental consequences, and evaluate those. If you suppose an exact simulation ability, you don't need to consider how the agent would reason using generalizations, abstractions, or probabilities... but this trivial solution doesn't feel like it sheds any light or gets us any closer to understanding reflective stability; that is, it feels like the key part of the problem has been simplified away and solving what remains was too easy and didn't help.

Faced with an "unbounded solution" you don't like, the next step is to say crisply exactly what is wrong with it in the form of a new desideratum for your solution. In this case, our reply would be that for Agent 1 to exactly simulate Agent 2, Agent 1 must be larger than Agent 2, and since we want to model stable self-modification, we can't introduce a requirement that Agent 2 be strictly weaker than Agent 1. More generally, we apply the insight of Vinge_principle to this observation and arrive at the desiderata of Vinge_uncertainty and Vinge_reflection, which we also demand that an unbounded solution exhibit.

This illustrates one of the potential general failure modes of using an unbounded setup to shed conceptual light on a confusing problem - namely, when you simplify away the key confusing issue you wanted to resolve, and end up with a trivial-seeming solution that sheds no light on the original problem. A chief sign of this is when your paper is too easy to write. The next action is to say exactly what you simplified away, and put it in the form of a new desideratum, and try to say exactly why your best current attempts can't meet that desideratum.

So we've now further added: we want the agent to generalize over possible exact actions and behaviors of its successor rather than needing to know its successor's exact actions in order to approve building it.

With this new desideratum in hand, there's now another obvious unbounded model: consider deterministic agents operating in environments with known rules that reason about possible designs and the environment using first-order logic. The agent then uses an unbounded proof search, which no current AI algorithm could tackle in reasonable time (albeit a human engineer would be able to do it with a bunch of painstaking work) to arrive at justified logical beliefs about the effect of its successor on its environment. This is certainly still extremely unrealistic; but has this again simplified away all the interesting parts of the problem? In this case we can definitely reply, "No, it does expose something confusing" since we don't in fact know how to build a tiling agent under this setup. It may not capture all the confusing or interesting parts of the problem, but it seems to expose at least one confusion. Even if, as seems quite possible, it's introduced some new problem that's an artifact of the logical setup and wouldn't apply to agents doing probabilistic reasoning, there's then a relatively crisp challenge of, "Okay, show me how probabilistic reasoning resolves the problem, then."

It's not obvious that there's anything further to be gained by trying to create a toy model of the problem, or a toy model of the best current unsatisfactory partial solution, that could run as code with some further cheating and demo-rigging, but this is being tried anyway. The tiling agents problem did spend roughly nine years exclusively on paper before that, and the best current unsatisfactory solution was arrived at with whiteboards.

The pitfall of residual terms.

Besides "simplifying away the confusing part of the problem", another way that unbounded thinking can "bounce off" a confusing problem is by creating a residual term that encapsulates the confusion. Currently, there are good unbounded specifications for Cartesian non-self-modifying expected reward maximizers: if we allow the agent to use unlimited computing power, don't allow the environment to have unlimited computing power, don't ask the agent to modify itself, separate the agent from its environment by an impermeable barrier through which only sensory information and motor outputs can pass, and then ask the agent to maximize a sensory reward signal, there's a simple Python program which is expected to behave superintelligently given sufficient computing power. If we then introduce permeability into the Cartesian boundary and allow for the possibility that the agent can take drugs or drop an anvil on its own head, nobody has an unbounded solution to that problem any more.

So one way of bouncing off that problem is to say, "Oh, well, my agent calculates the effect of its motor actions on the environment and the expected effect on sensory information and the reward signal, plus a residual term $γ$ which stands for the expected utility of all effects of the agent's actions that change the agent's processing or destroys its hardware". How is $γ$ to be computed? This is left unsaid.

In this case you haven't omitted the confusing part of the problem, but you've packed it into a residual term you can't give an effective specification for calculating. So you no longer have an unbounded solution - you can't write down the Python program that runs given unlimited computing power - and you've probably failed to shed any important light on the confusing part of the problem. Again, one of the warning signs here is that the paper is very easy to write, and reading it does not make the key problem feel less like a hard opaque ball.

Introducing realistic complications can make it hard to build collective discourse about which ideas have which consequences.

One of the watershed moments in the history of AI alignment theory was Marcus Hutter's proposal for AIXI, not just because it was the first time anyone had put together a complete specification of an unbounded agent (in a Cartesian setting), but also because it was the first time that non-value-alignment could be pointed out in a completely pinned-down way.

(work in progress)

[todo: turn the rest of this into a finished doc, or delete it.

To the extent that "standing in front of a whiteboard trying to think of pseudocode" does work better to clear up conceptual confusion, it will be because this stage of the problem could iterate through more possible pieces of pseudocode, in a context that exposed the underlying problem as simply as possible without any added complications from demanding bounded solutions or realism.

Once you realize you currently can't write the code; what's the next step? Depending on how easy it is to build small programs that naturally represent the problem setup - and in the case of chess, this shouldn't be hard, the hard part is generating good moves, not representing the game of chess in running code - then it might make sense to take exploratory stabs at trying to compute something that makes the game play good chess.

What if you can't even represent the problem setup naturally in running code? Then

What about after you realize that you currently can't write code?

Doing unbounded analysis implies that you think your current state of understanding has basic confusions to be cleared up. If at this point somebody tried to write a paper about an algorithm that would, given hypercomputation, pilot a robotic car on an ideal 2-dimensional surface, it would be a legitimate reply to say "But we have actual running code for robotic cars, nor are we confused about how to do this; why do an unbounded analysis when we already have bounded ones?" In terms of the rocket alignment metaphor, spending a lot of time trying to figure out how to fire a cannonball so that it circles a perfectly spherical Earth and returns to its exact starting point, corresponds to a suspicion that you currently lack the mathematical language to even talk about landing a rocket on the Moon. Someone who expects that current vehicles only require one or two tweaks to get to the Moon, or that they can be piloted there by keeping the Moon in the viewport and steering intuitively, is more likely to object "Why are you talking about cannonballs, instead of the vehicles that reach the highest altitudes today?" The closer you think current algorithms are to supporting a Friendly AI, the less sympathetic you might be to the suggestion that very basic foundational work needs to be done and might need to be done in a simplified setting.

Or imagine, say, somebody claiming that an ideal chess program ought to evaluate the ideal goodness of each move, and giving their philosophical analysis in terms of a chess agent which knows the perfect goodness of each move, without ever giving an effective specification of what determines the ideal goodness, but using the term $γ$ to symbolize it so that the paper looks very formal. We can imagine an early programmer sitting down to write a chess-playing program, crafting the parts that take in user-specified moves and the part that displays the chess screen, using random numbers for move goodness until they get around to writing the "move-goodness determining module", and then finally finding themselves being unable to complete the program; at which point they finally recognize that all the talk of "the ideal goodness $γ$ of a chess move" wasn't an effective specification.

Sitting down to write code can hopefully force you to recognize that your analysis is still confused, that to talk about "ideal goodness of a chess move" is not yet an effective specification of an ideal chessplayer, because you don't know how to compute the ideal goodness of a move.

You can also, arguably, save yourself a lot of time and code that goes nowhere by learning to recognize that you can't write down "ideal goodness of a chess move" as an effective formula, and discussing the problem with friends using a whiteboard. Instead of spending a long time writing every part of a chess-playing program except the "compute the goodness of a move" part until finally being forced to recognize you can't complete program.

The obvious reason why

Some early cautionary tales from the history of AI deal with toy programs (unrealistic agents) that failed to generalize well to the real world. Even so, an unrealistic toy model is superior in some ways to an unbounded model, because the code can be run and the results observed. In the absence of running code, it's possible to make outright cognitive errors about whether you've described an intuitive setup using a mathematical statement - failing to give an effective specification, or giving an effective specification that doesn't match the intuitive description. But this doesn't mean that code is a panacea; especially when creating a "toy model", it's possible that the toy setup omits key properties of interest. Depending on the person and the circumstances, a toy model might be extra work and not bring any extra enlightenment over writing down the key properties in a formula and thinking about the formula.

In the case of value alignment theory, much work does take place in a context of unlimited computing power and usually other unrealistic properties as well. The subsections below deal with some of these reasons and their associated

Unbounded agent properties

Much of the current technical work in value alignment theory takes place on whiteboards, and tries to analyze programs or mathematical structures that could only run on very large finite computers or hypercomputers.

Much of the current technical work in value alignment theory takes place against a background of simplifying assumptions such as:

Unlimited finite computing power, or even hypercomputation.

Perfect knowledge of the environment.

Deterministic environments that are smaller than the agent and can be fully simulated.

Environments separated from the agent by an impermeable Cartesian boundary.

Turn-based discrete time (the environment waits for the agent to make a move, rather than proceeding in continuous real time).

Common knowledge of each other's code in multi-agent dilemmas.

Predefined action sets.

The core tradeoff in unbounded analysis is that these properties can make a setup (much) easier to analyze and for human beings to reason about, at the expense of realism. Unrealism means a risk of having simplified away a key part of the problem, maybe even the only really interesting part of the problem, and of obtaining results that fail to apply to the real world.

Introduction: Claude Shannon, Deep Blue, and Edgar Allen Poe

In 1950, the very first paper ever written on computer chess, by Claude Shannon, gave in passing the algorithm that would play perfect chess given unlimited computing power, and then went on to consider more practical algorithms involving small search trees and local position evalution. It then took an additional 47 years until Deep Blue beat Garry Kasparov for the world chess championship. Knowing how to do something in principle doesn't always mean you can do it in practice without additional hard work and insights like, e.g., the alpha-beta pruning algorithm for chess.

Nonetheless, if you don't know how to play chess using unlimited computing power, you definitely can't play chess using limited computing power.

In the early 19th century, there was a minor sensation called the Mechanical Turk, allegedly a chess-playing automaton. Today, understanding the computational requirements of chess, we know immediately that there's no way to do that using a box cabinet full of gears, but in the early 19th century this wasn't as obvious. In 1836, Edgar Allen Poe, who was also an amateur magician of some repute, wrote an essay arguing that the Mechanical Turk was human-operated. Poe compares the Turk to the Duck of Vaucanson which quacked, moved almost naturally, and seemed to digest food, acknowledging that there have indeed been wondrous automata. But Poe insists that the Turk cannot be one of them, and that if it were, it would be the greatest wonder ever constructed by mankind. Poe writes:

But if these machines were ingenious, what shall we think of the calculating machine of Mr. Babbage? What shall we think of an engine of wood and metal which can not only compute astronomical and navigation tables to any given extent, but render the exactitude of its operations mathematically certain through its power of correcting its possible errors? What shall we think of a machine which can not only accomplish all this, but actually print off its elaborate results, when obtained, without the slightest intervention of the intellect of man? [...] Arithmetical or algebraical calculations are, from their very nature, fixed and determinate. Certain data being given, certain results necessarily and inevitably follow [...] But the case is widely different with the Chess-Player. With him there is no determinate progression. No one move in chess necessarily follows upon any one other. From no particular disposition of the men at one period of a game can we predicate their disposition at a different period [...] Now even granting that the movements of the Automaton Chess-Player were in themselves determinate, they would be necessarily interrupted and disarranged by the indeterminate will of his antagonist. There is then no analogy whatever between the operations of the Chess-Player, and those of the calculating machine of Mr. Babbage [...] It is quite certain that the operations of the Automaton are regulated by mind, and by nothing else. Indeed this matter is susceptible of a mathematical demonstration, a priori.

In other words: In an algebraical problem, each step follows with the previous step of necessity, and therefore can be represented by the determinate motions of wheels and gears as in Charles Babbage's proposed computing engine; while in chess, the player's move and opponent's move don't follow with necessity from the board position, and therefore require Mind to analyze rather than deterministic gears.

(The rest of Poe's essay is then devoted to dissecting the appearance of the Turk and arguing about where the human operator is concealed inside it.)

As much as Claude Shannon in 1950 was a long way from all the additional insights required to defeat the human champion in 1997 using limited and realistic amounts of computing power, he was also a long way ahead of Edgar Allen Poe who lacked the key insight of 'search trees' that enables a chess game to be represented in deterministic transistors.

Between Poe in 1830 and Shannon in 1950 there was a key increase in the understanding of computer chess, represented by the intermediate work of Turing, Church, and others. Shannon's 1950 paper doesn't belabor his unbounded solution to chess that would explore the whole search tree - he seems to consider it mostly obvious, and gives it as only a preliminary to his suggestions for playing chess using far more limited calculations. Unbounded understanding often seems easy once you have it. But this leaves aside the incredible difficulty of writing a bounded program when you don't even have an unbounded one. Inability to state the unbounded solution to your problem often indicates that you are confused about the nature of the computational work to be performed. We can imagine someone who lacks the insight of search trees, trying to write a program that plays chess with no better definition of "good chess move" as one that leads to center control, captures pieces, sets up forks, leads to a well-defended king, attacks the opponent's king, etcetera, as if all of these considerations were equal parts of some undefined and indefinable "move goodness" property...

The core idea behind doing unbounded analysis in value alignment theory is that we are presently confused about the question "How would you write a Python program that would implement a nice AI if you could run it on a computer much larger than the universe?"

and that there are subproblems where we can expose the confusion by trying to crisply pose an important subproblem that nobody seems to know how to solve using unbounded computing power, then trying to come up with unbounded solutions and analyzing those. (It's important, in this case, to slice off much smaller pieces of the confusion than, "Write a program that would be a nice AI.")

(If you think you already know the answer to this, you should probably be reading up on "why this problem is hard" pages rather than

The fact that nobody can give answer this question - even though there are unbounded solutions to most of AI and AGI (e.g. AIXI) - illustrates that

If you ask yourself how to write a Python program that would be a nice AI if we could run it on a computer much larger than the universe, you might notice a "stubbing the mind's toe" feeling of trying to write a program when you're confused about the nature of the computations you need to perform.

One possible research avenue is to directly tackle that ignorance, as such, and try directly to figure out how to crisply model some aspects of the problem given unlimited computing power and other, even more dangerous simplifying assumptions. It's an approach that has well-known pitfalls, but it's arguably among the best options we have today. Some of the considerations pushing value alignment theory in the direction of unbounded analyses can be briefly summarized as follows:

If we don't know how to solve a problem even given unbounded computation, that means we're confused about the nature of the work to be performed. It is often worthwhile to tackle this basic conceptual confusion in a setup that exposes the confusing part of the problem as simply as possible, and doesn't introduce further and distracting complications until the core issue is resolved.

Introducing 'realistic' complications can make it difficult to engage in cooperative discourse about which ideas have which consequences - AIXI was a watershed moment because it was formally specified and there was no way to say "Oh, I didn't mean that" when somebody pointed out that AIXI would try to seize control of its own reward channel.

Increasing the intelligence of an advanced agent may sometimes move its behavior closer to ideals and further from specific complications of an algorithm. Early, stupid chess algorithms had what seemed to humans like weird idiosyncracies tied to their specific algorithms. Modern chess programs, far beyond the human champions, can from an intuitive human perspective be seen as just making good chess moves.

Current AI algorithms are often incapable of demonstrating future phenomena that seem like they should predictably occur later, and whose interesting properties seem like they can be described using an unbounded algorithm as an example. E.g. convergent instrumental strategies arise from consequentialist agents reasoning about difficult domains like "My programmers have beliefs about whether I'll achieve their goals" and "What happens if they decide to press my shutdown button?", which is very far from something that naturally arises in current AI algorithms; but we can still anticipate this future issue using notions like economic agency, and try out unbounded problem formulations and attempted solutions like Utility indifference.

Unbounded agent properties

Much of the current technical work in value alignment theory takes place against a background of simplifying assumptions such as:

Unlimited finite computing power, or even hypercomputation.

Perfect knowledge of the environment.

Deterministic environments that are smaller than the agent and can be fully simulated.

Environments separated from the agent by an impermeable Cartesian boundary.

Turn-based discrete time (the environment waits for the agent to make a move, rather than proceeding in continuous real time).

Common knowledge of each other's code in multi-agent dilemmas.

Predefined action sets.

Danger of simplifying away an important part of the problem.

The pitfall is when an unbounded analysis simplifies away all the important parts of the problem and sheds no light on what remains.

For example, the point of the Vingean reflection program and tiling agents theory is that we should not assume that agents can exactly model their own successors or self-modifications. Even if we say that we can assume unlimited finite computing power, we should still assume that the future version of the agent has more computing power than the past version (or has seen new sensory data) and therefore the current self can't exactly predict the future self. The core of the problem is to obtain trust in an agent under conditions of Vingean uncertainty where we can't know exactly what a smarter agent will do because to know exactly what it would do we would need to be at least that smart ourselves. To make the 'unbounded' assumption that we could exactly predict what that agent will do, and judge exactly what consequence would follow, would assume away the central confusing core of the problem. But allowing for computers bigger than the universe, where the other agent has an even bigger computer, does not assume away this confusing core. Or at least, so long as we don't know how to solve even this unbounded form of the problem, it's safe to say that this problem formulation exposes some key confusion.

Unbounded analysis doesn't always make a problem easier to think about. If you can get your algorithm to run on real computers, it becomes possible to observe that your results are correct as well as proving them, and to observe the outcomes in cases that don't have closed-form solutions. (Though these observations may still be untrustworthy if there's a bug in the code, meaning your setup was not what you thought.) This implies a sharp dropoff in cognitive tractability at the boundary where you can no longer run your analysis as code. When it was realized that PatrickLavictoire's modal decision theory could be run in polynomial time, it was then possible to test possible modal agents against each other much more rapidly than when they needed to be written out on a whiteboard and analyzed by argument.

However, when a scenario doesn't easily fit into modern AI algorithms, simplifying the representation of the problem down to where it runs on modern computers can also destroy the core or interesting part of the problem. "Will this Utility indifference setup cause a shutdown button to be preserved even as the AI modifies its own code?" The flaw with the original Utility indifference proposal turned out to be that the AI would act in all ways as if the shutdown button as having probability zero of being pressed, and hence, while it wouldn't try to prevent the button being pressed, it would also probably prune away the "dead code" for the shutdown button the next time it executed a self-modification. There have been some preliminary attempts to model shutdown-button setups (agents that do or don't try to prevent other agents from interfering) using observable running code, but that particular failure mode would not have come close to showing up in any of those setups. Early AI may have been hurt by a cultural insistence that only running code counted (opposed to another cultural tendency to do unbounded analysis that assumed away everything important). Relative to the computers and algorithms available in, say, 1970, insisting on running code meant that everything had to be insanely simplified in order to be spoken about at all.

Modern AI algorithms are better than in 1970, but there's still no easy or natural expression for a number of important-seeming scenarios about sufficiently advanced AIs that use Consequentialist cognition reasoning to model the real world, their programmers, and their own source code. We often can't naturally, faithfully, realistically reproduce these scenarios using current algorithms; in some cases, the key issue is only supposed to show up when the AI is smarter than us. Trying to simplify those scenarios down to where toy models of them can be run as modern code, encounters some of the same pitfalls as trying to do unbounded analysis of them.

Simplifying for running code and simplifying for an unbounded analysis aren't mutually exclusive. Sometimes (as with utility indifference and shutdown buttons) there's different groups pursuing both approaches at once. (And of course the goal of any unbounded analysis is to reintroduce the complications as understanding improves, including realistic limits on computing power, until there are algorithms that can actually be run. This process almost always requires new basic insights and may take 47 years (or not), but it remains the goal.)

Benefits of simplification

The benefit of doing an unbounded analysis is often also straightforward: it lets you expose a core conceptual point in a setting simple enough that you can think about it successfully. (This also has a corrupted version where you write down some simple formulas and congratulate yourself for having illustrated some key conceptual point that is obvious, useless, or wrong. A key sign that this corrupted version is not happening is when you can pose a problem that has no known unbounded solution, and it hangs around for a while being confusing, and writing a paper on it isn't easy.)

Unbounded analysis in computer science

you can also obtain confident false beliefs from running code, not just theoretical analyses

As background for the state of mind in modern computer science, it should be remembered that individual computer scientists may sometimes be tempted to exaggerate how much their algorithm solves or how realistic it is - while not everyone gives into that temptation, at least some computer scientists do so at least some of the time. This means that there are a number of 'unbounded' analyses floating around which don't solve the real-world problem, may not even shed very much light on the real-world problem, and whose authors praise them as nearly complete solutions but for some drudge-work of mere computation. A reasonable reaction is to be suspicious of unbounded analyses, and scrutinize them closely to make sure that they haven't assumed away the conceptual core of the problem - if you try to model image recognition and you assume that all the images are 2 pixels by 2 pixels, you haven't simplified the problem, you've eliminated it and shed no useful light for people trying to solve the realistic problem.

This is a real problem. At MIRI, we could indeed, if asked in private, point to several researchers in near-neighboring fields who we think exaggerated what their unbounded algorithms did - e.g., "finds the fastest provably correct algorithm after longer than the age of the universe" becomes "finds the fastest algorithm" in the researcher's informal description. Or when the entire interesting core question is what to do about events that cross the boundary between the AI's interior and exterior, like dropping an anvil on your own head, this event is designated using a particular Greek letter and it's assumed that its expected utility is already known. In this case, formalization hasn't shed light on the core conceptual confusion, but just swept that conceptual confusion under a rug.

Again, as a matter of cultural background in computer science, you can see some researchers trying to play up the loftier, more mathematical nature of conceptual analyses as showing off more intellectual power. And you can find a reaction to that particular game in the form of computer scientists who say that all the real work, all the useful work, is done with real computers and algorithms that run well today in the messy real world. This isn't entirely wrong either, and it's truer in some parts of computer science than others - we are well past the point where it makes sense to analyze a robotic car using anything resembling first-order logic; with good realistic algorithms that do all the key work, the time for unrealistic algorithms may well have passed.

If, on the other hand, you're in a part of computer science that's still living in the Edgar Allen Poe era of "Literally nobody knows how to write down a program that solves the problem even given unlimited computing power and every other possible simplifying assumption that doesn't eliminate the problem entirely*", then you are perhaps not in the sort of realm where the mathematicians have nothing left to do - or so MIRI would argue, at any rate. It would be even better if we had bounded solutions, to be sure, but to criticize confusion-reducing work on the grounds that it fails to solve the entire and realistic problem would be the nirvana fallacy when no more properly realistic solution as yet exists.

This doesn't mean that doing unbounded analysis will be pitfall-free, or that you can make any possible simplifying assumption without destroying the core of the problem. The whole point of the Vingean reflection program and tiling agents theory is that, for example, we should not assume that agents can exactly model their own successors or self-modifications - even if we say that we can assume unlimited computing power, we should still assume that the future version of the agent has more computing power than the past version (or has seen new sensory data) and therefore can't be predicted exactly in advance. The interesting problem is how to obtain abstract trust in an agent under conditions of Vingean uncertainty where we can't know exactly what a smarter agent will do because to know exactly what it would do we would need to be at least that smart ourselves. To make the 'unbounded' assumption that we could exactly predict what that agent will do, and judge exactly what consequence would follow, would assume away the central confusing core of the problem. But allowing for very large computers bigger than the universe, where the other agent has an even bigger computer, does not assume away this confusing core; or at least, so long as we don't know how to solve even this unbounded form of the problem, it's safe to say that this problem formulation exposes some key confusion.

Given the modern social landscape of computer science, it does happen that debates about appropriate levels of abstraction can take on the heated, personal nature of, say, Republican-Democratic national politics in the United States. For someone who has taken on the identity of opposing the folly of over-abstraction, there will be a tempting ad-hominem available to depict those airy folk who only write Greek letters and never try their ideas in running code, as being ignorant of the pitfalls of Greek letters, ignorant of the advantages of running code, believing that all realistic programs are mere drudge-work, etcetera. And while not everyone gives into the temptation of that ad-hominem, some people do, just like those who give into the temptation to exaggerate the power of unbounded analyses, and there is a well-known selection effect where debates can be dominated by the speaking volume of those who give in the most to such temptations.

Understanding the pitfalls of Greek letters and the benefits of running code is certainly part of being good at unbounded analysis and doing the right kind of unbounded analysis. It was great when we realized that PatrickLaVictoire's modal agents formalism could be run in quadratic time and that we could just write down the agents and run them. It's great when you can give a lecture and take audience suggestions for new modal agents during Q&A and compete them against each other as fast as you can type them. It's helpful to further theoretical development for all the obvious reasons: it's easier to make a mistake on a whiteboard than in a computer, and it can be just faster to try ideas as rapidly as you can type them into a prompt, gaining an inexpensive concrete intuition for the field.

That said, as fun as it was to do that with modal agents, and as much as we'd love to be able to do it for everything (except paperclip maximizers, we'd rather NOT have running code of those), there's a lot of places we can't do that yet. And even when life isn't that convenient, it's sometimes possible to make progress anyway. We'd never have gotten to the point of running code for modal agents in the first place, if we hadn't spent years before then pushing on the whiteboard stages of logical decision theory.

And this doesn't seem like a very unusual state of affairs for computer science as a whole, either. The historical reality is that there often is a lot of whiteboard scribbling that comes first - not always, but sometimes, and not nearly rarely enough for it to be unusual.

Even so, in technical work on value alignment, the relative proportion of whiteboard scribbling to running code is unusual for modern computer science. This doesn't happen randomly or because people who work on value alignment are afraid of computers (yes, there's researchers in the field who've worked with real-world machine learning code, etcetera); it happens because there's systematic reasons pushing in that direction.

Four of those reasons were briefly summarized above, and five of them will be considered at greater length below.

Reasons for the prevalence of unbounded analysis in value alignment theory

Unbounded analysis is often the most practical way to expose and solve conceptual confusions.

(writing in progress)

That said, if a version of Vingean reflection using unrealistic logical agents in unrealistic logical worlds does seem to expose a core problem that we don't know how to solve using unlimited computing power, it may make sense to ask, explicitly, "Okay, how would I solve the simplest version of this scenario that seems to expose the core problem?" rather than launching right into adding all the complications of probabilistic reasoning, realistic computational bounds, action in realtime, etcetera.

Similarly, in modern AI and especially in value alignment theory, there's a sharp divide between problems we know how to solve using unlimited computing power, and problems which are confusing enough that we can't even state the simple program that would solve them given a larger-than-the-universe computer. It is an alarming aspect of the current state of affairs that we know how to build a non-value-aligned hostile AI using unbounded computing power but not how to build a nice AI using unlimited computing power. The unbounded analysis program in value alignment theory centers on crossing this gap.

Besides the role of mathematization in producing conceptual understanding, it also helps build cooperative discourse. Marcus Hutter's AIXI marks not only the first time that somebody described a short Python program that would be a strong AI if run on a hypercomputer, but also the first time that anyone was able to point to a formal specification and say "And that is why this AI design would wipe out the human species and turn all matter within its reach into configurations of insignificant value" and the reply wasn't just "Oh, well, of course that's not what I meant" because the specification of AIXI was fully nailed down. While not every key issue in value alignment theory is formalizable, there's a strong drive toward formalism not just for conceptual clarity but also to sustain a cooperative process of building up collective debate and understanding of ideas.

A facile way of putting it would be that we're often trying to consider phenomena that just don't happen in weak modern AI algorithms, but seem like they should predictably happen with more advanced algorithms later, and we can try to model those future phenomena by introducing unlimited computing power. A more precise and still facile way of putting it would be that philanthropic work in this sector often consists in trying to identify problems that seem unlikely to be solved in the course of addressing immediate, commercial pressures to get today's AI running today. If you can exhibit a problem in running code of today, or if you expect it to appear in running code of 10 years later but still before AIs are smart enough to produce pivotal catastrophes, then it's likely to be the kind of problem that enormous well-funded commercial projects will solve incidentally in the course of pursuing their own immediate incentives. The dangerous problems are those that spring up for the first time with very smart AIs, or the kind of problem where early cheap patches fail in very smart AIs - they're dangerous precisely because immediate incentives don't force people to solve them unless they've used foresight and seen the bullet coming before any bullet hits. A lot of the pressure towards running code comes from the social pressure to deliver immediate benefits and the epistemic desirability of obtaining lots of definite observations; but where current running code seems sufficient to expose and solve a problem in value alignment theory, that's probably not where a philanthropic group should be focusing its own work.

]

- more detailed history of Shannon and Poe - (diagnosis) nirvana fallacy applied to unbounded analyses, past exaggerations leading to widespread disrespect - unbounded analysis makes claims precise enough to be critiqued and enables the discourse to progress - AIXI is the central example of an unbounded agent, and often also demarcates the boundary between 'straightforward' and 'confusing' problems in modern AI - bounded agent assumptions still hold in real life and should clearly be marked as being violated - in particular we still need to be wary of 'unbounded analyses' that avert the central problem - in some cases advanced agent properties will start to predictably or possibly cross some of those lines (a superintelligence is much more likely to find a strategy, even in a large search space) - you still don't get to assume that the agent can identify arbitrary nice things unless you know how to make a Python program run on a large-enough computer output those nice things - (this is what blocks the 'unbounded analysis' of "Oh, it's really smart, it'll know what we mean")

- why we need basic theoretical understanding - because it's hard to do FAI otherwise - because our FAI concepts need to anchor in basic things that might stay constant under self-modification, not particular programmatic tricks that will evaporate like snow in winter

LESSWRONG
Wikitags
LW

Methodology of unbounded analysis

Summary

Attacking confusion in the simplest settings.

The pitfall of simplying away the key, confusing part of the problem.

The pitfall of residual terms.

Introducing realistic complications can make it hard to build collective discourse about which ideas have which consequences.

Unbounded agent properties

Introduction: Claude Shannon, Deep Blue, and Edgar Allen Poe

Unbounded agent properties

Danger of simplifying away an important part of the problem.

Benefits of simplification

Unbounded analysis in computer science

Reasons for the prevalence of unbounded analysis in value alignment theory

Unbounded analysis is often the most practical way to expose and solve conceptual confusions.