The limited predictor problem

10 cousin_it 21 March 2012 12:15AM

This post requires some knowledge of logic, computability theory, and K-complexity. Much of the credit goes to Wei Dai. The four sections of the post can be read almost independently.

The limited predictor problem (LPP) is a version of Newcomb's Problem where the predictor has limited computing resources. To predict the agent's action, the predictor simulates the agent for N steps. If the agent doesn't finish in N steps, the predictor assumes that the agent will two-box. LPP is similar to the ASP problem, but with simulation instead of theorem proving.

1. Solving the problem when the agent has a halting oracle

Consider the agent defined in "A model of UDT with a halting oracle", and a predictor that can run the agent's code step-by-step, with oracle calls and all. Turns out that this agent solves LPP correctly if N is high enough. To understand why, note that the agent offloads all interesting work to oracles that return instantly, so the agent's own runtime is provably bounded. If that bound is below N, the agent's oracle will prove that the predictor predicts the agent correctly, so the agent will one-box.

2. Failing to solve the problem when N is algorithmically random

Consider a setting without oracles, with only Turing-computable programs. Maybe the agent should successively search for proofs somehow?

Unfortunately you can't solve most LPPs this way, for a simple but surprising reason. Assume that the predictor's time limit N is a large and algorithmically random number. Then the predictor's source code is >log(N) bits long, because N must be defined in the source code. Then any proof about the world program must also have length >log(N), because the proof needs to at least quote the world program itself. Finding a proof by exhaustive search takes exponential time, so the agent will need >N steps. But the predictor simulates the agent for only N steps. Whoops!

3. Solving the problem when N is large but has a short definition

As usual, let U be the world program that returns a utility value, and A be the agent program that returns an action and has access to the world's source code. Consider the following algorithm for A:

  1. From L=1 to infinity, search for proofs up to length L of the form "if A()=a and runtime(A)<g(L), then U()=u", where g(L) is an upper bound on runtime(A) if A stops the search at length L. Upon finding at least one proof for each possible a, go to step 2.
  2. Search for proofs up to length f(L) of the form "if runtime(A)<g(L), then A()≠a", where f(L) is some suitably fast-growing function like 10^L. If such a proof is found, return a.
  3. If we're still here, return the best a found on step 1.

This algorithm is very similar to the one described in "A model of UDT without proof limits", but with the added complication that A is aware of its own runtime via the function g(L). By an analogous argument, A will find the "intended" proof that the predictor predicts A correctly if runtime(A) is small enough, as long as the "intended" proof exists and isn't too long relative to the predictor's time limit N. More concretely, A will solve all instances of LPP in which N is larger than g(L), where L is the length of the "intended" proof. For example, if f(L)=10^L, then g(L) is doubly exponential, so A will successfully solve LPPs where the predictor's source code defines N using triple exponentials or some more compact notation.

4. A broader view

TDT and UDT were originally designed for solving "decision-determined" problems. The agent figures out how the resulting utility logically depends on the agent's action, then returns the action with the highest utility, thus making the premise true.

But a cleverly coded decision program can also control other facts about itself. For example, the program may figure out how the resulting utility depends on the program's return value and running time, then choose the best return value and choose how long to keep running, thus making both premises true. This idea is a natural extension of quining (you carefully write a program that can correctly judge its own runtime so far) and can be generalized to memory consumption and other properties of programs.

With enough cleverness we could write a program that would sometimes decide to waste time, or run for an even number of clock cycles, etc. We did not need so much cleverness in this post because LPP lies in a smaller class that we may call "LPP-like problems", where utility depends only on the agent's return value and runtime, and the dependence on runtime is monotonous - it never hurts to return the same value earlier. That class also includes all the usual decision-determined problems like Newcomb's Problem, and our A also fares well on those.

I was surprised to find so many new ideas by digging into such a trivial-looking problem as LPP. This makes me suspect that advanced problems like ASP may conceal even more riches, if only we have enough patience to approach them properly...

A model of UDT without proof limits

13 cousin_it 20 March 2012 07:41PM

This post requires some knowledge of decision theory math. Part of the credit goes to Vladimir Nesov.

Let the universe be a computer program U that returns a utility value, and the agent is a subprogram A within U that knows the source code of both A and U. (The same setting was used in the reduction of "could" post.) Here's a very simple decision problem:

def U():
  if A() == 1:
    return 5
  else:
    return 10

The algorithm for A will be as follows:

  1. Search for proofs of statements of the form "A()=a implies U()=u". Upon finding at least one proof for each possible a, go to step 2.
  2. Let L be the maximum length of proofs found on step 1, and let f(L) be some suitably fast-growing function like 10^L. Search for proofs shorter than f(L) of the form "A()≠a". If such a proof is found, return a.
  3. If we're still here, return the best a found on step 1.

The usual problem with such proof-searching agents is that they might stumble upon "spurious" proofs, e.g. a proof that A()==2 implies U()==0. If A finds such a proof and returns 1 as a result, the statement A()==2 becomes false, and thus provably false under any formal system; and a false statement implies anything, making the original "spurious" proof correct. The reason for constructing A this particular way is to have a shot at proving that A won't stumble on a "spurious" proof before finding the "intended" ones. The proof goes as follows:

Assume that A finds a "spurious" proof on step 1, e.g. that A()=2 implies U()=0. We have a lower bound on L, the length of that proof: it's likely larger than the length of U's source code, because a proof needs to at least state what's being proved. Then in this simple case 10^L steps is clearly enough to also find the "intended" proof that A()=2 implies U()=10, which combined with the previous proof leads to a similarly short proof that A()≠2, so the agent returns 2. But that can't happen if A's proof system is sound, therefore A will find only "intended" proofs rather than "spurious" ones in the first place.

Quote from Nesov that explains what's going on:

With this algorithm, you're not just passively gauging the proof length, instead you take the first moral argument you come across, and then actively defend it against any close competition

By analogy we can see that A coded with f(L)=10^L will correctly solve all our simple problems like Newcomb's Problem, the symmetric Prisoner's Dilemma, etc. The proof of correctness will rely on the syntactic form of each problem, so the proof may break when you replace U with a logically equivalent program. But that's okay, because "logically equivalent" for programs simply means "returns the same value", and we don't want all world programs that return the same value to be decision-theoretically equivalent.

A will fail on problems where "spurious" proofs are exponentially shorter than "intended" proofs (or even shorter, if f(L) is chosen to grow faster). We can probably construct malicious examples of decision-determined problems that would make A fail, but I haven't found any yet.

Zeckhauser's roulette

11 cousin_it 19 January 2012 07:22PM

Imagine you're playing Russian roulette. Case 1: a six-shooter contains four bullets, and you're asked how much you'll pay to remove one of them.  Case 2: a six-shooter contains two bullets, and you're asked how much you'll pay to remove both of them. Steven Landsburg describes an argument by Richard Zeckhauser and Richard Jeffrey saying you should pay the same amount in both cases, provided that you don't have heirs and all your remaining money magically disappears when you die. What do you think?

A way of specifying utility functions for UDT

9 cousin_it 23 December 2011 12:37AM

The original UDT post defined an agent's preferences as a list of programs it cares about, and a utility function over their possible execution histories. That's a good and general formulation, but how do we make a UDT agent care about our real world, seeing as we don't know its true physics yet? It seems difficult to detect humans (or even paperclips) in the execution history of an arbitrary physics program.

But we can reframe the problem like this: the universal prior is a probability distribution over all finite bitstrings. If we like some bitstrings more than others, we can ask a UDT agent to influence the universal prior to make those bitstrings more probable. After all, the universal prior is an uncomputable mathematical object that is not completely known to us (and even defined in terms of all possible computer programs, including ours). We already know that UDT agents can control such objects, and it turns out that such control can transfer to the real world.

For example, you could pick a bitstring that describes the shape of a paperclip, type it into a computer and run a UDT AI with a utility function which says "maximize the probability of this bitstring under the universal prior, mathematically defined in such-and-such way". For good measure you could give the AI some actuators, e.g. connect its output to the Internet. The AI would notice our branch of the multiverse somewhere within the universal prior, notice itself running within that branch, seize the opportunity for control, and pave over our world so it looks like a paperclip from as many points of view as possible. The thing is like AIXI in its willingness to kill everyone, but different from AIXI in that it probably won't sabotage itself by mining its own CPU for silicon or becoming solipsistic.

This approach does not see the universal prior as a fixed probability distribution over possible universes, instead you could view it as picking from several different possible universal priors. A utility function thus specified doesn't always look like utility maximization from within the universal prior's own multiverse of programs. For example, you could ask the agent to keep a healthy balance between paperclips and staples in the multiverse, i.e. minimize |P(A)-P(B)| where A is a paperclip bitstring, B is a staple bitstring, and P is the universal prior.

The idea is scary because well-meaning people can be tempted to use it. For example, they could figure out which computational descriptions of the human brain correspond to happiness and ask their AI to maximize the universal prior probability of those. If implemented correctly, that could work and even be preferable to unfriendly AI, but we would still lose many human values. Eliezer's usual arguments apply in full force here. So if you ever invent a powerful math intuition device, please don't go off building a paperclipper for human happiness. We need to solve the FAI problem properly first.

Thanks to Paul Christiano and Gary Drescher for ideas that inspired this post.

A model of UDT with a halting oracle

41 cousin_it 18 December 2011 02:18PM

This post requires some knowledge of mathematical logic and computability theory. The basic idea is due to Vladimir Nesov and me.

Let the universe be a computer program U that can make calls to a halting oracle. Let the agent be a subprogram A within U that can also make calls to the oracle. The source code of both A and U is available to A.

Here's an example U that runs Newcomb's problem and returns the resulting utility value:

  def U():
    # Fill boxes, according to predicted action.
    box1 = 1000
    box2 = 1000000 if (A() == 1) else 0
    # Compute reward, based on actual action.
    return box2 if (A() == 1) else (box1 + box2)

A complete definition of U should also include the definition of A, so let's define it. We will use the halting oracle only as a provability oracle for some formal system S, e.g. Peano arithmetic. Here's the algorithm of A:

  1. Play chicken with the universe: if S proves that A()≠a for some action a, then return a.
  2. For every possible action a, find some utility value u such that S proves that A()=a ⇒ U()=u. If such a proof cannot be found for some a, break down and cry because the universe is unfair.
  3. Return the action that corresponds to the highest utility found on step 2.

Now we want to prove that the agent one-boxes, i.e. A()=1 and U()=1000000. That will follow from two lemmas.

Lemma 1: S proves that A()=1 ⇒ U()=1000000 and A()=2 ⇒ U()=1000. Proof: you can derive that from just the source code of U, without looking at A at all.

Lemma 2: S doesn't prove any other utility values for A()=1 or A()=2. Proof: assume, for example, that S proves that A()=1 ⇒ U()=42. But S also proves that A()=1 ⇒ U()=1000000, therefore S proves that A()≠1. According to the first step of the algorithm, A will play chicken with the universe and return 1, making S inconsistent unsound (thx Misha). So if S is sound, that can't happen.

We see that the agent defined above will do the right thing in Newcomb's problem. And the proof transfers easily to many other toy problems, like the symmetric Prisoner's Dilemma.

But why? What's the point of this result?

There's a big problem about formalizing UDT. If the agent chooses a certain action in a deterministic universe, then it's a true fact about the universe that choosing a different action would have caused Santa to appear. Moreover, if the universe is computable, then such silly logical counterfactuals are not just true but provable in any reasonable formal system. When we can't compare actual decisions with counterfactual ones, it's hard to define what it means for a decision to be "optimal".

For example, one previous formalization searched for formal proofs up to a specified length limit. Problem is, that limit is a magic constant in the code that can't be derived from the universe program alone. And if you try searching for proofs without a length limit, you might encounter a proof of a "silly" counterfactual which will make you stop early before finding the "serious" one. Then your decision based on that silly counterfactual can make it true by making its antecedent false... But the bigger problem is that we can't say exactly what makes a "silly" counterfactual different from a "serious" one.

In contrast, the new model with oracles has a nice notion of optimality, relative to the agent's formal system. The agent will always return whatever action is proved by the formal system to be optimal, if such an action exists. This notion of optimality matches our intuitions even though the universe is still perfectly deterministic and the agent is still embedded in it, because the oracle ensures that determinism is just out of the formal system's reach.

P.S. I became a SingInst research associate on Dec 1. They did not swear me to secrecy, and I hope this post shows that I'm still a fan of working in the open. I might just try to be a little more careful because I wouldn't want to discredit SingInst by making stupid math mistakes in public :-)

No one knows what Peano arithmetic doesn't know

17 cousin_it 16 December 2011 09:36PM

WARNING: this post requires some knowledge of mathematical logic and computability theory.

I was just talking with Wei Dai and something came up that seems at once obvious and counterintuitive. Though if the argument is correct, I guess it will be old news to about 50% of the people who read my posts :-)

Imagine you have an oracle that can determine if an arbitrary statement is provable in Peano arithmetic. Then you can try using it as a halting oracle: for an arbitrary Turing machine T, ask "can PA prove that there's an integer N such that T makes N steps and then halts?". If the oracle says yes, you know that the statement is true for standard integers because they're one of the models of PA, therefore N is a standard integer, therefore T halts. And if the oracle says no, you know that there's no such standard integer N because otherwise the oracle would've found a long and boring proof involving the encoding of N as SSS...S0, therefore T doesn't halt. So your oracle can indeed serve as a halting oracle.

On the other hand, if you had a halting oracle to begin with, you could use it as a provability oracle for PA: "if a program successively enumerates all proofs in PA, will it ever find a proof for such-and-such statement?"

So having a provability oracle for PA or any other consistent formal system that proves some valid arithmetic truths (like ZFC) is equivalent to having a halting oracle, and thus leads to a provability oracle for any other formal system. In other words, if you knew all about the logical implications of PA, then you would also know all about the logical implications of ZFC and all other formal systems. Hee hee.

ETA: this line leads to a nontrivial question. Is there a formal system (not talking about the standard integers, I guess) whose provability oracle is strictly weaker than the halting oracle, but still uncomputable?

ETA 2: the question seems to be resolved, see Zetetic's comment and my reply.

K-complexity of everyday things

11 cousin_it 04 December 2011 02:54PM

What can we say about the K-complexity of a non-random string from our universe, e.g. the text of Finnegans Wake? It contains lots of patterns making it easy to compress using a regular archiver, but can we do much better than that?

On one hand, the laws of physics in our universe seem to be simple, and generating the text is just a matter of generating our universe then pointing to the text. On the other hand, our evolution involved a lot of quantum randomness, so pointing to humans within the universe could require a whole lot of additional bits above and beyond the laws of physics. So does anyone have good arguments whether the K-complexity of Finnegans Wake is closer to 10% or 0.1% of its length?

What you think about them

6 cousin_it 01 December 2011 02:28AM

LWers who asked for personal feedback/critique in the comments to What I think about you: Dorikka, atucker, muflax, jsalvatier, Tyrell_McAllister, TheOtherDave.

LWers who asked for personal feedback at lukeprog's Tell me what you think about me: lukeprog, Jolly, SilasBarta, gwern.

If you have met any of these LWers face-to-face, or just have some constructive criticism to offer them, I urge you to do so by any discreet channel, e.g. private messages on LW or the various feedback forms that these LWers have set up. Most of them have declared Crocker's rules. Please don't hesitate: you have a chance to do a lot of good.

There seems to be a large unmet demand for feedback. Any ideas how to better satisfy that demand are welcome!

ETA: Antisuji, endoself, John_Maxwell_IV.

What I think about you

16 cousin_it 29 November 2011 11:28PM

So, having heard Mike Li compare Jaynes to a thousand-year-old vampire, one question immediately popped into my mind:

"Do you get the same sense off me?" I asked.

-- Eliezer Yudkowsky, The Level Above Mine

Most people need feedback in many areas. Most people can give feedback in many areas. But for some reason I don't see a lot of actual honest feedback happening, neither in my personal life, nor at work, nor here on LW. This looks like some sort of market failure, or perhaps a bug in society.

Would we benefit from a norm that encouraged asking for feedback or critique in any area, perhaps using open threads set up specially for that? I think we would. What do you think?

Why would an AI try to figure out its goals?

13 cousin_it 09 November 2011 10:08AM

"So how can it ensure that future self-modifications will accomplish its current objectives? For one thing, it has to make those objectives clear to itself. If its objectives are only implicit in the structure of a complex circuit or program, then future modifications are unlikely to preserve them. Systems will therefore be motivated to reflect on their goals and to make them explicit." -- Stephen M. Omohundro, The Basic AI Drives

This AI becomes able to improve itself in a haphazard way, makes various changes that are net improvements but may introduce value drift, and then gets smart enough to do guaranteed self-improvement, at which point its values freeze (forever). -- Eliezer Yudkowsky, What I Think, If Not Why

I have stopped understanding why these quotes are correct. Help!

More specifically, if you design an AI using "shallow insights" without an explicit goal-directed architecture - some program that "just happens" to make intelligent decisions that can be viewed by us as fulfilling certain goals - then it has no particular reason to stabilize its goals. Isn't that anthropomorphizing? We humans don't exhibit a lot of goal-directed behavior, but we do have a verbal concept of "goals", so the verbal phantom of "figuring out our true goals" sounds meaningful to us. But why would AIs behave the same way if they don't think verbally? It looks more likely to me that an AI that acts semi-haphazardly may well continue doing so even after amassing a lot of computing power. Or is there some more compelling argument that I'm missing?

View more: Prev | Next