Limited agents need approximate induction
[This post borders on some well-trodden ground in information theory and machine learning, so ideas in this post have an above-average chance of having already been stated elsewhere, by professionals, better. EDIT: As it turns out, this is largely the case, under the subjects of the justifications for MML prediction and Kolmogorov-simple PAC-learning.]
I: Introduction
I am fascinated by methods of thinking that work for well-understood reasons - that follow the steps of a mathematically elegant dance. If one has infinite computing power the method of choice is something like Solomonoff induction, which is provably ideal in a certain way at predicting the world. But if you have limited computing power, the choreography is harder to find.
To do Solomonoff induction, you search through all Turing machine hypotheses to find the ones that exactly output your data so far, then use the weighted average of those perfect retrodictors to predict the next time step. So the naivest way to build an ideal limited agent is to merely search through lots of hypotheses (chosen from some simple set) rather than all of them, and only run each Turing machine for time less than some limit. At least it's guaranteed to work in the limit of large computing power, which ain't nothing.
Suppose then that we take this nice elegant algorithm for a general predictor, and we implement it on today's largest supercomputer, and we show it the stock market prices from the last 50 years to try to predict stocks and get very rich. What happens?
Bupkis happens, that's what. Our Solomonoff predictor tries a whole lot of Turing machines and then runs out of time before finding any useful hypotheses that can perfectly replicate 50 years of stock prices. This is because such useful hypotheses are very, very, very rare.
We might then turn to the burgeoning field of logical uncertainty, which has a major goal of handling intractable math problems in an elegant and timely manner. We are logically uncertain about what distribution Solomonoff induction will output, so can we just average over that logical uncertainty to get some expected stock prices?
The trouble with this is that current logical uncertainty methods rely on proofs that certain outputs are impossible or contradictory. For simple questions this can narrow down the answers, but for complicated problems it becomes intractable, replacing the hard problem of evaluating lots of Turing machines with the hard problem of searching through lots and lots of proofs about lots of Turing machines - and so again our predictor runs out of time before becoming useful.
In practice, the methods we've found to work don't look very much like Solomonoff induction. Successful methods don't take the data as-is, but instead throw some of it away: curve fitting and smoothing data, filtering out hard-to-understand signals as noise, and using predictive models that approximate reality imperfectly. The sorts of things that people trying to predict stocks are already doing. These methods are vital to improve computational tractability, but are difficult (to my knowledge) to fit into a framework as general as Solomonoff induction.
II: Rambling
Suppose that our AI builds a lot of models of the world, including lossy models. How should it decide which models are best to use for predicting the world? Ideally we'd like to make a tradeoff between the accuracy of the model, measured in the expected utility of how accurate you expect the model's predictions to be, and the cost of the time and energy used to make the prediction.
Once we know how to tell good models, the last piece would be for our agent to make the explore/exploit tradeoff between searching for better models and using its current best.
There are various techniques to estimate resource usage, but how does one estimate accuracy?
Here was my first thought: If you know how much information you're losing (e.g. by binning data), for discrete distributions this sets the Shannon information of the ideal value (given by Solomonoff prediction) given the predicted value. This uses the relationship between information in bits of data and Shannon information that determines how sharp your probability distribution is allowed to be.
But with no guarantees about the normality (or similar niceness properties) of the ideal value given the prediction, this isn't very helpful. The problem is highlighted by hurricane prediction. If hurricanes behaved nicely as we threw away information, weather models would just be small, high-entropy deviations from reality. Instead, hurricanes can change route greatly even with small differences in initial conditions.
The failure of the above approach can be explained in a very general way: it uses too little information about the model and the data, only the amount of information thrown away. To do better, our agent has to learn a lot from its training data - a subject that workers in AI have already been hard at work on. On the one hand, it's a great sign if we can eventually connect ideal agents to current successful algorithms. On the other, doing so elegantly seems like a hard problem.
To sum up in the blandest possible way: If we want to build successful predictors of the future with limited resources, they should use their experience to learn approximate models of the world.
The real trick, though, is going to be to set this on a solid foundation. What makes a successful method of picking models? As we lack access to the future (yet! Growth mindset!), we can't grade models based on their future predictions unless we descend to solipsism and grade models against models. Thus we're left with grading models based on how well they retrodict the data so far. Sound familiar? The foundation we want seems like an analogue to Solomonoff induction, one that works for known reasons but doesn't require perfection.
III: An Example
Here's a paradigm that might or might not be a step in the right direction, but at least gestures at what I mean.
The first piece of the puzzle is that a model that gets proportion P of training bits wrong can be converted to a Solomonoff-accepted perfectly-precise model just by specifying the bits it gets wrong. Suppose we break the model output (with total length N) into chunks of size L, and prefix each chunk with the locations of the wrong bits in that chunk. Then the extra data required to rectify an approximate model is at most N/L·log(P·L)+N·P·log(L). Then the hypothesis where the model is right about the next bit is simpler than the hypothesis when it's wrong, because when the model is right you don't have to spend ~log(L) bits correcting it.
In this way, Solomonoff induction natively cares about some approximate models' predictions. There are some interesting details here that are outside the focus of this particular post. Does using the optimal chunk length lead to Solomonoff induction reflecting model accuracy correctly? What are some better schemes for rectifying models that handle things like models that output probabilities? The point is just that even if your model is wrong on fraction P of the training data, Solomonoff induction will still promote it as long as it's simpler than N-N/L·log(P·L)-N·P·log(L).
The second piece of the puzzle is that induction can be done over processed functions of observations, like smoothing the data or filtering difficult-to-predict parts (noise) out. If this processing increases the accuracy of models, we can use this to make high-accuracy models of functions the training data, and then use those models to predict the the processed future observations as above.
These two pieces allow an agent to use approximate models, and to throw away some of its information, and still have its predictions work for the same reason as Solomonoff induction. We can use this paradigm to interpret what an algorithm like curve fitting is doing - the fitted curve is a high-accuracy retrodiction of some smoothed function of the data, which therefore does a good job of predicting what that smoothed function will be in the future.
There are some issues here. If a model that you are using is not the simplest, it might have overfitting problems (though perhaps you can fix this just by throwing away more information than naively appears necessary) or systematic bias. More generally, we haven't explored how models get chosen; we've made the problem easier to brute force but we need to understand non-brute force search methods and what their foundations are. It's a useful habit to keep in mind what actually works for humans - as someone put it to me recently, "humans can make models they understand that work for reasons they understand."
Furthermore, this doesn't seem to capture reductionism well. If our agent learns some laws of physics and then is faced with a big complicated situation it needs to use a simplified model to make a prediction about, it should still in some sense "believe in the laws of physics," and not believe that this complicated situation violates the laws physics even if its current best model is independent of physics.
IV: Logical Uncertainty
It may be possible to relate this back to logical uncertainty - where by "this" I mean the general thesis of predicting the future by building models that are allowed to be imperfect, not the specific example in part III. Soares and Fallenstein use the example of a complex Rube Goldberg machine that deposits a ball into one of several chutes. Given the design of the machine and the laws of physics, suppose that one can in principle predict the output of this machine, but that the problem is much too hard for our computer to do. So rather than having a deterministic method that outputs the right answer, a "logical uncertainty method" in this problem is one that, with a reasonable amount of resources spent, takes in the description of the machine and the laws of physics, and gives a probability distribution over the machine's outputs.
Meanwhile, suppose that we take an approximately inductive predictor and somehow teach it the the laws of physics, then ask it to predict the machine. We'd like it to make predictions via some appropriately simplified folk model of physics. If this model gives a probability distribution over outcomes - like in the simple case of "if you flip this coin in this exact way, it has a 50% shot at landing heads" - doesn't that make it a logical uncertainty method? But note that the probability distribution returned by a single model is not actually the uncertainty introduced by replacing an ideal predictor with a resource-limited predictor. So any measurement of logical uncertainty has to factor in the uncertainty between models, not just the uncertainty within models.
Again, we're back to looking for some prediction method that weights models with some goodness metric more forgiving than just using perfectly-retrodicting Turing machines, and which outputs a probability distribution that includes model uncertainty. But can we apply this to mathematical questions, and not just Rube Goldberg machines? Is there some way to subtract away the machine and leave the math?
Suppose that our approximate predictor was fed math problems and solutions, and built simple, tractable programs to explain its observations. For easy math problems a successful model can just be a Turing machine that finds the right answer. As the math problems get more intractable, successful models will start to become logical uncertainty methods, like how we can't predict a large prime number exactly, but we can predict it's last digit is 1, 3, 7, or 9. Within this realm we have something like low-level reductionism, where even though we can't find a proof of the right answer, we still want to act as if mathematical proofs work and all else is ignorance, and this will help us make successful predictions.
Then we have complicated problems that seem to be beyond this realm, like P=NP. Humans certainly seem to have generated some strong opinions about P=NP without dependence on mathematical proofs narrowing down the options. It seems to such humans that the genuinely right procedure to follow is that, since we've searched long and hard for a fast algorithm for NP-complete problems without success, we should update in the direction that no such algorithm exists. In approximate-Solomonoff-speak, it's that P!=NP is consistent with a simple, tractable explanation for (a recognizable subset of) our observations, while P=NP is only consistent with more complicated tractable explanations. We could absolutely make a predictor that reasons this way - it just sets a few degrees of freedom. But is it the right way to reason?
For one thing, this seems like it's following Gaifman's proposed property of logical uncertainty, that seeing enough examples of something should convince you of it with probability 1 - which has been shown to be "too strong" in some sense (it assigns probability 0 to some true statements - though even this could be okay if those statements are infinitely dilute). Does the most straightforward implementation actually have the Gaifman condition, or not? (I'm sorry, ma'am. Your daughter has... the Gaifman condition.)
This inductive view of logical uncertainty lacks the consistent nature of many other approaches - if it works, it does so by changing approaches to suit the problem at hand. This is bad if you want your logical uncertainty methods to be based on a simple prior followed by some kind of updating procedure. But logical uncertainty is supposed to be practical, after all, and at least this is a simple meta-procedure.
V: Questions
Thanks for reading this post. In conclusion, here are some of my questions:
What's the role of Solomonoff induction in approximate induction? Is Solomonoff induction doing all of the work, or is it possible to make useful predictions using tractable hypotheses Solomonoff induction would exclude, or excluding intractable hypotheses Solomonoff induction would have to include?
Somehow we have to pick out models to promote to attention in the first place. What properties make a process for this good or bad? What methods for picking models can be shown to still lead to making useful predictions - and not merely in the limit of lots of computing time?
Are humans doing the right thing by making models they understand that work for reasons they understand? What's up with that reductionism problem anyhow?
Is it possible to formalize the predictor discussed in the context of logical uncertainty? Does it have to fulfill Gaifman's condition if it finds patterns in things like P!=NP?
More and Less than Solomonoff Induction
I've been thinking about how to put induction with limited resources on a firmer foundation. This may just be retracing the steps of others, but that's okay with me. Mostly I just want to talk about these thoughts.
After a few points of introduction.
What's Solomonoff induction?
Suppose we're given some starting data, and asked to predict the future. Solomonoff induction predicts the future by combining the predictions of all programs that (1) output the starting data up until now, and (2) aren't the continuation of another such program. The predictions are combined according to a weighting that decreases exponentially as the length of the program increases.
Why is it a good idea?
The simplest answer is that we have a frequentist guarantee that if the "true program" generating our input has some length N (that is, if the observable universe is a big but finite-sized computer), then our predictions will only be wrong a limited number of times, and after that we'll predict the correctly every time.
A more bayesian answer would start with the information that our observations can be generated by some finite-sized program, and then derive that something like Solomonoff induction has to represent our true prior over generating programs - as the length gets bigger, our probability is required to go to zero at infinity, and an exponential is the maximum-entropy such curve. This is not a complete answer, but it at least makes the missing pieces more apparent.
Why won't it work with limited resources
The trouble with using Solomonoff induction in real life is that to pick out which programs output our data so far, we need to run every program - and if the program doesn't ever halt, we need to use a halting oracle to stop it or else we'll take infinite time.
Limited resources require us to only pick from a class of programs that is guaranteed to not run over the limit.
If we have limited time and no halting oracle, we can't check every program. Instead, we are only allowed to check programs drawn from a class of programs that we can check the output of in limited time. The simplest example would be to just check programs but not report the result if we go over some time limit, in which case the class we're picking from is "programs that don't go over the time limit."
This is an application of a general principle - when we impose resource constraints on a real agent, we want to be able to cash them out as properties of an abstract description of what the agent is doing. In this case, we can cash out limited time for our real agent as an inability for our abstract description to have any contributions from long programs.
This isn't really a Bayesian restriction - we don't actually know that the true program is within our search space. This also weakens our frequentist guarantee.
The problem of overfitting - real models allow for incorrect retrodictions.
If we just take Solomonoff induction and put restrictions on it, our predictions will still only come from hypotheses that exactly reproduce our starting data. This is a problem.
For example, if we have a ton of data, like we do, then finding even one program that exactly replicates our data is too hard, and our induction is useless.
We as humans have two main ways of dealing with this. The first is that we ignore most of our data and only try to predict the important parts. The second is that we don't require our models to perfectly retrodict our observations so far (retrodiction- it's like prediction, for the past!).
In fact. having imperfect retrodictions is such a good idea that we can have a problem called "overfitting." Isn't that kinda odd? That it's better to use a simple model that is actually 100% known to be wrong, than to use a very complicated model that has gotten everything right so far.
This behavior makes more sense if we imagine that our data contains contributions both from the thing we're trying to model, and from stuff we want to ignore, in a way that's hard to separate.
What makes real models better than chance? The example of coarse-graining.
Is it even possible to know ahead of time that an imperfect model will make good predictions? It turns out the answer is yes.
The very simplest example is of just predicting the next bit based on which bit has been more common so far. If every pattern were equally likely this wouldn't work, but if we require the probability of a pattern to decrease with its complexity, we're more likely to see repetitions.
A more general example: an imperfect model is always better than chance if it is a likely coarse-graining of the true program. Coarse-graining is when you can use a simple program to predict a complicated one by only worrying about some of the properties of the complicated program. In the picture at right, a simple cellular automaton can predict whether a more complicated one will have the same or different densities of white and black in a region.
When a coarse-graining is exact, the coarse-grained properties follow exact rules all the time. Like in the picture at right, where the coarse-grained pattern "same different same" always evolves into "different different different," even though there are multiple states of the more complicated program that count as "different" and "same."
When a coarse-graining is inexact, only most of the states of the long program follow the coarse-grained rules of the short program. but it turns out that, given some ignorance about the exact rules or situation, this is also sufficient to predict the future better than chance.
Of course, when doing induction, we don't actually know the true program. Instead what happens is that we just find some simple program that fits our data reasonably well (according to some measurement of fit), and we go "well, what happened before is more likely to happen again, so this rule will help us predict the future."
Presumably we combine predictions according to some weighting that include both the length of the program and its goodness of fit.
Machine learning - probably approximately correct
Since this is a machine learning problem, there are already solutions to similar problems, one of which is probably approximately correct learning. The basic idea is that if you have some uniformly-sampled training data, and a hypothesis space you can completely search, then you can give some probabilistic guarantees about how good the hypothesis is that best fits the training data. A "hypothesis," here, is a classification of members of data-space into different categories.
The more training data you have, the closer (where "close" can be measured as a chi-squared-like error) the best-fitting hypothesis is to the actually-best hypothesis. If your hypothesis space doesn't contain the true hypothesis, then that's okay - you can still guarantee that the best-fitting hypothesis gets close to the best hypothesis in your hypothesis space. The probability that a far-away hypothesis would masquerade as a hypothesis within the small "successful" region gets smaller and smaller as the training data increases.
There is a straightforward extension to cases where your data contains contributions from both "things we want to model" and "things we want to ignore." Rather than finding the hypothesis that fits the training data best, we want to find the hypothesis for the "stuff we want to model" that has the highest probability of producing our observed data, after we've added some noise, drawn from some "noise distribution" that encapsulates the basic information about the stuff we want to ignore.
There are certainly some issues comparing this to Solomonoff induction, like the fact that our training data is randomly sampled rather than a time series, but I do like this paradigm a bit better than looking for a guarantee that we'll find the one true answer in finite time.
Bayes' theorem
If there's an easy way to combine machine learning with Solomonoff induction, it's Bayes' theorem. The machine learning was focused on driving down P(training data | chance), and picking a hypothesis with a high P(training data | hypothesis, noise model). We might want to Use Bayes' rule to say something like P(hypothesis | training data, noise model) = P(hypothesis | complexity prior) * P(training data | hypothesis, noise model) / P(training data | noise model).
Anyhow - what are the obvious things I've missed? :)
How to make AIXI-tl incapable of learning
Consider a simple game: You are shown a random-looking 512-bit string h. You may then press one of two buttons, labeled '0' and '1'. No matter which button you press, you will then be shown a 256-bit string s such that SHA512(s) = h. In addition, if you pressed '1' you are given 1$.
This game seems pretty simple, right? s and h are irrelevant, and you should simply press '1' all the time (I'm assuming you value recieving money). Well, let's see how AIXI and AIXI-tl fare at this game.
Let's say the machine already played the game many times. Its memory is h0, b0, r0, s0, h1, ..., b_(n-1), r_(n-1), s_(n-1), h_n, where the list is in chronological order, inputs are unbolded while decisions are bolded, and r_i is the reward signal. It is always the case that r_i=b_i and h_i=SHA512(s_i).
First let's look at AIXI. It searches for models that compress and extrapolate this history up to the limit of its planning horizon. One class of such models is this: there is a list s0, ..., s_N of random 256-bits strings and a list b0, ..., b_N of (possibly compressible) bits. The history is SHA512(s0), b0, b0, s0, ..., b_(N-1), s_(N-1), SHA512(s_N), b_N. Here s_i for i<n must match the s_i in its memory, and s_n must be the almost certainly unique value with SHA512(s_n) = h_n. While I don't have a proof, it intuitively seems like this class of models will dominate the machines probability mass, and repeated arg-max should lead to the action of outputting 1. It wouldn't always do this due to exploration/exploitation considerations and due to the incentive to minimize K (b0, ... b_N) built into its prior, but it should do it most of the time. So AIXI seems good.
Now let's consider AIXI-tl. It picks outputs by having provably correct programs assign lower bounds to the expected utility, and picking the one with the best lower bound, where expected utility is as measured by AIXI. This would include accepting the analysis I just made with AIXI if that analysis can be made provably accurate. Here lies a problem: the agent has seen h_n but hasn't seen s_n. Therefore, it can't be certain that there is an s_n with SHA512(s_n)=h_n. Therefore, it can't be certain that the models used for AIXI actually works (this isn't a problem for AIXI since it has infinite computational power and can always determine that there is such an s_n).
There is an ad hoc fix for this for AIXI-tl: Take the same model as before, but h_n is a random string rather than being SHA512(s_n). This seems at first to work okay. However, it adds n, the current time, as an input to the model, which adds to its complexity. Now other models dependent on the current time also need to be considered. For instance, what if h_n was a random string, and in addition r_n=not(b_n). This models seems more complicated, but maybe in the programming language used for the Solomonoff prior it is shorter. The key point is that the agent won't be able to update past that initial prior no matter how many times it plays the game. In that case AIXI-tl may consistently prefer 0, assuming there aren't any other models it considers that make its behavior even more complicated.
The key problem is that AIXI-tl is handling logical uncertainty badly. The reasonable thing to do upon seeing h_n is assuming that it, like all h_i before it, is a hash of some 256-bit string. Instead, it finds itself unable to prove this fact and is forced into assuming the present h_n is special. This makes it assume the present time is special and makes it incapable of learning from experience.
A basis for pattern-matching in logical uncertainty
Previous logical uncertainty robot designs (e.g. here, here, here) have relied on proving theorems that relate to the statement (e.g. "the trillionth digit of pi is 5") under inspection (where a useful theorem would be e.g. "there are 10 possible mutually exclusive digits [1,2,3,4,5,6,7,8,9,0] the trillionth digit of pi could be."). This is nice. But it doesn't do pattern-matching - if you tell a theorem-proving robot that a statement is true for the first thousand integers, it doesn't even suspect that the statement might be true for the 1001st too. The statements are just independent black boxes.
In order to go further, our robot needs to peek inside the black box. But how, and why should it see patterns?
We tell our robot these facts: "3 is 'odd'. 5 is 'odd'. 7 is 'odd'. 11 is 'odd'."
The robot asks "Could you just tell me what this concept 'odd' means explicitly? I don't know English very well."
"No," we say. "You'll have to guess."
For robots that don't make up information out of thin air, guesses about 'odd'-ness will obey the maximum-entropy principle, which roughly means "give everything equal probability until you learn more."
"Well, robot, what do you think is the probability that 9 is 'odd', given what we've told you?"
"50 percent. Beep boop."
"But all those other things were odd, weren't they?"
"Those other things were not the number 9. Imagine going number by number and labeling each number as 'odd' or 'not odd'. There are equal numbers of possible labelings with 9 odd and 9 not odd, regardless of what those other numbers are. Since by the principle of maximum entropy I start out with all labelings equally likely, 9 has a 50% chance of being 'odd'."
"But isn't it just sensible to think that the next number we give you will be 'odd' if all the others were? What of the appeal of simplicity?"
"What's so great about simplicity? I'm trying not to make up information out of thin air here."
What do we tell our robot to make it guess that 9 is probably odd? Well, we want it to think that patterns of odd-ness that are simpler are more likely. So how about we let the robot peek at our definition of 'odd', just long enough to see that for every integer we can test if it's odd, and how complicated the test is.
Even if our robot has no clue what the constituent symbols were (or, more to the point, not enough processing power to fully explore the ramifications in a more difficult and realistic scenario), knowing the mere number of characters defining 'odd' puts an upper bound on the Kolmogorov complexity of how the numbers are labeled. Heck, even just learning that we can write down the oddness-test is huge, since there are an infinite number of infinite-complexity rules.
Once the robot knows that the complexity of 'odd' is below a certain smallish value, low-complexity hypotheses like "all numbers are 'odd'" and "odd numbers are 'odd'" start to outweigh1 bigger hypotheses like "all numbers except {long list including 9} are 'odd'." These simple hypotheses contain just the sorts of patterns we sometimes want the robot to see, like 'odd'-ness.
We tell our robot these facts: "3 is odd. 5 is odd. 7 is odd. 11 is odd. A number is 'odd' if a 14-character predicate is true."
The robot asks "Could you just tell me what this concept 'odd' means explicitly?"
"No," we say. "You'll have to guess. What do you think is the probability that 9 is 'odd', given what we've told you?"
"65.1 percent."
"Hah, got you! When we said 'odd' we secretly meant prime!"
I suspect that this kind of peeking handles most of the cases where we want something like pattern-matching (specifically, minimum-message-length prediction of patterns) in logical uncertainty. The obvious un-handled case - the choice of axioms, or logical systems, or that sort of thing, seems more like model uncertainty than logical uncertainty - that is, the question is not really "is second-order logic true," it's "does second-order logic correspond to the world I want to model," which is beyond the scope of this Discussion article.
1 How do you turn the number of symbols we wrote it down with into a distribution over possible rules? It's easiest to just maximum-entropy all valid rules with the correct number of symbols. Since many rules can be compressed, the set of rules with some number of symbols is smeared over lower possible Kolmogorov complexities, I think with an exponential preference towards lower complexity. But on the other hand, humans don't hand out maximum-entropy samples of definitions - we'll need probabilistic logic to do probabilistic logic. (Yo dawg, I heard you liked recursion, so we put this joke in itself so you can [this joke] while you [this joke].)
Logical uncertainty, kind of. A proposal, at least.
If you want context and are fast at seeing the implications of math, see Benja's post. This post is much lighter on the math, though it may take more background reading and more laborious interpolation, since it's, well, lighter on the math.
Imagine I introduced my pet robot to a game. The robot has 10 seconds to pick a digit, and if the trillionth prime number ends with that digit, the robot gets a cookie (it likes peanut butter cookies the best). 10 seconds is not enough time for my robot to calculate the answer deductively. And yet, guessing an answer is superior to running out of time quietly. What sort of general logic should my robot follow in under 10 seconds to figure out that it should be indifferent between answering 1, 3, 7 or 9? Does it even make sense to be indifferent between the real answer and an impossible answer, even if you don't know which is which?
As you might expect from context, the proposed solution will involve assigning every true or false math statement a probabability-esque degree of plausibility, with numbers other than 0 or 1 indicating logical uncertainty. Why is this a good idea?
To explain logical uncertainty, let's first take a step back and reframe logical certainty in terms of rules for reasoning that apply to both deductive logic and probabilistic logic. An important resource here is E.T. Jaynes' Probability Theory (pdf) - the most relevant part being page 31 of the book. The key idea is that each of the probability axioms applies just fine no matter what kind of Boolean statement you want to find the probability of. Which is to say probability already applies to arithmetic - the laws of probability are also laws of arithmetic, just in the limit that probabilities go to 1 or 0. Our robot starts with a collection of definitions labeled with probability 1 (like "0 is a number" or "S(0)+0=S(0)" [if this S(0) stuff needs context, see wikipedia]), and then applies deductive rules according to the universal rules of probability. We translate "A implies B" into the language of probabilities as P(AB|C) = P(A|C), and then apply the always-true product rule P(B|AC)=P(AB|C) / P(A|C). If P(A|C)=1, that is, A|C is deductively true, and A implies B, then P(B|AC)=P(B|C)=1. The machinery that underlies deduction is in fact the same machinery that underlies probabilistic reasoning. And we're just going to exploit that a little.
An alternate axiomatization due to Savage (hat tip to articles by Sniffoy and fool) is based just on actions - it doesn't seem necessary for every agent to store numerical plausibilities, but every agent has to act, and if our agent is to act as if it had consistent preferences when presented with bets, it must act as if it calculated probabilities. Just like the conditions of Cox's theorem as used by E.T. Jaynes, the conditions of Savage's theorem apply to bets on arithmetic just fine. So our robot always behaves as if it assigns some probabilities over the last digit of the trillionth prime number - it's just that when our robot's allowed to run long enough, all but one of those probabilities is 0.
So how do we take the basic laws of belief-manipulation, like the product rule or the sum rule, and apply them to cases where we run out of time and can't deduce all the things? If we still want to take actions, we still want to assign probabilities, but we can't use deduction more than a set number of times...
Okay fine I'll just say it. The proposal outlined here is to treat a computationally limited agent's "correct beliefs" as the correct beliefs of a computationally unlimited agent with a limited definition of what deduction can do. So this weakened-deduction agent has a limitation, in that starting from axioms it can only prove some small pool of theorems, but it's unlimited in that it can take the pool of proven theorems, and then assign probabilities to all the unproven true or false statements. After we flesh out this agent, we can find a computationally limited algorithm that finds correct (i.e. equal to the ones from a sentence ago) probabilities for specific statements, rather than all of them. And finally, we have to take this and make a decision procedure - our robot. After all, it's no good for our robot to assign probabilities if it proceeds to get stuck because it tries to compare the utilities of the world if the end of the trillionth prime number were 1 versus 7 and doesn't even know what it means to calculate the utility of the impossible. We have to make a bit of a modification to the whole decision procedure, we can't just throw in probabilities and expect utility to keep up.
So, formally, what's going on when we limit deduction? Well, remember the process of deduction outlined earlier?
We translate "A implies B" into the language of probabilities as P(AB|C) = P(A|C), and then apply the always-true product rule P(B|AC)=P(AB|C) / P(A|C). If P(A|C)=1, that is, A|C is deductively true, and A implies B, then P(B|AC)=P(B|C)=1
There is a chain here, and if we want to limit deduction to some small pool of provable theorems, we need one of the links to be broken outside that pool. As implied, I don't want to mess with the product rule, or else we violate a desideratum of belief. Instead, we'll mess with implication itself - we translate "A implies B" into "P(AB|C)=P(A|C) only if we've spent less than 10 seconds doing deduction." Or "P(AB|C)=P(A|C) only if it's been less than 106 steps from the basic axioms." These limitations are ugly and nonlocal because they represent the intrusion of nonlocal limitations on our agent into a system that previously ran forever.
Note that the weakening of implication does not necessarily determine the shape of our pool of deduced theorems. A weakened-deduction agent could spiral outward from shortest to longest theorems, or it could search more cleverly to advance on some specific theorems before time runs out.
If a weakened-deduction agent just had the product rule and this new way of translating the axioms into probabilities, it would accumulate some pool of known probabilities - it could work out from the probability-1 axioms to show that some short statements had probability 1 and some other short statements had probability 0. It could also prove some more abstract things like P(AB)=0 without proving anything else about A or B, as long as it followed the right search pattern. But it can't assign probabilities outside of deduction - it doesn't have the rules. So it just ends up with a pool of deduced stuff in the middle of a blank plain of "undefined."
Okay, back to referring to E.T. Jaynes (specifically, the bottom of page 32). When deriving the laws of probability from Cox's desiderata, the axioms fall into different groups - there are the "laws of thought" parts, and the "interface" parts. The laws of thought are things like Bayes' theorem, or the product rule. They tell you how probabilities have to fit with other probabilities. But they don't give you probabilities ex nihilo, you have to start with probability-1 axioms or known probabilities and build out from them. The parts that tell you how to get new probabilities are the interface parts, ideas like "if you have equivalent information about two things, they should have the same probability."
So what does our limited-deduction agent do once it reaches its limits of deduction? Well, to put it simply, it uses deduction as much as it can, and then it uses the principle of maximum entropy for the probability of everything else. Maximum entropy corresponds to minimum information, so it satisfies a desideratum like "don't make stuff up."
The agent is assigning probabilities to true or false logical statements, statements like S(0)+S(0)=S(S(0)). If it had an unrestricted translation of "A implies B," it could prove this statement quickly. But suppose it can't. Then this statement is really just a string of symbols. The agent no longer "understands" the symbols, which is to say it can only use facts about the probability of these symbols that were previously proved and are within the pool of theorems - it's only a part of an algorithm, and doesn't have the resources to prove everything, so we have to design the agent to assign probabilities based just on what it proved deductively.
So the design of our unlimited-computation, limited-deduction agent is that it does all the deduction it can according to some search algorithm and within some limit, and this can be specified to take any amount of time. Then, to fill up the infinity of un-deduced probabilities, the agent just assigns the maximum-entropy probability distribution consistent with what's proven. For clever search strategies that figure out things like P(AB)=0 without figuring out P(A), doing this assignment requires interpretation of AND, OR, and NOT operations - that is, we still need a Boolean algebra for statements. But our robot no longer proves new statements about probabilities of these symbol strings, in the sense that P(S(0)+0=S(0))=P(S(0)+S(0)=S(S(0))) is a new statement. An example of a non-new statement would be P(S(0)+0=S(0)) AND S(0)+S(0)=S(S(0))) = P(S(0)+0=S(0)) * P(S(0)+S(0)=S(S(0)) | S(0)+0=S(0)) - that's just the product rule, it hasn't actually changed any of the equations.
End of part 1 exercise: Can deducing an additional theorem lead to our agent assigning less probability to the right answer under certain situations? (Reading part 2 may help)
Okay, now on to doing this with actual bounded resources. And back to the trillionth prime number! You almost forgot about that, didn't you. The plan is to break up the strict deduction -> max entropy procedure, and do it in such a way that our robot can get better results (higher probability to the correct answer) the longer it runs, up to proving the actual correct answer. It starts with no theorems, and figures out the max entropy probability distribution for the end of the trillionth prime number. Said distribution happens to be one-half to everything, e.g. p(1)=1/2 and p(2)=1/2 and p(3)=1/2. The robot doesn't know yet that the different answers are mutually exclusive and exhaustive, much less what's wrong with the answer of 2. But the important thing is, assigning the same number to everything of interest is fast. Later, as it proves relevant theorems, the robot updates the probability distribution, and when it runs out of resources it stops.
Side note: there's also another way of imagining how the robot stores probabilities, used in Benja's post, which is to construct a really big mutually exclusive and exhaustive basis (called "disjunctive normal form"). Instead of storing P(A) and P(B), which are not necessarily mutually exclusive or exhaustive, we store P(AB), P(A¬B) (the hook thingy means "NOT"), P(¬AB), and P(¬A¬B), which are mutually exclusive and exhaustive. These things would then each have probability 1/4, or 1/2N, where N is the number of statements you're assigning probabilities to. This is a pain when N goes to infinity, but can be useful when N is approximately the number of possible last digits of a number.
Back on track: suppose the first thing the robot proves about the last digit of the trillionth prime number is that answers of 1, 2, 3, 4, 5, 6, 7, 8, 9, and 0 are exhaustive. What does that do to the probabilities? In disjunctive normal form, the change is clear - exhaustiveness means that P(¬1¬2¬3¬4¬5¬6¬7¬8¬9¬0)=0, there's no leftover space. Previously there were 210=1024 of these disjunctive possibilities, now there are 1023, and the remaining ones stay equivalent in terms of what's been proven about them (nothing), so the probability of each went from 1/1024 to 1/1023. Two things to note: figuring this out took a small amount of work and is totally doable for the robot, but we don't want to do this work every time we use modus tollens, so we need to have some way to tell whether our new theorem matters to the trillionth prime number.
For example, image we were interested in the statement A. The example is to learn that A, B, and C are mutually exclusive and exhaustive, step by step. First, we could prove that A, B, C are exhaustive - P(¬A¬B¬C)=0. Does this change P(A)? Yes, it changes from 4/8 (N is 3, so 23=8) to 4/7. Then we learn that P(AB)=0, i.e. A and B are mutually exclusive. This leaves us only A¬BC, ¬ABC, A¬B¬C, ¬AB¬C, and ¬A¬BC. P(A) is now 2/5. Now we learn that A and C are mutually exclusive, so the possibilities are ¬ABC, A¬B¬C, ¬AB¬C, and ¬A¬BC. P(A)=1/4. Each of the steps until now have had the statement A right there inside the parentheses - but for the last step, we show that B and C are mutually exclusive, P(BC)=0, and now we just have P(A)=P(B)=P(C)=1/3. We just took a step that didn't mention A, but it changed the probability of A. This is because we'd previously disrupted the balance between ABC and ¬ABC. To tell when to update P(A) we not only need to listen for A to be mentioned, we have to track what A has been entangled with, and what's been entangled with that, and so on in a web of deduced relationships.
The good news is that that's it. The plausibility assigned to any statement A by this finite-computation method is the same plausibility that our computationally-unlimited deductively-limited agent would have assigned to it, given the same pool of deduced theorems. The difference is just that the limited-deduction agent did this for every possible statement, which as mentioned doesn't make as much sense in disjunctive normal form.
So IF we accept that having limited resources is like having a limited ability to do implication, THEN we know how our robot should assign probabilities to a few statements of interest. It should start with the good old "everything gets probability 1/2," which should allow it to win some cookies even if it only has a few milliseconds, and then it should start proving theorems, updating its probabilities when it proves something that should impact those probabilities.
Now onto the last part. The robot's utility function wasn't really designed for U(last digit of trillionth prime number is 1), so what should it do? Well, what does our robot like? It likes having a cookie over not having a cookie. C is for cookie, and that's good enough for it. So we want to transform a utility over cookies into a an expected utility that will let us order possible actions.
We have to make the exact same transformation in the case of ordinary probabilities, so let's examine that. If I flip a coin and get a cookie if I call it correctly, I don't have a terminal U(heads) or U(tails), I just have U(cookie). My expected utility of different guesses comes from not knowing which guess leads to the cookie.
Similarly, the expected utility of different guesses when betting on the trillionth prime number comes from not knowing which guess leads to the cookie. It is possible to care about the properties of math, or to care about whether coins land heads or tails, but that just means we have to drag in causality - your guess doesn't affect how math works, or flip coins over.
So the standard procedure for our robot looks like this:
Start with some utility function U over the world, specifically cookies.
Now, face a problem. This problem will have some outcomes (possible numbers of cookies), some options (that is, strategies to follow, like choosing one of 10 possible digits), and any amount of information about how options correspond to outcomes (like "iff the trillionth prime ends with this digit, you get the cookie").
Now our robot calculates the limited-resources probability of getting different outcomes given different strategies, and from that calculates an expected utility for each strategy.
Our robot then follows one of the strategies with maximum expected utility.
Bonus exercises: Does this procedure already handle probabilistic maps from the options to the outcomes, like in the case of the flipped coin? How about if flipping a coin isn't already converted into a probability, but is left as an underdetermined problem a la "a coin (heads XOR tails) is flipped, choose one."
A model of UDT with a concrete prior over logical statements
I've been having difficulties with constructing a toy scenario for AI self-modification more interesting than Quirrell's game, because you really want to do expected utility maximization of some sort, but currently our best-specified decision theories search through the theorems of one particular proof system and "break down and cry" if they can't find one that tells them what their utility will be if they choose a particular option. This is fine if the problems are simple enough that we always find the theorems we need, but the AI rewrite problem is precisely about skirting that edge. It seems natural to want to choose some probability distribution over the possibilities that you can't rule out, and then do expected utility maximization (because if you don't maximize EU over some prior, it seems likely that someone could Dutch-book you); indeed, Wei Dai's original UDT has a "mathematical intuition module" black box which this would be an implementation of. But how do you assign probabilities to logical statements? What consistency conditions do you ask for? What are the "impossible possible worlds" that make up your probability space?
Recently, Wei Dai suggested that logical uncertainty might help avoid the Löbian problems with AI self-modification, and although I'm sceptical about this idea, the discussion pushed me into trying to confront the logical uncertainty problem head-on; then, reading Haim Gaifman's paper "Reasoning with limited resources and assigning probabilities to logical statements" (which Luke linked from So you want to save the world) made something click. I want to present a simple suggestion for a concrete definition of "impossible possible world", for a prior over them, and for an UDT algorithm based on that. I'm not sure whether the concrete prior is useful—the main point in giving it is to have a concrete example we can try to prove things about—but the definition of logical possible worlds looks like a promising theoretical tool to me.
Logical Uncertainty as Probability
This post is a long answer to this comment by cousin_it:
Logical uncertainty is weird because it doesn't exactly obey the rules of probability. You can't have a consistent probability assignment that says axioms are 100% true but the millionth digit of pi has a 50% chance of being odd.
I'd like to attempt to formally define logical uncertainty in terms of probability. Don't know if what results is in any way novel or useful, but.
Let X be a finite set of true statements of some formal system F extending propositional calculus, like Peano Arithmetic. X is supposed to represent a set of logical/mathematical beliefs of some finite reasoning agent.
Given any X, we can define its "Obvious Logical Closure" OLC(X), an infinite set of statements producible from X by applying the rules and axioms of propositional calculus. An important property of OLC(X) is that it is decidable: for any statement S it is possible to find out whether S is true (S∈OLC(X)), false ("~S"∈OLC(X)), or uncertain (neither).
We can now define the "conditional" probability P(*|X) as a function from {the statements of F} to [0,1] satisfying the axioms:
Axiom 1: Known true statements have probability 1:
P(S|X)=1 iff S∈OLC(X)
Axiom 2: The probability of a disjunction of mutually exclusive statements is equal to the sum of their probabilities:
"~(A∧B)"∈OLC(X) implies P("A∨B"|X) = P(A|X) + P(B|X)
From these axioms we can get all the expected behavior of the probabilities:
P("~S"|X) = 1 - P(S|X)
P(S|X)=0 iff "~S"∈OLC(X)
0 < P(S|X) < 1 iff S∉OLC(X) and "~S"∉OLC(X)
"A=>B"∈OLC(X) implies P(A|X)≤P(B|X)
"A<=>B"∈OLC(X) implies P(A|X)=P(B|X)
etc.
This is still insufficient to calculate an actual probability value for any uncertain statement. Additional principles are required. For example, the Consistency Desideratum of Jaynes: "equivalent states of knowledge must be represented by the same probability values".
Definition: two statements A and B are indistinguishable relative to X iff there exists an isomorphism between OLC(X∪{A}) and OLC(X∪{B}), which is identity on X, and which maps A to B.
[Isomorphism here is a 1-1 function f preserving all logical operations: f(A∨B)=f(A)∨f(B), f(~~A)=~~f(A), etc.]
Axiom 3: If A and B are indistinguishable relative to X, then P(A|X) = P(B|X).
Proposition: Let X be the set of statements representing my current mathematical knowledge, translated into F. Then the statements "millionth digit of PI is odd" and "millionth digit of PI is even" are indistinguishable relative to X.
Corollary: P(millionth digit of PI is odd | my current mathematical knowledge) = 1/2.
A Problem About Bargaining and Logical Uncertainty
Suppose you wake up as a paperclip maximizer. Omega says "I calculated the millionth digit of pi, and it's odd. If it had been even, I would have made the universe capable of producing either 1020 paperclips or 1010 staples, and given control of it to a staples maximizer. But since it was odd, I made the universe capable of producing 1010 paperclips or 1020 staples, and gave you control." You double check Omega's pi computation and your internal calculator gives the same answer.
Then a staples maximizer comes to you and says, "You should give me control of the universe, because before you knew the millionth digit of pi, you would have wanted to pre-commit to a deal where each of us would give the other control of the universe, since that gives you 1/2 probability of 1020 paperclips instead of 1/2 probability of 1010 paperclips."
Is the staples maximizer right? If so, the general principle seems to be that we should act as if we had precommited to a deal we would have made in ignorance of logical facts we actually possess. But how far are we supposed to push this? What deal would you have made if you didn't know that the first digit of pi was odd, or if you didn't know that 1+1=2?
On the other hand, suppose the staples maximizer is wrong. Does that mean you also shouldn't agree to exchange control of the universe before you knew the millionth digit of pi?
To make this more relevant to real life, consider two humans negotiating over the goal system of an AI they're jointly building. They have a lot of ignorance about the relevant logical facts, like how smart/powerful the AI will turn out to be and how efficient it will be in implementing each of their goals. They could negotiate a solution now in the form of a weighted average of their utility functions, but the weights they choose now will likely turn out to be "wrong" in full view of the relevant logical facts (e.g., the actual shape of the utility-possibility frontier). Or they could program their utility functions into the AI separately, and let the AI determine the weights later using some formal bargaining solution when it has more knowledge about the relevant logical facts. Which is the right thing to do? Or should they follow the staples maximizer's reasoning and bargain under the pretense that they know even less than they actually do?
Other Related Posts: Counterfactual Mugging and Logical Uncertainty, If you don't know the name of the game, just tell me what I mean to you
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)