You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

confirmation bias, thought experiment

1 Douglas_Reay 15 July 2016 12:19PM

Why do people end up with differing conclusions, given the same data?

 

Model

The information we get from others can not always be 100% relied upon.  Some of the people telling you stuff are liars, some are stupid, and some are incorrectly or insufficiently informed.  Even in the case where the person giving you an opinion is honest, smart and well informed, they are still unlikely to be able to tell you accurately how reliable their own opinion is.

So our brains use an 'unreliability' factor.  Automatically we take what others tell us, and discount it by a certain amount, depending on how 'unreliable' we estimate the source to be.

We also compare what people tell us about 'known reference points' in order to update our estimates of their unreliability.

If Sally tells me that vaccines cause AIDS and I am very much more certain that this is not the case, than I am of Sally's reliability, then instead of modifying my opinion about what causes AIDS, I modify my opinion of how reliable Sally is.

If I'm only slightly more certain, then I might take the step of asking Sally her reason for thinking that, and looking at her data.

If I have a higher opinion of Sally than my own knowledge of science, and I don't much care or am unaware of what other people think about the relationship between vaccines and AIDS, then I might just accept what she says, provisionally, without checking her data.

If I have a very much higher opinion of Sally, then not only will I believe her, but my opinion of her reliability will actually increase as I assess her as some mould-breaking genius who knows things that others do not.

 

Importantly, once we have altered our opinion, based upon input that we originally considered to be fairly reliable, we are very bad at reversing that alteration, if the input later turns out to be less reliable than we originally thought.  This is called the "continued influence effect", and we can use it to explain a number of things...

 

Experiment

Let us consider a thought experiment where two subjects, Peter and Paul, are exposed to input about a particular topic (such as "Which clothes washing powder is it best to use?") from multiple sources.   Both will be exposed to the same sources, 100 in favour of using the Persil brand of washing powder, and 100 in favour of using the Bold brand of washing powder, but in a different order.

If they both start off with no strong opinion in either direction, would we expect them to end the experiment with roughly the same opinion as each other, or can we manipulate their opinions into differing, just by changing the order in which the sources are presented?

Suppose, with Peter, we start him off with 10 of the Persil side's most reputable and well argued sources, to raise Peter's confidence in sources that support Persil.

We can then run another 30 much weaker pro-Persil sources past him, and he is likely to just nods and accept, without bothering to examine the validity of the arguments too closely, because he's already convinced.

At this point, when he'll consider a source to be a bit suspect, straight away, just because they don't support Persil, we introduce him to the pro-Bold side, starting with the least reliable - the ones that are obviously stupid or manipulative.   Further more, we don't let the pro-Bold side build up momentum.   For every three poor pro-Bold sources, we interrupt with a medium reliability pro-Persil source that's rehashing pro-Persil points that Peter is by now familiar with and agrees with.

After seeing the worst 30 pro-Bold sources, Peter now don't just consider them to be a bit suspect - he considers them to be down right deceptive and mentally categorises all such sources as not worth paying attention to.   Any further pro-Bold sources, even ones that seem to be impartial and well reasoned, he's going to put down as being fakes created by malicious researchers in the pay of an evil company.

We can now, safely, expose Peter to the medium-reliability pro-Bold sources and even the good ones, and will need less and less to refute them, just a reminder to Peter of 'which side he is on', because it is less about the data now, and more about identity - he doesn't see himself as the sort of person who'd support Bold.   He's not a sheep.  He's not taken in by the hoax.

Finally, after 80 pro-Persil sources and 90 pro-Bold sources, we have 10 excellent pro-Bold sources whose independence and science can't fairly be questioned.   But it is too late for them to have much effect, and there are 20 good pro-Persil sources to balance them.

For Paul we do the reverse, starting with pro-Bold sources and only later introducing the pro-Persil side once a known reference point has been established as an anchor.

 

Simulation

Obviously, things are rarely that clear cut in real life.   But people also don't often get data from both sides of an argument at a precisely equal rate.   They bump around randomly, and once one side accumulates some headway, it is unlikely to be reversed.

We could add a third subject, Mary, and consider what is likely to happen if she is exposed to a random succession of sources, each with a 50% chance of supporting one side or the other, and each with a random value on a scale of 1(poor) to 3 (good) for honesty, validity and strength of conclusion supported by the claimed data.

If we use mathematics to make some actual models of the points at which a source agreeing or disagreeing with you affects your estimate of their reliability, we can use a computer simulation of the above thought experiment to predict how different orders of presentation will affect people's final opinion, under each model.   Then we could compare that against real-world data, to see which model best matches reality.

 

Prediction

I think, if this experiment were carried out, one of the properties that would emerge naturally from it is the backfire effect:

" The backfire effect occurs when, in the face of contradictory evidence, established beliefs do not change but actually get stronger. The effect has been demonstrated experimentally in psychological tests, where subjects are given data that either reinforces or goes against their existing biases - and in most cases people can be shown to increase their confidence in their prior position regardless of the evidence they were faced with. "

 

Further Reading

https://en.wikipedia.org/wiki/Confirmation_bias
https://en.wikipedia.org/wiki/Attitude_polarization
http://www.dartmouth.edu/~nyhan/nyhan-reifler.pdf
http://www.tandfonline.com/doi/abs/10.1080/17470216008416717
http://lesswrong.com/lw/iw/positive_bias_look_into_the_dark/
http://www.tandfonline.com/doi/abs/10.1080/14640749508401422
http://rationalwiki.org/wiki/Backfire_effect

Approximating Solomonoff Induction

6 Houshalter 29 May 2015 12:23PM

Solomonoff Induction is a sort of mathematically ideal specification of machine learning. It works by trying every possible computer program and testing how likely they are to have produced the data. Then it weights them by their probability.

Obviously Solomonoff Induction is impossible to do in the real world. But it forms the basis of AIXI and other theoretical work in AI. It's a counterargument to the no free lunch theorem; that we don't care about the space of all possible datasets, but ones which are generated by some algorithm. It's even been proposed as a basis for a universal intelligence test.

Many people believe that trying to approximate Solomonoff Induction is the way forward in AI. And any machine learning algorithm that actually works, to some extent, must be an approximation of Solomonoff Induction.

But how do we go about trying to approximate true Solomonoff Induction? It's basically an impossible task. Even if you make restrictions to remove all the obvious problems like infinite loops/non-halting behavior. The space of possibilities is just too huge to reasonably search through. And it's discrete - you can't just flip a few bits in a program and find another similar program.

We can simplify the problem a great deal by searching through logic circuits. Some people disagree about whether logic circuits should be classified as Turing complete, but it's not really important. We still get the best property of Solomonoff Inducion; that it allows most interesting problems to be modelled much more naturally. In the worst case you have some overhead to specify the memory cells you need to emulate a Turing machine.

Logic circuits have some nicer properties compared to arbitrary computer programs, but they still are discrete and hard to do inference on. To fix this we can easily make continuous versions of logic circuits. Go back to analog. It's capable of doing all the same functions, but also working with real valued states instead of binary.

Instead of flipping between discrete states, we can slightly increase connections between circuits, and it will only slightly change the behavior. This is very nice, because we have algorithms like MCMC that can efficiently approximate true bayesian inference on continuous parameters.

And we are no longer restricted to boolean gates, we can use any function that takes real numbers. Like a function that takes a sum of all of it's inputs, or one that squishes a real number between 0 and 1.

We can also look at how much changing the input of a circuit slightly, changes the output. Then we can go to all the circuits that connect to it in the previous time step. And we can see how much changing each of their input changes their output, and therefore the output of the first logic gate.

And we can go to those gates' inputs, and so on, chaining it all the way through the whole circuit. Finding out how much a slight change to each connection will change the final output. This is called the gradient, and we can then do gradient descent. Basically change each parameter slightly in the direction that increases the output the way we want.

This is a very efficient optimization algorithm. With it we can rapidly find circuits that fit functions we want. Like predicting the price of a stock given the past history, or recognizing a number in an image, or something like that.

But this isn't quite Solomonoff Induction. Since we are finding the best single model, instead of testing the space of all possible models. This is important because essentially each model is like a hypothesis. There can be multiple hypotheses which also fit the data yet predict different things.

There are many tricks we can do to approximate this. For example, if you randomly turn off each gate with 50% probability and then optimize the whole circuit to deal with this. For some reason this somewhat approximates the results of true bayesian inference. You can also fit a distribution over each parameter, instead of a single value, and approximate bayesian inference that way.

Although I never said it, everything I've mentioned about continuous circuits is equivalent to Artificial Neural Networks. I've shown how they can be derived from first principles. My goal was to show that ANNs do approximate true Solomonoff Induction. I've found the Bayes-Structure.

It's worth mentioning that Solomonoff Induction has some problems. It's still an ideal way to do inference on data, it just has problems with self-reference. An AI based on SI might do bad things like believe in an afterlife, or replace it's reward signal with an artificial one (e.g. drugs.) It might not fully comprehend that it's just a computer, and exists inside the world that it is observing.

Interestingly, humans also have these problem to some degree.

Reposted from my blog here.

Limited agents need approximate induction

8 Manfred 24 April 2015 07:42AM

[This post borders on some well-trodden ground in information theory and machine learning, so ideas in this post have an above-average chance of having already been stated elsewhere, by professionals, better. EDIT: As it turns out, this is largely the case, under the subjects of the justifications for MML prediction and Kolmogorov-simple PAC-learning.]

I: Introduction

I am fascinated by methods of thinking that work for well-understood reasons - that follow the steps of a mathematically elegant dance. If one has infinite computing power the method of choice is something like Solomonoff induction, which is provably ideal in a certain way at predicting the world. But if you have limited computing power, the choreography is harder to find.

To do Solomonoff induction, you search through all Turing machine hypotheses to find the ones that exactly output your data so far, then use the weighted average of those perfect retrodictors to predict the next time step. So the naivest way to build an ideal limited agent is to merely search through lots of hypotheses (chosen from some simple set) rather than all of them, and only run each Turing machine for time less than some limit. At least it's guaranteed to work in the limit of large computing power, which ain't nothing.

Suppose then that we take this nice elegant algorithm for a general predictor, and we implement it on today's largest supercomputer, and we show it the stock market prices from the last 50 years to try to predict stocks and get very rich. What happens?

Bupkis happens, that's what. Our Solomonoff predictor tries a whole lot of Turing machines and then runs out of time before finding any useful hypotheses that can perfectly replicate 50 years of stock prices. This is because such useful hypotheses are very, very, very rare.

We might then turn to the burgeoning field of logical uncertainty, which has a major goal of handling intractable math problems in an elegant and timely manner. We are logically uncertain about what distribution Solomonoff induction will output, so can we just average over that logical uncertainty to get some expected stock prices?

The trouble with this is that current logical uncertainty methods rely on proofs that certain outputs are impossible or contradictory. For simple questions this can narrow down the answers, but for complicated problems it becomes intractable, replacing the hard problem of evaluating lots of Turing machines with the hard problem of searching through lots and lots of proofs about lots of Turing machines - and so again our predictor runs out of time before becoming useful.

In practice, the methods we've found to work don't look very much like Solomonoff induction. Successful methods don't take the data as-is, but instead throw some of it away: curve fitting and smoothing data, filtering out hard-to-understand signals as noise, and using predictive models that approximate reality imperfectly. The sorts of things that people trying to predict stocks are already doing. These methods are vital to improve computational tractability, but are difficult (to my knowledge) to fit into a framework as general as Solomonoff induction.

II: Rambling

Suppose that our AI builds a lot of models of the world, including lossy models. How should it decide which models are best to use for predicting the world? Ideally we'd like to make a tradeoff between the accuracy of the model, measured in the expected utility of how accurate you expect the model's predictions to be, and the cost of the time and energy used to make the prediction.

Once we know how to tell good models, the last piece would be for our agent to make the explore/exploit tradeoff between searching for better models and using its current best.

There are various techniques to estimate resource usage, but how does one estimate accuracy?

Here was my first thought: If you know how much information you're losing (e.g. by binning data), for discrete distributions this sets the Shannon information of the ideal value (given by Solomonoff prediction) given the predicted value. This uses the relationship between information in bits of data and Shannon information that determines how sharp your probability distribution is allowed to be.

But with no guarantees about the normality (or similar niceness properties) of the ideal value given the prediction, this isn't very helpful. The problem is highlighted by hurricane prediction. If hurricanes behaved nicely as we threw away information, weather models would just be small, high-entropy deviations from reality. Instead, hurricanes can change route greatly even with small differences in initial conditions.

The failure of the above approach can be explained in a very general way: it uses too little information about the model and the data, only the amount of information thrown away. To do better, our agent has to learn a lot from its training data - a subject that workers in AI have already been hard at work on. On the one hand, it's a great sign if we can eventually connect ideal agents to current successful algorithms. On the other, doing so elegantly seems like a hard problem.

To sum up in the blandest possible way: If we want to build successful predictors of the future with limited resources, they should use their experience to learn approximate models of the world.

The real trick, though, is going to be to set this on a solid foundation. What makes a successful method of picking models? As we lack access to the future (yet! Growth mindset!), we can't grade models based on their future predictions unless we descend to solipsism and grade models against models. Thus we're left with grading models based on how well they retrodict the data so far. Sound familiar? The foundation we want seems like an analogue to Solomonoff induction, one that works for known reasons but doesn't require perfection.

III:  An Example

Here's a paradigm that might or might not be a step in the right direction, but at least gestures at what I mean.

The first piece of the puzzle is that a model that gets proportion P of training bits wrong can be converted to a Solomonoff-accepted perfectly-precise model just by specifying the bits it gets wrong. Suppose we break the model output (with total length N) into chunks of size L, and prefix each chunk with the locations of the wrong bits in that chunk. Then the extra data required to rectify an approximate model is at most N/L·log(P·L)+N·P·log(L). Then the hypothesis where the model is right about the next bit is simpler than the hypothesis when it's wrong, because when the model is right you don't have to spend ~log(L) bits correcting it.

In this way, Solomonoff induction natively cares about some approximate models' predictions. There are some interesting details here that are outside the focus of this particular post. Does using the optimal chunk length lead to Solomonoff induction reflecting model accuracy correctly? What are some better schemes for rectifying models that handle things like models that output probabilities? The point is just that even if your model is wrong on fraction P of the training data, Solomonoff induction will still promote it as long as it's simpler than N-N/L·log(P·L)-N·P·log(L).

The second piece of the puzzle is that induction can be done over processed functions of observations, like smoothing the data or filtering difficult-to-predict parts (noise) out. If this processing increases the accuracy of models, we can use this to make high-accuracy models of functions the training data, and then use those models to predict the the processed future observations as above.

These two pieces allow an agent to use approximate models, and to throw away some of its information, and still have its predictions work for the same reason as Solomonoff induction. We can use this paradigm to interpret what an algorithm like curve fitting is doing - the fitted curve is a high-accuracy retrodiction of some smoothed function of the data, which therefore does a good job of predicting what that smoothed function will be in the future.

There are some issues here. If a model that you are using is not the simplest, it might have overfitting problems (though perhaps you can fix this just by throwing away more information than naively appears necessary) or systematic bias. More generally, we haven't explored how models get chosen; we've made the problem easier to brute force but we need to understand non-brute force search methods and what their foundations are. It's a useful habit to keep in mind what actually works for humans - as someone put it to me recently, "humans can make models they understand that work for reasons they understand."

Furthermore, this doesn't seem to capture reductionism well. If our agent learns some laws of physics and then is faced with a big complicated situation it needs to use a simplified model to make a prediction about, it should still in some sense "believe in the laws of physics," and not believe that this complicated situation violates the laws physics even if its current best model is independent of physics.

IV: Logical Uncertainty

It may be possible to relate this back to logical uncertainty - where by "this" I mean the general thesis of predicting the future by building models that are allowed to be imperfect, not the specific example in part III. Soares and Fallenstein use the example of a complex Rube Goldberg machine that deposits a ball into one of several chutes. Given the design of the machine and the laws of physics, suppose that one can in principle predict the output of this machine, but that the problem is much too hard for our computer to do. So rather than having a deterministic method that outputs the right answer, a "logical uncertainty method" in this problem is one that, with a reasonable amount of resources spent, takes in the description of the machine and the laws of physics, and gives a probability distribution over the machine's outputs.

Meanwhile, suppose that we take an approximately inductive predictor and somehow teach it the the laws of physics, then ask it to predict the machine. We'd like it to make predictions via some appropriately simplified folk model of physics. If this model gives a probability distribution over outcomes - like in the simple case of "if you flip this coin in this exact way, it has a 50% shot at landing heads" - doesn't that make it a logical uncertainty method? But note that the probability distribution returned by a single model is not actually the uncertainty introduced by replacing an ideal predictor with a resource-limited predictor. So any measurement of logical uncertainty has to factor in the uncertainty between models, not just the uncertainty within models.

Again, we're back to looking for some prediction method that weights models with some goodness metric more forgiving than just using perfectly-retrodicting Turing machines, and which outputs a probability distribution that includes model uncertainty. But can we apply this to mathematical questions, and not just Rube Goldberg machines? Is there some way to subtract away the machine and leave the math?

Suppose that our approximate predictor was fed math problems and solutions, and built simple, tractable programs to explain its observations. For easy math problems a successful model can just be a Turing machine that finds the right answer. As the math problems get more intractable, successful models will start to become logical uncertainty methods, like how we can't predict a large prime number exactly, but we can predict it's last digit is 1, 3, 7, or 9. Within this realm we have something like low-level reductionism, where even though we can't find a proof of the right answer, we still want to act as if mathematical proofs work and all else is ignorance, and this will help us make successful predictions.

Then we have complicated problems that seem to be beyond this realm, like P=NP. Humans certainly seem to have generated some strong opinions about P=NP without dependence on mathematical proofs narrowing down the options. It seems to such humans that the genuinely right procedure to follow is that, since we've searched long and hard for a fast algorithm for NP-complete problems without success, we should update in the direction that no such algorithm exists. In approximate-Solomonoff-speak, it's that P!=NP is consistent with a simple, tractable explanation for (a recognizable subset of) our observations, while P=NP is only consistent with more complicated tractable explanations. We could absolutely make a predictor that reasons this way - it just sets a few degrees of freedom. But is it the right way to reason?

For one thing, this seems like it's following Gaifman's proposed property of logical uncertainty, that seeing enough examples of something should convince you of it with probability 1 - which has been shown to be "too strong" in some sense (it assigns probability 0 to some true statements - though even this could be okay if those statements are infinitely dilute). Does the most straightforward implementation actually have the Gaifman condition, or not? (I'm sorry, ma'am. Your daughter has... the Gaifman condition.)

This inductive view of logical uncertainty lacks the consistent nature of many other approaches - if it works, it does so by changing approaches to suit the problem at hand. This is bad if you want your logical uncertainty methods to be based on a simple prior followed by some kind of updating procedure. But logical uncertainty is supposed to be practical, after all, and at least this is a simple meta-procedure.

V: Questions

Thanks for reading this post. In conclusion, here are some of my questions:

What's the role of Solomonoff induction in approximate induction? Is Solomonoff induction doing all of the work, or is it possible to make useful predictions using tractable hypotheses Solomonoff induction would exclude, or excluding intractable hypotheses Solomonoff induction would have to include?

Somehow we have to pick out models to promote to attention in the first place. What properties make a process for this good or bad? What methods for picking models can be shown to still lead to making useful predictions - and not merely in the limit of lots of computing time?

Are humans doing the right thing by making models they understand that work for reasons they understand? What's up with that reductionism problem anyhow?

Is it possible to formalize the predictor discussed in the context of logical uncertainty? Does it have to fulfill Gaifman's condition if it finds patterns in things like P!=NP?

A simple game that has no solution

10 James_Miller 20 July 2014 06:36PM

The following simple game has one solution that seems correct, but isn’t.  Can you figure out why?

 

The Game

 

Player One moves first.  He must pick A, B, or C.  If Player One picks A the game ends and Player Two does nothing.  If Player One picks B or C, Player Two will be told that Player One picked B or C, but will not be told which of these two strategies Player One picked, Player Two must then pick X or Y, and then the game ends.  The following shows the Players’ payoffs for each possible outcome.  Player One’s payoff is listed first.

 

A   3,0    [And Player Two never got to move.]

B,X 2,0

B,Y 2,2

C,X 0,1

C,Y 6,0

continue reading »

More and Less than Solomonoff Induction

4 Manfred 21 May 2014 04:55AM

I've been thinking about how to put induction with limited resources on a firmer foundation. This may just be retracing the steps of others, but that's okay with me. Mostly I just want to talk about these thoughts.

After a few points of introduction.

What's Solomonoff induction?

Suppose we're given some starting data, and asked to predict the future. Solomonoff induction predicts the future by combining the predictions of all programs that (1) output the starting data up until now, and (2) aren't the continuation of another such program. The predictions are combined according to a weighting that decreases exponentially as the length of the program increases.

Why is it a good idea?

The simplest answer is that we have a frequentist guarantee that if the "true program" generating our input has some length N (that is, if the observable universe is a big but finite-sized computer), then our predictions will only be wrong a limited number of times, and after that we'll predict the correctly every time.

A more bayesian answer would start with the information that our observations can be generated by some finite-sized program, and then derive that something like Solomonoff induction has to represent our true prior over generating programs - as the length gets bigger, our probability is required to go to zero at infinity, and an exponential is the maximum-entropy such curve. This is not a complete answer, but it at least makes the missing pieces more apparent.

Why won't it work with limited resources

The trouble with using Solomonoff induction in real life is that to pick out which programs output our data so far, we need to run every program - and if the program doesn't ever halt, we need to use a halting oracle to stop it or else we'll take infinite time.

Limited resources require us to only pick from a class of programs that is guaranteed to not run over the limit.

If we have limited time and no halting oracle, we can't check every program. Instead, we are only allowed to check programs drawn from a class of programs that we can check the output of in limited time. The simplest example would be to just check programs but not report the result if we go over some time limit, in which case the class we're picking from is "programs that don't go over the time limit."

This is an application of a general principle - when we impose resource constraints on a real agent, we want to be able to cash them out as properties of an abstract description of what the agent is doing. In this case, we can cash out limited time for our real agent as an inability for our abstract description to have any contributions from long programs.

This isn't really a Bayesian restriction - we don't actually know that the true program is within our search space. This also weakens our frequentist guarantee.

The problem of overfitting - real models allow for incorrect retrodictions.

If we just take Solomonoff induction and put restrictions on it, our predictions will still only come from hypotheses that exactly reproduce our starting data. This is a problem.

For example, if we have a ton of data, like we do, then finding even one program that exactly replicates our data is too hard, and our induction is useless.

We as humans have two main ways of dealing with this. The first is that we ignore most of our data and only try to predict the important parts. The second is that we don't require our models to perfectly retrodict our observations so far (retrodiction- it's like prediction, for the past!).

In fact. having imperfect retrodictions is such a good idea that we can have a problem called "overfitting."  Isn't that kinda odd? That it's better to use a simple model that is actually 100% known to be wrong, than to use a very complicated model that has gotten everything right so far.

This behavior makes more sense if we imagine that our data contains contributions both from the thing we're trying to model, and from stuff we want to ignore, in a way that's hard to separate.

What makes real models better than chance? The example of coarse-graining.

Is it even possible to know ahead of time that an imperfect model will make good predictions? It turns out the answer is yes.

The very simplest example is of just predicting the next bit based on which bit has been more common so far. If every pattern were equally likely this wouldn't work, but if we require the probability of a pattern to decrease with its complexity, we're more likely to see repetitions.

A more general example: an imperfect model is always better than chance if it is a likely coarse-graining of the true program. Coarse-graining is when you can use a simple program to predict a complicated one by only worrying about some of the properties of the complicated program. In the picture at right, a simple cellular automaton can predict whether a more complicated one will have the same or different densities of white and black in a region.

When a coarse-graining is exact, the coarse-grained properties follow exact rules all the time. Like in the picture at right, where the coarse-grained pattern "same different same" always evolves into "different different different," even though there are multiple states of the more complicated program that count as "different" and "same."

When a coarse-graining is inexact, only most of the states of the long program follow the coarse-grained rules of the short program. but it turns out that, given some ignorance about the exact rules or situation, this is also sufficient to predict the future better than chance.

Of course, when doing induction, we don't actually know the true program. Instead what happens is that we just find some simple program that fits our data reasonably well (according to some measurement of fit), and we go "well, what happened before is more likely to happen again, so this rule will help us predict the future."

Presumably we combine predictions according to some weighting that include both the length of the program and its goodness of fit.

Machine learning - probably approximately correct

Since this is a machine learning problem, there are already solutions to similar problems, one of which is probably approximately correct learning. The basic idea is that if you have some uniformly-sampled training data, and a hypothesis space you can completely search, then you can give some probabilistic guarantees about how good the hypothesis is that best fits the training data. A "hypothesis," here, is a classification of members of data-space into different categories.

The more training data you have, the closer (where "close" can be measured as a chi-squared-like error) the best-fitting hypothesis is to the actually-best hypothesis. If your hypothesis space doesn't contain the true hypothesis, then that's okay - you can still guarantee that the best-fitting hypothesis gets close to the best hypothesis in your hypothesis space. The probability that a far-away hypothesis would masquerade as a hypothesis within the small "successful" region gets smaller and smaller as the training data increases.

There is a straightforward extension to cases where your data contains contributions from both "things we want to model" and "things we want to ignore." Rather than finding the hypothesis that fits the training data best, we want to find the hypothesis for the "stuff we want to model" that has the highest probability of producing our observed data, after we've added some noise, drawn from some "noise distribution" that encapsulates the basic information about the stuff we want to ignore.

There are certainly some issues comparing this to Solomonoff induction, like the fact that our training data is randomly sampled rather than a time series, but I do like this paradigm a bit better than looking for a guarantee that we'll find the one true answer in finite time.

Bayes' theorem

If there's an easy way to combine machine learning with Solomonoff induction, it's Bayes' theorem. The machine learning was focused on driving down P(training data | chance), and picking a hypothesis with a high P(training data | hypothesis, noise model). We might want to Use Bayes' rule to say something like P(hypothesis | training data, noise model) = P(hypothesis | complexity prior) * P(training data | hypothesis, noise model) / P(training data | noise model).

 

Anyhow - what are the obvious things I've missed? :)

The Logic of the Hypothesis Test: A Steel Man

5 Matt_Simpson 21 February 2013 06:19AM

Related to: Beyond Bayesians and Frequentists

Update: This comment by Cyan clearly explains the mistake I made - I forgot that the ordering of the hypothesis space is important is necessary for hypothesis testing to work. I'm not entirely convinced that NHST can't be recast in some "thin" theory of induction that may well change the details of the actual test, but I have no idea how to formalize this notion of a "thin" theory and most of the commenters either 1) misunderstood my aim (my fault, not theirs) or 2) don't think it can be formalized.

I'm teaching an econometrics course this semester and one of the things I'm trying to do is make sure that my students actually understand the logic of the hypothesis test. You can motivate it in terms of controlling false positives but that sort of interpretation doesn't seem to be generally applicable. Another motivation is a simple deductive syllogism with a small but very important inductive component. I'm borrowing the idea from a something we discussed in a course I had with Mark Kaiser - he called it the "nested syllogism of experimentation." I think it applies equally well to most or even all hypothesis tests. It goes something like this:

1. Either the null hypothesis or the alternative hypothesis is true.

2. If the null hypothesis is true, then the data has a certain probability distribution.

3. Under this distribution, our sample is extremely unlikely.

4. Therefore under the null hypothesis, our sample is extremely unlikely.

5. Therefore the null hypothesis is false.

6. Therefore the alternative hypothesis is true.

An example looks like this:

Suppose we have a random sample from a population with a normal distribution that has an unknown mean and unknown variance . Then:

1. Either or where is some constant.

2. Construct the test statistic where is the sample size, is the sample mean, and is the sample standard deviation.

3. Under the null hypothesis, has a distribution with degrees of freedom.

4. is really small under the null hypothesis (e.g. less than 0.05).

5. Therefore the null hypothesis is false.

6. Therefore the alternative hypothesis is true.

What's interesting to me about this process is that it almost tries to avoid induction altogether. Only the move from step 4 to 5 seems anything like an inductive argument. The rest is purely deductive - though admittedly it takes a couple premises in order to quantify just how likely our sample was and that surely has something to do with induction. But it's still a bit like solving the problem of induction by sweeping it under the rug then putting a big heavy deduction table on top so no one notices the lumps underneath. 

This sounds like it's a criticism, but actually I think it might be a virtue to minimize the amount of induction in your argument. Suppose you're really uncertain about how to handle induction. Maybe you see a lot of plausible sounding approaches, but you can poke holes in all of them. So instead of trying to actually solve the problem of induction, you set out to come up with a process which is robust to alternative views of induction. Ideally, if one or another theory of induction turns out to be correct, you'd like it to do the least damage possible to any specific inductive inferences you've made. One way to do this is to avoid induction as much as possible so that you prevent "inductive contamination" spreading to everything you believe. 

That's exactly what hypothesis testing seems to do. You start with a set of premises and keep deriving logical conclusions from them until you're forced to say "this seems really unlikely if a certain hypothesis is true, so we'll assume that the hypothesis is false" in order to get any further. Then you just keep on deriving logical conclusions with your new premise. Bayesians start yelling about the base rate fallacy in the inductive step, but they're presupposing their own theory of induction. If you're trying to be robust to inductive theories, why should you listen to a Bayesian instead of anyone else?

Now does hypothesis testing actually accomplish induction that is robust to philosophical views of induction? Well, I don't know - I'm really just spitballing here. But it does seem to be a useful steel man.

 

Modifying Universal Intelligence Measure

2 Alex_Altair 18 September 2012 11:44PM

In 2007, Legg and Hutter wrote a paper using the AIXI model to define a measure of intelligence. It's pretty great, but I can think of some directions of improvement.

  • Reinforcement learning. I think this term and formalism are historically from much simpler agent models which actually depended on being reinforced to learn. In its present form (Hutter 2005 section 4.1) it seems arbitrarily general, but it still feels kinda gross to me. Can we formalize AIXI and the intelligence measure in terms of utility functions, instead? And perhaps prove them equivalent?
  • Choice of Horizon. AIXI discounts the future by requiring that total future reward is bounded, and therefore so does the intelligence measure. This seems to me like a constraint that does not reflect reality, and possibly an infinitely important one. How could we remove this requirement? (Much discussion on the "Choice of the Horizon" in Hutter 2005 section 5.7).
  • Unknown utility function. When we reformulate it in terms of utility functions, let's make sure we can measure its intelligence/optimization power without having to know its utility function. Perhaps by using an average of utility functions weighted by their K-complexity.
  • AI orientation. Finally, and least importantly, it tests agents across all possible programs, even those which are known to be inconsistent with our universe. This might okay if your agent is a playing arbitrary games on a computer, but if you are trying to determine how powerful an agent will be in this universe, you probably want to replace the Solomonoff prior with the posterior resulting from updating the Solomonoff prior with data from our universe.

Any thought or research on this by others? I imagine lots of discussion has occurred over these topics; any referencing would be appreciated.

Request: Induction, Proof, Experimental mathematics

-4 snarles 18 December 2011 01:29PM

Suppose we made an algorithm capable of forming empirical conjectures for mathematics.  How might such an algorithm discover the principle of mathematical proof?

 

I would like to see an article on the relevant philosophy and mathematical logic background for this problem.  Since I currently lack the inclination to research and write up such an article, I instead made this post.

Intuitive Explanation of Solomonoff Induction

13 lukeprog 01 December 2011 06:56AM

Update: Alex Altair has now finished this tutorial!

 

A while ago I began writing a Solomonoff Induction tutorial, but now it's one of those Less Wrong articles I probably will never have time to write.

But, maybe a few of you can take what I've started, team up, and finish the thing. I'm basically just following along with the Solomonoff induction tutorials from Shane Legg (2008, 2004), but making them simpler and readable by a wider audience. I think this would be a valuable thing for Less Wrong to have, and one that would draw even more traffic to our community, but I don't have time to write it myself anymore! 

Who wants to be a hero and finish this thing? The original markdown source is available here.

 

 

 

This is a tutorial for beginners. Those familiar with Solomonoff induction may prefer to read Rathmanner & Hutter (2011).

People disagree about things. Some say that vaccines cause autism; others say they don't. Some say everything is physical; others believe in a god. Some say that complicated financial derivatives are essential to a modern competitive economy; others think a nation's economy will do better without them. It's hard to know what is true.

And it's hard to know how to figure out what is true. Some think you should assume the things you are most certain about and then deduce all other beliefs from your original beliefs. Others think you should accept at face value the most intuitive explanations of your personal experience. Others think you should generally agree with the scientific consensus until it is disproven.

Wouldn't it be nice if finding out what's true was like baking a cake? What if there was a recipe for finding out what is true? All you'd have to do is follow the written instruction exactly, and after the last instruction you'd inevitably find yourself with some sweet, tasty truth!

As it turns out, there is an exact recipe for finding truth. It was discovered in the 1960s.

The problem is that you don't have time to follow the recipe. To find the truth to even a simple question using this recipe would require you to follow one step after another until long after the heat death of the universe, and you can't do that.

But we can find shortcuts. Suppose you know the exact recipe for baking a cake requires you to count out one molecule of H2O at a time until you have exactly 0.5 cups of water. If you did that, you might not finish before the heat death of the universe. But you could approximate that part of the recipe by measuring out something very close to 0.5 cups of water, and you'd probably still end up with a pretty good cake.

Similarly, once we know the exact recipe for finding truth, we can try to approximate it in a way that allows us to finish all the steps sometime before the heat death of the universe.

This tutorial explains the exact recipe for finding truth, Solomonoff induction, and also explains some attempts to approximate it in ways that allow us to figure out now what is (probably) true. Fear not; I shall not assume you know anything beyond grade-school math.

Like my Crash Course in the Neuroscience of Human Motivation, this tutorial is long. You may not have time for it; that's fine. But if you do read it, I recommend you read it in sections.

Contents:

  1. Algorithms
  2. Induction
  3. Occam’s Razor
  4. Updating Beliefs
  5. The Problem of Priors
  6. Binary Sequence Prediction
  7. Solomonoff and Kolmogorov
  8. Solomonoff's Lightsaber
  9. Approximations
  10. Philosophical Implications

Algorithms

At an early age you learned a set of precisely-defined steps — a 'recipe' or, more formally, an algorithm — that you could use to find the largest number in an unsorted list of numbers like this:

21, 18, 4, 19, 55, 12, 30

The algorithm you learned probably looked something like this:

  1. Note the leftmost item as the largest you've seen in this list so far. If this is the only item on the list, output it as the largest number on the list. Otherwise, proceed to step 2.
  2. Look at the next item. If it is larger than the largest item noted so far, note it as the largest you've seen in this list so far. Proceed to step 3.
  3. If you have not reached the end of the list, return to step 2. Otherwise, output the last noted item as the largest number in the list.

Other algorithms could be used to solve the same problem. For example, you could work your way from right to left instead of from left to right. But the point is that if you follow this algorithm exactly, and you have enough time to complete the task, you can't fail to solve the problem. You can't get confused about what one of the steps means or what the next step is. Every instruction tells you exactly what to do next, all the way through to the answer.

You probably learned other algorithms, too, like how to find the greatest common divisor of any two integers (see image on right).

But not just any set of instructions is a precisely-defined algorithm. Sometimes, instructions are unclear or incomplete. Consider the following instructions based on an article about the scientific method:

  1. Make an observation.
  2. Form a hypothesis that explains the observation.
  3. Conduct an experiment that will test the hypothesis.
  4. If the experimental results disconfirm the hypothesis, return to step #2 and form a hypothesis not yet used. If the experimental results confirm the hypothesis, provisionally accept the hypothesis.

This is not an algorithm.

First, many of the terms are not clearly defined. What counts as an observation? What counts as a hypothesis? What would a hypothesis need to be like in order to ‘explain’ the observation? What counts as an experiment that will ‘test’ the hypothesis? What does it mean for experimental results to ‘confirm’ or ‘disconfirm’ a hypothesis?

Second, the instructions may be incomplete. What do we do if we reach step 4 and the experimental results neither ‘confirm’ nor ‘disconfirm’ the hypothesis under consideration, but instead are in some sense ‘neutral’ toward the hypothesis? These instructions don’t tell us what to do in that case.

An algorithm is a well-defined procedure that takes some value or values as input and, after a finite series of steps, generates some value or values as output.

For example, the ‘find the largest number’ algorithm above could take the input {21, 18, 4, 19, 55, 12, 30} and would, after 13 steps, produce the following output: {55}. Or it could take the input {34} and, after 1 step, produce the output: {34}.

What we’re looking for is a precise algorithm that will produce truth as its output.

Induction

Whether we are a detective trying to catch a thief, a scientist trying to discover a new physical law, or a businessman attempting to understand a recent change in demand, we are all in the process of collecting information and trying to infer the underlying causes.a

The problem of inference is this: We have a collection of observations (data), and we have a collection of hypotheses about the underlying causes of those observations. We’d like to know which hypothesis is correct, so we can use that knowledge to predict future events.

Suppose your data concern a large set of stock market price changes and other events in the world. You’d like to know the processes responsible for the stock market price changes, because then you can predict what the stock market will do in the future, and make some money.

Or, suppose you are a parent. You come home from work to find a chair propped against the refrigerator, with the cookie jar atop the fridge a bit emptier than before. One hypothesis that leaps to mind is that your young daughter used the chair to reach the cookies. However, many other hypothesis explain the data. Perhaps a very short thief broke into your home and stole some cookies. Perhaps your daughter put the chair in front of the fridge because the fridge door is broken and no longer stays shut, and you forgot that your friend ate a few cookies when he visited last night. Perhaps you moved the chair and ate the cookies yourself while sleepwalking the night before.

All these hypotheses are possible, but intuitively it seems like some hypotheses are more likely than others. If you’ve seen your daughter access the cookies this way before but have never been burgled, then the ‘daughter hypothesis’ seems more plausible. If some expensive things from your bedroom and living room are missing and there is hateful graffiti on your door at the eye level of a very short person, then then ‘short thief’ hypothesis seems more plausible than before. If you suddenly remember that your friend ate a few cookies and broke the fridge door last night, the ‘broken fridge door’ hypothesis gains credibility. If you’ve never been burgled and your daughter is out of town and you have a habit of moving and eating things while sleepwalking, the ‘sleepwalking’ hypothesis is less bizarre.

But the weight you give to each hypothesis depends greatly on your prior knowledge. What if you had just been hit on the head and lost all past memories, and for some reason the most urgent thing you wanted to do was to solve the mystery of the chair in front of the fridge door? Then how would weigh the likelihood of the available hypotheses?

Occam’s Razor

Consider a different inductive problem. A computer program outputs the following sequence of numbers:

1, 3, 5, 7

Which number comes next? If you guess correctly, you’ll win $500.

In order to predict the next number in the sequence, you make a hypothesis about the process the computer is using to generate these numbers. One obvious hypothesis is that it is simply listing all the odd numbers in ascending order from 1. If that’s true, you should guess 9 to win the $500.

But perhaps the computer is using a different algorithm to generate the numbers. Suppose that n is the step in the sequence, so that n=1 when it generated ‘1’, n=2 when it generated ‘3’, and so on. Maybe the computer used this equation to calculate each number in the sequence:

2n − 1 + (n − 1)(n − 2)(n − 3)(n − 4)

If so, the next number in the sequence will be 33. (Go ahead, check the calculations.) But doesn’t the first hypothesis seem more likely?

The principle behind this intuition, which goes back to William of Occam, could be stated:

Among all hypotheses consistent with the observations, the simplest is the most likely.

The principle is called Occam’s razor because it ‘shaves away’ unnecessary assumptions.

For example, think about the case of the missing cookies again. In most cases, the ‘daughter’ hypothesis seems to make fewer unnecessary assumptions than the ‘short thief’ hypothesis does. You already know you have a daughter that likes cookies and knows how to move chairs to reach cookies. But in order for the short thief hypothesis to be plausible, you have to assume that (1) a thief found a way to break into your home, that (2) the thief wanted cookies more than anything else from your home, that (3) the thief was, unusually, too short to reach the top of the fridge without the help of a chair, and many other unnecessary assumptions.

Occam’s razor sounds right, but can it be made more precise, and can it be justified? We will return to those questions later. For now, let us consider another important part of induction, that of updating our beliefs in response to new evidence.

Updating Beliefs

You’re a soldier in combat, crouching in a trench. You know for sure there is just one enemy soldier left on the battlefield, about 400 yards away. You also know that if the remaining enemy is a regular army troop, there’s only a small chance he could hit you with one shot from that distance. But if the remaining enemy is a sniper, then there’s a very good chance he can hit you with one shot from that distance. But snipers are rare, so it’s probably just a regular army troop.

You peek your head out of the trench, trying to get a better look.

Bam! A bullet glance off your helmet and you duck down again.

“Okay,” you think. “I know snipers are rare, but that guy just hit me with a bullet from 400 yards away. I suppose it might still be a regular army troop, but there’s a seriously good chance it’s a sniper, since he hit me from that far away.”

After another minute, you dare to take another look, and peek your head out of the trench again.

Bam! Another bullet glances off your helmet! You duck down again.

“Damn,” you think. “It’s definitely a sniper. No matter how rare snipers are, there’s no way that guy just hit me twice in a row from that distance if he’s a regular army troop. He’s gotta be a sniper. I’d better call for support.”

This is an example of updating beliefs in response to evidence, and we do it all the time.

You start with some prior beliefs, and all of them are uncertain. You are 99.99% certain the Earth revolves around the sun, 90% confident your best friend will attend your birthday party, and 40% sure that the song on the radio you’re listening to was played by The Turtles.

Then, you encounter new evidence — new observations — and update your beliefs in response.

Suppose you start out 85% confident the one remaining enemy soldier is not a sniper, leaving only 15% credence to the hypothesis that he is a sniper. But then, a bullet glances off your helmet — an event far more likely if the enemy soldier is a sniper than if he is not. So now you’re only 40% confident he’s not a sniper, and 60% confident he is. Another bullet glances off your helmet, and you update again. Now you’re only 2% confident he’s not a sniper, and 98% confident he is a sniper.

Now, I’m about to show you some math, but don’t run away. The math isn’t supposed to make sense if you haven’t studied it. Don’t worry; I’m going to explain it all.

There’s a theorem in probability theory that tells you how likely one observation is given some other observations. It’s called Bayes’ Theorem.

At this point you may want to take a break and read either tutorial #1, tutorial #2, tutorial #3, or tutorial #4 on Bayes’ Theorem. I’ll only say a little bit more about Bayes’ Theorem in this tutorial.

The short form of Bayes’ Theorem looks like this:

Let’s unpack this. The H refers to some hypothesis, and the E responds to some evidence. p(x) refers to the probability of x. The pipe symbol | means ‘given’ or ‘assuming’. So, Bayes’ Theorem reads:

[The probability of hypothesis H given evidence E] is equal to [the probability of evidence E given hypothesis H] times [the probability of hypothesis H] divided by [the probability of evidence E].

You can see how this tells us what we’d like to know. We want to know the probability of some hypothesis — some model of the world that, if correct, will allow us to successfully predict the future — given the evidence that we have. And that’s what Bayes’ Theorem tells us. Now we just plug the numbers in, and solve for p(H|E).

Of course, it’s not easy to “just plug the numbers in.” You aren’t an all-knowing god. You don’t know exactly how likely it is that the enemy soldier would hit your helmet if he’s a sniper, compared to how likely that is if he’s not a sniper. But you can do your best. With enough evidence, it will become overwhelmingly clear which hypothesis is correct.

At this point, I'll let the tutorials on Bayes' Theorem above do the heavy lifting on how to update beliefs. Let me get back to the part of induction those tutorials don't explain: the choice of priors.

The Problem of Priors

In the example above where you're a soldier in combat, I gave you your starting probabilities: 85% confidence that the enemy soldier was a sniper, and 15% confidence he was not. But what if you don't know your "priors"? What then?

When using Bayes' Theorem to calculate your probabilities, your choice of prior can influence your results greatly. But how can we determine the probability of a hypothesis before seeing any data? What does that even mean?

Legg (2008) explains:

Bayesians respond to this in a number of ways. Firstly, they point out that the problem is generally small, in the sense that with a reasonable prior and quantity of data, the posterior distribution P(h|D) depends almost entirely on D rather than the chosen prior P(h). In fact on any sizable data set, not only does the choice of prior not especially matter, but Bayesian and classical statistical methods typically produce similar results, as one would expect. It is only with relatively small data sets or complex models that the choice of prior becomes an issue.

If classical statistical methods could avoid the problem of prior bias when dealing with small data sets then this would be a significant argument in their favour. However Bayesians argue that all systems of inductive inference that obey some basic consistency principles define, either explicitly or implicitly, a prior distribution over hypotheses. Thus, methods from classical statistics make assumptions that are in effect equivalent to defining a prior. The difference is that in Bayesian statistics these assumptions take the form of an explicit prior distribution. In other words, it is not that prior bias in Bayesian statistics is necessarily any better or worse than in classical statistics, it is simply more transparent.

But this doesn't solve our problem. It merely points out that other statistical techniques don't fare any better.

What we'd like is to reduce the potential for abusing one's opportunity to select priors, and to use agreed-upon priors when possible. Thus, many standard "prior distributions" have been developed. Generally, they distribute probability equally across hypotheses.

But to solve the problem of priors once and for all, we'd like to have an acceptable, universal prior distribution, so that there's no fuzziness about the process of induction. We need a recipe, and algorithm, for finding out the truth. And there can't be any fuzziness in an algorithm.

Binary Sequence Prediction

[intro to binary sequence prediction]

Solomonoff and Kolmogorov

[summarize the contributions of Solomonoff and Kolmogorov]

Solomonoff Lightsaber

[explain the result: Solomonoff Induction]

Approximations

[survey a few of the approximations of Solomonoff Induction in use today]

Philosophical Implications

[survey of some philosophical implications of unviersal induction]

Notes

a This quote and some of the examples to follow are from Legg (2008).

References

Legg (2008). Machine Superintelligence. PhD thesis, Department of Informatics, University of Lugano.

Rathmanner & Hutter (2011). A philosophical treatise of universal induction.

The Doomsday Argument and Self-Sampling Assumption are wrong, but induction is alive and well.

7 RonPisaturo 14 August 2011 06:15PM

Since the Doomsday Argument still is discussed often on Less Wrong, I would like to call attention to my new, short, self-published e-book, The Longevity Argument, which is a much-revised and much-expanded work that began with my paper, “Past Longevity as Evidence for the Future,” in the January 2009 issue of Philosophy of Science. In my judgment, my work provides a definitive refutation of the Doomsday Argument, identifying two elementary errors in the argument.

The first elementary error is that the Doomsday Argument conflates total duration and future duration. Although the Doomsday Argument’s Bayesian formalism is stated in terms of total duration, all attempted real-life applications of the argument—with one exception, a derivation by Gott (1994, 108) of his delta t argument introduced in Gott 1993—actually plug in prior probabilities for future duration.

For example, Leslie (1996, 198–200) presents a Bayesian equation stated in terms of prior probabilities of total instances. But then Leslie (1996, 201–203) plugs into this equation prior probabilities for future instances: humans being born for the next 150 years vs. humans being born for the next many thousands of centuries. Bostrom (2002, 94–96) recounts Leslie’s general argument in terms of births instead of durations of time, using 200 billion total births vs. 200 trillion total births. (A closer parallel to Leslie 1996 would be 80 billion total births vs. 80 trillion total births.) But the error persists: the actual prior probabilities that are plugged in to Leslie’s Bayesian equation, based on all of the real-life risks actually considered by Leslie (1996, 1–153) and Bostrom (2002, 95), are of future births, not total births.

In other words, Leslie supposes a prior probability of doom within the next 150 years or roughly 20 billion births. (The prior probabilities supposed in the Doomsday Argument are prior to knowledge of one’s birth rank.) Leslie then assumes that—since there have already been, say, 60 billion births—this prior probability is equal to the prior probability that the total number of births will have been 80 billion births. However, in the absence of knowledge of one’s birth rank, this assumption is absurd.

The second elementary error is the Doomsday Argument’s use of the Self-Sampling Assumption, which is contradicted by the prior information in all attempts at real-life applications in the literature.

For example, many risks to the human race—including most if not all the real-life risks discussed by Leslie and Bostrom—can reasonably be described mathematically as Poisson processes. Then the Self-Sampling Assumption implies that the risk per birth—the ‘lambda’ in the Poisson formula—is constant throughout the duration of the human race. But Leslie (1996, 202) also supposes that if mankind survives past the next century and a half, then the risk per birth will drop dramatically, because mankind will begin spreading throughout the galaxy. (The Doomsday Argument implicitly relies on such a drop in lambda—and the resultant bifurcation of risk into ‘doom soon’ and ‘doom very much later’—for the argument’s significant claims.) In other words, Leslie’s prior probabilities of doom are mathematical contradictions of the Self-Sampling Assumption that Leslie and Bostrom invoke in the Doomsday Argument.

In my book, I perform Bayesian analyses that correct these errors. These analyses demonstrate that gaining more knowledge of the past can indeed update one’s assessment of the future; but this updating is consistent with common sense instead of with the Doomsday Argument. In short, while refuting the Doomsday Argument, I vindicate induction.

 The price of my e-book is $4. However, professional scholars and educators are invited to email me to request a complimentary evaluation copy (not for further distribution, of course). I extend the same offer to the first ten Less Wrong members with a Karma Score of 100 or greater who email me. (I may send to more than ten, or to some with lower Karma Scores, but I don’t want to make an open-ended commitment.)

For an abstract of the e-book, see this entry on PhilPapers. For a non-technical introduction, see here on my blog.

The e-book covers much more than the Doomsday Argument; here is a one-sentence summary: The Doomsday Argument, Self-Sampling Assumption, and Self-Indication Assumption are wrong; Gott’s delta t argument (Gott 1993, 315–316; 1994) underestimates longevity, providing lower bounds on probabilities of longevity, and is equivalent to Laplace’s Rule of Succession (Laplace 1812, xii–xiii; [1825] 1995, 10–11); but Non-Parametric Predictive Inference based on the work of Hill (1968, 1988, 1993) and Coolen (1998, 2006) forms the basis of a calculus of induction.

References

Bostrom, Nick (2002), Anthropic Bias: Observation Selection Effects in Science and Philosophy. New York & London: Routledge.

Coolen, Frank P.A. (1998), “Low Structure Imprecise Predictive Inference For Bayes' Problem”, Statistics & Probability Letters 36: 349–357.

——— (2006), On Probabilistic Safety Assessment in the Case of Zero Failures. Journal of Risk and Reliability 220 (Proceedings of the Institute of Mechanical Engineers O): 105–114.

Gott, J. Richard III (1993), “Implications of the Copernican Principle for our Future Prospects”, Nature 363: 315–319.

 ——— (1994), “Future Prospects Discussed”, Nature 368: 108.

Hill, Bruce M. (1968), “Posterior Distribution of Percentiles: Bayes' Theorem for Sampling from a Population”, Journal of the American Statistical Association 63: 677–691.

——— (1988), “De Finetti’s Theorem, Induction, and A(n) or Bayesian Nonparametric Predictive Inference”, Bayesian Statistics 3, Edited by Bernardo J.M., DeGroot, M.H., Lindley, D.V. & Smith A.F.M. Oxford: Oxford University Press: 211–241.

——— (1993), “Parametric Models for An: Splitting Processes and Mixtures”, Journal of the Royal Statistical Society B 55: 423–433.

Laplace, Pierre-Simon (1812), Theorie Analytique des Probabilités. Paris: Courcier.

——— ([1825] 1995), Philosophical Essay on Probabilities. Translated by Andrew I. Dale. Originally published as Essai philosophique sur les probabilite´s (Paris: Bachelier). New York: Springer-Verlag.

Leslie, John (1996), The End of the World: The Science and Ethics of Human Extinction. London: Routledge.

 

Here is and Addendum addressing the question by Manfred to elaborate on my statement, "the Self-Sampling Assumption implies that the risk per birth—the ‘lambda’ in the Poisson formula—is constant throughout the duration of the human race."

To avoid integrals, let me discuss a binomial process, which is a discrete version of a Poisson process.

Suppose you are studying a species from another planet. Suppose the only main risk to the species is an asteroid hitting the planet. Suppose the risk of an asteroid hit in a year is q. Given that the present moment is within a window (from the past through to the future) of N years without an asteroid hit, what is the probability P(Y) that the present moment is within year Y of that window?

P(Y) = [q(1 – q)Y(1 – q)N–Yq]/B, where B is the probability that the window is N years.

P(Y) = [q2(1 – q)N]/B.

Since Y does not appear in this formula, it is clear that P(Y) is constant for all Y. That is, since q is constant, P(Y) is uniform in [1, N], and P(Y) = 1/N. This result is equivalent to the Self-Sampling Assumption with units of time (years) as the reference class.

But suppose that the risk of an asteroid hit in the past was q, but the species has just built an asteroid destroyer, and the risk in the future is r where r << q. Then

P(Y) = [q(1 – q)Y(1 – r)N–Yr]/B.

[8/16/2011: Corrected the final 'r' in the above equation from a 'q'.] Y does appear in this formula. Clearly, the greater the value of Y, the smaller the value of P(Y). That is, contrary to the Self-Sampling Assumption, it is very likely that the present moment is in the early part of the window of N years.

The above argument demonstrates why the choice of ‘reference class’ matters. If the risk is constant per unit time, then the correct reference class is units of time. If the risk is constant per birth, then the correct reference class is births. Suppose birth rates increase exponentially. Then constant risk per unit time precludes constant risk per birth, and vice versa. The two reference classes cannot both be right. More generally, if the prior information stipulates that risk per birth is not constant, then the Self-Sampling Assumption using a reference class of births does not apply.

This passage is from my book (p. 59):

Here is a more philosophical and less mathematical perspective on the same point. SSA [the Self-Sampling Assumption] rests on the premise that all indexical information has been removed from the prior information. One's birth rank, which applies only to oneself, is such indexical information that is removed from the prior information before SSA is invoked. But even in the absence of birth rank, the prior information may—and usually does—include information that is indexical. For example, if the prior information states that λpast is large and λfuture is small, then the prior information is stating something that is true only of the present—namely, that the present is when λ changes abruptly from a large value to a small value. It turns out that this indexical information contradicts the mathematical conclusion of SSA. Moreover, this indexical information cannot be removed without consequence from the prior information, because the prior probabilities rest on it.

Perhaps the statement that Manfred quotes would have been clearer if I had instead written the following: The Self-Sampling Assumption implies that the risk per birth—the ‘lambda’ in the Poisson formula—is constant throughout the past and present.

 

[Link] The Bayesian argument against induction.

4 Peterdjones 18 July 2011 09:52PM

In 1983 Karl Popper and David Miller published an argument to the effect that probability theory could be used to disprove induction. Popper had long been an opponent of induction. Since probability theory in general, and Bayes in particular is often seen as rescuing induction from the standard objections, the argument is significant.

It is being discussed over at the Critical Rationalism site.

Induction, Deduction, and the Collatz Conjecture: the Decidedly Undecidable Propositions.

3 potato 15 June 2011 03:21PM

The question I want to ask is "is there a proof for every statement about the natural numbers that seems to be inductively verifiable, but is a general recursive decision problem, provided your sample is large enough?" Simply, if we define a binary property 'P' that can only be tested by algorithms that might not halt, and show, say by computation, that every natural number up to some arbitrarily large number 'N' has the property P, does that mean that there must be a generalized deductive proof that for all natural numbers P holds?

How big can N get before there must be a deductive proof that P holds for all natural numbers? What if N was larger than Graham's number[1]?  What if we showed thousands of years from now--using computers that are unimaginably strong by today's standards--that every number less than or equal to the number you get when you put Graham's number as both of the inputs to the Ackermann function[2] has a property P. But they still have no generalized deductive proof that all numbers have P, is that enough for these future number theorists to state that it is a scientific fact that all natural numbers have the property P? 

It may seem impossible for such a disagreement to come up between induction and deduction, but we are already at a similar (though admittedly less dramatic) impasse. The disagreement centers around the Collatz conjecture, which states that every number is a Collatz number. To test if some number n is a Collatz number,  if n is even, divide it by 2 to get n / 2, if n is odd multiply it by 3 and add 1 to obtain 3n + 1, and repeat this process with the new number thus obtained, indefinitely, if you eventually reach the cycle 4, 2, 1, then n is a Collatz number. Every number up to 20 × 2^58 has been shown to be a Collatz number[3], and it has been shown that there are no cycles with a smaller period than 35400[4], yet there is still no deductive proof of the Collatz conjecture[5]. In fact, one method of generalized proof has already been shown to be undecidable[6]. If it was shown that no general proof could ever be found, but all numbers up to the unimaginably large ones described above were shown to be Collatz numbers, what epistemological status should we grant the conjecture?

It has been made clear by the history of mathematics that a lack of small counterexamples to a conjecture of the form "for all natural numbers P(n) = 1" where 'P' is a binary function (let's call this an "inductive-conjecture"), is not at all a proof of that conjecture. There have been many inductive-conjectures with that sort of evidence in their favor that have later been shown not to be theorems, just because the counter examples were very large. But what if a proof of the undecidability of a conjecture of that form is given, what then if no counter example had been found up to the insanely large values described above?

If there is a binary property 'R' that holds for all natural numbers, i.e., there is no counter example, and it can be deductively shown that no proof of 'R' holding for all natural numbers exists, then the implications for epistemology, ontology and the scientific/rational endeavor in general are huge. If some facts about the natural numbers don't follow from the application of valid inferences onto axiom and theorems, then what makes us think that all the facts about the natural world must follow from natural laws in combination with the initial state of the universe? If there is such a property, then that means that in a completely deterministic system where you have all the rules describing all the possible ways that things can change, and you have all the rest of the formally verifiable information about the system, there still might be some fact about this system which is true but does not follow from those rules. Those statements would only be verifiable by finite probabilistic sampling from an infinite population with an undetermined standard deviation, but still be true facts. Our crowning example of such a system would of course be the theory of the natural numbers itself if such a binary property were discovered. Suppose the Collatz conjecture were shown to be undecidable, that would mean that there is no counter example, i.e., all numbers are Collatz numbers (since the existence of a counter example would guarantee the conjecture's decidability), but there would also be no generalized proof that no counter example exists (since the existence of such a proof would guarantee the decidability of the conjecture). So since we can't verify the conjecture either way, should we call it meaningless/unverifiable? Or is logically undecidable somehow distinct from literally meaningless? What restrictions/expectations should we have if we believe that an inductive-conjecture is undecidable, and how would those restrictions change if we believed that conjecture was actually unverifiable.

Let's call the claim that "there is a binary property R which holds for all natural numbers and that there is no counter example that can or will ever be found, but which also cannot be proven to hold for all natural numbers" the "first Potato conjecture". How would one ever show the first potato conjecture, or even offer evidence in it's favor? Let's say we knew that some property 'R_b' held of all natural numbers that we might ever test. Then we would have a proof of this and R_b could not be our R. If we get a candidate property 'R_c'  that isn't capable of being proven or dis-proven of all natural numbers, then we will never know if it holds for all natural numbers. Could induction even offer us any evidence in this case? Is a finite sample ever representative of an infinite population with no standard deviation even in the case of simple succession? If not, then what evidence could we ever offer for or against the potato conjecture, if an undecidable inductive-conjecture were discovered? If the answer is no evidence one way or the other, does that mean that the potato conjecture is meaningless, or just undecidable?

 

(But no, really, I'm asking.)

 


[1]: http://en.wikipedia.org/wiki/Graham's number

[2]: http://en.wikipedia.org/wiki/Ackermann_function

[3]: http://www.ieeta.pt/~tos/3x+1.html

[4]: http://www.jstor.org/pss/2044308

[5]: http://en.wikipedia.org/wiki/Collatz_conjecture

[6]: Quoting Lagarias 1985: "J. H. Conway proved the remarkable result that a simple generalization of the problem is algorithmically undecidable." The work was reported in: J. H. Conway (1972). "Unpredictable Iterations". Proceedings of the 1972 Number Theory Conference : University of Colorado, Boulder, Colorado, August 14–18, 1972. Boulder: University of Colorado. pp. 49–52.

A Philosophical Treatise of Universal Induction (Link)

11 XiXiDu 10 June 2011 06:53PM

Abstract: Understanding inductive reasoning is a problem that has engaged mankind for thousands of years. This problem is relevant to a wide range of fields and is integral to the philosophy of science. It has been tackled by many great minds ranging from philosophers to scientists to mathematicians, and more recently computer scientists. In this article we argue the case for Solomonoff Induction, a formal inductive framework which combines algorithmic information theory with the Bayesian framework. Although it achieves excellent theoretical results and is based on solid philosophical foundations, the requisite technical knowledge necessary for understanding this framework has caused it to remain largely unknown and unappreciated in the wider scientific community. The main contribution of this article is to convey Solomonoff induction and its related concepts in a generally accessible form with the aim of bridging this current technical gap. In the process we examine the major historical contributions that have led to the formulation of Solomonoff Induction as well as criticisms of Solomonoff and induction in general. In particular we examine how Solomonoff induction addresses many issues that have plagued other inductive systems, such as the black ravens paradox and the confirmation problem, and compare this approach with other recent approaches.

Link: mdpi.com/1099-4300/13/6/1076/

Download PDF Full-Text: mdpi.com/1099-4300/13/6/1076/pdf

Authors: Samuel Rathmanner and Marcus Hutter

Published: 3 June 2011

Via: vetta.org/2011/06/treatise-on-universal-induction/