Meetup : Urbana-Champaign: The Steep Approach to Crazytown
Discussion article for the meetup : Urbana-Champaign: The Steep Approach to Crazytown
As was suggested last week, we'll be checking out some exercises from the Scientology community. Making fun of them is, of course, mandatory, but the goal is to find some that are interesting enough to try and then trying them. I don't see how this could go wrong.
Discussion article for the meetup : Urbana-Champaign: The Steep Approach to Crazytown
Meetup : Urbana-Champaign: Practical Rationality
Discussion article for the meetup : Urbana-Champaign: Practical Rationality
What techniques do you use for acting effectively given that you are a squishy irrational meatbag? Jack will contribute some ideas which he has smuggled to us from CFAR.
Also, we will continue refining the rules of Wits and Wagers.
Discussion article for the meetup : Urbana-Champaign: Practical Rationality
Meetup : Urbana-Champaign: Reconstituting
Discussion article for the meetup : Urbana-Champaign: Reconstituting
In the puddles and pools formed by the fall rains, meetup groups that seemed dead end their arid hibernation and emerge from the earth.
Come join us for the first meetup of the new school year. Board games and light discussion are likely. A starter topic: increasing exposure to positive black swans.
Discussion article for the meetup : Urbana-Champaign: Reconstituting
Thought experiments on simplicity in logical probability
A common feature of many proposed logical priors is a preference for simple sentences over complex ones. This is sort of like an extension of Occam's razor into math. Simple things are more likely to be true. So, as it is said, "why not?"
Well, the analogy has some wrinkles - unlike hypothetical rules for the world, logical sentences do not form a mutually exclusive set. Instead, for every sentence A there is a sentence not-A with pretty much the same complexity, and probability 1-P(A). So you can't make the probability smaller for all complex sentences, because their negations are also complex sentences! If you don't have any information that discriminates between them, A and not-A will both get probability 1/2 no matter how complex they get.
But if our agent knows something that breaks the symmetry between A and not-A, like that A belongs to a mutually exclusive and exhaustive set of sentences with differing complexities, then it can assign higher probabilities to simpler sentences in this set without breaking the rules of probability. Except, perhaps, the rule about not making up information.
The question: is the simpler answer really more likely to be true than the more complicated answer, or is this just a delusion? If so, is it for some ontologically basic reason, or for a contingent and explainable reason?
There are two complications to draw your attention to. The first is in what we mean by complexity. Although it would be nice to use the Kolmogorov complexity of any sentence, which is the length of the shortest program that prints the sentence, such a thing is uncomputable by the kind of agent we want to build in the real world. The only thing our real-world agent is assured of seeing is the length of the sentence as-is. We can also find something in between Kolmogorov complexity and length by doing a brief search for short programs that print the sentence - this meaning is what is usually meant in this article, and I'll call it "apparent complexity."
The second complication is in what exactly a simplicity prior is supposed to look like. In the case of Solomonoff induction the shape is exponential - more complicated hypotheses are exponentially less likely. But why not a power law? Why not even a Poisson distribution? Does the difficulty of answering this question mean that thinking that simpler sentences are more likely is a delusion after all?
Thought experiments:
1: Suppose our agent knew from a trusted source that some extremely complicated sum could only be equal to A, or to B, or to C, which are three expressions of differing complexity. What are the probabilities?
Commentary: This is the most sparse form of the question. Not very helpful regarding the "why," but handy to stake out the "what." Do the probabilities follow a nice exponential curve? A power law? Or, since there are just the three known options, do they get equal consideration?
This is all based off intuition, of course. What does intuition say when various knobs of this situation are tweaked - if the sum is of unknown complexity, or of complexity about that of C? If there are a hundred options, or countably many? Intuitively speaking, does it seem like favoring simpler sentences is an ontologically basic part of your logical prior?
2: Consider subsequences of the digits of pi. If I give you a pair (n,m), you can tell me the m digits following the nth digit of pi. So if I start a sentence like "the subsequence of digits of pi (10100, 102) = ", do you expect to see simpler strings of digits on the right side? Is this a testable prediction about the properties of pi?
Commentary: We know that there is always a short-ish program to produce the sequences, which is just to compute the relevant digits of pi. This sets a hard upper bound on the possible Kolmogorov complexity of sequences of pi (that grows logarithmically as you increase m and n), and past a certain m this will genuinely start restricting complicated sequences, and thus favoring "all zeros" - or does it?
After all, this is weak tea compared to an exponential simplicity prior, for which the all-zero sequence would be hojillions of times more likely than a messy one. On the other hand, an exponential curve allows sequences with higher Kolmogorov complexity than the computation of the digits of pi.
Does the low-level view outlined in the first paragraph above demonstrate that the exponential prior is bunk? Or can you derive one from the other with appropriate simplifications (keeping in mind Komogorov complexity vs. apparent complexity)? Does pi really contain more long simple strings than expected, and if not what's going on with our prior?
3: Suppose I am writing an expression that I want to equal some number you know - that is, the sentence "my expression = your number" should be true. If I tell you the complexity of my expression, what can you infer about the likelihood of the above sentence?
Commentary: If we had access to Kolmogorov complexity of your number, then we could completely rule out answers that were too K-simple to work. With only an approximation, it seems like we can still say that simple answers are less likely up to a point. Then as my expression gets more and more complicated, there are more and more available wrong answers (and, outside of the system a bit, it becomes less and less likely that I know what I'm doing), and so probability goes down.
In the limit that my expression is much more complex than your number, does an elegant exponential distribution emerge from underlying considerations?
Raven paradox settled to my satisfaction
The raven paradox, originated by Carl Gustav Hempel, is an apparent absurdity of inductive reasoning. Consider the hypothesis:
H1: All ravens are black.
Inductively, one might expect that seeing many black ravens and no non-black ones is evidence for this hypothesis. As you see more black ravens, you may even find it more and more likely.
Logically, a statement is equivalent to its contrapositive (where you negate both things and flip the order). Thus if "if it is a raven, it is black" is true, so is:
H1': If it is not black, it is not a raven.
Take a moment to double-check this.
Inductively, just like with H1, one would expect that seeing many non-black non-ravens is evidence for this hypothesis. As you see more and more examples, you may even find it more and more likely. Thus a yellow banana is evidence for the hypothesis "all ravens are black."
Since this is silly, there is an apparent problem with induction.
Resolution
Consider the following two possible states of the world:

Suppose that these are your two hypotheses, and you observe a yellow banana (drawing from some fixed distribution over things). Q: What does this tell you about one hypothesis versus another? A: It tells you bananas-all about the number of black ravens.
One might contrast this with a hypothesis where there is one less banana, and one more yellow raven, by some sort of spontaneous generation.

Observations of both black ravens and yellow bananas cause us to prefer 1 over 3, now!
The moral of the story is that the amount of evidence that an observation provides is not just about whether it whether it is consistent with the "active" hypothesis - it is about the difference in likelihood between when the hypothesis is true versus when it's false.
This is a pretty straightforward moral - it's a widely known pillar of statistical reasoning. But its absence in the raven paradox takes a bit of effort to see. This is because we're using an implicit model of the problem (driven by some combination of outside knowledge and framing effects) where nonblack ravens replace black ravens, but don't replace bananas. The logical statements H1 and H1' are not alone enough to tell how you should update upon seeing new evidence. Or to put it another way, the version of induction that drives the raven paradox is in fact wrong, but probability theory implies a bigger version.
(Technical note: In the hypotheses above, the exact number of yellow bananas does not have to be the same for observing a yellow banana to provide no evidence - what has to be the same is the measure of yellow bananas in the probability distribution we're drawing from. Talking about "99 ravens" is more understandable, but what differentiates our hypotheses are really the likelihoods of observing different events [there's our moral again]. This becomes particularly important when extending the argument to infinite numbers of ravens - infinities or no infinities, when you make an observation you're still drawing from some distribution.)
Maybe we're not doomed
This is prompted by Scott's excellent article, Meditations on Moloch.
I might caricature (grossly unfairly) his post like this:
- Map some central problems for humanity onto the tragedy of the commons.
- Game theory says we're doomed.
- Incentives for government employees sometimes don't match the needs of the people.
- This has costs, and those costs help explain why some things that suck, suck.
- Map some central problems for humanity onto the iterated prisoner's dilemma.
- Evolutionary game theory says we're not doomed.
Top-Down and Bottom-Up Logical Probabilities
I.
I don't know very much model theory, and thus I don't fully understand Hutter et al.'s logical prior, detailed here, but nonetheless I can tell you that it uses a very top-down approach. About 60% of what I mean is that the prior is presented as a completed object with few moving parts, which fits the authors' mathematical tastes and proposed abstract properties the function should have. And for another thing, it uses model theory - a dead giveaway.
There are plenty of reasons to take a top-down approach. Yes, Hutter et al.'s function isn't computable, but sometimes the properties you want require uncomputability. And it's easier to come up with something vaguely satisfactory if you don't have to have many moving parts. This can range from "the prior is defined as a thing that fulfills the properties I want" on the lawful good side of the spectrum, to "clearly the right answer is just the exponential of the negative complexity of the statement, duh".
Probably the best reason to use a top-down approach to logical uncertainty is so you can do math to it. When you have some elegant description of global properties, it's a lot easier to prove that your logical probability function has nice properties, or to use it in abstract proofs. Hence why model theory is a dead giveaway.
There's one other advantage to designing a logical prior from the top down, which is that you can insert useful stuff like a complexity penalty without worrying too much. After all, you're basically making it up as you go anyhow, you don't have to worry about where it comes from like you would if you were going form the bottom up.
A bottom-up approach, by contrast, starts with an imagined agent with some state of information and asks what the right probabilities to assign are. Rather than pursuing mathematical elegance, you'll see a lot of comparisons to what humans do when reasoning through similar problems, and demands for computability from the outset.
For me, a big opportunity of the bottom-up approach is to use desiderata that look like principles of reasoning. This leads to more moving parts, but also outlaws some global properties that don't have very compelling reasons behind them.
II.
Before we get to the similarities, rather than the differences, we'll have to impose the condition of limited computational resources. A common playing field, as it were. It would probably serve just as well to extend bottom-up approaches to uncomputable heights, but I am the author here, and I happen to be biased towards the limited-resources case.
The part of top-down assignment using limited resources will be played by a skeletonized pastiche of Paul Christiano's recent report:
i. No matter what, with limited resources we can only assign probabilities to a limited pool of statements. Accordingly, step one is to use some process to choose the set S0 of statements (and their negations) to assign probabilities.
ii. Then we use something a weakened consistency condition (that can be decided between pairs of sentences in polynomial time) to set constraints on the probability function over S0. For example, sentences that are identical except for a double-negation have to be given the same probability.
iii. Christiano constructs a description-length-based "pre-prior" function that is bigger for shorter sentences. There are lots of options for different pre-priors, and I think this is a pretty good one.
iv. Finally, assign a logical probability function over S0 that is as similar as possible to the pre-prior while fulfilling the consistency condition. Christiano measures similarity using cross-entropy between the two functions, so that the problem is one of minimizing cross-entropy subject to a finite list of constraints. (Even if the pre-prior decreases exponentially, this doesn't mean that complicated statements will have exponentially low logical probability, because of the condition from step two that P(a statement) + P(its negation) = 1 - in a state of ignorance, everything still gets probability 1/2. The pre-prior only kicks in when there are more options with different description lengths.)
Next, let's look at the totally different world of a bottom-up assignment of logical probabilities, played here by a mildly rephrased version of my past proposal.
i. Pick a set of sentences S1 to try and figure out the logical probabilities of.
ii. Prove the truth or falsity of a bunch of statements in the closure of S1 under conjugation and negation (i.e. if sentences a and b are in S1, a&b is in the closure of S1).
iii. Assign a logical probability function over the closure of S1 under conjugation with maximum entropy, subject to the constraints proved in part two, plus the constraints that each sentence && its negation has probability 0.
These turn out to be really similar! Look in step three of my bottom-up example - there's a even a sneakily-inserted top-down condition about going through every single statement and checking an aspect of consistency. In the top-down approach, every theorem of a certain sort is proved, while in the bottom-up approach there are allowed to be lots of gaps - but the same sorts of theorems are proved. I've portrayed one as using proofs only about sentences in S0, and the other as using proofs in the entire closure of S1 under conjunction, but those are just points on an available continuum (for more discussion, see Christiano's section on positive semidefinite methods).
The biggest difference is this "pre-prior" thing. On the one hand, it's essential for giving us guarantees about inductive learning. On the other hand, what piece of information do we have that tells us that longer sentences really are less likely? I have unresolved reservations, despite the practical advantages.
III.
A minor confession - my choice of Christiano's report was not coincidental at all. The causal structure went like this:
Last week - Notice dramatic similarities in what gets proved and how it gets used between my bottom-up proposal and Christiano's top-down proposal.
Now - Write post talking about generalities of top-down and bottom-up approaches to logical probability, and then find as a startling conclusion the thing that motivated me to write the post in the first place.
The teeensy bit of selection bias here means that though these similarities are cool, it's hard to draw general conclusions.
So let's look at one more proposal, this one due to Abram Demski, modified by to use limited resources.
i. Pick a set of sentences S2 to care about.
ii. Construct a function on sentences in S2 that is big for short sentences and small for long sentences.
iii. Start with the set of sentences that are axioms - we'll shortly add new sentences to the set.
iv. Draw a sentence from S2 with probability proportional to the function from step two.
v. Do a short consistency check (can use a weakened consistency condition, or just limited time) between this sentence and the sentences already in the set. If it's passed, add the sentence to the set.
vi. Keep doing steps four and five until you've either added or ruled out all the sentences in S2.
vii. The logical probability of a sentence is defined as the probability that it ends up in our set after going through this process. We can find this probability using Monte Carlo by just running the process a bunch of times and counting up what portion of the time each sentences is in the set by the end.
Okay, so this one looks pretty different. But let's look for the similarities. The exact same kinds of things get proved again - weakened or scattershot consistency checks between different sentences. If all you have in S2 are three mutually exclusive and exhaustive sentences, the one that's picked first wins - meaning that the probability function over what sentence gets picked first is acting like our pre-prior.
So even though the method is completely different, what's really going on is that sentences are being given measure that looks like the pre-prior, subject to the constraints of weakened consistency (via rejection sampling) and normalization (keep repeating until all statements are checked).
In conclusion: not everything is like everything else, but some things are like some other things.
More and Less than Solomonoff Induction
I've been thinking about how to put induction with limited resources on a firmer foundation. This may just be retracing the steps of others, but that's okay with me. Mostly I just want to talk about these thoughts.
After a few points of introduction.
What's Solomonoff induction?
Suppose we're given some starting data, and asked to predict the future. Solomonoff induction predicts the future by combining the predictions of all programs that (1) output the starting data up until now, and (2) aren't the continuation of another such program. The predictions are combined according to a weighting that decreases exponentially as the length of the program increases.
Why is it a good idea?
The simplest answer is that we have a frequentist guarantee that if the "true program" generating our input has some length N (that is, if the observable universe is a big but finite-sized computer), then our predictions will only be wrong a limited number of times, and after that we'll predict the correctly every time.
A more bayesian answer would start with the information that our observations can be generated by some finite-sized program, and then derive that something like Solomonoff induction has to represent our true prior over generating programs - as the length gets bigger, our probability is required to go to zero at infinity, and an exponential is the maximum-entropy such curve. This is not a complete answer, but it at least makes the missing pieces more apparent.
Why won't it work with limited resources
The trouble with using Solomonoff induction in real life is that to pick out which programs output our data so far, we need to run every program - and if the program doesn't ever halt, we need to use a halting oracle to stop it or else we'll take infinite time.
Limited resources require us to only pick from a class of programs that is guaranteed to not run over the limit.
If we have limited time and no halting oracle, we can't check every program. Instead, we are only allowed to check programs drawn from a class of programs that we can check the output of in limited time. The simplest example would be to just check programs but not report the result if we go over some time limit, in which case the class we're picking from is "programs that don't go over the time limit."
This is an application of a general principle - when we impose resource constraints on a real agent, we want to be able to cash them out as properties of an abstract description of what the agent is doing. In this case, we can cash out limited time for our real agent as an inability for our abstract description to have any contributions from long programs.
This isn't really a Bayesian restriction - we don't actually know that the true program is within our search space. This also weakens our frequentist guarantee.
The problem of overfitting - real models allow for incorrect retrodictions.
If we just take Solomonoff induction and put restrictions on it, our predictions will still only come from hypotheses that exactly reproduce our starting data. This is a problem.
For example, if we have a ton of data, like we do, then finding even one program that exactly replicates our data is too hard, and our induction is useless.
We as humans have two main ways of dealing with this. The first is that we ignore most of our data and only try to predict the important parts. The second is that we don't require our models to perfectly retrodict our observations so far (retrodiction- it's like prediction, for the past!).
In fact. having imperfect retrodictions is such a good idea that we can have a problem called "overfitting." Isn't that kinda odd? That it's better to use a simple model that is actually 100% known to be wrong, than to use a very complicated model that has gotten everything right so far.
This behavior makes more sense if we imagine that our data contains contributions both from the thing we're trying to model, and from stuff we want to ignore, in a way that's hard to separate.
What makes real models better than chance? The example of coarse-graining.
Is it even possible to know ahead of time that an imperfect model will make good predictions? It turns out the answer is yes.
The very simplest example is of just predicting the next bit based on which bit has been more common so far. If every pattern were equally likely this wouldn't work, but if we require the probability of a pattern to decrease with its complexity, we're more likely to see repetitions.
A more general example: an imperfect model is always better than chance if it is a likely coarse-graining of the true program. Coarse-graining is when you can use a simple program to predict a complicated one by only worrying about some of the properties of the complicated program. In the picture at right, a simple cellular automaton can predict whether a more complicated one will have the same or different densities of white and black in a region.
When a coarse-graining is exact, the coarse-grained properties follow exact rules all the time. Like in the picture at right, where the coarse-grained pattern "same different same" always evolves into "different different different," even though there are multiple states of the more complicated program that count as "different" and "same."
When a coarse-graining is inexact, only most of the states of the long program follow the coarse-grained rules of the short program. but it turns out that, given some ignorance about the exact rules or situation, this is also sufficient to predict the future better than chance.
Of course, when doing induction, we don't actually know the true program. Instead what happens is that we just find some simple program that fits our data reasonably well (according to some measurement of fit), and we go "well, what happened before is more likely to happen again, so this rule will help us predict the future."
Presumably we combine predictions according to some weighting that include both the length of the program and its goodness of fit.
Machine learning - probably approximately correct
Since this is a machine learning problem, there are already solutions to similar problems, one of which is probably approximately correct learning. The basic idea is that if you have some uniformly-sampled training data, and a hypothesis space you can completely search, then you can give some probabilistic guarantees about how good the hypothesis is that best fits the training data. A "hypothesis," here, is a classification of members of data-space into different categories.
The more training data you have, the closer (where "close" can be measured as a chi-squared-like error) the best-fitting hypothesis is to the actually-best hypothesis. If your hypothesis space doesn't contain the true hypothesis, then that's okay - you can still guarantee that the best-fitting hypothesis gets close to the best hypothesis in your hypothesis space. The probability that a far-away hypothesis would masquerade as a hypothesis within the small "successful" region gets smaller and smaller as the training data increases.
There is a straightforward extension to cases where your data contains contributions from both "things we want to model" and "things we want to ignore." Rather than finding the hypothesis that fits the training data best, we want to find the hypothesis for the "stuff we want to model" that has the highest probability of producing our observed data, after we've added some noise, drawn from some "noise distribution" that encapsulates the basic information about the stuff we want to ignore.
There are certainly some issues comparing this to Solomonoff induction, like the fact that our training data is randomly sampled rather than a time series, but I do like this paradigm a bit better than looking for a guarantee that we'll find the one true answer in finite time.
Bayes' theorem
If there's an easy way to combine machine learning with Solomonoff induction, it's Bayes' theorem. The machine learning was focused on driving down P(training data | chance), and picking a hypothesis with a high P(training data | hypothesis, noise model). We might want to Use Bayes' rule to say something like P(hypothesis | training data, noise model) = P(hypothesis | complexity prior) * P(training data | hypothesis, noise model) / P(training data | noise model).
Anyhow - what are the obvious things I've missed? :)
Meetup : Urbana-Champaign: Recreation
Discussion article for the meetup : Urbana-Champaign: Recreation
Since the weather is nice, let's meet outside, on the benches south of the engineering quad. I'll bring kites - coincidentally, the engineering quad is a great place to fly kites.
We may talk about mechanism design (reading).
Bonus topic: what do debates like this recent one tell us about what's going on in peoples' heads?
Discussion article for the meetup : Urbana-Champaign: Recreation
Meetup : Urbana-Champaign: Planning and Re-planning
Discussion article for the meetup : Urbana-Champaign: Planning and Re-planning
When things get complicated enough, you have to plan them in advance or they fail. You need blueprints and logistics before you can build a skyscraper. On a personal level, good plans improve our chances of success at anything we can make a plan for.
One trouble with plans is that once you've made them they're sticky. What kind of life to lead, what to study, when to marry - we inherit plans about these things.from the past and we don't always rethink them when appropriate.
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)