We can't use the universal prior in practice unless physics contains harnessable non-recursive processes. However, this is exactly the situation in which the universal prior doesn't always work. Thus, one source of the 'magic' is through allowing us to have access to higher levels of computation than the phenomena we are predicting (and to be certain of this).
Also, the constants involved could be terrible and there are no guarantees about this (not even probabilistic ones). It is nice to reach some ratio in the limit, but if your first Graham's number of guesses are bad, then that is very bad for (almost) all purposes.
The Bayesian using the universal prior is cheating. Kolmogorov complexity is incomputable. So you have to use a realistic compressor. Then the optimality results go away. But, in practice if your compressor is good, it's almost always going to outdo the frequentist.
For example, suppose the machine is actually set up to always produce head-biased coins. After observing the coin tosses for a while, a typical intelligent person, just applying common sense, would notice that 90% of the tosses come up heads, and infer that perhaps all the coins are biased towards heads. They would become more certain of this with time, and adjust their answers accordingly. But the frequentist would not (or isn't supposed to) notice this.
They wouldn't (they aren't)?
...He or she would answer "the coin is head-biased with probability
I don't understand why your universal prior passes the tape through the Turing machine. If you have a source of genuinely random tape inputs in the first place, why not just use the probability that a tape has the first four bits '1100'? And in any case, what's the advantage of invoking Turing machinery in place of the good old fair coins?
What I keep coming to here is, doesn't the entire point of this post come to the situations where the parameters in question, the bias of the coins, are not independent? And doesn't this contradict?
estimate 100 independent unknown parameters
Which leads me to read the later half of this post as, we can (in principle, perhaps not computably) estimate 1 complex parameter with 100 data sets better than 100 independent unknown parameters from individual data sets. This shouldn't be surprising. I certainly don't find it as such.
The first half just points ou...
How hard is it to "crack" a pseudorandom number generator? Say, one used to generate cards in an online poker game?
In the very early days of online poker (late 90's), a group of people did exactly that.
A number of factors combined to make it possible:
Nowadays, sites keep their algorithms secret, and the even distribution of hands is constantly being verified by third parties with large data sets. They also harvest large sources of entropy, from user mouse gestures and physical RNGs. See PokerStars' write-up on security.
Thank you for introducing me to the "universal prior".
The model I'd have used in explaining a sequence of such coin flips would indeed assume that the machine generates each coin independently as heads-biased with fixed probability H. But rather than fixing H=1/2, I'd have a prior distribution for H (probably uniform).
How superior would the universal prior be to this? I confess I can't do the math.
After observing the coin tosses for a while, a typical intelligent person, just applying common sense, would notice that 90% of the tosses come up heads, and infer that perhaps all the coins are biased towards heads. They would become more certain of this with time, and adjust their answers accordingly.
A bayesian could do better if you allowedtot to remember things.
coin lands on heads: P(headsbiased|heads) = .9.5/.5 =.9 P(tailsbiased|heads) = .1.5/.5=.1 another heads: P(headsbiased|heads)=.9.9/(.1.1+.9*.9)=.987 P(tailsbiased|heads)=.0121
Too many restrictions there, I think. The format doesn't have to be nice - any format it doesn't know it will know after a fixed-length penalty. We could just dump in the Internet Archive raw.
Alright, the thing I meant by "nice" needs some elaboration. I'll put it this way: For an Ultimate AI, all of its knowledge -- with no exceptions -- is in terms of constraints on what it expects to observe. (And yes, this is what rationalists should strive for too.) So there is no, "light is waves", no "that's the Doppler effect". There are only mappings from inputs to probability distributions on future inputs. (Confusingly, this would also mean an expectation for [phenomena explainable as] humans [generating sound waves explainable by them] saying [something we would recognize as the statement] "light is waves". Phew!)
Any large human-generated knowledge base initially appears to AIXI as a long string of characters and/or some input/output black box. What in its inputspace do the characters refer to? What is the appropriate way to group them? Most importantly, after being told that "this stuff is true" or "this is a lot of what goes on in the environment that computes your inputs", how does it know how it maps to the rest of the environment's generating function? (Which I guess is ultimately the same as the first question.)
That problem is nearly as intractable as starting from just an Occamian prior. It's only resolved by symbol-grounding, which means representing knowledge in the form of a probability distribution on observations, in a way your program can understand. Which I think brings you back into the AI-complete realm.
But those other environments where it performs poorly are the ones that ought to perform poorly: its best performance is reserved for the most likely future histories, just like we would aspire to.
Okay, you're right, I wasn't keeping the comparison baselines straight. If you could give the program enough knowledge, in a form it understands or could quickly learn, to distinguish this computable environment from substantively different environment (including its location within it, the relevant history, etc.), then yes, it would make better inferences than humans.
But the point stands that dumping any kind of knowledge base on a computable version of AIXI won't help you a bit until you've done a lot more of the cognitive labor.
[I posted this to open thread a few days ago for review. I've only made some minor editorial changes since then, so no need to read it again if you've already read the draft.]
This is a belated reply to cousin_it's 2009 post Bayesian Flame, which claimed that frequentists can give calibrated estimates for unknown parameters without using priors:
And indeed they can. Here's the simplest example that I can think of that illustrates the spirit of frequentism:
Suppose there is a machine that produces biased coins. You don't know how the machine works, except that each coin it produces is either biased towards heads (in which case each toss of the coin will land heads with probability .9 and tails with probability .1) or towards tails (in which case each toss of the coin will land tails with probability .9 and heads with probability .1). For each coin, you get to observe one toss, and then have to state whether you think it's biased towards heads or tails, and what is the probability that's the right answer.
Let's say that you decide to follow this rule: after observing heads, always answer "the coin is biased towards heads with probability .9" and after observing tails, always answer "the coin is biased towards tails with probability .9". Do this for a while, and it will turn out that 90% of the time you are right about which way the coin is biased, no matter how the machine actually works. The machine might always produce coins biased towards heads, or always towards tails, or decide based on the digits of pi, and it wouldn't matter—you'll still be right 90% of the time. (To verify this, notice that in the long run you will answer "heads" for 90% of the coins actually biased towards heads, and "tails" for 90% of the coins actually biased towards tails.) No priors needed! Magic!
What is going on here? There are a couple of things we could say. One was mentioned by Eliezer in a comment:
In this example, the "perfect information about experimental setups and likelihood ratios" is the information that a biased coin will land the way it's biased with probability .9. I think this is a valid criticism, but it's not complete. There are perhaps many situations where we have much better information about experimental setups and likelihood ratios than about the mechanism that determines the unknown parameter we're trying to estimate. This criticism leaves open the question of whether it would make sense to give up Bayesianism for frequentism in those situations.
The other thing we could say is that while the frequentist in this example appears to be perfectly calibrated, he or she is liable to pay a heavy cost for this in accuracy. For example, suppose the machine is actually set up to always produce head-biased coins. After observing the coin tosses for a while, a typical intelligent person, just applying common sense, would notice that 90% of the tosses come up heads, and infer that perhaps all the coins are biased towards heads. They would become more certain of this with time, and adjust their answers accordingly. But the frequentist would not (or isn't supposed to) notice this. He or she would answer "the coin is head-biased with probability .9" 90% of the time, and "the coin is tail-biased with probability .9" 10% of the time, and keep doing this, irrevocably and forever.
The frequentist magic turns out to be weaker than it first appeared. What about the Bayesian solution to this problem? Well, we know that it must involve a prior, so the only question is which one. The maximum entropy prior that is consistent with the information given in the problem statement is to assign each coin an independent probability of .5 of being head-biased, and .5 of being tail-biased. It turns out that a Bayesian using this prior will give the exact same answers as the frequentist, so this is also an example of a "matching prior". (To verify: P(biased heads | observed heads) = P(OH|BH)*P(BH)/P(OH) = .9*.5/.5 = .9)
But a Bayesian can do much better. A Bayesian can use a universal prior. (With a universal prior based on a universal Turing machine, the prior probability that the first 4 coins will be biased "heads, heads, tails, tails" is the probability that the UTM will produce 1100 as the first 4 bits of its output, when given a uniformly random input tape.) Using such a prior guarantees that no matter how the coin-producing machine works, as long as it doesn't involve some kind of uncomputable physics, in the long run your expected total Bayes score will be no worse than someone who knows exactly how the machine works, except by a constant (that's determined by the algorithmic complexity of the machine). And unless the machine actually settles into deciding the bias of each coin independently with 50/50 probabilities, your expected Bayes score will also be better than the frequentist (or a Bayesian using the matching prior) by an unbounded margin as time goes to infinity.
I consider this magic also, because I don't really understand why it works. Is our prior actually a universal prior, or is the universal prior just a handy approximation that we can substitute in place of the real prior? Why does the universe that we live in look like a giant computer? What about uncomputable physics? Just what are priors, anyway? These are some of the questions that I'm still confused about.
But as long as we're choosing between different magics, why not pick the stronger one?