All of MedicJason's Comments + Replies

To my mind all such questions are related to arguments about solipcism, i.e. the notion that even other humans don't, or may not, have minds/consciousness/qualia. The basic argument is that I can only see behavior (not mind) in anyone other than myself. Most everyone rejects solipsism, but I don't know if there have actually many very good arguments against it, except that it is morally unappealing (if anyone know of any please point them out). I think the same questions hold regarding emulations, only even more so (at least with other humans we know th... (read more)

0Baughn
Arguments against solipsism? Well, the most obvious one would be that everyone else appears to be implemented on very nearly the same hardware I am, and so they should be conscious for the same reasons I am. Admittedly I don't know what those reasons are. It's possible that there are none, that the only reason I'm conscious is because I'm not implemented on a brain but in fact on some unknown thing by a Dark Lord of the Matrix, but although this recovers the solipsism option as not quite as impossible as it'd get if you buy the previous argument, it doesn't seem likely. Even in matrix scenarios.

They don't. To get the probabilities about something occuring in our universe, you need to get the information about our universe first. Solomonoff Induction tells you how to do that, in a random universe. After you get enough evidence to understand the universe, only then you start getting good results.

Yes, but we already have lots of information about our universe. So, making use of all that, if we could start using SI to, say, predict the weather, would its predictions be well-calibrated? (They should be - modern weather predictions are already wel... (read more)

0Viliam_Bur
I admit I am rather confused here, but here is my best guess: It is not true, in our specific world, that all predictions compatible with the past will occur in exact proportion to their bit-length complexity. Some of them will occur more frequently, some of them will occur less frequently. The problem is, you don't know which ones. Because all of them are compatible with the past, so how could you tell the difference, except by a lucky guess? How could any other model tell the difference, except by a lucky guess? How could you tell which model guessed the difference correctly, except by a lucky guess? So if you want to get the best result on average, assigning the probability according to the bit-length complexity is best.

You quoted me

"the theory seems to predict that possible (evidence-compatible) events or states in the universe will occur in exact or fairly exact proportion to their relative complexities as measured in bits [...] if I am predicting between 2 (evidence-compatible) possibilities, and one is twice as information-complex as the other, then it should actually occur 1/3 of the time"

then replied

"Let's suppose that there are two hypotheses H1 and H2, each of them predicting exactly the same events, except that H2 is one bit longer and therefore ha... (read more)

This seems reasonable - it basically makes use of the fact that most statements are wrong, therefore adding a given statement whose truth-value is as-yet-unknown is likely to be wrong.

However, that's vague. It supports Occam's Razor pretty well, but does it also offer good evidence that that those likelihoods will manifest in real-world probabilities IN EXACT PROPORTION to the bit-lengths of their inputs? That is a much more precise claim! (For convenience I am ignoring the problem of multiple algorithms where hypotheses have different bit-lengths.)

0DaFranker
Nope, and we have no idea where we'd even start on evaluating this precisely because of the various problems relating to different languages. I think this is an active area of research. It does seem though, by observation and inference (heh, use whatever tools you have), that more efficient languages tend to formulate shorter hypotheses that tend to hint at this. There's also been some demonstrations of how well SI works for learning and inferring about a completely unknown environment. I think this was what AIXI was about, though I can't recall specifics.

Yes, that was the post I read that generated my current line of questioning.

My reply to Viliam_Bur was phrased in terms of probabilities in a single universe, while your post here is in terms of mathematically possible universes. Let me try to rephrase my point to him in many-worlds language. This is not how I originally thought of the question, though, so I may end up a little muddled in translation.

Taking your original example, where half of the Mathematically Possible Universes start with 1, and the other half with 0. It is certainly possible to imag... (read more)

0DaFranker
Hmm, I think I see what you mean. Yes, there's no reason for Solomonoff to be well-calibrated in the end, but once we obtain information that most of the universes starting with 0 do not work, that is data against which most of the hypotheses starting with 0 will fail. At this point, brute solomonoff induction will be obviously inefficient, and we should begin using the heuristic of testing almost only hypotheses starting with 1. In fact, we're already doing this: We know for a fact that we live in the subset of universes where the acceleration between two particles is not constant and invariant of distance. So it is known that the simpler hypothesis where gravitational attraction is "0.02c/year times the total mass of the objects" is not more likely than the one where gravitational attraction also depends on distance and angular momentum and other factors, despite the former being much less complex than the latter (or so we presume). There's still murky depths and open questions, such as (IIRC) how to calculate how "long" (see Kolmogorov complexity) the instructions are. Because suppose we build two universal turing machines with different sets of internal instructions. We run Solomonoff Induction on the first machine, and it turns out that 01110101011110101010101111011 is the simplest possible program that will output "110", and by analyzing the language and structure of the machine we learn that this corresponds to the hypothesis "2*3", with the output being "6". Meanwhile, on the second machine, 1111110 will also output "110", and by analyzing it we find out that this corresponds to the hypothesis "6", with the output being "6". On the first machine, to do the hypothesis "6", we must write 101010101111110110101111111110000000111111110000110, which is much more complex than the earlier "2*3" hypothesis, while on the second machine the "2*3" hypothesis is input as 1010111010101111, which is much longer than the "6" hypothesis. Which hypothesis, between "2*3

Thank you for your reply. It does clear up some of the virtues of SI, especially when used to generate priors absent any evidence. However, as I understand it, SI does take into account evidence - one removes all the possibilities incompatible with the evidence, then renormalizes the probablities of the remaining possibilities. Right?

If so, one could still ask - after taking account of all available evidence - is SI then well-calibrated? (At some point it should be well-calibrated, right? More calibrated than human beings. Otherwise, how is it useful... (read more)

1Pfft
Yes. The prediction error theorem states that as long as the true distribution is computable, the estimate will converge quickly to the true distribution. However, almost all the work done here, comes from the conditioning. The proof uses that for any computable mu, M(x) > 2^(-K(mu)) mu(x). That is, M does not assign a "very" small probablility to any possible observation. The exact prior you pick does not matter very much, as long as it dominates the set of all possible distributions mu in this sense. If you have some other distribution P, such that for every mu there is a C with P(x) > C mu(x), you get a similar theorem, differing by just the constant in the inequality. So I disagree with this: It's ok if the prior is not very exact. As long as we don't overlook any possibilities as a priori super-unlikely when they are not, we can use observations to pin down the exact proportions later.
0Viliam_Bur
I am not sure about the terminology. I would call the described process "Solomonoff priors, plus updating", but I don't know the official name. I believe the answer is "yes, with enough evidence it is better calibrated then humans". How much would "enough evidence" be? Well, you need some to compensate for the fact that humans are already born with some physiology and instincts adapted by evolution to our laws of physics. But this is a finite amount of evidence. All the evidence that humans get, should be processed better by the hypothetical "Solomonoff prior plus updating" process. So even if the process would start from zero and get the same information as humans, at some moment it should become and remain better calibrated. Let's suppose that there are two hypotheses H1 and H2, each of them predicting exactly the same events, except that H2 is one bit longer and therefore half as likely as H1. Okay, so there is no evidence to distinguish between them. Whatever happens, we either reject both hypotheses, or we keep their ratio at 1:2. Is that a problem? In real life, no. We will use the system to predict future events. We will ask about a specific event E, and by definition both H1 and H2 would give the same answer. So why should we care whether the answer was derived from H1, from H2, or from a combination of both. The question will be: "Will it rain tomorrow?" and the answer will be: "No." That's all, from outside. Only if you try to look inside and ask "What was your model of the world that you used for this prediction?" the machine would tell you about H1, H2, and infinitely many other hypotheses. Then, you could ask it to use Occam's razor to only choose the simplest one and display it to you. But internally, it could keep all of them (we already suppose it has an infinite memory and infinite processing power). Note, if I understand it correctly, that it would be actually impossible for the machine to tell whether in general two hypotheses H1 and H2 are e
0DaFranker
Yes, and the first piece of evidence is rather trivial. For any given law of physics, chemistry, etc. or basically any model of anything in the universe, I can conjure up an arbitrary amount of more and more complicated hypotheses that match the current data, but all or nearly-all of which will fail utterly against new data obtained later. For a very trivial thought experiment / example, we could have an alternate hypothesis which includes all of the current data, with only instructions to the turing machine to print this data. Then we could have another which includes all the current data twice, but tells the turing machine to only print one copy. Necessarily, both of these will fail against new data, because they will only print the old data and halt. We could conjure any infinities of copies similar to this which also contain arbitrary amounts of gibberish right after the old data, gibberish which will be unlikely to match the new data (with probability 1/2^n where n is the length of the new data / gibberish, assuming perfect randomness).

Hi, my name is Jason, this is my first post. I have recently been reading about 2 subjects here, Calibration and Solomoff Induction; reading them together has given me the following question:

How well-calibrated would Solomonoff Induction be if it could actually be calculated?

That is to say, if one generated priors on a whole bunch of questions based on information complexity measured in bits - if you took all the hypotheses that were measured at 10% likely - would 10% of those actually turn out to be correct?

I don't immediately see why Solomonoff Inductio... (read more)

0DaFranker
Viliam_Bur makes a great run-down of what's going on. For a more detailed introduction though, see this post explaining Solomonoff Induction, or perhaps you'd prefer to jump straight to this paragraph (Solomonoff's Lightsaber) that contains an explanation of why shorter (simpler) hypotheses are more likely under Solomonoff Induction. To make the bridge between that and what Viliam is saying, basically, if we consider all mathematically possible universes, then half the universes will start with a 1, and the other half will start with a 0. Then a quarter will start with 11, and another with 10, and so on. Which means that, to reuse the example in the above-linked post, 01001101 (which matches observed data perfectly so far) will appear in 1 out of 256 mathematically-possible universes, and 1000111110111111000111010010100001 (which also matches the data just as perfectly) will only appear in 1 out of 17179869184 mathematically-possible universe. So if we expect to live in one out of all mathematically-possible universe, but we have no idea what properties it has (or if you just got warped to a different universe with different laws of physics), which of the two hypotheses do you want? The one that is true more often, in more of the possible universes, because you're more likely to be in one of those than in one that has the longer, rarer hypothesis. That's the basic simplified logic behind it.
6Viliam_Bur
Solomonoff Induction could be well-calibrated across mathematically possible universes. If a hypothesis has a probability 10%, you should expect it to be true in 10% of the universes. Important thing is that Solomonoff priors are just a starting point in our reasoning. Then we update on evidence, which is at least as important as having reasonable priors. If it does not seem well calibrated, that is because you can't get good calibration without using evidence. Imagine that at this moment you are teleported to another universe with completely different laws of physics... do you expect any other method to work better than Solomonoff Induction? Yes, gradually you get data about the new universe and improve your model. But that's exactly what you are supposed to do with Solomonoff priors. You wouldn't predictable get better results by starting from different priors. To me it seems that Occam's Razor is a rule of thumb, and Solomonoff Induction is a mathematical background explaining why the rule of thumb works. (OR: "Choose the most simple hypothesis that fits your data." Me: "Okay, but why?" SI: "Because it is more likely to be the correct one.") You can't get a good "recipe for truth" without actually looking at the evidence. Solomonoff Induction is the best thing you can do without the evidence (or before you start taking the evidence into account). Essentially, the Solomonoff Induction will help you avoid the following problems: * Getting inconsistent results. For example, if you instead supposed that "if I don't have any data confirming or rejecting a hypothesis, I will always assume its prior probability is 50%", then if I give you two new hypotheses X and Y without any data, you are supposed to think that p(X) = 0.5 and p(Y) = 0.5, but also e.g. p(X and Y) = 0.5 (because "X and Y" is also a hypothesis you don't have any data about). * Giving so extremely low probability to a reasonable hypothesis that available evidence cannot convince you otherwise. Fo
0[anonymous]
.