The witch hypothesis is going to do relatively well at explaining strange, rare events that could have been produced by humans casting low-complexity magic spells. It will do relatively badly at concisely explaining observations that contain a whole lot of mundane, redundant information (e.g. a video taken by a person walking through a building); there are other hypotheses that assign much higher probabilities to these.
In particular, given the pigeonhole principle, the witch hypothesis isn't going to help at all for compressing random noise, and for similar reasons, for any simple stochastic model (e.g. a probabilistic program), it isn't going produce better codes (on average) for stuff that model would have produced than the model itself would have, as the KL divergence between the distributions is positive.
Quoting Eliezer (from the page you linked):
This lets us see clearly the problem with using “The lady down the street is a witch; she did it” to explain the pattern in the sequence “0101010101″. If you’re sending a message to a friend, trying to describe the sequence you observed, you would have to say: “The lady down the street is a witch; she made the sequence come out 0101010101.” Your accusation of witchcraft wouldn’t let you shorten the rest of the message; you would still have to describe, in full detail, the data which her witchery caused.
There is a somewhat general problem with MDL-type frameworks, which is that they will often posit magic in the absence of having a good causal model, as opposed to being "confused" and thinking that some good causal model exists but is currently unknown. (This is mainly a problem for bounded approximations to MDL, rather than unbounded MDL)
EDIT: additional thoughts copied from the thread:
If the human can execute arbitrary programs and is computable, and can interpret messages of the form "run this program on the rest of the message", then by definition the human brain based computer is a UTM, so it can be used in Solomonoff induction, weirdly enough. However, there is a concern in that the UTM is meant to be a prior, whereas the brain is more of a representation of a posterior. So it will be able to overfit things you already know. This might not be a problem if you were using CDT but would be a problem for UDT, since UDT isn't supposed to change its prior as it gets more observations (this would make it stop caring about non-actual worlds in e.g. counterfactual mugging, leading to dynamic inconsistency).
In general if you're using UDT then your choice of prior is a choice of which possible worlds you care about and how much. There won't be universally compelling arguments for a particular prior the same way there aren't universally compelling arguments for particular values.
Hmm, so I agree that if your message is a short binary string, then obviously the description length of that binary string will be much shorter in Python than it is on a brain.
But if (as is usual in my daily experience) the message is a bunch of human visual sensory input, then I expect a human brain to have a much shorter description length of that input than a Python script. A python script would probably have to include a simulation of a human brain in order to produce the sensory input, which is likely going to be extremely complex.
The definition of Solomonoff induction is indifferent to the choice of universal Turing machine, because the difference it makes is a bounded number of bits. Two calculations of Kolmogorov complexity using different UTMs will always agree to within a number of bits c, where c depends on both of the UTMs (and measures how easily each can simulate the other).
c can be arbitrarily large.
If you pack your UTM full of preferred hypotheses given short codings (e.g. "let it be a human brain"), then you will get those hypotheses back out of it. But that did not come from Solomonoff induction. It came from your choice of UTM.
This raises the question: if, contra the theoretical indifference to choice of UTM, the choice does matter, how should the choice be made? One might consider a UTM having minimal description length, but which UTM do you use to determine that, before you've chosen one? Suppose one first chooses an arbitrary UTM T0, then determines which UTM T1 is given the shortest description length by T0, then generates T2 from T1 in the same way, does this necessarily converge on a UTM that in some definable sense has no extra hypotheses stuffed into it? Or does this process solve nothing?
Alternatively, you might go with some standard construction of a UTM out of a computability theory textbook. Those look minimal enough that no complex hypotheses would be unjustly favoured, but it seems to me there is still a theoretical gap to be plugged here.
My recursive suggestion won't work. One can devise a UTM that gives the shortest code to itself, by the usual reflexivity constructions. The computability theory textbook method looks better. But what theoretical justification can be given for it? Why are we confident that bad explanations are not lurking within it?
Actually, perhaps we shouldn't be. It has already been remarked by Eliezer that Solomonoff induction gives what looks like undue weight to hypotheses involving gigantic numbers with short descriptions, e.g. 3^^^3, despite the fact that...
I previously tried to do something similar - define an objective complexity upper bound by adding the complexity within a UTM to the complexity of the UTM. I don't recall how I defined the complexity of a UTM - I don't think it was recursive, though. Perhaps I used an average over all UTMs, weighted by the complexity of each UTM? But that still requires us to already have a definition of complexity for UTMs! (Unless there's a unique fixed point in the assignment of complexity to UTMs, but then I'd suspect that function to be non-computable. And uniqueness doesn't seem very likely anyway.)
Yeah, if you treat your mind as a TM predicting the next experience, you run into the problem of witches. I prefer to think about TMs that could've generated my mind, and what these TMs will give me as my next experience. That way you don't get witches. Though it's still a mystery how to measure "simplicity" of such TMs.
Wouldn't carrying this analogy to its conclusion mean that you would have to generalize it for a human mind which doesn't already have a bunch of prior understanding of the world? Instead of a universal turing machine, you're using "a universal turing machine that has been taught about witches", which feels like it should be called out as part of the experimental methodology.
So really you would imagine describing the hypothesis and all of the information required to understand the hypotheses to a human that doesn't already know English and isn't already familiar with witches. Rather than a single sentence you'd end up with a book.
I remember somewhere in the sequences the example of Thor causing thunderstorms *feels* simpler to us than weather dynamics, because our brains handwave away the complexity of Thor.
Source: pure speculation and hazy recollection of old LessWrong articles.
I feel like I must be missing something in my understanding of Solomonoff induction.
Compare the length of the description "a witch did it" + (a description of it), and the description by itself. It is worth noting that a lot of ideas or activities humans think of as simple are really complicated, and things people think of as really complicated can be really simple - especially if they're counterintuitive.
I've recently been thinking about Solomonoff induction, and in particular the free choice of Universal Turing Machine.
One variable that seems like a potential choice here is a human brain (my brain for example). It's obviously a bit of a weird choice, but I don't see any reason to disprefer it over a Python interpreter, given that the whole point of Solomonoff induction is to define a prior and so my knowledge of physics or atoms shouldn't really come into play when choosing a UTM.
Concretely, the UTM would be a human simulated in an empty room with an infinitely large notebook symbolizing the tape. It's output would be an encoding of sensory data, and it's input would be a string of instructions in english.
If we do that, then the sentence "the woman at the end of the street is a witch, she did it" suddenly becomes one of the simplest hypotheses that are available to us. Since the english sentence is so short, we now basically just need to give an encoding of who that woman is and what the action in question is, which is probably also going to be a lot shorter in human language than machine language (since our UTM already understands basic physics, society, etc.), and then our simulated human (which since Solomonoff induction doesn't have runtime constraints can take as much time as they want) should be able to produce a prediction of historical sensory input quite well, with relatively little input.
I feel like I must be missing something in my understanding of Solomonoff induction. I have a lot more thoughts, but maybe someone else has already thought of this and can help me understand this. Some thoughts that come to mind: