A clarification if I might:
"is the probability that we will see data sequence E, given that we run program H on the universal Turing machine."
I think it'll be helpful to word it as "output begins with the data sequence E", as it is generally a very common misconception that it suffices to see E somewhere within the output; that it suffices that the H "explains" the data (the original article used "explains").
When thinking of e.g. the universe, the "explains" is typically taken to mean "the universe contains me somewhere" and a form of anthropic reasoning, which can lead to substantially different concept than Solomonoff induction.
As a side note, one can obtain a type of anthropic reasoning prior by including some self-description on extra tape that can be read; then the code can search for instances of itself within the models for only a constant cost, but still needs to be predictive, i.e. output string that begins with the observed data. This seems no different (up to a constant) from simply including the self description as part of the data sequence E . edit: on second thought, extra tape is different in major fallible way: the self description on extra tape, if sufficiently complete, can allow to construct the god in your own image for 'goddidit' . One should just add self description as part of the data sequence E . It is still no-different-up-to-a-constant though.
You've read the introduction to Bayes' theorem. You've read the introduction to Solomonoff induction. Both describe fundamental theories of epistemic rationality. But how do they fit together?
It turns out that it’s pretty simple. Let’s take a look at Bayes’ theorem.
For a review:
In terms of Solomonoff induction:
The denominator is the same meaning as the numerator, except as a sum for every possible hypothesis. This essentially normalizes the probability in the numerators. Any hypotheses that do not match the data E exactly will cause P(E|Hi) = 0, and therefore that term will contribute nothing to the sum. If the hypothesis does output E exactly, then P(E|Hi) = 1, and the matching hypothesis contributes its weight to the renormalizing sum in the denominator.
Let's see an example with these things substituted. Here, the set of Hi is the set of hypotheses that match.
In summary; Bayes’ theorem says that once we find all matching hypotheses, we can find their individual probability by dividing their individual weight of
by the weights of all the matching hypotheses.
This is intuitive, and matches Bayes’ theorem both mathematically and philosophically. Updating will occur when you get more bits of evidence E. This will eliminate some of the hypotheses Hi, which will cause the renormalization in the denominator to get smaller.