Brevity of code and english can correspond via abstraction.
I don't know why brevity in low and high abstraction programs/explanations/ideas would correspond (I suspect they wouldn't). If brevity in low/high abstraction stuff corresponded; isn't that like contradictory? If a simple explanation in high abstraction is also simple in low abstraction then abstraction feels broken; typically ideas only become simple after abstraction. Put another way: the reason to use abstraction is to make ideas/thing that are highly complex into things that are less complex.
I think Occam's Razor makes sense only if you take into account abstractions (note: O.R. itself is still a rule of thumb regardless). Occam's Razor doesn't make sense if you think about all the extra stuff an explanation invokes - partially because that body of knowledge grows as we learn more, and good ideas become more consistent with the population of other ideas over time.
When people think of short code they think of doing complex stuff with a few lines of code. e.g. cat asdf.log | cut -d ',' -f 3 | sort | uniq
. When people think of (good) short ideas they think of ideas which are made of a few well-established concepts that are widely accessible and easy to talk about. e.g. we have seasons because energy from sunlight fluctuates ~sinusoidally through our annual orbit.
One of the ways SI can use abstraction is via the abstraction being encoded in both the program, program inputs, and the observation data.
(I think) SI uses an arbitrary alphabet of instructions (for both programs and data), so you can design particular abstractions into your SI instruction/data language. Of course the program would be a bit useless for any other problem than the one you designed it for, in this case.
Is there literature arguing that code and English brevity usually or always correspond to each other?
I don't know of any.
If not, then most of our reasons for accepting Occam’s Razor wouldn’t apply to SI.
I think some of the reasoning makes sense in a pointless sort of way. e.g. the hypothesis 1100
corresponds to the program "output 1 and stop". The input data is from an experiment, and the experiment was "does the observation match our theory?", and the result was 1
. The program 1100
gets fed into SI pretty early, and it matches the predicted output. The reason this works is that SI found a program which has info about 'the observation matching the theory' already encoded, and we fed in observation data with that encoding. Similarly, the question "does the observation match our theory?" is short and elegant like the program. The whole thing works out because all the real work is done elsewhere (in the abstraction layer).
Intuitive explanation: Say it takes X bits to specify a human, and that the human knows how to correctly predict whatever sequence we're applying SI to. SI has to find the human among the other 2^X programs of length X. Say SI is trying to predict the next bit. There will be some fraction of those 2^X programs that predict it will be 0, and some fraction predicting 1. There fractions define SI's probabilities for what the next bit will be. Imagine the next bit will be 0. Then SI is predicting badly if greater than half of those programs predict a 1. But then, all those programs will be eliminated in the update phase. Clearly, this can happen at most X times before most of the weight of SI is on the human hypothesis(or a hypothesis that's just as good at predicting the sequence in question)
The above is a sketch, not quite how SI really works. Rigorous bounds can be found here, in particular the bottom of page 979("we observe that Theorem 2 implies the number of errors of the universal predictor is finite if the number of errors of the informed prior is finite..."). In the case where the number of errors is not finite, the universal and informed prior still have the same asymptotic rate of growth of error (error of universal prior is in big-O class of error of informed prior)
When I say the 'sense of simplicity of SI', I use 'simple program' to mean the programs that SI gives the highest weight to in its predictions(these will by definition be the shortest programs that haven't been ruled out by data). The above results imply that, if humans use their own sense of simplicity to predict things, and their predictions do well at a given task, SI will be able to learn their sense of simplicity after a bounded number of errors.
I think you can input multiple questions by just feeding a sequence of question/answer pairs. Actually getting SI to act like a question-answering oracle is going to involve various implementation details. The above arguments are just meant to establish that SI won't do much worse than humans at sequence prediction(of any type) -- so, to the extent that we use simplicity to attempt to predict things, SI will "learn" that sense after at most a finite number of mistakes(in particular, it won't do any *worse* than 'human-SI', hypotheses ranked by the shortness of their English description, then fed to a human predictor)