Brevity of code and english can correspond via abstraction.
I don't know why brevity in low and high abstraction programs/explanations/ideas would correspond (I suspect they wouldn't). If brevity in low/high abstraction stuff corresponded; isn't that like contradictory? If a simple explanation in high abstraction is also simple in low abstraction then abstraction feels broken; typically ideas only become simple after abstraction. Put another way: the reason to use abstraction is to make ideas/thing that are highly complex into things that are less complex.
I think Occam's Razor makes sense only if you take into account abstractions (note: O.R. itself is still a rule of thumb regardless). Occam's Razor doesn't make sense if you think about all the extra stuff an explanation invokes - partially because that body of knowledge grows as we learn more, and good ideas become more consistent with the population of other ideas over time.
When people think of short code they think of doing complex stuff with a few lines of code. e.g. cat asdf.log | cut -d ',' -f 3 | sort | uniq
. When people think of (good) short ideas they think of ideas which are made of a few well-established concepts that are widely accessible and easy to talk about. e.g. we have seasons because energy from sunlight fluctuates ~sinusoidally through our annual orbit.
One of the ways SI can use abstraction is via the abstraction being encoded in both the program, program inputs, and the observation data.
(I think) SI uses an arbitrary alphabet of instructions (for both programs and data), so you can design particular abstractions into your SI instruction/data language. Of course the program would be a bit useless for any other problem than the one you designed it for, in this case.
Is there literature arguing that code and English brevity usually or always correspond to each other?
I don't know of any.
If not, then most of our reasons for accepting Occam’s Razor wouldn’t apply to SI.
I think some of the reasoning makes sense in a pointless sort of way. e.g. the hypothesis 1100
corresponds to the program "output 1 and stop". The input data is from an experiment, and the experiment was "does the observation match our theory?", and the result was 1
. The program 1100
gets fed into SI pretty early, and it matches the predicted output. The reason this works is that SI found a program which has info about 'the observation matching the theory' already encoded, and we fed in observation data with that encoding. Similarly, the question "does the observation match our theory?" is short and elegant like the program. The whole thing works out because all the real work is done elsewhere (in the abstraction layer).
That seems a fair approach in general, like how can we use the program efficiently/profitably, but I don't think it answers the question in OP. I think it actually actually implies the opposite effect: as you go through more layers of abstraction you get more and more complex (i.e. simplicity doesn't hold across layers of abstraction). That's why the strategy you mention needs to be over ever larger and larger problem spaces to make sense.
So this would still mean most of our reasoning about Occam's Razor wouldn't apply to SI.
I'm not sure we (humanity) know enough to claim only a short string needs to be added. I think GPT-3 hints at a counter-example b/c GTP has been growing geometrically.
Moreover, I don't think we have any programs or ideas for programs that are anywhere near sophisticated enough to answer meaningful Qs - unless they just regurgitate an answer. So we don't have a good reason to claim to know what we'll need to add to extend your solution to handle more and more cases (especially increasingly technical/sophisticated cases).
Intuitively I think there is (physically) a way to do something like what you describe efficiently because humans are an example of this -- we have no known limit for understanding new ideas. However, it's not okay to use this as a hypothetical SI program b/c such a program does other stuff we don't know how to do with SI programs (like taking into account itself, other actors, and the universe broadly).
If the hypothetical program does stuff we don't understand and we also don't understand its data encoding methods, then I don't think we can make claims about how much data we'd need to add.
I think it's reasonable there would be no upper limit on the amount of data we'd need to add to such a program as we input increasingly sophisticated questions. I also think it's intuitive there's no upper limit on this data requirement (for both people and the hypothetical programs you mention).