Solomonoff induction is generally given as the correct way to penalise more complex hypotheses when calculating priors. A great introduction can be found here.
My question is, how is this actually calculated in practice?
As an example, say I have 2 hypotheses:
A. The probability distribution of the output is given by the same normal distribution for all inputs, with mean and standard deviation .
B. The probability distribution of the output is given by a normal distribution depending on an input with mean and standard deviation .
It is clear that hypothesis B is more complex (using an additional input [], having an additional parameter [] and requiring 2 additional operations to calculate) but how does one calculate the actual penalty that B should be given vs A?
How do we choose the correct version of Occam's razor to use? As always, we use Occam's razor to give prior probabilities to each possibility (each version of Occam's razor), then update using real-world observations. There's a problem of circularity here, of course. I think that the version that humans intuitively use lies in a large region of the space of versions such that if you use one version from the region to choose a new version, and repeat this self-reflection, the process converges.