The scientific method is wonderfully simple, intuitive, and above all effective. Based on the available evidence, you formulate several hypotheses and assign prior probabilities to each one. Then, you devise an experiment which will produce new evidence to distinguish between the hypotheses. Finally, you perform the experiment, and adjust your probabilities accordingly. 

So far, so good. But what do you do when you cannot perform any new experiments?

This may seem like a strange question, one that leans dangerously close to unprovable philosophical statements that don't have any real-world consequences. But it is in fact a serious problem facing the field of cosmology. We must learn that when there is no new evidence that will cause you to change your beliefs (or even when there is), the best thing to do is to rationally re-examine the evidence you already have.


Cosmology is the study of the universe as we see it today. The discoveries of supernovae, black holes, and even galaxies all fall in the realm of cosmology. More recently, the CMB (Cosmic Microwave Background) contains essential information about the origin and structure of our universe, encoded in an invisible pattern of bright and dark spots in the sky.

Of course, we have no way to create new stars or galaxies of our own; we can only observe the behaviour of those that are already there. But the universe is not infinitely old, and information cannot travel faster than light. So all the cosmological observations we can possibly make come from a single slice of the universe - a 4-dimensional cone of spacetime. And as there are a finite number of events in this cone, cosmology has only a limited amount of data it can ever gather; in fact, the amount of data that even exists is finite.

Now, finite does not mean small, and there is much that can be deduced even from a restricted data set. It all depends on how you use the data. But you only get one chance; if you need to find trained physicists who have not yet read the data, you had better hope you didn't already release it to the public domain. Ideally, you should know how you are going to distribute the data before it is acquired.


The problem is addressed in this paper (The Virtues of Frugality - Why cosmological observers should release their data slowly), published almost a year ago by three physicists. They give details of the Planck satellite, whose mission objective is to perform a measurement of the CMB to a greater resolution and sensitivity than anyone has ever done before. At the time the paper was written, the preliminary results had been released, showing the satellite to be operating properly. By now, its mission is complete, and the data is being analysed and collated in preparation for release.

The above paper holds the Planck satellite to be significant because with it we are rapidly reaching a critical point. As of now, analysis of the CMB is limited not primarily by the accuracy of our measurements, but by interference from other microwave sources, and by the cosmic variance itself.

"Cosmic variance" stems from the notion that the amount of data in existence is finite. Imagine a certain rare galactic event A that occurs with probability 0.5 whenever a certain set of conditions are met, independently of all previous occurrences of A. So far, the necessary conditions have been met exactly 2 million times. How many events A can be expected to happen? The answer is 1 million, plus or minus one thousand. This uncertainty of 1,000 is the cosmic variance, and it poses a serious problem. If we have two theories of the universe, one of which is correct in its description of A, and one of which predicts that A will happen with probability 0.501, when A has actually happened 1,001,000 times (a frequency of 0.5005), this is not statistically significant evidence to distinguish between those theories. But this evidence is all the evidence there is; so if we reach this point, there will never be any way of knowing which theory is correct, even though there is a significant difference between their predictions.

This is an extreme example and an oversimplification, but we do know (from experience) that people tend to cling to their current beliefs and demand additional evidence. If there is no such evidence either way, we must use techniques of rationality to remove our biases and examine the situation dispassionately, to see which side the current evidence really supports.


The Virtues of Frugality proposes one solution. Divide the data into pieces (methods for determining the boundaries of these pieces are given in VoF). Find a physicist who has never seen the data set in detail. Show him the first piece of data, let him design models and set parameters based on this data piece. When he is satisfied with his ability to predict the contents of the second data piece, show him that one as well and let him adjust his parameters and possibly invent new models. Continue until you have exhausted all the data.

To a Bayesian superintelligence, this is transparent nonsense. Given a certain list of theories and associated prior probabilities (e.g. the set of all computable theories with complexity below a given limit), there is only one right answer to the question "What is the probability that theory K is true given all the available evidence?" Just because we're dealing in probability doesn't mean we can't be certain.

Humans, however, are not Bayesian superintelligences, and we are not capable of conceiving of all computable theories at once. Given new evidence, we might think up a new theory that we would not previously have considered. VoF asserts that we cannot then use the evidence we already have to check that theory; we must find new evidence. We already know that the evidence we have fits the theory, because it made us think of it. Using that same evidence to check it would be incorrect; not because of confirmation bias, but simply because we are counting the same evidence twice.


This sounds reasonable, but I happen to disagree. The authors' view forgets the fact that the intuition which brought the new theory to our attention is itself using statistical methods, albeit unconsciously. Checking the new theory against the available evidence (basing your estimated prior probability solely on Occam's Rasor) is not counting the same evidence twice; it's checking your working. Every primary-school child learning arithmetic is told that if they suspect they have made a mistake (which is generally the case with primary-school children), they should derive their result again, ideally using a different method. That is what we are doing here; we are re-evaluating our subconscious estimate of the posterior probability using mathematically exact methods.

That is not to say that the methods for analysing finite data sets cannot be improved, simply that the improvement suggested by VoF is suboptimal. Instead, I suggest a method which paraphrases one of Yudkowsky's posts: that of giving all the available evidence to separate individuals or small groups, without telling them of any theories which had already been developed based on that evidence, and without letting them collude with any other such groups. The human tendency to be primed by existing ideas instead of thinking of new ones would thus be reduced in effect, since there would be other groups with different sets of existing ideas.

Implementing such a system would be difficult, if not downright politically dangerous, in our current academic society. Still, I have hope that this is merely a logistic problem, and that we as a species are able to overcome our biases even in such restricted circumstances. Because we may only get one chance.

New Comment
47 comments, sorted by Click to highlight new comments since:

Well done. Since most of the top-level posts by new users last couple of months have IMHO gone badly off the rails, I was tensing for the moment when this one would, but it never did.

My comments have gone badly off-topic though. :)

I think this question applies in a lot of areas where experiments are possible but extremely expensive; for most of these that I can think of, we are in much greater trouble because we're starting with even less data.

Two examples: personal: Theory 1 and 2 are I (for some value of I) will be happiest as a 1) doctor or 2) physicist. I can sort of experiment with this, rudely and grossly, and I can spend some time gathering supporting data. However, without actually making a career in physics and in medicine, I haven't really experimented; even if I did there would be all sorts of variables I couldn't control.

national: Theory 1 and 2 are we (for some value of we) will better provide for the future of our citizens as 1) Democratic capitalists or 2) social democrats. This is testable in the same way, and just as one can look to other doctors and physicists to gain insight, a nation can look to other nations, but fundamentally, transitioning back and forth between economic systems is extremely costly.

So, in the absence of a superintelligence that can perfectly simulate doctor-you and physicist-you, or a whole nation of us, how do we reach scientifically justifiable conclusions, and how do we refine our theories and models with so little data?

This sounds a lot like the concept of cross-validation in machine learning. Suppose you have a bunch of data -- emails that are either spam or nonspam, for example -- and you want to make a model that will allow you to predict whether an email is spam or not. You split the data into n subsets, and for each subset S, do this test:

  1. Create a model from all the data that are not in this set S.

  2. Test the model on the data from S. How good are the predictions?

This method lets you test how well your machine learning methods are doing, so you can tweak them or go with a different method. Google recently got some publicity for offering a service that will apply a bunch of machine learning methods to your data and pick the best one; it picks the best one by running cross-validation.

Applying this to scientists, we get a variation on what you proposed in your penultimate paragraph: give part of the data to separate individuals or groups, stick them in a Science Temple in the mountains somewhere without Internet access, and then once they've come up with theories, test them on the data that they weren't given. For extra hilarity, split the data 50/50 among two large research groups, let them think of theories for a year, then have them exchange theories but not data. Then sit back and watch the argument.

You could also combine this sort of thing with the method suggested in VoF, by giving data out slowly to separate groups, in different orders. I'm not sure which of these ideas is best.

The scientific method is wonderfully simple, intuitive, and above all effective. Based on the available evidence, you formulate several hypotheses and assign prior probabilities to each one. Then, you devise an experiment which will produce new evidence to distinguish between the hypotheses. Finally, you perform the experiment, and adjust your probabilities accordingly.

Either there is more than one "scientific method", it isn't really a method, or science doesn't actually follow the scientific method (and therefore, cannot be justified by that method).

[-]ata20

Science approximates that method. Most scientists don't explicitly assign their hypotheses prior probabilities and then use Bayesian updating on the results of their experiments, but their brains have to assign different levels of credence to different hypotheses (to determine which ones get the attention) and adjust those credence levels as new data comes in, and the larger scientific community performs a similar process to determine consensus; Bayes-structure is implicit in there even if you're not actually using probability math.

http://en.wikipedia.org/wiki/Scientific_method

I think your third idea is right: science cannot be used to justify its own fundamental principles. For one thing, that argument would be very circular.

Imagine a certain rare galactic event A that occurs with probability 0.5 whenever a certain set of conditions are met

It seems like this is an indirect description of the Black Swan problem. Consider an astrophysical event so unlikely that it has a 50% chance of occurring in the entire light cone. Theory A completely prohibits the event, while theory B assigns the event a very small probability. If the event does not occur in the observation set, there is no way to distinguish between A and B.

This is a really great post, and I'm glad that this post has raised these issues, because I've been interested in them for a while.

Could you clarify this for me?

all the cosmological observations we can possibly make come from a single slice of the universe - a 4-dimensional cone of spacetime. And as there are a finite number of events in this cone, cosmology has only a limited amount of data it can ever gather; in fact, the amount of data that even exists is finite.

Surely we constantly receive new data from the receding boundary of the observable universe? As we move through time our past light-cone follows us, swallowing up more events that will affect us.

Also, if we literally have all possible evidence, then we lose our motivation to care, as the theory can't actually help us predict anything. When we really need these techniques is when we have all the evidence that is available at the moment, but it might make more predictions in the more distant future.

Surely we constantly receive new data from the receding boundary of the observable universe?

Yes, but the effect is so small I didn't think it worth mentioning. Over the course of your natural lifetime, your past light-cone will extend by about 100 years. Since it already envelopes almost 14 billion years, you won't get much new information relative to what you already know. If you have reason to believe that your lifespan will exceed 5 billion years, the situation is very different.

When we really need these techniques is when we have all the evidence that is available at the moment, but it might make more predictions in the more distant future.

Looking back, I implicitly assumed (without justification) that improving our understanding of the universe always has a positive utility, regardless of currently known predictive power. You may disagree if your utility function differs from mine. But you are correct in that a new theory may make predictions that we will only be able to test in the distant future, so thanks for making my post more rigorous :)

Over the course of your natural lifetime, your past light-cone will extend by about 100 years. Since it already envelopes almost 14 billion years, you won't get much new information relative to what you already know.

You are forgetting the impact of improving science. In fact, most of what we know about the 14 billion year light cone has been added to our knowledge in the last few hundred years due to improved instruments and improved theories. As theories improve, we build better instruments and reinterpret data we collected earlier. As I explained in a recent comment, suggesting new tests for distinguishing between states of the universe is an important part of the progress of science.

You are right about the growth rate of the accessible light cone, but we will continue to improve the amount of information we extract from it over time until our models are perfect.

Actually, since the Universe is accelerating, the past light cone effectively gets smaller over time. Billions of years from now there will be significantly less cosmological data available.

I don't think so. Any events in the past now will still be in the past in a billion years. The past light-cone can only get bigger. (I think you might be misunderstanding my use of the word "past". I'm using the relativistic definition: the set of all events from which one can reach present-day Earth while travelling slower than lightspeed.)

Without getting mathematical: there are galaxies moving away from us faster than the speed of light (and moreover every galaxy outside the Local Group is accelerating away from us). In the future these galaxies will not be visible to Earth-based observers. Similarly the CMB will be more redshifted and hence contain less information. So if you're using a meaning of "event" such that every Planck volume of space produces an event every Planck time regardless of whether there are any atoms there or not, then yes, that number can only go up. But if you're talking about actually interesting things to observe, then it's certainly going down.

There are galaxies moving away from us faster than the speed of light (and moreover every galaxy outside the Local Group is accelerating away from us). In the future these galaxies will not be visible to Earth-based observers.

If they're moving away from us faster than the speed of light, they're not observable now either. As for currently observable galaxies, the event horizon between what we can observe and what we cannot is receding at lightspeed, relativity does not allow us to observe anything break the light barrier, therefore nothing observable can outrace the event horizon and become hidden from us. Black holes notwithstanding because Stephen Hawking is still working on that one.

Similarly the CMB will be more redshifted and hence contain less information.

I don't think redshifting destroys information. Unless you mean that the information will be hidden by noise from within our own galaxy, which is perfectly true.

Anyway, I accept that the amount of data one can gather from current observations may go down over time. But, over a long enough time period, the amount of data one can gather from current and past observations will go up, because there is more past to choose from. Of course, even over millennia it will only go up by an insignificant amount, so we should be careful with the data we have.

Right, so this is the standard misunderstanding about what it means for space itself to be expanding. These two Wikipedia article might be a good place to start, but in brief: relativity forbids information to pass through space faster than light, but when space itself expands then the distance between two objects can increase faster than c without a problem. (The second link quotes a number of 2 trillion years for the time when no galaxies not currently gravitationally bound to us will be visible.)

I don't think redshifting destroys information.

Well, technically I guess it just lowers the information density, which means less information can be gathered by observers on Earth (and less is available inside the observable universe, etc.) And then eventually the wavelength will be greater than the size of the observable Universe and thus undetectable entirely.

Thanks for the links. It all makes a lot more sense to me now (though at 2 trillion years, the timescales involved are much longer than I had considered). One last quibble: Relativity does not forbid the space between two objects (call them A and B) from expanding faster than c, it's true. But a photon emitted by object A would not be going fast enough to outrace the expansion of space, and would never reach B. So B would never obtain any information about A if they are flying apart faster than light.

But because the expansion of the Universe is accelerating, the apparent receding velocity caused by the expansion is increasing, and, for any object distant enough, will at some point become greater than c, causing the object to disappear beyond the cosmological horizon.

This, obviously, assuming that the current theories are correct in this respect.

But a photon emitted by object A would not be going fast enough to outrace the >expansion of space, and would never reach B. So B would never obtain any >information about A if they are flying apart faster than light.

I think that was the point, but since the expansion is accelerating this was not always the case.

A and B are retreating faster than light now (in our reference frame), so the light they are emitting now will not reach each other.

However, the A and B are far apart, say 5 billion light years. 5 billion years ago A and B were receding more slowly - perhaps half the speed of light, so the light emitted 5 billion years ago from A is now reaching B. Hence, B currently sees light from A.

Five billion years in the future this will not be the case. Sometime in the next 5 billion years B will observe A to redshift all the way to zero and wink out.

Agreed. Thanks.

While I tend to agree with your rationale as it stands on its own, I don't think the biggest problem with implementing is politics in the academic world, a bigger problem may be the potential for abuse that would come with the authority to restrict availability of information.

In fact, politically it this concept already shows itself as very valuable as well as attractive, lots of governments and professional groups do restrict access to information sometimes even as true believers in their own justifications but with an outcome that is decidedly unattractive for the losers in that game.

I think I have a solution to that. Delay the release of the information only long enough to set up a few groups of scientists who are cloistered away from the rest of the world in a monastery somewhere. Release the information to the public, but don't tell the cloistered groups of scientists; to them you dole it out slowly. Or give each of them different parts of the data, or give it to them all at once and don't let them collaborate.

Offer generous research funding and prestige for going to one of these Science Temples, and you'll soon fill them up with ambitious post-docs eager to improve their chances of getting tenure. Maybe this isn't the most humane system, but then neither is modern-day academia. Hell, I bet a lot of professors would jump at the opportunity to do some research, uninterrupted, with no classes to teach, and no funding proposals to write.

Notice that this method doesn't require restricting information from anybody except small groups who have agreed to this restriction willingly.

How effective do you think this would actually be? I think first off, we have to ensure these volunteers believe in the principle of limited access to data, and aren't just going after these positions because of the money. It seems like it would be hard to effectively cloister researchers away from the world while keeping them productive. I'm not a physicist, but I know several, and they spend an awful lot of time online for various reasons. They could access the internet only through a human proxy, but I would quit my job if I had to do that.

Cloistering juries away from trial-relevant news seems to be similar, but I don't know anything about that.

There are also (at least) two strong motivations for cheating: wanting more data, and wanting to be more right. I don't know any scientist who doesn't always want more data, and knowing that data is being withheld must be frustrating. Furthermore, especially in a competitive environment, having access to the data your theory will be judged against is a strong advantage. I'm not at all sure how many physicists would be happy being cloistered away, and then be willing to follow the rules once there. I don't know enough to take a stance one way or another, but I am skeptical.

Upvoted; I think cosmology in particular and astronomy in general is a very important case study of how to make theories and generally become less wrong in a situation where you absolutely cannot make any changes to the thing you're studying. (Of course I'm likely biased because it happens to be the field I'm in at the moment.)

Have you seen Vassar's talk on the development of the scientific method? He mentions the Scholarly method, whereby different scholars would work on the same data , and if they came up with the same theory independently, that was strong evidence.

How about this: it doesn't matter.

If we want to build a device that only works if a certain theory is true, we can use it to test the theory. If not, you can do what you want either way, so what does it matter?

There's still similar useful problems. For instance: you can keep getting new data on economics, but there's no way anyone's going to let you do an experiment. In addition, the data you're getting is very bad if you're trying to eliminate bias. It can't be solved in that way, though.

For instance: you can keep getting new data on economics, but there's no way anyone's going to let you do an experiment.

This is somewhat true of macroeconomics, but manifestly untrue of microeconomics. Economists are constantly doing experiments to learn more about how incentives and settings affect behavior. And the results are being applied in the real world, sometimes in environments where alternative hypotheses can be compared.

And even in macroeconomics, work like that explained in Freakonomics shows how people can compare historical data from polities that chose different policies and learn from the different outcomes. So even if no individual scientist will be allowed to conduct a controlled experiment on the macroeconomy, there are enough competing theories that politicians are constantly following different policies, and providing data that sheds light on the consequences of different choices.

Yes; if your theories don't differ in their constraints on expectation, you both can't test the difference and the difference doesn't matter for the future.

The problem is when your theories diverge in the future predictions such that you would take radically different courses of action, depending on which was true.

The Virtues of Frugality - Why cosmological observers should release their data slowly

...doesn't seem very convincing to me. Publish your data already, dammit!

Yeah, but I had to credit them for giving me the idea in the first place.

Alternative solution: package all the cosmic observations into a big database. Given an astrophysical theory, instantiate it as a specialized compression program, and invoke it on the database. To select between rival theories, measure the sum of encoded file size plus length of compressor; smaller is better.

So every theory has to cover all of cosmology? Most astrophysicists study really specific things, and make advances in those; their theories [usually] don't contradict other theories, but usually don't make predictions about the entire universe.

I actually don't think this would work at all. Each event observed is usually uncorrelated (often outside the light cone) with almost all other events. In this case, room for compression is very small; certainly a pulsar series can be compressed well, but can it be compressed substantially better by better astrophysical theories? I think a program that could actually model so much of the universe would be huge compared to a much more naive compressor, and the difference might well exceed the difference in compression.

Also, everything has a margin of error. Is this compression supposed to be lossless? No physical theory will outperform 7zip (or whatever), because to get all the digits right it will need those or correction factors of nearly that size anyway. If it's lossy, how are we ensuring that it's accepting the right losses, and the right amount? Given these, I suspect a model of our observational equipment and the database storage model will compress much better than a cosmological model, and both might under-perform a generic compression utility.

So every theory has to cover all of cosmology? Most astrophysicists study really specific things, and make advances in those;

In this case, the researcher would take the current standard compressor/model and make a modification to a single module or component of the software, and then show that the modification leads to improved codelengths.

No physical theory will outperform 7zip (or whatever), because to get all the digits right it will need those or correction factors of nearly that size anyway.

You're getting at an important subtlety, which is that if the observations include many digits of precision, it will be impossible to achieve good compression in absolute terms. But the absolute rate is irrelevant; the point is to compare theories. So maybe theory A can only achieve 10% compression, but if the previous champion only gets 9%, then theory A should be preferred. But a specialized compressor based on astrophysical theories will outperform 7zip on a database of cosmological observations, though maybe by only a small amount in absolute terms.

You seem very confident in both those points. Can you justify that? I'm familiar with (both implemented and reverse-engineered) both generic and specialized compressions algorithms, and I don't personally see a way to assure that (accurate) cosmological models substantially outperform generic compression, at least without loss. On the other hand, I have little experience with astronomy, so please correct me where I'm making inaccurate assumptions.

I'm imagining this database to be structured so that it holds rows along the lines of (datetime, position, luminosity by spectral component). Since I don't have a background in astronomy, maybe that's a complete misunderstanding. However, I see this as holding an enormous number of events, each of which consists of a small amount of information, most of which are either unrelated to other events in the database or trivially related so that very simple rules would predict better than trying to model all of the physical processes occurring in the stars [or whatever] that were the source of the event.

Part of the reason I feel this way is that we can gather so little information; the luminosity of a star varies, and we understand at least some about what can make it vary, but I am currently under the impression that actually understanding a distant star's internal processes is so far away from what we can gather from the little light we receive that most of the variance is expected but isn't predictable. We don't even understand our own Sun that well!

There is also the problem with weighing items; if I assume that an accurate cosmological model would work well, one that accurately predicts stellar life cycles but wholly misunderstands the acceleration of the expansion of the universe would do much better than a model that accurately captured all of that, but to even a small degree was less well fitted to observed stellar life cycles (even if it is more accurate and less over fitted). Some of the most interesting questions we are investigating right now are the rarest events; if we have a row in the database for each observable time period, you start with an absolutely enormous number of rows for each observable star, but once-in-a-lifetime events are what really intrigue and confound us; starting with so little data, compressing them is simply not worth the compressors time, relative to compressing the much better understood phenomena.

I think it would be easier to figure out how much data it would take. For example, I can easily work out that the data in something happening 1,001,000 times out of 2,000,000 when that's how often it should happen is 1,999,998.56 bits. Actually making the compression isn't so easy.

This seems like it would lead to overfitting on the random details of our particular universe, when what we really want (I think) is a theory that equally describes our universe or any sufficiently similar one.

First off, when you have that much data, over-fitting won't make a big difference. For example, you'll get a prediction that something happens between 999,000 and 1,001,000 times, instead of 1,000,000. Second, the correct answer would take 2,000,000 bits. The incorrect one would take 1001000-ln(0.5005)/ln(2)+999000-ln(0.4995)/ln(2) = 1,999,998.56 bits. The difference in data will always be how unlikely it is to be that far from the mean.

Third, and most importantly, no matter how much your intuition says otherwise, this actually is the correct way to do it. The more bits you have to use, the less likely it is. The coincidence might not seem interesting, but that exact sequence of data is unlikely. What normally makes it seem like a coincidence is that there seems to be a way to explain it with smaller data.

Can someone else explain this better?

[-][anonymous]00

I agree that this is an attractive alternative solution.

And allow me to rephrase. Since human scientists stick too much to the first hypothesis that seems to fit the data (confirmation bias) and have a regrettable tendency unfairly to promote hypotheses that they and their friends discovered (motivated cognition -- the motivation being the fame that comes from being known as the discoverer of an important successful hypothesis), it would win for the enterprise of science to move where possible to having algorithms generate the hypotheses.

Since the hypotheses "found" (more accurately, "promoted to prominence" or "favored") by the algorithms will be expressed in formal language, professionals with scientific skills, PhD, tenure and such will still be needed to translate them into English. Professionals will also still be necessary to refine the hypothesis-finding (actually "hypothesis-favoring") algorithms and to identify good opportunities for collecting more observations.

How do you come up with astrophysical theories in the first place? Ask a human, an AI, or just test every single computable possibility?

The compression method doesn't specify that part; this shouldn't be considered a weakness, since the traditional method doesn't either. Both methods depend on human intuition, strokes of genius, falling apples, etc.

Of course, we have no way to create new stars or galaxies of our own

Well, we do make simulations.

Simulations allow us to predict what will happen if theory X is true. We still need to find a corresponding real event to check whether the prediction agrees with the fact.

I agree, I was just pointing out the flaw in your wording.

Oh. Thanks.

I think a sensible step would be to treat a human as a black box statistical learning machine which can produce patterns (systems, theories etc.) that simplify the data and can be used to make predictions. A human has no special qualities that distinguishes it from automated approaches such as a support vector machine. These solutions can be seen as necessary approximations to the correct space of possible hypotheses from which predictions may be impractical (particularly as priors are unknown).

One way of having confidence in a particular output from these black boxes is the use of some additional prior over the likelihood of different theories (their elegance if you like), but I'm not sure to what extent such a prior can rationally be determined, i.e. the pattern of likely theories, of which simplicity is a factor.

Another approach is the scientific method, model with a subset and then validate with additional data (a common AI approach to minimise overfitting) I am not sufficiently knowledgeable in statistical learning theory to know how (or if) such approaches can be shown to provably improve predictive accuracy but I think this book covers some of it (other less wrong readers are likely to know more).

Culturally, we also apply a prior on the black box itself, i.e. when Einstein proposed theories, people rationally assumed they were more likely as given the limited data his black box seemed to suffer from less overfitting, of course we have few samples and don't know the variance so he could just have been lucky.

Another perspective is that if we cannot obtain any more information on a subject is it valuable to continue to try and model it. In effect, the data is the answer and predictive power is irrelevant as no more data can be obtained.

but I'm not sure to what extent such a prior can rationally be determined, i.e. the pattern of likely theories, of which simplicity is a factor.

A theory that takes one more bit is less than twice as likely. Either that, or all finite theories have infinitesimal likelihoods. I can't tell you how much less than twice, and I can't tell you what compression algorithm you're using. Trying to program the compression algorithm only means that the language you just used is the algorithm.

Technically, the extra bit thing is only as the amount of data goes to infinity, but that's equivalent to the compression algorithm part.

I also assign anything with infinities as having zero probability, because otherwise paradoxes of infinity would break probability and ethics.

That's the extent to which it can be done.

Interesting,

Is it correct, to say that the bit based prior is a consequence of creating an internally consistent formalisation of the aesthetic heuristic of preferring simpler structures to complex ones?

If so I was wondering if it could be extended to reflect other aesthetics. For example, if an experiment produces a single result that is inconsistent with an existing simple physics theory, it may be that the simplest theory that explains this data is to treat this result as an isolated exception, however, aesthetically we find it more plausible that this exception is evidence of a larger theory that the sample is one part of.

In contrast when attempting to understand the rules of a human system (e.g. a bureaucracy) constructing a theory that lacked exceptions seems unlikely ("that's a little too neat"). Indeed when stated informally the phrase might go "in my experience, that's a little too neat" implying that we formulate priors based on learned patterns from experience. In the case of the bureaucracy, this may stem from a probabilistic understanding of the types of system that result from a particular 'maker' (i.e. politics).

However, this moves the problem to one of classifying contexts and determining which contexts are relevant, if this process is considered part of the theory, then it may considerably increase its complexity always preferring theories which ignore context. Unless of course the theory is complete (incorporating all contexts) in which case the simplest theory may share these contextual models and thus become the universal simplest model. It would therefore not be rational to apply Kolmogorov complexity to a problem in isolation. I.e. probability and reductionism are not compatible.