MockTurtle comments on Is simplicity truth indicative? - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (45)
Looking at the machine learning section of the essay, and the paper it mentions, I believe the author to be making a bit too strong a claim based on the data. When he says:
"In some cases the simpler hypotheses were not the best predictors of the out-of-sample data. This is evidence that on real world data series and formal models simplicity is not necessarily truth-indicative."
... he fails to take into account that many more of the complex hypotheses get high error rates than the simpler hypotheses (despite a few of the more complex hypotheses getting the smallest error rates in some cases), which still says that when you have a whole range of hypotheses, you're more likely to get higher error rates when choosing a single complex one than a single simple one. It sounds like he says Occam's Razor is not useful just because the simplest hypothesis isn't ALWAYS the most likely to be true.
Similarly, when he says:
"In a following study on artificial data generated by an ideal fixed 'answer', (Murphy 1995), it was found that a simplicity bias was useful, but only when the 'answer' was also simple. If the answer was complex a bias towards complexity aided the search."
This is not actually relevant to the discussion of whether simple answers are more likely to be fact than complex answers, for a given phenomenon. If you say "It turns out that you're more likely to be wrong with a simple hypothesis when the true answer is complex", this does not affect one way or the other the claim that simple answers may be more common than complex answers, and thus that simple hypotheses may be, all else being equal, more likely to be true than complex hypotheses when both match the observations.
That being said, I am sympathetic to the author's general argument. While complexity (elaboration), when humans are devising theories, tends to just mean more things which can be wrong when further observations are made, this does not necessarily point to whether natural phenomena is generally 'simple' or not. If you observe only a small (not perfectly representative) fraction of the phenomenon, then a simple hypothesis produced at this time is likely to be proven wrong in the end. I'm not sure if this is really an interesting thing to say, however - when talking about the actual phenomena, they are neither really simple nor complex. They have a single true explanation. It's only when humans are trying to establish the explanation based on limited observation that simplicity and complexity come into it.
Did you look up the papers he referenced, then? Or are you speaking just based on your impression of his summaries? I too thought that his summaries were potentially misleading, but I failed to track down the papers he mentioned to verify that for certain.
This perspective is new to me. What are your thoughts on things like Salmonoff induction? It seems to me like that's sufficiently abstract that it requires simplicity is a meaningful idea even outside the human psyche. I cannot really imagine any thinking-like process that doesn't involve notions of simplicity.
The first paper he mentions in the machine learning section can be found here, if you'd like to take a look: Murphy and Pazzani 1994 I had more trouble finding the others which he briefly mentions, and so relied on his summary for those.
As for the 'complexity of phenomena rather than theories' bit I was talking about, your reminder of Solomonoff induction has made me change my mind, and perhaps we can talk about 'complexity' when it comes to the phenomena themselves after all.
My initial mindset (reworded with Solomonoff induction in mind) was this: Given an algorithm (phenomenon) and the data it generates (observations), we are trying to come up with algorithms (theories) that create the same set of data. In that situation, Occam's Razor is saying "the shorter the algorithm you create which generates the data, the more likely it is to be the same as the original data-generating algorithm". So, as I said before, the theories are judged on their complexity. But the essay is saying, "Given a set of observations, there are many algorithms that could have originally generated it. Some algorithms are simpler than others, but nature does not necessarily choose the simplest algorithm that could generate those observations."
So then it would follow that when searching for a theory, the simplest ones will not always be the correct ones, since the observation-generating phenomenon was not chosen by nature to necessarily be the simplest phenomenon that could generate those observations. I think that may be what the essay is really getting at.
Someone please correct me if I'm wrong, but isn't the above only kinda valid when our observations are incomplete? Intuitively, it would seem to me that given the FULL set of possible observations from a phenomenon, if you believe any theory but the simplest one that generates all of them, surely you're making irrefutably unnecessary assumptions? The only reason you'd ever doubt the simplest theory is if you think there are extra observations you could make which would warrant extra assumptions and a more complex theory...
It might be a difference of starting points, then. We can either start with a universal approach, a broad prior, and use general heuristics like Occam's Razor, then move towards the specifics of a situation, or we can start with a narrow prior and a view informed by local context, to see how Nature typically operates in such domains according to the evidence of our intuitions, then try to zoom out. Of course both approaches have advantages in some cases, so what's actually being debated is their relative frequency.
I'm not sure of any good way to survey the problem space in an unbiased way to assess whether or not this assertion is typically true (maybe Monte Carlo simulations over random algorithms or something ridiculous like that?), but the point that adding unnecessary additional assumptions to a theory is flawed practice seems like a good heuristic argument suggesting we should generally assume simplicity. Does the fact that naive neural nets almost always fail when applied to out of sample data constitute a strong general argument against the anti-universalizing approach? Or am I just mixing metaphors recklessly here, with this whole "localism" thing? Simplicity and generalizability are more or less the same thing, right? Or is that question assuming the conclusion once again?
I think this demonstrates the problem rather well. In the end, the phenomenon you are trying to model has a level of complexity N. You want your model (neural network or theory or whatever) to have the same level of complexity - no more, no less. So the fact that naive neural nets fail on out of sample data for a given problem shows that the neural network did not reach sufficient complexity. That most naive neural networks fail shows that most problems have at least a bit more complexity than that embodied in the simplest neural networks.
As for how to approach the problem in view of all this... Consider this: for any particular problem of complexity N, there are N - 1 levels of complexity below it, which may fail to make accurate predictions due to oversimplification. And then there's an infinity of complexity levels above N, which may fail to make accurate predictions due to overfitting. So it makes sense to start with simple theories, and keep adding complexity as new observations arrive, and gradually improve the predictions we make, until we have the simplest theory we can which still produces low errors when predicting new observations.
I say low errors because to truly match all observations would certainly be overfitting! So there at the end we have the same problem again, where we trade off accuracy on current data against overfitting errors on future data... Simple (higher errors) versus complex (higher overfitting)... At the end of the process, only empiricism can help us find the theory that produces the lowest error on future data!
This is one possibility. Another, MUCH more common in practice, is that your NN overfitted the in-sample data and so trivially failed at out-of-sample forecasting.
To figure out the complexity of the process you're trying to model, you first need to be able to separate features of that process from noise and this is far from a trivial exercise.
This is more along the lines of what I was thinking. Most instances of complexity that seem like they're good are in practice going to be versions of overfitting to noise. Or, perhaps stated more concisely and powerfully, noise and simplicity are opposites (information entropy), thus if we dislike noise we should like simplicity. Does this seem like a reasonable perspective?
Not quite. Noise and simplicity are not opposites. I would say that the amount of noise in the data (along with the amount of data) imposes a limit, an upper bound, on the complexity that you can credibly detect.
Basically, if your data is noisy you are forced to consider only low-complexity models.
Can you elaborate on why you think it's a boundary, not an opposite? I still feel like it's an opposite. My impression, from self-study, is that randomness in information means the best way to describe eg a sequence of coin flips is to copy the sequence exactly, there is no algorithm or heuristic that allows you to describe the random information more efficiently, like "all heads" or "heads, tails, heads, tails, etc." That sort of efficient description of information seems identical to simplicity to me. If randomness is defined as the absence of simplicity...
I guess maybe all of this is compatible with an upper bound understanding, though. What is there that distinguishes the upper bound understanding from my "opposites" understanding, that goes your way?
Noise is not randomness. What is "noise" depends on the context, but generally it means the part of the signal that we are not interested in and do not care about other than that we'd like to get rid of it.
But we may be talking in different frameworks. If you define simplicity as the opposite (or inverse) of Kolmogorov complexity and if you define noise as something that increases the Kolmogorov complexity then yes, they are kinda opposite by definition.