VoiceOfRa comments on Beyond Statistics 101 - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (129)
PCA and other dimensionality reduction techniques are great, but there's another very useful technique that most people (even statisticians) are unaware of: dimensional analysis, and in particular, the Buckingham pi theorem. For some reason, this technique is used primarily by engineers in fluid dynamics and heat transfer despite its broad applicability. This is the technique that allows scale models like wind tunnels to work, but it's more useful than just allowing for scaling. I find it very useful to reduce the number of variables when developing models and conducting experiments.
Dimensional analysis recognizes a few basic axioms about models with dimensions and sees what they imply. You can use these to construct new variables from the old variables. The model is usually complete in a smaller number of these new variables. The technique does not tell you which variables are "correct", just how many independent ones are needed. Identifying "correct" variables requires data, domain knowledge, or both. (And sometimes, there's no clear "best" variable; multiple work equivalently well.)
Dimensional analysis does not help with categorical variables, or numbers which are already dimensionless (though by luck, sometimes combinations of dimensionless variables are actually what's "correct"). This is the main restriction that applies. And you can expect at best a reduction in the number of variables of about 3. Dimensional analysis is most useful for physical problems with maybe 3 to 10 variables.
The basic idea is this: Dimensions are some sort of metadata which can tell you something about the structure of the problem. You can always rewrite a dimensional equation, for example, to be dimensionless on both sides. You should notice that some terms become constants when this is done, and that simplifies the equation.
Here's a physical example: Let's say you want to measure the drag on a sphere (units: N). You know this depends on the air speed (units: m/s), viscosity (units: m^2/s), air density (units: kg/m^3), and the diameter of the sphere (units: m). So, you have 5 variables in total. Let's say you want to do a factorial design with 4 levels in each variable, with no replications. You'd have to do 4^4 = 256 experiments. This is clearly too complicated.
What fluid dynamicists have recognized is that you can rewrite the relationship in terms of different variables, and nothing is missing. The Buckingham pi theorem mentioned previously says that we only need 2 dimensionless variables given our 5 dimensional variables. So, instead of the drag force, you use the drag coefficient, and instead of the speed, viscosity, etc., you use the Reynolds number. Now, you only need to do 4 experiments to get the same level of representation.
As it turns out, you can use techniques like PCA on top of dimensional analysis to determine that certain dimensionless parameters are unimportant (there are other ways too). This further simplifies models.
There's a lot more on this topic than what I have covered and mentioned here. I would recommend reading the book Dimensional analysis and the theory of models for more details and the proof of the pi theorem.
(Another advantage of dimensional analysis: If you discover a useful dimensionless variable, you can get it named after yourself.)
That's because outside of physics (and possibly chemistry) there are enough constants running around that all quantities are effectively dimensionless. I'm having a hard time seeing a situation in say biology where I could propose dimensional analysis with a straight face, to say nothing of softer sciences.
One thing that most scientists in these soft scientists already have a good grasp on, but a lot of laypeople do not, is the idea of appropriately normalizing parameters. For instance dividing something by the mass of the body, or the population of a nation, to do comparisons between individuals/nations of different sizes.
People will often make bad comparisons where they don't normalize properly. But hopefully most people reading this article are not at risk for that.
As I said, dimensional analysis does not help with categorical variables. And when the number of dimensions is low and/or the number of variables is large, dimensional analysis can be useless. I think it's a necessary component of any model builder's toolbox, but not a tool you will use for every problem. Still, I would argue that it's underutilized. When dimensional analysis is useful, it definitely should be used. (For example, despite its obvious applications in physics, I don't think most physics undergrads learn the Buckingham pi theorem. It's usually only taught to engineers learning fluid dynamics and heat transfer.)
Two very common dimensionless parameters are the ratio and fraction. Both certainly appear in biology. Also, the subject of allometry in biology is basically simple dimensional analysis.
I've seen dimensional analysis applied in other soft sciences as well, e.g., political science, psychology, and sociology are a few examples I am aware of. I can't comment much on the utility of its application in these cases, but it's such a simple technique that I think it's worth trying whenever you have data with units.
Speaking more generally, the idea of simplification coming from applying transformations to data has broad applicability. Dimensional analysis is just one example of this.