Daniel_Burfoot comments on Beyond Statistics 101 - Less Wrong

19 Post author: JonahSinick 26 June 2015 10:24AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (129)

You are viewing a single comment's thread. Show more comments above.

Comment author: btrettel 27 June 2015 01:01:56AM *  7 points [-]

PCA and other dimensionality reduction techniques are great, but there's another very useful technique that most people (even statisticians) are unaware of: dimensional analysis, and in particular, the Buckingham pi theorem. For some reason, this technique is used primarily by engineers in fluid dynamics and heat transfer despite its broad applicability. This is the technique that allows scale models like wind tunnels to work, but it's more useful than just allowing for scaling. I find it very useful to reduce the number of variables when developing models and conducting experiments.

Dimensional analysis recognizes a few basic axioms about models with dimensions and sees what they imply. You can use these to construct new variables from the old variables. The model is usually complete in a smaller number of these new variables. The technique does not tell you which variables are "correct", just how many independent ones are needed. Identifying "correct" variables requires data, domain knowledge, or both. (And sometimes, there's no clear "best" variable; multiple work equivalently well.)

Dimensional analysis does not help with categorical variables, or numbers which are already dimensionless (though by luck, sometimes combinations of dimensionless variables are actually what's "correct"). This is the main restriction that applies. And you can expect at best a reduction in the number of variables of about 3. Dimensional analysis is most useful for physical problems with maybe 3 to 10 variables.

The basic idea is this: Dimensions are some sort of metadata which can tell you something about the structure of the problem. You can always rewrite a dimensional equation, for example, to be dimensionless on both sides. You should notice that some terms become constants when this is done, and that simplifies the equation.

Here's a physical example: Let's say you want to measure the drag on a sphere (units: N). You know this depends on the air speed (units: m/s), viscosity (units: m^2/s), air density (units: kg/m^3), and the diameter of the sphere (units: m). So, you have 5 variables in total. Let's say you want to do a factorial design with 4 levels in each variable, with no replications. You'd have to do 4^4 = 256 experiments. This is clearly too complicated.

What fluid dynamicists have recognized is that you can rewrite the relationship in terms of different variables, and nothing is missing. The Buckingham pi theorem mentioned previously says that we only need 2 dimensionless variables given our 5 dimensional variables. So, instead of the drag force, you use the drag coefficient, and instead of the speed, viscosity, etc., you use the Reynolds number. Now, you only need to do 4 experiments to get the same level of representation.

As it turns out, you can use techniques like PCA on top of dimensional analysis to determine that certain dimensionless parameters are unimportant (there are other ways too). This further simplifies models.

There's a lot more on this topic than what I have covered and mentioned here. I would recommend reading the book Dimensional analysis and the theory of models for more details and the proof of the pi theorem.

(Another advantage of dimensional analysis: If you discover a useful dimensionless variable, you can get it named after yourself.)

Comment author: Daniel_Burfoot 28 June 2015 06:01:47PM *  3 points [-]

I've always been amazed at the power of dimensional analysis. To me the best example is the problem of calculating the period of an oscillating mass on a spring. The relevant values are the spring constant K (kg/s^2) and the mass M (kg), and the period T is in (s). The only way to combine K and M to obtain a value with dimensions of (s) is sqrt(M/K), and that's the correct form of the actual answer - no calculus required!

Comment author: Douglas_Knight 01 July 2015 06:25:23AM 2 points [-]

Actually, there's another parameter, the displacement. It turns out that the spring period does not depend on the displacement, but that's a miracle that is special to springs. Instead, look at the pendulum. The same dimensional analysis gives the square root of the length divided by gravitational acceleration. That's off by a dimensionless constant, 2π. Moreover, even that is only approximately correct. The real answer depends on the displacement in a complicated way.

Comment author: btrettel 01 July 2015 01:55:39PM *  0 points [-]

This is a good point. At best you can figure out that period is proportional to (not equal to) sqrt(M/K) multiplied by some function of other parameters, say, one involving displacement and another characterizing the non-linearity (if K is just the initial slope, as I've seen done before). It's a fortunate coincidence if the other parameters are unimportant. You can not determine based solely on dimensional analysis whether certain parameters are unimportant.