variance of 5 degrees
Nitpick: Units of variance would be 5 degrees^2.
Also, personally I find standard deviation easier to think about, and initially thought that you were accidentally calling the standard deviation the variance, though for Kalman filters variance does seem more useful.
Thanks! Edited. Yeah, I specifically focused on variance because of how Bayesian updates combine Normal distributions.
Good post!
Is it common to use Kalman filters for things that have nonlinear transformations, by approximating the posterior with a Gaussian (eg. calculating the closest Gaussian distribution to the true posterior by JS-divergence or the like)? How well would that work?
Grammar comment--you seem to have accidentally a few words at
Measuring multiple quantities: what if we want to measure two or more quantities, such as temperature and humidity? Furthermore, we might know that these are [missing words?] Then we now have multivariate normal distributions.
There are a number of Kalman-like things you can do when your updates are nonlinear.
The "extended Kalman filter" uses a local linear approximation to the update. There are higher-order versions. The EKF unsurprisingly tends to do badly when the update is substantially nonlinear. The "unscented Kalman filter" uses (kinda) a finite-difference approximation instead of the derivative, deliberately taking points that aren't super-close together to get an approximation that's meaningful on the scale of your actual uncertainty. Going further in that direction you get "particle filters" which represent your uncertainty not as a Gaussian but by a big pile of samples from its distribution. (There's a ton of lore on all this stuff. I am in no way an expert on it.)
Very neat tool, thanks for the conciseness of the explanation. Though I hope I won't have to measure 70° temperatures by hand any time soon. (I know, I know, it's in Fahrenheit, but it still sounds... dissonant ? to my european ears)
Summary: the Kalman Filter is Bayesian updating applied to systems that are changing over time, assuming all our distributions are Gaussians and all our transformations are linear.
Preamble - the general Bayesian approach to estimation: the Kalman filter is an approach to estimating moving quantities. When I think about a Bayesian approach to estimation, I think about passing around probability distributions: we have some distribution as our prior, we gather some evidence, and we have a new distribution as our posterior. In general, the mean of our distribution measures our best guess of the underlying value, and the variance represents our uncertainty.
In the Kalman filter, the only distribution we use is the normal/Gaussian distribution. One important property of this is that it can be parameterized completely by the mean and variance (or covariance in the multi-variate case.) If you know those two values, you know everything about the distribution.
As a result, people often talk about the Kalman filter as though it's estimating means and variances at different points, but I find it easier to think of it as outputting a distribution representing our current knowledge at any point.
The simplest case: taking multiple measurements of a fixed quantity with an accurate but imprecise sensor. For example, say we're trying to measure the temperature with a thermometer that we believe is accurate but has a variance of 5 degrees2.
We're very bad at estimating temperatures by hand, so let's say our prior distribution is that the temperature is somewhere around 70 degrees with a variance of 20, or N(70,20) . We take one readout from the thermometer, which (by assumption) yields a normal distribution centered around the true temperature with variance 5: N(t, 5). The thermometer reads 78. What's our new estimate?
Well, it turns out there's a simple rule for combining Normal distributions with known variance: if our prior is N(μ0,σ20) and our observation is N(μ1,σ21) then the posterior has mean
(1) μ′=μ0+k(μ1−μ0)
(2) σ′2=σ20−kσ20 , where
(3) k=σ20σ20+σ21 is called the Kalman gain.
So if our first reading is 72, then k is 2025=.8, σ′2=20−.8∗20=4, and μ′=70+.8∗(78−70)=76.4 . If we take another reading, we'd apply the same set of calculations, except our prior would be N(76.4,4).
Some intuition: let's look at the Kalman gain. First, note that its value is always between 0 or 1. Second, note that the gain is close to 0 if σ21 is large compared to σ20 , and close to 1 in the opposite case. Intuitively, we can think of the Kalman gain as a ratio of how much we trust our new observation relative to our prior, where the variances are a measure of uncertainty.
What happens to the mean? It moves along the line from our prior mean to the observation. If we trust the observation a lot, k is nearly 1, and we move almost all the way. If we trust the prior much more than the observation, we adjust our estimate very little. And if we trust them equally, we take the average of the two.
Also note that the variance always goes down. Once again, if we trust the new information a lot, the variance goes down a bunch. If we trust the new information and our prior equally, then the variance is halved.
Finally, as a last tidbit, it doesn't matter whether which distribution is the prior and which is the observation in this case - we'll get exactly the same posterior if we switch them around.
Adding a sensor: none of the math above assumes we're always using the same sensor. As long as we assume all our sensors draw from distributions centered around the true mean and with a known (or estimated) variance, we can update on observations from any number of sensors, using the same update rule.
Measuring multiple quantities: what if we want to measure two or more quantities, such as temperature and humidity? Then we now have multivariate normal distributions. While a single-variable Gaussian is parameterized by its mean and variance, an n-variable Gaussian is parameterized by a vector of n means and an n×n covariance matrix: N(→μ,Σ).
Our update equations are the multivariate versions of the equations above: given a prior distribution N(→μ0,Σ0) and a measurement →μ1 from a sensor with covariance matrix Σ1, our posterior distribution is N(→μ′,Σ′) with:
(4) →μ′=→μ0+K→μ1
(5) Σ′=Σ0−KΣ0
(6) K=Σ0(Σ0+Σ1)−1
These are basically just the matrix versions of equations (1), (2), and (3).
Adding predictable change over time: so far, we've covered Bayesian updates when you're making multiple measurements of some static set of quantities. But what about when things are changing? A classic example is a moving car. For this case, let's assume we're measuring two quantities – position and velocity.
For a bit more detail, say at time 0 our vector →μ0=(x0v0) where x0 is the position and v0 is velocity. Then at time τ , we might expect the position to be x0+τ⋅v0, and the velocity to be the same on average. We can represent this with a matrix: →μ′=F→μ0 , where F is the matrix (1 τ0 1) .
More generally, say our belief at time t is N(→μ0,Σ0). Then our belief at time t+τ, before we make any new observations, should be FN(→μ0,Σ0). Fortunately there's a simple formula for this: FN(→μ0,Σ0)=N(Fμ0,FΣ0FT).
Putting it all together, say our belief at time t is N(→μ0,Σ0) , and at time t+τ we measure a value →μ1 from a sensor with covariance matrix Σ1, then we perform the Bayesian update with FN(→μ0,Σ0)=N(Fμ0,FΣ0FT) as the prior and N(→μ1,Σ1) as the posterior:
(7) →μ′=−−→Fμ0+K→μ1
(8) Σ′=FΣ0FT−KΣ0
(9) K=FΣ0FT(FΣ0FT+Σ1)−1
And that's the main idea! We just adjust our prior by applying a transition function/matrix to it first In practice, the Kalman filter tends to quickly converge to true values, and is widely used in applications such as GPS tracking.