A wild speculation about a possible application of this to network interpretability:
Given a dog-cat classifying network with the following layer format ,[1] this can be viewed as a sequence of compressions and information-throwing-away-operations (the zeros in the S, and the taking of all negative values to 0 in ReLU discard information), where for any given input, the information relevant to the classification task is placed into the entries which get preserved (the non-zero entries in S, and the positive entries in the ReLU).
In other words, the network is playing a very similar version of the generalized heat engine game, but where the energy being extracted is relevant information, and the transformations are subject to the constraint that they must all be orthogonal and linear.
If this framing is useful, we should expect that if you hold each S constant, randomize the Vs and Us, then find the optimal Vs and Us, we should get back similar Vs and Us to what we started with.
Where is an SVD of our learned weight matrix.
This post continues where the previous post left off.
Recap from previous post: we have two sets of IID coins. One set is the “hot pool”, XH, in which each coin has probability 0.2 of being heads. The other set is the “cold pool”, XC, in which each coin has probability 0.1 of being heads. We apply transformations to X. Each transformation must be invertible and conserve the total number of heads across both pools. We want to choose such a transformation to make some of the coins (nearly) deterministically heads. We call the number of deterministic coins produced “work”, w.
Key idea from previous post: this is a compression problem. We want w of the coins to be deterministic and the transformation to be reversible, so all the information from the initial coin-state must be compressed into the non-deterministic coins in the final state. The thermodynamic flavor of the problem comes from the additional constraint: we want to compress the initial state “data” while also conserving the total number of heads.
In this post, we will make one small tweak to this setup relative to the previous post. In the previous post, “work” just meant making some coins deterministically heads; we didn’t have a strong opinion about which coins, so long as we knew which coins they were. In this post, we’ll assume that we start with some extra coins outside our two pools which are deterministically tails, and use our “work” to make those particular coins deterministically heads. This makes it a bit cleaner to separate the “moving energy around” and “moving uncertainty around” aspects of the problem, though of course those two pieces end up coupled.
This post will dig more into the optimization aspect of the problem, look at temperature as a “price” at which we can trade energy (or analogous conserved quantities) for entropy, and view the heat engine problem as arbitrage between two subsystems with different temperatures/prices. That, in turn, yields the usual thermodynamic efficiency limit. Then, we’ll point to some generalizations and applications to wrap up.
Lagrange Multipliers
Let’s think about just one of our two pools in isolation. We’ll imagine adding/removing marginal heads (analogous to energy) to/from the pool.
Recall that our initial coin distribution for each pool is maxentropic subject to a constraint on the number of heads. If we add one head to a pool, then the constraint is relaxed slightly, so that pool’s entropy can increase - the maximum entropy distribution on those coins with one more head allowed will have slightly more entropy than the initial maximum entropy distribution.
How much can the entropy increase? That’s given by the Lagrange multiplier associated with the constraint. As long as the number of heads we add is small (i.e. small enough to use a linear approximation), the increase in maximum entropy will be roughly the number of heads added times the Lagrange multiplier. In economic terms: the Lagrange multiplier is the “price” at which we can trade marginal heads for a marginal change in maximum entropy. (Indeed, prices in economics are exactly the Lagrange multipliers in agents’ optimization problems.)
In standard stat mech, this Lagrange multiplier is the (inverse) temperature. Specifically: in our setup, if we assume energy is proportional to the number of heads and take the limit as the number of coins goes to infinity, then our constraint on total heads becomes a constraint on average energy. The Lagrange multiplier associated with the average energy constraint in the entropy maximization problem is β∝1T, with the proportionality given by Boltzmann’s constant.
Quantitatively, for our example problem, the (initial) Lagrange multiplier for each pool is −log(pH1−pH), i.e. the log likelihood of heads in each pool. That’s 2 bits/head for the hot pool and roughly 3.17 bits/head for the cold pool. Conceptually: if heads have probability 0.2, then one head makes a contribution of −log(0.2) to the entropy, while one tail makes a contribution of −log(0.8). Flipping a tail to a head therefore increases entropy by −log(0.2)+log(0.8)=log(4)=2 bits. (Though note that this is a post-hoc conceptual explanation; Lagrange multipliers are usually best calculated using the usual methods of convex optimization and maximum entropy.)
Arbitrage
If we have two pools at different temperatures, then we can “arbitrage” heads/energy between them to increase the maximum entropy of the whole system.
We remove one head from the hot pool (remember: this just means subtracting 1 from the constraint on the number of heads in that pool). In our example, the hot pool’s Lagrange multiplier is 2, so this decreases the maximum entropy by roughly 2 bits. But then, we add one head to the cold pool, so its maximum entropy increases by roughly 3.17 bits. The total number of heads across the full system remains constant, but the maximum entropy of the full system has increased by 1.17 bits.
What does this mean in terms of extractable “work”, i.e. number of bits we can deterministically make heads?
To extract a unit of work, we remove a head from one of the pools, and add that head to our initially-tails pool, reducing the maximum entropy of the pool by one head (2 bits for the hot pool, 3.17 bits for the cold pool, same as earlier). To maximize efficiency, we’ll take it from the hot pool, so each head of work will decrease the maximum achievable entropy by 2 bits.
To make our whole transformation valid, we must move enough heads from hot pool to cold pool to offset the maximum entropy loss of our work-coins. Assuming we take the work-coins from the hot pool, we’ll need to move roughly 2/1.17 = 1.71 heads from hot to cold for each head of work extracted. In terms of thermodynamic efficiency: for each head removed from the hot pool, we can extract roughly 1/(1 + 1.71) = .37 heads of work.
Writing out the general equation: our Lagrange multipliers are inverse temperatures 1TH and 1TC. Each work-coin “costs” 1TH bits of maximal entropy, and each head moved from hot to cold “earns” 1TC−1TH bits of maximal entropy, so the number of heads we need to move from hot to cold for each head of work is
1TH1TC−1TH=1THTC−1
Finally, the traditional thermodynamic efficiency measure: for each head removed from the hot pool, we can extract 11+1THTC−1=1−TCTH heads of work.
As expected, we've reproduced the usual thermodynamic efficiency limit.
Recap
Let’s recap the reasoning.
We want our transformation to be reversible, so all of the information from the initial distribution must be “stored” in the final distribution. That means the final distribution must have at least as much entropy as the initial distribution - otherwise we won’t have enough space. So, our transformation must not decrease the maximum achievable entropy. That’s the argument from the previous post.
This post says that, if we decrease the number of heads in a pool by 1, then that has a “cost” in maximum achievable entropy, and that cost is given by the Lagrange multiplier in the entropy maximization problem (i.e. the inverse temperature). With one hot pool and one cold pool, we can “arbitrage” by moving heads from one pool to the other, freeing up extra entropy. We can then “spend” that extra entropy to remove heads from a pool and turn them into work. This gives us the usual thermodynamic efficiency limit, 1−TCTH.
Further Generalization
With this setup, it’s easy to see further generalizations. We could have more constraints; these would each have a “price” associated with them, given by the corresponding Lagrange multiplier in the maximum entropy problem. We could even have nonlinear constraints (i.e. not additive across the pools), which we'd handle by local linear approximation. We could have more pools, and we could arbitrage between the pools whenever the prices are different.
We can also generalize thermal equilibrium. Traditionally, we consider two systems to be in thermal equilibrium if they have the same temperature, i.e. the same Lagrange multipliers/prices. More generally, we can consider systems with many constraints and many pools to be in equilibrium when the Lagrange multipliers/prices for all pools match. Notably, this is essentially the same equilibrium condition used in microeconomics: economic agents in a market are in equilibrium when the Lagrange multipliers/prices on all their utility-maximization problems match, and those matching prices are called the “market prices”. So thermal equilibrium corresponds to economic equilibrium in a rather strong sense. (One difference, however: in the thermodynamic picture, the entropies of different pools are added together, whereas we can’t always add utilities across agents in economics - the economic model is a bit more general in that sense. Thermal equilibrium is an economic equilibrium, but the reverse does not apply.)
Another generalization: recall that our original hot pool had heads-probability 0.2, and the cold pool 0.1. Those numbers were chosen so that adding a head to either pool would increase the pool’s maximum entropy, just as adding energy to a system usually increases its maximum entropy in physics. But it could go the other way. For instance, if the hot pool had heads-probability 0.8, then removing a head would increase its maximum entropy. In that case, we could just remove heads for free! (Well, at least until the excess heads ran out.) Alas, in thermodynamics, presumably such a system would be highly unstable.
Finally, a generalization with potentially very wide applicability: it turns out that all of Bayesian probability can be expressed in terms of maximum entropy. When data comes in, our update says “now maximize entropy subject to the data variable being deterministically the observed value”, and this turns out to be equivalent to Bayes’ rule. So, if we express all our models in maximum entropy terms, then essentially any system would be in the right form to apply the ideas above. That said, it wouldn’t necessarily say anything interesting about any random system; it’s the interplay of compression and additional constraints which makes things interesting.
Applications?
I’m still absorbing all this myself. Some examples of the sort of applications I imagine it might apply to, beyond physics:
I see two (not-insurmountable) barriers to applying thermo-like ideas to these problems. First, outside of physics, our transformations don’t always need to be invertible. In more general problems, I expect we’d want to factor the problem into two parts: one part where our choices reduce the number of possible environments, and another part where we just move uncertainty around within the possible environments. The second part would be thermo-like.
The other barrier is the “goal”. In the thermodynamic setup, we’re trying to deterministically extract a resource - heads, in our toy problem, or energy in physics. This resource-extraction is not synonymous with whatever the goal is in a general optimization problem; resource-extraction would usually just be a subgoal. In any particular problem, we might be able to identify subgoals which involve deterministic resource extraction, but it would be more useful to have a general method for tying a generic goal to the thermo-like problem.
Again, these problems don’t seem insurmountable, or even very conceptually difficult. They’d take some legwork, but are probably tractable.
I’d also be interested to hear other applications which jump to mind. I’m still mulling this over, so there’s probably whole categories of use-cases that I’m missing.