Just got around to reading this, and I found it helpful. Thank you.
This was a cool post! I was familiar with f-divergences as a generalization of KL divergence, and of course familiar with maxent methods, but I hadn't seen the two put together before.
Problem: I'm left unsure when or why I would ever want to use this machinery. Sufficient statistics are an intuitive concept, and I can look around at the world and make guesses about where I would want to use sufficient statistics to model things. Independence is also an intuitive concept, and I can look around the world and make guesses about where to use that. Combining those two, I can notice places in the world where (some version of) KPD should bind, and if it doesn't bind then I'm surprised.
But I don't see how to notice places where max-f-divergence distributions should bind to reality. I can see where sufficient statistics should exist in the world, but that's not a sufficient condition for a max-f-divergence distribution; there are (IIUC) lots of other distributions for which sufficient statistics exist. So why focus on the max-f-divergence class specifically? What intuitive property of (some) real-world systems nails down that class of distributions? Maybe some kind of convexity condition on the distribution or on updates or something?
A practical roadblock is that the above numerical results for inference are terribly slow to compute...
Not sure exactly what you're doing numerically, but here's how I usually handle vanilla maxent problems numerically (in my usual notation, which is not quite the same as yours, apologies for putting marginal translation work on you). We start with
maxent subject to
Transform that to
maxent subject to
This gives a dual problem for the partition function, which I'll write in full:
min
That's an unconstrained optimization problem, it's convex in the happy direction, and standard optimizers (e.g. scipy using jax for derivatives) can usually handle it very quickly.
I would guess that process generalizes just fine to other f-divergences, since it's basically just relying on convex optimization tricks. And that should yield quite fast numerics.
A barrier for : suppose X and Y are both bitstrings of length 2k. The first k bits of the two strings are equal (i.e. X[:k] == Y[:k] in python notation); the rest are independent. Otherwise, all bits are maxentropic (i.e. IID 50/50 coinflips).
Then there's an exact (deterministic) natural latent: := X[:k] = Y[:k]. But I(X; Y), H(X|Y), and H(Y|X) are all much larger than zero; each is k bits.
Maybe twice a year I go looking for this comment and can't find it, so I'm copying it into shortform:
Oh, I can just give you a class of nontrivial predictions of expected utility theory. I have not seen any empirical results on whether these actually hold, so consider them advance predictions.
So, a bacteria needs a handful of different metabolic resources - most obviously energy (i.e. ATP), but also amino acids, membrane lipids, etc. And often bacteria can produce some metabolic resources via multiple different paths, including cyclical paths - e.g. it's useful to be able to turn A into B but also B into A, because sometimes the environment will have lots of B and other times it will have lots of A. Now, there's the obvious prediction that the bacteria won't waste energy turning B into A and then back into B again - i.e. it will suppress one of those two pathways (assuming the cycle is energy-burning), depending on which metabolite is more abundant. Utility generalizes this idea to arbitrarily many reactions and products, and predicts that at any given time we can assign some (non-unique) "values" to each metabolite (including energy carriers), such that any reaction whose reactants have more total "value" than its products is suppressed (or at least not catalyzed; the cell doesn't really have good ways to suppress spontaneous reactions other than putting things in separate compartments).
Of course in practice this will be an approximation, and there may be occasional exceptions where the cell is doing something the model doesn't capture. If we were to do this sort of analysis in a signalling network rather than a metabolic network, for instance, there would likely be many exceptions: cells sometimes burn energy to maintain a concentration at a specific level, or to respond quickly to changes, and this particular model doesn't capture the "value" of information-content in signals; we'd have to extend our value-function in order for the utility framework to capture that. But for metabolic networks, I expect that to mostly not be an issue.
That's really just utility theory; expected utility theory would involve an organism storing some resources over time (like e.g. fat). Then we'd expect to be able to assign "values" such that the relative "value" assigned to stored resources which are not currently used is a weighted sum of the "values" assigned to those resources in different possible future environments (of the sort the organism might find itself in after something like its current environment, in the ancestral world), and the weights in the sums should be consistent. (This is a less-fleshed-out prediction than the other one, but hopefully it's enough of a sketch to give the idea.)
Of course, if we understand expected utility theory deeply, then these predictions are quite trivial; they're just saying that organisms won't make pareto-suboptimal use of their resources! It's one of those predictions where, if it's false, then we've probably discovered something interesting - most likely some place where an organism is spending resources to do something useful which we haven't understood yet. [EDIT-TO-ADD: This is itself intended as a falsifiable prediction - if we go look at an anomaly and don't find any unaccounted-for phenomenon, then that's a very big strike against expected utility theory.] And that's the really cool prediction here: it gives us a tool to uncover unknown-unknowns in our understanding of a cell's behavior.
I think one thing I didn't communicate in the post is that I don't necessarily intend to hypothesize deep nonconsent as a terminal preference. So, for instance,
women are scared men will get angry if they go from "yes" to "no", in a way they won't if the woman goes from "----" to "no", so women delay being explicit until they have all the information
sounds to me like one of many possible generators of deep nonconsent preference - i.e. it's directly explaining why women would typically have a deep-in-the-sense-of-appearing-in-lots-of-places preference for nonconsent behavior. It therefore sounds not-at-all at odds with the post, or at least what I had in mind when writing the post.
"Another I played with was e.g. "blame avoidance", i.e. something-like-ladybrain really wants any dating/sex to happen in a way which is "not her fault". That seems to mostly generate the same predictions."
Do you think it has some disadvantage, such that you didn't choose to mention it at all in the OP?
"Blame avoidance" seems like a candidate generator of deep nonconsent preference: if one never consents to anything that's going on, then one is not to blame for any of it (or so goes the story). There are other generators one could imagine as well - e.g. Elizabeth hypothesized elsethread 'women are scared men will get angry if they go from "yes" to "no", in a way they won't if the woman goes from "----" to "no", so women delay being explicit until they have all the information'. That's another hypothesis for what might generate deep nonconsent preference.
I settled on the term "deep nonconsent preference" because that seemed like the most direct description of the behavior-cluster, while assuming the least about what generates that behavior. I did not think (and still don't think) I had enough information to nail down a primary generator of the behavior.
Can you gesture at what kind of data would be helpful to bring in-frame?
So, there's this thing called Solomonoff induction. It works, provably, for anything Turing computable. And human social behavior is definitely Turing computable.
"If a theory claims to compactly generate any significant set of social dynamics, that's evidence against the theory" is an anti-inductive prior. It's like saying that things which have happened less often before are more likely in the future, and therefore the sun will certainly not rise tomorrow.
Look, I don't like dealing with the sort of stuff I called "deep nonconsent" in this post. Sure, I'm quite kinky in bed, but in the rest of the mating process? When someone who's interested won't send any goddamn signals, or sends negative signals while hoping that I pursue, it's just incredibly obnoxious. I strongly prefer to deal with women who actually send signals when interested, or better yet just ask me out. I want to date and fuck women who are, like, "on my team", not trying to make everything pointlessly difficult all the time.
And maybe that will change at some point. It's the sort of thing which sometimes seems less obnoxious as one understands it better. But man, for now, I sure prefer to just avoid women who do that shit.
Like, okay, let's put it this way - if it were to turn out to have been true the entire time, what other generator could produce this evidence that also would produce evidence incompatible with this model? Or, in what way could "nonconsent" be missing the point about the generator? I'd sure like to see a slightly more ladybrain discussion, if that's available.
I totally agree that there are other possible generators which look very similar to "deep nonconsent preference". Another I played with was e.g. "blame avoidance", i.e. something-like-ladybrain really wants any dating/sex to happen in a way which is "not her fault". That seems to mostly generate the same predictions.
So yeah, I am totally ready to believe there's some other nearby generator, and if you have one which also better explains some additional things then please state it I want to know it. I have not found it on my own, and one of the main points of posting this stuff online is that sometimes people come along and tell me what I'm missing. That's what I want. If you have clean examples where the model in the post would produce incorrect interpretations of what's going on, I also want to hear those. What I don't want is people being like "this is problematic and missing important things" without actually saying a single thing that it's wrong about or presenting any alternative model.
Tutoring.
This answer generated by considering what one can usefully hire other people to do full time, for oneself (or one's family) alone.