The first new qualitative thing in Information Theory when you move from two variables to three variables is the presence of negative values: information measures (entropy, conditional entropy, mutual information) are always nonnegative for two variables, but there can be negative triple mutual information .
This so far is a relatively well-known fact. But what is the first new qualitative thing when moving from three to four variables? Non-Shannon-type Inequalities.
A fundamental result in Information Theory is that...
Relevant: Alignment as a Bottleneck to Usefulness of GPT-3
between alignment and capabilities, which is the main bottleneck to getting value out of GPT-like models, both in the short term and the long(er) term?
By the way, Gemini 2.5 Pro and o3-mini-high is good at tic-tac-toe. I was surprised because the last time I tested this on o1-preview, it did quite terribly.
Where in the literature can I find the proof of the lower bound?
Previous discussion, comment by A.H. :
...Sorry to be a party pooper, but I find the story of Jason Padgett (the guy who 'banged his head and become a math genius') completely unconvincing. From the video that you cite, here is the 'evidence' that he is 'math genius':
- He tells us, with no context, 'the inner boundary of pi is f(x)=x sin(pi/x)'. Ok!
- He makes 'math inspired' drawings (some of which admittedly are pretty cool but they're not exactly original) and sells them on his website
- He claims that a physicist (who is not named or interviewed) saw him drawing i
I wrote "your brain can wind up settling on either of [the two generative models]", not both at once.
Ah that makes sense. So the picture I should have is: whatever local algorithm oscillates between multiple local MAP solutions over time that correspond to qualitatively different high-level information (e.g., clockwise vs counterclockwise). Concretely, something like the metastable states of a Hopfield network, or the update steps of predictive coding (literally gradient update to find MAP solution for perception!!) oscillating between multiple local minima?
Curious about the claim regarding bistable perception as the brain "settling" differently on two distinct but roughly equally plausible generative model parameters behind an observation. In standard statistical terms, should I think of it as: two parameters having similarly high Bayesian posterior probability, but the brain not explicitly representing this posterior, instead using something like local hill climbing to find a local MAP solution—bistable perception corresponding to the two different solutions this process converges to?
If correct, to what ext...
I just read your koan and wow it's a great post, thank you for writing it. It also gave me some new insights as to how to think about my confusions and some answers. Here's my chain of thought:
I like the definition, it's the minimum expected code length for a distribution under constraints on the code (namely, constraints on the kind of beliefs you're allowed to have - after having that belief, the optimal code is as always the negative log prob).
Also the examples in Proposition 1 were pretty cool in that it gave new characterizations of some well-known quantities - log determinant of the covariance matrix does indeed intuitively measure the uncertainty of a random variable, but it is very cool to see that it in fact has entropy interpretations!
It's kinda sad because after a brief search it seems like none of the original authors are interested in extending this framework.
That makes sense. I've updated towards thinking this is reasonable (albeit binning and discretization is still ad hoc) and captures something real.
We could formalize it like where with being some independent noise parameterized by \sigma. Then would become finite. We could think of binning the output of a layer to make it stochastic in a similar way.
Ideally we'd like the new measure to be finite even for deterministic maps (this is the case for above) and some strict d...
Ah you're right. I was thinking about the deterministic case.
Your explanation of the jacobian term accounting for features "squeezing together" makes me update towards thinking maybe the quantizing done to turn neural networks from continuous & deterministic to discrete & stochastic, while ad hoc, isn't as unreasonable as I originally thought it was. This paper is where I got the idea that discretization is bad because it "conflates 'information theoretic stuff' with 'geometric stuff', like clustering" - but perhaps this is in fact capturing something real.
Thank you, first three examples make sense and seem like an appropriate use of mutual information. I'd like to ask about the fourth example though, where you take the weights as unknown:
epistemic status: unoriginal. trying to spread a useful framing of theoretical progress introduced from an old post.
Tl;dr, often the greatest theoretical challenge comes from the step of crossing the chasm from [developing an impractical solution to a problem] to [developing some sort of a polytime solution to a problem], because the nature of their solutions can be opposites.
Summarizing Diffractor's post on Program Search and Incomplete Understanding:
Solving a foundational problem to its implementation often takes the ...
I'd vote for removing the stage "developing some sort of polytime solution" and just calling 4 "developing a practical solution". I think listing that extra step is coming from the perspective of something who's more heavily involved in complexity classes. We're usually interested in polynomial time algorithms because they're usually practical, but there are lots of contexts where practicality doesn't require a polynomial time algorithm, or really, where we're just not working in a context where it's natural to think in terms of algorithms with run-times.
I agree with this framing. The issue of characterizing in what way Our World is Special is the core theoretical question of learning theory.
The way of framing it as a single bottleneck 3-4 maybe understates how large the space of questions is here. E.g. it encompasses virtually every field of theoretical computer science, and physics& mathematics relevant to computation outside of AIT and numerical math.
Thank you! I tried it on this post and while the post itself is pretty short, the raw content that i get seems to be extremely long (making it larger than the o1 context window, for example), with a bunch of font-related information inbetween. Is there a way to fix this?
The critical insight is that this is not always the case!
Let's call two graphs I-equivalent if their set of independencies (implied by d-separation) are identical. A theorem of Bayes Nets say that two graphs are I-equivalent if they have the same skeleton and the same set of immoralities.
This last constraint, plus the constraint that the graph must be acyclic, allows some arrow directions to be identified - namely, across all I-equivalent graphs that are the perfect map of a distribution, some of the edges have identical directions assigned to them.
The IC ...
Epistemic status: metaphysics
I was reading Factored Space Models (previously, Finite Factored Sets) and was trying to understand in what sense it was a Theory of Time.
Scott Garrabrant says "[The Pearlian Theory of Time] ... is the best thing to happen to our understanding of time since Einstein". I read Pearl's book on Causality[1], and while there's math, this metaphysical connection that Scott seems to make isn't really explicated. Timeless Causality and Timeless Physics is the only place I saw this vie...
The grinding inevitability is not a pressure on you from the outside, but a pressure from you, towards the world. This type of determination is the feeling of being an agent with desires and preferences. You are the unstoppable force, moving towards the things you care about, not because you have to but simply because that’s what it means to care.
I think this is probably one of my favorite quotes of all time. I translated it to Korean (with somewhat major stylistic changes) with the help of ChatGPT:
...의지(意志)라 함은,
하나의 인간으로서,
멈출 수 없는 힘으로
https://www.lesswrong.com/posts/KcvJXhKqx4itFNWty/k-complexity-is-silly-use-cross-entropy-instead
The K-complexity of a function is the length of its shortest code. But having many many codes is another way to be simple! Example: gauge symmetries in physics. Correcting for length-weighted code frequency, we get an empirically better simplicity measure: cross-entropy.
[this] is a well-known notion in algorithmic information theory, and differs from K-complexity by at most a constant
Epistemic status: literal shower thoughts, perhaps obvious in retrospect, but was a small insight to me.
I’ve been thinking about: “what proof strategies could prove structural selection theorems, and not just behavioral selection theorems?”
Typical examples of selection theorems in my mind are: coherence theorems, good regulator theorem, causal good regulator theorem.
"I always remember, [Hamming] would come into my office and try to solve a problem [...] I had a very big blackboard, and he’d start on one side, write down some integral, say, ‘I ain’t afraid of nothin’, and start working on it. So, now, when I start a big problem, I say, ‘I ain’t afraid of nothin’, and dive into it."
The question is whether this expression is easy to compute or not, and fortunately the answer is that it's quite easy! We can evaluate the first term by the simple Monte Carlo method of drawing many independent samples and evaluating the empirical average, as we know the distribution explicitly and it was presumably chosen to be easy to draw samples from.
My question when reading this was: why can't we say the same thing about ? i.e. draw many independent samples and evaluate the empirical average...
Tl;dr, Systems are abstractable to the extent they admit an abstracting causal model map with low approximation error. This should yield a pareto frontier of high-level causal models consisting of different tradeoffs between complexity and approximation error. Then try to prove a selection theorem for abstractability / modularity by relating the form of this curve and a proposed selection criteria.
Recall, an abstracting causal model (ACM)—exact transformations, -abstractions, and approximations—is a m...
I don't know if this is just me, but it took me an embarrassingly long time in my mathematical education to realize that the following three terminologies, which introductory textbooks used interchangeably without being explicit, mean the same thing. (Maybe this is just because English is my second language?)
X => Y means X is sufficient for Y means X only if Y
X <= Y means X is necessary for Y means X if Y
I'd also love to have access!
(the causal incentives paper convinced me to read it, thank you! good book so far)
if you read Sutton & Barto, it might be clearer to you how narrow are the circumstances under which 'reward is not the optimization target', and why they are not applicable to most AI things right now or in the foreseeable future
Can you explain this part a bit more?
My understanding of situations in which 'reward is not the optimization target' is when the assumptions of the policy improvement theorem don't hold. In particular, the theorem (that iterating policy improvemen...
I think something in the style of abstracting causal models would make this work - defining a high-level causal model such that there is a map from the states of the low-level causal model to it, in a way that's consistent with mapping low-level interventions to high-level interventions. Then you can retain the notion of causality to non-low-level-physical variables with that variable being a (potentially complicated) function of potentially all of the low-level variables.
tl;dr, the unidimensional continuity of preference assumption in the money pumping argument used to justify the VNM axioms correspond to the assumption that there exists some unidimensional "resource" that the agent cares about, and this language is provided by the notion of "souring / sweetening" a lottery.
Various coherence theorems - or more specifically, various money pumping arguments generally have the following form:
...If you violate this principle, then [you are rationally re
Yeah I'd like to know if there's a unified way of thinking about information theoretic quantities and causal quantities, though a quick literature search doesn't show up anything interesting. My guess is that we'd want separate boundary metrics for informational separation and causal separation.
I no longer think the setup above is viable, for reasons that connect to why I think Critch's operationalization is incomplete and why boundaries should ultimately be grounded in Pearlian Causality and interventions.
(Note: I am thinking as I'm writing, so this might be a bit rambly.)
Intuition: Why does a robust glider in Lenia intuitively feel like a system possessing boundary? Well, I imagine various situations that happen in the world (like bullets) and this pattern mostly stays stable in face of them.
Now, n...
I think it's plausible that the general concept of boundaries can possibly be characterized somewhat independently of preferences, but at the same time have boundary-preservation be a quality that agents mostly satisfy (discussion here. very unsure about this). I see Critch's definition as a first iteration of an operationalization for boundaries in the general, somewhat-preference-independent sense.
But I do agree that ultimately all of this should tie back to game theory. I find Discovering Agents most promising in this regards, though there are still a l...
EDIT: I no longer think this setup is viable, for reasons that connect to why I think Critch's operationalization is incomplete and why boundaries should ultimately be grounded in Pearlian Causality and interventions. Check update.
I believe there's nothing much in the way of actually implementing an approximation of Critch's boundaries[1] using deep learning.
Recall, Critch's boundaries are:
Damn, why did Pearl recommend readers (in the preface of his causality book) to read all the chapters other than chapter 2 (and the last review chapter)? Chapter 2 is literally the coolest part - inferring causal structure from purely observational data! Almost skipped that chapter because of it ...
Here's my current take, I wrote it as a separate shortform because it got too long. Thanks for prompting me to think about this :)
I find the intersection of computational mechanics, boundaries/frames/factored-sets, and some works from the causal incentives group - especially discovering agents and robust agents learn causal world model (review) - to be a very interesting theoretical direction.
By boundaries, I mean a sustaining/propagating system that informationally/causally insulates its 'viscera' from the 'environment,' and only allows relatively small amounts of deliberate information flow through certain channels in both directions. Living systems are an example of it (from bacte...
Discovering agents provide a genuine causal, interventionist account of agency and an algorithm to detect them, motivated by the intentional stance. I find this paper very enlightening from a conceptual perspective!
I've tried to think of problems that needed to be solved before we can actually implement this on real systems - both conceptual and practical - on approximate order of importance.
Thanks, it seems like the link got updated. Fixed!
Quick paper review of Measuring Goal-Directedness from the causal incentives group.
tl;dr, goal directedness of a policy wrt a utility function is measured by its min distance to one of the policies implied by the utility function, as per the intentional stance - that one should model a system as an agent insofar as doing so is useful.
Thank you, that is very clarifying!
I've been doing a deep dive on this post, and while the main theorems make sense I find myself quite confused about some basic concepts. I would really appreciate some help here!
Does anyone know if Shannon arrive at entropy from the axiomatic definition first, or the operational definition first?
I've been thinking about these two distinct ways in which we seem to arrive at new mathematical concepts, and looking at the countless partial information decomposition measures in the literature all derived/motivated based on an axiomatic basis, and not knowing which intuition to prioritize over which, I've been assigning less premium on axiomatic conceptual definitions than i used to:
Just finished the local causal states paper, it's pretty cool! A couple of thoughts though:
I don't think the causal states factorize over the dynamical bayes net, unlike the original random variables (by assumption). Shalizi doesn't claim this either.
Also I don't fo...
a Markov blanket represents a probabilistic fact about the model without any knowledge you possess about values of specific variables, so it doesn't matter if you actually do know which way the agent chooses to go.
The usual definition of Markov blankets is in terms of the model without any knowledge of the specific values as you say, but I think in Critch's formalism this isn't the case. Specifically, he defines the 'Markov Boundary' of (being the non-abstracted physics-ish model) as a function of the random variable (where he wri...
I thought if one could solve one NP-complete problem then one can solve all of them. But you say that the treewidth doesn't help at all with the Clique problem. Is the parametrized complexity filtration by treewidth not preserved by equivalence between different NP-complete problems somehow?
All NP-complete problems should have parameters that makes the problem polynomial when bounded, trivially so by the => 3-SAT => Bayes Net translation, and using the treewidth bound.
This isn't the case for the clique problem (finding max clique) because it's not NP...
You mention treewidth - are there other quantities of similar importance?
I'm not familiar with any, though ChatGPT does give me some examples! copy-pasted below:
...
- Solution Size (k): The size of the solution or subset that we are trying to find. For example, in the k-Vertex Cover problem, k is the maximum size of the vertex cover. If k is small, the problem can be solved more efficiently.
- Treewidth (tw): A measure of how "tree-like" a graph is. Many hard graph problems become tractable when restricted to graphs of bounded treewidth. Algorithms that leverage tr
I like to think of treewidth in terms of its characterization from tree decomposition, a task where you find a clique tree (or junction tree) of an undirected graph.
Clique trees for an undirected graph is a tree such that:
You can check that these properties hold in the example below. I will also refer to nodes of a clique tree...
Bayes Net inference algorithms maintain its efficiency by using dynamic programming over multiple layers.
Level 0: Naive Marginalization
Level 1: Variable Elimination
Level 2: Clique-tree...
Perhaps I should one day in the far far future write a sequence on bayes nets.
Some low-effort TOC (this is basically mostly koller & friedman):
Speaking from the perspective of someone still developing basic mathematical maturity and often lacking prerequisites, it's very useful as a learning aid. For example, it significantly expanded the range of papers or technical results accessible to me. If I'm reading a paper containing unfamiliar math, I no longer have to go down the rabbit hole of tracing prerequisite dependencies, which often expand exponentially (partly because I don't know which results or sections in the prerequisite texts are essential, making it difficult to scope my focus). Now I c... (read more)