Tomorrow can be brighter than today
Although the night is cold
The stars may seem so very far awayBut courage, hope and reason burn
In every mind, each lesson learned
Shining light to guide our wayMake tomorrow brighter than today
Epistemic status: metaphysics
I was reading Factored Space Models (previously, Finite Factored Sets) and was trying to understand in what sense it was a Theory of Time.
Scott Garrabrant says "[The Pearlian Theory of Time] ... is the best thing to happen to our understanding of time since Einstein". I read Pearl's book on Causality[1], and while there's math, this metaphysical connection that Scott seems to make isn't really explicated. Timeless Causality and Timeless Physics is the only place I saw this view explained explicitly, but not at the level of math / language used in Pearl's book.
Here is my attempt at explicitly writing down what all of these views are pointing at (in a more rigorous language)—the core of the Pearlian Theory of Time, and in what sense FSM shares the same structure.
Causality leave a shadow of conditional independence relationships over the observational distribution. Here's an explanation providing the core intuition:
The metaphysical claim is that, this direction is the definition of time[5], morally so, based on the intuition provided by the example above.
So, the Pearlian Theory of Time is the claim that Time is the partial order over the variables of a Bayes Net corresponding to the perfect map of a distribution.
Abstracting away, the structure of any Theory of Time is then to:
This is exactly what Factored Space Model does:
... thus justifying calling FSM a Theory of Time in the same spirit that Pearlian Causal Discovery is a Theory of Time.
Chapter 2, specifically, which is about Causal Discovery. All the other chapters are mostly irrelevant for this purpose.
By (1) making a graph with edge direction corresponding to equation assignment direction, (2) pushforwarding uncertainties to endogenous variables, and (3) letting interventional distributions be defined by the truncated factorization formula.
By (1) forgetting the causal semantics, i.e. no longer associating the graph with all the interventional distributions, and only the no intervention observational distribution.
This shortform answers this question I had.
Pearl comes very close. In his Temporal Bias Conjecture (2.8.2):
"In most natural phenomenon, the physical time coincides with at least one statistical time."
(where statistical time refers to the aforementioned direction.)
But doesn't go as far as this ought to be the definition of Time.
The grinding inevitability is not a pressure on you from the outside, but a pressure from you, towards the world. This type of determination is the feeling of being an agent with desires and preferences. You are the unstoppable force, moving towards the things you care about, not because you have to but simply because that’s what it means to care.
I think this is probably one of my favorite quotes of all time. I translated it to Korean (with somewhat major stylistic changes) with the help of ChatGPT:
의지(意志)라 함은,
하나의 인간으로서,
멈출 수 없는 힘으로 자신이 중요히 여기는 것들을 향해 나아가는 것.이를 따르는 갈아붙이는 듯한 필연성은,
외부에서 자신을 압박하는 힘이 아닌,
스스로가 세상을 향해 내보내는 압력임을.해야 해서가 아니라,
단지 그것이 무언가를 소중히 여긴다는 뜻이기 때문에.
https://www.lesswrong.com/posts/KcvJXhKqx4itFNWty/k-complexity-is-silly-use-cross-entropy-instead
The K-complexity of a function is the length of its shortest code. But having many many codes is another way to be simple! Example: gauge symmetries in physics. Correcting for length-weighted code frequency, we get an empirically better simplicity measure: cross-entropy.
[this] is a well-known notion in algorithmic information theory, and differs from K-complexity by at most a constant
Epistemic status: literal shower thoughts, perhaps obvious in retrospect, but was a small insight to me.
I’ve been thinking about: “what proof strategies could prove structural selection theorems, and not just behavioral selection theorems?”
Typical examples of selection theorems in my mind are: coherence theorems, good regulator theorem, causal good regulator theorem.
And I got stuck here wondering, man, how do I ever prove anything structural.
Then I considered some theorems that, if you squint really really hard, could also be framed in the selection theorem language in a very broad sense:
This made me realize that to prove selection theorems on structural properties of agents, you should obviously give more mathematical structure to the “agent” in the first place:
And recall, we actually have an agent foundations style selection theorem that does prove something structural about agent internals by giving more mathematical structure to the agent:
So in short, we need more initial structure or even assumptions on our “agent,” at least more so than literally a single node in a Bayes Net, to expect to be able to prove something structural.
Here is my 5-minute attempt to put more such "structure" to the [agent/decision node] in the Causal good regulator theorem with the hopes that this would make the theorem more structural, and perhaps end up as a formalization of the Agent-like Structure Problem (for World Models, at least), or very similarly the Approximate Causal Mirror hypothesis:
I'm being very very [imprecise/almost misleading] here—because I'm just trying to make a high-level point and the details don't matter too much—one of the caveats (among many) being that this statement makes the theoretically yet unjustified connection between SGD and Bayes.
"I always remember, [Hamming] would come into my office and try to solve a problem [...] I had a very big blackboard, and he’d start on one side, write down some integral, say, ‘I ain’t afraid of nothin’, and start working on it. So, now, when I start a big problem, I say, ‘I ain’t afraid of nothin’, and dive into it."
The question is whether this expression is easy to compute or not, and fortunately the answer is that it's quite easy! We can evaluate the first term by the simple Monte Carlo method of drawing many independent samples and evaluating the empirical average, as we know the distribution explicitly and it was presumably chosen to be easy to draw samples from.
My question when reading this was: why can't we say the same thing about ? i.e. draw many independent samples and evaluate the empirical average? Usually is also assumed known and simple to sample from (e.g., gaussian).
So far, my answer is:
Tl;dr, Systems are abstractable to the extent they admit an abstracting causal model map with low approximation error. This should yield a pareto frontier of high-level causal models consisting of different tradeoffs between complexity and approximation error. Then try to prove a selection theorem for abstractability / modularity by relating the form of this curve and a proposed selection criteria.
Recall, an abstracting causal model (ACM)—exact transformations, -abstractions, and approximations—is a map between two structural causal models satisfying certain requirements that lets us reasonably say one is an abstraction, or a high-level causal model of another.
Now consider a curve: x-axis is the node count, and y-axis is the minimum approximation error of ACMs of the original system with that node count (subject to some conditions[1]). It would hopefully an decreasing one[2].
Then, try hard to prove a selection theorem of the following form: given low-level causal model satisfying certain criteria (eg low regret over varying objectives, connection costs), the abstractability curve gets pushed further downwards. Or conversely, find conditions that make this true.
I don't know how to prove this[3], but at least this gets closer to a well-defined mathematical problem.
I've been thinking about this for an hour now and finding the right definition here seems a bit non-trivial. Obviously there's going to be an ACM of zero approximation error for any node count, just have a single node that is the joint of all the low-level nodes. Then the support would be massive, so a constraint on it may be appropriate.
Or instead we could fold it in to the x-axis—if there is perhaps a non ad-hoc, natural complexity measure for Bayes Nets that capture [high node counts => high complexity because each nodes represent stable causal mechanisms of the system, aka modules] and [high support size => high complexity because we don't want modules that are "contrived" in some sense] as special cases, then we could use this as the x-axis instead of just node count.
Immediate answer: Restrict this whole setup into a prediction setting so that we can do model selection. Require on top of causal consistency that both the low-level and high-level causal model have a single node whose predictive distribution are similar. Now we can talk about eg the RLCT of a Bayes Net. I don't know if this makes sense. Need to think more.
Or rather, find the appropriate setup to make this a decreasing curve.
I suspect closely studying the robust agents learn causal world models paper would be fruitful, since they also prove a selection theorem over causal models. Their strategy is to (1) develop an algorithm that queries an agent with low regret to construct a causal model, (2) prove that this yields an approximately correct causal model of the data generating model, (3) then arguing that this implies the agent must internally represent something isomorphic to a causal world model.
I don't know if this is just me, but it took me an embarrassingly long time in my mathematical education to realize that the following three terminologies, which introductory textbooks used interchangeably without being explicit, mean the same thing. (Maybe this is just because English is my second language?)
X => Y means X is sufficient for Y means X only if Y
X <= Y means X is necessary for Y means X if Y
The critical insight is that this is not always the case!
Let's call two graphs I-equivalent if their set of independencies (implied by d-separation) are identical. A theorem of Bayes Nets say that two graphs are I-equivalent if they have the same skeleton and the same set of immoralities.
This last constraint, plus the constraint that the graph must be acyclic, allows some arrow directions to be identified - namely, across all I-equivalent graphs that are the perfect map of a distribution, some of the edges have identical directions assigned to them.
The IC algorithm (Verma & Pearl, 1990) for finding perfect maps (hence temporal direction) is exactly about exploiting these conditions to orient as many of the edges as possible:
More intuitively, (Verma & Pearl, 1992) and (Meek, 1995) together shows that the following four rules are necessary and sufficient operations to maximally orient the graph according to the I-equivalence (+ acyclicity) constraint:
Anyone interested in further detail should consult Pearl's Causality Ch 2. Note that for some reason Ch 2 is the only chapter in the book where Pearl talks about Causal Discovery (i.e. inferring time from observational distribution) and the rest of the book is all about Causal Inference (i.e. inferring causal effect from (partially) known causal structure).