LESSWRONG
LW

All of Dalcy's Comments + Replies

Alexander Gietelink Oldenziel's Shortform

Dalcy11h52

Speaking from the perspective of someone still developing basic mathematical maturity and often lacking prerequisites, it's very useful as a learning aid. For example, it significantly expanded the range of papers or technical results accessible to me. If I'm reading a paper containing unfamiliar math, I no longer have to go down the rabbit hole of tracing prerequisite dependencies, which often expand exponentially (partly because I don't know which results or sections in the prerequisite texts are essential, making it difficult to scope my focus). Now I c... (read more)

2

2

Dalcy's Shortform

Dalcy6d300

Non-Shannon-type Inequalities

The first new qualitative thing in Information Theory when you move from two variables to three variables is the presence of negative values: information measures (entropy, conditional entropy, mutual information) are always nonnegative for two variables, but there can be negative triple mutual information $I (X; Y; Z)$ .

This so far is a relatively well-known fact. But what is the first new qualitative thing when moving from three to four variables? Non-Shannon-type Inequalities.

A fundamental result in Information Theory is that... (read more)

3

2Alexander Gietelink Oldenziel6d

@Fernando Rosas

quetzal_rainbow's Shortform

Relevant: Alignment as a Bottleneck to Usefulness of GPT-3

between alignment and capabilities, which is the main bottleneck to getting value out of GPT-like models, both in the short term and the long(er) term?

Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

Dalcy10d30

By the way, Gemini 2.5 Pro and o3-mini-high is good at tic-tac-toe. I was surprised because the last time I tested this on o1-preview, it did quite terribly.

The Hessian rank bounds the learning coefficient

Dalcy26d10

Where in the literature can I find the proof of the lower bound?

bgold's Shortform

Dalcy1mo485

Previous discussion, comment by A.H. :

Sorry to be a party pooper, but I find the story of Jason Padgett (the guy who 'banged his head and become a math genius') completely unconvincing. From the video that you cite, here is the 'evidence' that he is 'math genius':
He tells us, with no context, 'the inner boundary of pi is f(x)=x sin(pi/x)'. Ok!
He makes 'math inspired' drawings (some of which admittedly are pretty cool but they're not exactly original) and sells them on his website
He claims that a physicist (who is not named or interviewed) saw him drawing i

... (read more)

1

3Ben Goldhaber1mo

good to know thanks for flagging!

[Intuitive self-models] 1. Preliminaries

Dalcy2mo40

I wrote "your brain can wind up settling on either of [the two generative models]", not both at once.

Ah that makes sense. So the picture I should have is: whatever local algorithm oscillates between multiple local MAP solutions over time that correspond to qualitatively different high-level information (e.g., clockwise vs counterclockwise). Concretely, something like the metastable states of a Hopfield network, or the update steps of predictive coding (literally gradient update to find MAP solution for perception!!) oscillating between multiple local minima?

1

1

[Intuitive self-models] 1. Preliminaries

Dalcy2mo40

Curious about the claim regarding bistable perception as the brain "settling" differently on two distinct but roughly equally plausible generative model parameters behind an observation. In standard statistical terms, should I think of it as: two parameters having similarly high Bayesian posterior probability, but the brain not explicitly representing this posterior, instead using something like local hill climbing to find a local MAP solution—bistable perception corresponding to the two different solutions this process converges to?

If correct, to what ext... (read more)

3Steven Byrnes2mo

Yup, sounds right. I think it can represent multiple possibilities to a nonzero but quite limited extent; I think the superposition can only be kinda local to a particular subregion of the cortex and a fraction of a second. I talk about that a bit in §2.3. I wrote "your brain can wind up settling on either of [the two generative models]", not both at once. …Not sure if I answered your question.

What's the Right Way to think about Information Theoretic quantities in Neural Networks?

Dalcy3mo10

I just read your koan and wow it's a great post, thank you for writing it. It also gave me some new insights as to how to think about my confusions and some answers. Here's my chain of thought:

if I want my information theoretic quantities to not degenerate, then I need some distribution over the weights. What is the natural distribution to consider?
Well, there's the Bayesian posterior.
But I feel like there is a sense in which an individual neural network with its weight should be considered as a deterministic information processing system on its own, witho

... (read more)

1

2Dmitry Vaintrob3mo

If I understand correctly, you want a way of thinking about a reference class of programs that has some specific, perhaps interpretability-relevant or compression-related properties in common with the deterministic program you're studying? I think in this case I'd actually say the tempered Bayesian posterior by itself isn't enough, since even if you work locally in a basin, it might not preserve the specific features you want. In this case I'd probably still start with the tempered Bayesian posterior, but then also condition on the specific properties/explicit features/ etc. that you want to preserve. (I might be misunderstanding your comment though)

What's the Right Way to think about Information Theoretic quantities in Neural Networks?

Dalcy3mo10

I like the definition, it's the minimum expected code length for a distribution under constraints on the code (namely, constraints on the kind of beliefs you're allowed to have - after having that belief, the optimal code is as always the negative log prob).

Also the examples in Proposition 1 were pretty cool in that it gave new characterizations of some well-known quantities - log determinant of the covariance matrix does indeed intuitively measure the uncertainty of a random variable, but it is very cool to see that it in fact has entropy interpretations!

It's kinda sad because after a brief search it seems like none of the original authors are interested in extending this framework.

1

What's the Right Way to think about Information Theoretic quantities in Neural Networks?

Dalcy3mo10

That makes sense. I've updated towards thinking this is reasonable (albeit binning and discretization is still ad hoc) and captures something real.

We could formalize it like $I_{σ} (X; f (X))$ where $I_{σ} (X; f (X)) = I (X; f (X) + ϵ_{σ})$ with $ϵ_{σ}$ being some independent noise parameterized by \sigma. Then $I_{σ} (X; f (X))$ would become finite. We could think of binning the output of a layer to make it stochastic in a similar way.

Ideally we'd like the new measure to be finite even for deterministic maps (this is the case for above) and some strict d... (read more)

What's the Right Way to think about Information Theoretic quantities in Neural Networks?

Dalcy3mo20

Ah you're right. I was thinking about the deterministic case.

Your explanation of the jacobian term accounting for features "squeezing together" makes me update towards thinking maybe the quantizing done to turn neural networks from continuous & deterministic to discrete & stochastic, while ad hoc, isn't as unreasonable as I originally thought it was. This paper is where I got the idea that discretization is bad because it "conflates 'information theoretic stuff' with 'geometric stuff', like clustering" - but perhaps this is in fact capturing something real.

What's the Right Way to think about Information Theoretic quantities in Neural Networks?

Dalcy3mo30

Thank you, first three examples make sense and seem like an appropriate use of mutual information. I'd like to ask about the fourth example though, where you take the weights as unknown:

What's the operational meaning of $I (X; Z)$ under some $p (W)$ ? More importantly: to what kind of theoretical questions we could ask are these quantities an answer to? (I'm curious of examples in which such quantities came out as a natural answer to the research questions you were asking in practice.)
- I would guess the choice of $p (W)$ (maybe the bayesian po

... (read more)

4James Camacho3mo

I don't think this is true? The differential entropy changes, even if you use a reversible map: H(Y)=H(X)+EX[log|detJ|] where J is the Jacobian of your map. Features that are "squeezed together" are less usable, and you end up with a smaller entropy. Similarly, "unsqueezing" certain features, or examining them more closely, gives a higher entropy.

Dalcy's Shortform

Dalcy5mo289

The 3-4 Chasm of Theoretical Progress

epistemic status: unoriginal. trying to spread a useful framing of theoretical progress introduced from an old post.

Tl;dr, often the greatest theoretical challenge comes from the step of crossing the chasm from [developing an impractical solution to a problem] to [developing some sort of a polytime solution to a problem], because the nature of their solutions can be opposites.

Summarizing Diffractor's post on Program Search and Incomplete Understanding:

Solving a foundational problem to its implementation often takes the ... (read more)

3

3

Alex_Altair5mo100

I'd vote for removing the stage "developing some sort of polytime solution" and just calling 4 "developing a practical solution". I think listing that extra step is coming from the perspective of something who's more heavily involved in complexity classes. We're usually interested in polynomial time algorithms because they're usually practical, but there are lots of contexts where practicality doesn't require a polynomial time algorithm, or really, where we're just not working in a context where it's natural to think in terms of algorithms with run-times.

Alexander Gietelink Oldenziel5mo154

I agree with this framing. The issue of characterizing in what way Our World is Special is the core theoretical question of learning theory.

The way of framing it as a single bottleneck 3-4 maybe understates how large the space of questions is here. E.g. it encompasses virtually every field of theoretical computer science, and physics& mathematics relevant to computation outside of AIT and numerical math.

Open Thread Summer 2024

Dalcy5mo10

Thank you! I tried it on this post and while the post itself is pretty short, the raw content that i get seems to be extremely long (making it larger than the o1 context window, for example), with a bunch of font-related information inbetween. Is there a way to fix this?

Dalcy's Shortform

Dalcy5mo50

The critical insight is that this is not always the case!

Let's call two graphs I-equivalent if their set of independencies (implied by d-separation) are identical. A theorem of Bayes Nets say that two graphs are I-equivalent if they have the same skeleton and the same set of immoralities.

This last constraint, plus the constraint that the graph must be acyclic, allows some arrow directions to be identified - namely, across all I-equivalent graphs that are the perfect map of a distribution, some of the edges have identical directions assigned to them.

The IC ... (read more)

2cubefox5mo

Ah yes, the fork asymmetry. I think Pearl believes that correlations reduce to causations, so this is probably why he wouldn't particularly try to, conversely, reduce causal structure to a set of (in)dependencies. I'm not sure whether the latter reduction is ultimately possible in the universe. Are the correlations present in the universe, e.g. defined via the Albert/Loewer Mentaculus probability distribution, sufficient to recover the familiar causal structure of the universe?

Dalcy's Shortform

Dalcy5mo476

The Metaphysical Structure of Pearl's Theory of Time

Epistemic status: metaphysics

I was reading Factored Space Models (previously, Finite Factored Sets) and was trying to understand in what sense it was a Theory of Time.

Scott Garrabrant says "[The Pearlian Theory of Time] ... is the best thing to happen to our understanding of time since Einstein". I read Pearl's book on Causality^[1], and while there's math, this metaphysical connection that Scott seems to make isn't really explicated. Timeless Causality and Timeless Physics is the only place I saw this vie... (read more)

2cubefox5mo

This approach goes back to Hans Reichenbach's book The Direction of Time. I think the problem is that the set of independencies alone is not sufficient to determine a causal and temporal order. For example, the same independencies between three variables could be interpreted as the chains A→B→C and A←B←C. I think Pearl talks about this issue in the last chapter.

The virtue of determination

Dalcy6mo40

The grinding inevitability is not a pressure on you from the outside, but a pressure from you, towards the world. This type of determination is the feeling of being an agent with desires and preferences. You are the unstoppable force, moving towards the things you care about, not because you have to but simply because that’s what it means to care.

I think this is probably one of my favorite quotes of all time. I translated it to Korean (with somewhat major stylistic changes) with the help of ChatGPT:

의지(意志)라 함은,
하나의 인간으로서,
멈출 수 없는 힘으로

... (read more)

Lucius Bushnaq's Shortform

Dalcy6mo30

https://www.lesswrong.com/posts/KcvJXhKqx4itFNWty/k-complexity-is-silly-use-cross-entropy-instead

The K-complexity of a function is the length of its shortest code. But having many many codes is another way to be simple! Example: gauge symmetries in physics. Correcting for length-weighted code frequency, we get an empirically better simplicity measure: cross-entropy.

[this] is a well-known notion in algorithmic information theory, and differs from K-complexity by at most a constant

6Lucius Bushnaq6mo

Sure. But what’s interesting to me here is the implication that, if you restrict yourself to programs below some maximum length, weighing them uniformly apparently works perfectly fine and barely differs from Solomonoff induction at all. This resolves a remaining confusion I had about the connection between old school information theory and SLT. It apparently shows that a uniform prior over parameters (programs) of some fixed size parameter space is basically fine, actually, in that it fits together with what algorithmic information theory says about inductive inference.

Dalcy's Shortform

Dalcy6mo4120

Epistemic status: literal shower thoughts, perhaps obvious in retrospect, but was a small insight to me.

I’ve been thinking about: “what proof strategies could prove structural selection theorems, and not just behavioral selection theorems?”

Typical examples of selection theorems in my mind are: coherence theorems, good regulator theorem, causal good regulator theorem.

Coherence theorem: Given an agent satisfying some axioms, we can observe their behavior in various conditions and construct $U$ , and then the agent’s behavior is equivalent to a system that

... (read more)

3

1

2Alex_Altair6mo

[Some thoughts that are similar but different to my previous comment;] I suspect you can often just prove the behavioral selection theorem and structural selection theorem in separate, almost independent steps. 1. Prove a behavioral theorem 2. add in a structural assumption 3. prove that behavioral result plus structural assumption implies structural result. Behavior essentially serves as an "interface", and a given behavior can be implemented by any number of different structures. So it would make sense that you need to prove something about structure separately (and that you can prove it for multiple different types of structural assumption). Further claims: for any given structural class, * there will be a natural simplicity measure * simpler instances will be exponentially rare. A structural class is something like programs, or Markov chains, or structural causal models. The point of specifying structure is to in some way model how the system might actually be shaped in real life. So it seems to me that any of these will be specified with a finite string over a finite alphabet. This comes with the natural simplicity measure of the length of the specification string, and there are exponentially fewer short strings than long ones.[1] So let's say you want to prove that your thing X which has behavior B has specific structure S. Since structure S has a fixed description length, you almost automatically know that it's exponentially less likely for X to be one of the infinitely many structures with description length longer than S. (Something similar holds for being within delta of S) The remaining issue is whether there are any other secret structures that are shorter than S (or of similar length) that X could be instead. 1. ^ Technically, you could have a subset of strings that didn't grow exponentially. For example, you could, for some reason, decide to specify your Markov chains using only strings of zeros. That would grow linearly rather t

2Alexander Gietelink Oldenziel6mo

There is a straightforward compmech take also. If the goal of the agent is simply to predict well (let's say the reward is directly tied to good prediction) for a sequential task AND it performs optimally then we know it must contain the Mixed State Presentation of the epsilon machine (causal states). Importantly the MSP must be used if optimal prediction is achieved. There is a variant I think, that has not been worked out yet but we talked about briefly with Fernando and Vanessa in Manchester recently for transducers /MDPs

2cubefox6mo

Thoughts, @Jeremy Gillen?

6Alex_Altair6mo

Yeah, I think structural selection theorems matter a lot, for reasons I discussed here. This is also one reason why I continue to be excited about Algorithmic Information Theory. Computable functions are behavioral, but programs (= algorithms) are structural! The fact that programs can be expressed in the homogeneous language of finite binary strings gives a clear way to select for structure; just limit the length of your program. We even know exactly how this mathematical parameter translates into real-world systems, because we can know exactly how many bits our ML models take up on the hard drives. And I think you can use algorithmic information distance to well-define just how close to agent-structured your policy is. First, define the specific program A that you mean to be maximally agent-structured (which I define as a utility-maximizing program). If your policy (as a program) can be described as "Program A, but different in ways X" then we have an upper bound for how close it is to agent-structured it is. X will be a program that tells you how to transform A into your policy, and that gives us a "distance" of at most the length of X in bits. For a given length, almost no programs act anything like A. So if your policy is only slightly bigger than A, and it acts like A, then it's probably of the form "A, but slightly different", which means it's agent-structured. (Unfortunately this argument needs like 200 pages of clarification.)

Rationality Quotes - Fall 2024

Dalcy6mo172

"I always remember, [Hamming] would come into my office and try to solve a problem [...] I had a very big blackboard, and he’d start on one side, write down some integral, say, ‘I ain’t afraid of nothin’, and start working on it. So, now, when I start a big problem, I say, ‘I ain’t afraid of nothin’, and dive into it."

—Bruce MacLennan

Variational Bayesian methods

Dalcy7mo*10

The question is whether this expression is easy to compute or not, and fortunately the answer is that it's quite easy! We can evaluate the first term by the simple Monte Carlo method of drawing many independent samples $z \sim Q (z ∣ x)$ and evaluating the empirical average, as we know the distribution $Q (z ∣ x)$ explicitly and it was presumably chosen to be easy to draw samples from.

My question when reading this was: why can't we say the same thing about $P (x) = E_{z \sim P (z)} [P (x | z)]$ ? i.e. draw many independent samples and evaluate the empirical average... (read more)

Dalcy's Shortform

Dalcy7mo10

Is there a way to convert a LessWrong sequence into a single pdf? Should ideally preserve comments, latex, footnotes, etc.

2niplav7mo

The way I do this is use the Print as PDF functionality in the browser on every single post, and then concatenate them using pdfunite.

Dalcy's Shortform

Dalcy7mo110

Formalizing selection theorems for abstractability

Tl;dr, Systems are abstractable to the extent they admit an abstracting causal model map with low approximation error. This should yield a pareto frontier of high-level causal models consisting of different tradeoffs between complexity and approximation error. Then try to prove a selection theorem for abstractability / modularity by relating the form of this curve and a proposed selection criteria.

Recall, an abstracting causal model (ACM)—exact transformations, $τ$ -abstractions, and approximations—is a m... (read more)

Dalcy's Shortform

Dalcy7mo10

I don't know if this is just me, but it took me an embarrassingly long time in my mathematical education to realize that the following three terminologies, which introductory textbooks used interchangeably without being explicit, mean the same thing. (Maybe this is just because English is my second language?)

X => Y means X is sufficient for Y means X only if Y
X <= Y means X is necessary for Y means X if Y

2Alex_Altair6mo

For some reason the "only if" always throws me off. It reminds me of the unless keyword in ruby, which is equivalent to if not, but somehow always made my brain segfault.

1cubefox7mo

Saying "if X then Y" generally is equivalent to "X is sufficient for Y", "Y is necessary for X", "X only if Y".

1papetoast7mo

I think the interchangeability is just hard to understand. Even though I know they are the same thing, it is still really hard to intuitively see them as being equal. I personally try (but not very hard) to stick with X -> Y in mathy discussions and if/only if for normal discussions

Ruby's Quick Takes

Dalcy7mo10

I'd also love to have access!

2Ruby7mo

Added!

Dalcy's Shortform

Dalcy7mo84

Any thoughts on how to customize LessWrong to make it LessAddictive? I just really, really like the editor for various reasons, so I usually write a bunch (drafts, research notes, study notes, etc) using it but it's quite easy to get distracted.

2MondSemmel7mo

You could use the ad & content blocker uBlock Origin to zap any addictive elements of the site, like the main page feed or the Quick Takes or Popular Comments. Then if you do want to access these, you can temporarily turn off uBlock Origin. Incidentally, uBlock Origin can also be installed on mobile Firefox, and you can manually sync its settings across devices.

2papetoast7mo

Maybe make a habit of blocking https://www.lesswrong.com/posts/* while writing?

Least-problematic Resource for learning RL?

Dalcy8mo30

(the causal incentives paper convinced me to read it, thank you! good book so far)

if you read Sutton & Barto, it might be clearer to you how narrow are the circumstances under which 'reward is not the optimization target', and why they are not applicable to most AI things right now or in the foreseeable future

Can you explain this part a bit more?

My understanding of situations in which 'reward is not the optimization target' is when the assumptions of the policy improvement theorem don't hold. In particular, the theorem (that iterating policy improvemen... (read more)

Agent Boundaries Aren't Markov Blankets. [Unless they're non-causal; see comments.]

Dalcy8mo10

I think something in the style of abstracting causal models would make this work - defining a high-level causal model such that there is a map from the states of the low-level causal model to it, in a way that's consistent with mapping low-level interventions to high-level interventions. Then you can retain the notion of causality to non-low-level-physical variables with that variable being a (potentially complicated) function of potentially all of the low-level variables.

Dalcy's Shortform

Dalcy8mo10

Unidimensional Continuity of Preference $\approx$ Assumption of "Resources"?

tl;dr, the unidimensional continuity of preference assumption in the money pumping argument used to justify the VNM axioms correspond to the assumption that there exists some unidimensional "resource" that the agent cares about, and this language is provided by the notion of "souring / sweetening" a lottery.

Various coherence theorems - or more specifically, various money pumping arguments generally have the following form:

If you violate this principle, then [you are rationally re

... (read more)

2Alexander Gietelink Oldenziel8mo

Thinking about for some time my feeling has been that resources are about fungibility implicitly embedded in a context of trade, multiple agents (very broadly construed. E.g. an agent in time can be thought of as multiple agents cooperating intertemporally perhaps). A resource over time has the property that I can spend it now or I can spend it later. Glibly, one could say the operational meaning of the resource arises from the intertemporal bargaining of the agent. ---------------------------------------- Perhaps it's useful to distinguish several levels of resources and resource-like quantities. Discrete vs continuous, tradeable / meaningful to different agents, ?? Fungibility, ?? Temporal and spatial locatedness, ?? Additivity?, submodularity ? ---------------------------------------- Addendum: another thing to consider is that the input of the vNM theorem is in some sense more complicated than the output. The output is just a utility function u: X -> R, while your input is a preference order on the very infinite set of lotteries (= probability distributions ) L(X). Thinking operationally about a preference ordering on a space of distribution is a little wacky. It means you are willing to trade off uncertain options against one another. For this to be a meaningful choice would seem to necessitate some sort of (probabilistic) world model.

Dalcy's Shortform

Dalcy8mo30

Yeah I'd like to know if there's a unified way of thinking about information theoretic quantities and causal quantities, though a quick literature search doesn't show up anything interesting. My guess is that we'd want separate boundary metrics for informational separation and causal separation.

Dalcy's Shortform

Dalcy8mo70

I no longer think the setup above is viable, for reasons that connect to why I think Critch's operationalization is incomplete and why boundaries should ultimately be grounded in Pearlian Causality and interventions.

(Note: I am thinking as I'm writing, so this might be a bit rambly.)

The world-trajectory distribution is ambiguous.

Intuition: Why does a robust glider in Lenia intuitively feel like a system possessing boundary? Well, I imagine various situations that happen in the world (like bullets) and this pattern mostly stays stable in face of them.

Now, n... (read more)

4Jonas Hallgren8mo

I think the update makes sense in general, isn't there however some way mutual information and causality is linked? Maybe it isn't strong enough for there to be an easy extrapolation from one to the other. Also I just wanted to drop this to see if you find it interesting, kind of on this topic? Im not sure its fully defined in a causality based way but it is about structure preservation. https://youtu.be/1tT0pFAE36c?si=yv6mbswVpMiywQx9 Active Inference people also have the boundary problem as core in their work so they have some interesting stuff on it.

Dalcy's Shortform

Dalcy8mo30

I think it's plausible that the general concept of boundaries can possibly be characterized somewhat independently of preferences, but at the same time have boundary-preservation be a quality that agents mostly satisfy (discussion here. very unsure about this). I see Critch's definition as a first iteration of an operationalization for boundaries in the general, somewhat-preference-independent sense.

But I do agree that ultimately all of this should tie back to game theory. I find Discovering Agents most promising in this regards, though there are still a l... (read more)

3Vladimir_Nesov8mo

There are two different points here, boundaries as a formulation of agency, and boundaries as a major component of human values (which might be somewhat sufficient by itself for some alignment purposes). In the first role, boundaries are an acausal norm that many agents end up adopting, so that it's natural to consider a notion of agency that implies boundaries (after the agent had an opportunity for sufficient reflection). But this use of boundaries is probably open to arbitrary ruthlessness, it's not respect for autonomy of someone the powers that be wouldn't sufficiently care about. Instead, boundaries would be a convenient primitive for describing interactions with other live players, a Schelling concept shared by agents in this sense. The second role as an aspect of values expresses that the agent does care about autonomy of others outside game theoretic considerations, so it only ties back to game theory by similarity, or through the story of formation of such values that involved game theory. A general definition might be useful here, if pointing AIs at it could instill it into their values. But technical definitions don't seem to work when you consider what happens if you try to protect humanity's autonomy using a boundary according to such definitions. It's like machine translation, the problem could well be well-defined, but impossible to formally specify, other than by gesturing at a learning process.

Dalcy's Shortform

Dalcy8mo*176

EDIT: I no longer think this setup is viable, for reasons that connect to why I think Critch's operationalization is incomplete and why boundaries should ultimately be grounded in Pearlian Causality and interventions. Check update.

I believe there's nothing much in the way of actually implementing an approximation of Critch's boundaries^[1] using deep learning.

Recall, Critch's boundaries are:

Given a world (markovian stochastic process) $W_{t}$ , map its values $W$ (vector) bijectively using $f$ into 'features' that can be split into four

... (read more)

7Dalcy8mo

I no longer think the setup above is viable, for reasons that connect to why I think Critch's operationalization is incomplete and why boundaries should ultimately be grounded in Pearlian Causality and interventions. (Note: I am thinking as I'm writing, so this might be a bit rambly.) The world-trajectory distribution is ambiguous. Intuition: Why does a robust glider in Lenia intuitively feel like a system possessing boundary? Well, I imagine various situations that happen in the world (like bullets) and this pattern mostly stays stable in face of them. Now, notice that the measure of infiltration/exfiltration depends on ϕ∈Δ(Wω), a distribution over world history. Infil(ϕ):=Aggt≥0MutWω∼ϕ((Vt+1,At+1);Et∣(Vt,At,Pt)) So, for the above measure to capture my intuition, the approximate Markov condition (operationalized by low infil & exfil) must consider the world state Wω that contains the Lenia pattern with it avoiding bullets. Remember, W is the raw world state, no coarse graining. So ϕ is the distribution over the raw world trajectory. It already captures all the "potentially occurring trajectories under which the system may take boundary-preserving-action." Since everything is observed, our distribution already encodes all of "Nature's Intervention." So in some sense Critch's definition is already causal (in a very trivial sense), by the virtue of requiring a distribution over the raw world trajectory, despite mentioning no Pearlian Causality. Issue: Choice of ϕ Maybe there is some canonical true ϕ for our physical world that minds can intersubjectively arrive at, so there's no ambiguity. But when I imagine trying to implement this scheme on Lenia, there's immediately an ambiguity as to which distribution (representing my epistemic state on which raw world trajectories that will "actually happen") we should choose: 1. Perhaps a very simple distribution: assigning uniform probability over world trajectories where the world contains nothing but the glider

9Vladimir_Nesov8mo

I don't see much hope in capturing a technical definition that doesn't fall out of some sort of game theory, and even the latter won't directly work for boundaries as representation of respect for autonomy helpful for alignment (as it needs to apply to radically weaker parties). Boundaries seem more like a landmark feature of human-like preferences that serves as a test case for whether toy models of preference are reasonable. If a moral theory insists on tiling the universe with something, it fails the test. Imperative to merge all agents fails the test unless the agents end up essentially reconstructed. And with computronium, we'd need to look at the shape of things it's computing rather than at the computing substrate.

Dalcy's Shortform

Dalcy9mo*10

Damn, why did Pearl recommend readers (in the preface of his causality book) to read all the chapters other than chapter 2 (and the last review chapter)? Chapter 2 is literally the coolest part - inferring causal structure from purely observational data! Almost skipped that chapter because of it ...

6tom4everitt8mo

it's true it's cool, but I suspect he's been a bit disheartened by how complicated it's been to get this to work in real-world settings. in the book of why, he basically now says it's impossible to learn causality from data, which is a bit of a confusing message if you come from his previous books. but now with language models, I think his hopes are up again, since models can basically piggy-back on causal relationships inferred by humans

2[anonymous]9mo

You should also check out Timeless Causality, if you haven't done so already.

Dalcy's Shortform

Dalcy9mo30

Here's my current take, I wrote it as a separate shortform because it got too long. Thanks for prompting me to think about this :)

Dalcy's Shortform

Dalcy9mo20

I find the intersection of computational mechanics, boundaries/frames/factored-sets, and some works from the causal incentives group - especially discovering agents and robust agents learn causal world model (review) - to be a very interesting theoretical direction.

By boundaries, I mean a sustaining/propagating system that informationally/causally insulates its 'viscera' from the 'environment,' and only allows relatively small amounts of deliberate information flow through certain channels in both directions. Living systems are an example of it (from bacte... (read more)

Dalcy's Shortform

Dalcy9mo50

Discovering agents provide a genuine causal, interventionist account of agency and an algorithm to detect them, motivated by the intentional stance. I find this paper very enlightening from a conceptual perspective!

I've tried to think of problems that needed to be solved before we can actually implement this on real systems - both conceptual and practical - on approximate order of importance.

There are no 'dynamics,' no learning. As soon as a mechanism node is edited, it is assumed that agents immediately change their 'object decision variable' (a condition

... (read more)

Dalcy's Shortform

Dalcy9mo10

Thanks, it seems like the link got updated. Fixed!

Dalcy's Shortform

Dalcy9mo210

Quick paper review of Measuring Goal-Directedness from the causal incentives group.

tl;dr, goal directedness of a policy wrt a utility function is measured by its min distance to one of the policies implied by the utility function, as per the intentional stance - that one should model a system as an agent insofar as doing so is useful.

Details

how is "policies implied by the utility function" operationalized? given a value $u$ , we define a set containing policies of maximum entropy (of the decision variable, given its parents in the causal bayes net) among

... (read more)

1

6mattmacdermott8mo

Thanks for the feedback! Yeah, uniqueness definitely doesn't always hold for the optimal/anti-optimal policy. I think the way MEG works here makes sense: if you're following the unique optimal policy for some utility function, that's a lot of evidence for goal-directedness. If you're following one of many optimal policies, that's a bit less evidence -- there's a greater chance that it's an accident. In the most extreme case (for the constant utility function) every policy is optimal -- and we definitely don't want to ascribe maximum goal-directedness to optimal policies there. With regard to relaxing smoothly to epsilon-optimal/anti-optimal policies, from memory I think we do have the property that MEG is increasing in the utility of the policy for policies with greater than the utility of the uniform policy, and decreasing for policies with less than the utility of the uniform policy. I think you can prove this via the property that the set of maxent policies is (very nearly) just Boltzman policies with varying temperature. But I would have to sit down and think about it properly. I should probably add that to the paper if it's the case. Thanks for this. The proof is indeed nonsense, but I think the proposition is still true. I've corrected it to this.

6kave9mo

Reminds me a little bit of this idea from Vanessa Kosoy.

2habryka9mo

This link doesn't work for me:

2Adam Shai9mo

Thanks for writing this up! Having not read the paper, I am wondering if in your opinion there's a potential connection between this type of work and comp mech type of analysis/point of view? Even if it doesn't fit in a concrete way right now, maybe there's room to extend/modify things to combine things in a fruitful way? Any thoughts?

Natural Latents: The Math

Dalcy9mo30

Thank you, that is very clarifying!

Natural Latents: The Math

Dalcy9mo30

I've been doing a deep dive on this post, and while the main theorems make sense I find myself quite confused about some basic concepts. I would really appreciate some help here!

So 'latents' are defined by their conditional distribution functions whose shape is implicit in the factorization that the latents need to satisfy, meaning they don't have to always look like $P [Λ | X]$ , they can look like $P [Λ], P [X | Λ]$ , etc, right?
I don't get the 'standard form' business. It seems like a procedure to turn one latent variable $Λ$ into another relative t

... (read more)

6johnswentworth9mo

The key idea here is that, when "choosing a latent", we're not allowed to choose P[X]; P[X] is fixed/known/given, a latent is just a helpful tool for reasoning about or representing P[X]. So another way to phrase it is: we're choosing our whole model P[X,Λ], but with a constraint on the marginal P[X]. P[Λ|X] then captures all of the degrees of freedom we have in choosing a latent. Now, we won't typically represent the latent explicitly as P[Λ|X]. Typically we'll choose latents such that P[X,Λ] satisfies some factorization(s), and those factorizations will provide a more compact representation of the distribution than two giant tables for P[X], P[Λ|X]. For instance, insofar as P[Λ,X] factors as P[Λ]∏iP[Xi|Λ], we might want to represent the distribution as P[Λ] and {P[Xi|Λ]} (both for analytic and computational purposes). We've largely moved away from using the standard form anyway, I recommend ignoring it at this point. Yup, that post proves the universal natural latent conjecture when strong redundancy holds (over 3 or more variables). Whether the conjecture does not hold when strong redundancy fails is an open question. But since the strong redundancy result we've mostly shifted toward viewing strong redundancy as the usual condition to look for, and focused less on weak redundancy. Resampling We conceptually start with the objects P[X], P[Λ|X], and P[Λ′|X]. (We're imagining here that two different agents measure the same distribution P[X], but then they each model it using their own latents.) Given only those objects, the joint distribution P[X,Λ,Λ′] is underdefined - indeed, it's unclear what such a joint distribution would even mean! Whose distribution is it? One simple answer (unsure whether this will end up being the best way to think about it): one agent is trying to reason about the observables X, their own latent Λ, and the other agent's latent Λ′ simultaneously, e.g. in order to predict whether the other agent's latent is isomorphic to Λ (as would b

Dalcy's Shortform

Dalcy9mo20

Does anyone know if Shannon arrive at entropy from the axiomatic definition first, or the operational definition first?

I've been thinking about these two distinct ways in which we seem to arrive at new mathematical concepts, and looking at the countless partial information decomposition measures in the literature all derived/motivated based on an axiomatic basis, and not knowing which intuition to prioritize over which, I've been assigning less premium on axiomatic conceptual definitions than i used to:

decision theoretic justification of probability > C

... (read more)

3Algon9mo

I'm not sure what you mean by operational vs axiomatic definitions. But Shannon was unaware of the usage of S=−Σi pi ln pi in statistical mechanics. Instead, he was inspired by Nyquist and Hartley's work, which introduced ad-hoc definitions of information in the case of constant probability distributions. And in his seminal paper, "A mathematical theory of communication", he argued in the introduction for the logarithm as a measure of information because of practicality, intuition and mathematical convenience. Moreover, he explicitly derived the entropy of a distribution from three axioms: 1) that it be continuous wrt. the probabilities, 2) that it increase monotonically for larger systems w/ constant probability distributions, 3) and that it be a weighted sum the entropy of sub-systems. See section 6 for more details. I hope that answers your question.

Alexander Gietelink Oldenziel's Shortform

Dalcy9mo30

Just finished the local causal states paper, it's pretty cool! A couple of thoughts though:

I don't think the causal states factorize over the dynamical bayes net, unlike the original random variables (by assumption). Shalizi doesn't claim this either.

This would require proving that each causal state is conditionally independent of its nondescendant causal states given its parents, which is a stronger theorem than what is proved in Theorem 5 (only conditionally independent of its ancestor causal states, not necessarily all the nondescendants)

Also I don't fo... (read more)

Agent Boundaries Aren't Markov Blankets. [Unless they're non-causal; see comments.]

Dalcy10moΩ110

a Markov blanket represents a probabilistic fact about the model without any knowledge you possess about values of specific variables, so it doesn't matter if you actually do know which way the agent chooses to go.

The usual definition of Markov blankets is in terms of the model without any knowledge of the specific values as you say, but I think in Critch's formalism this isn't the case. Specifically, he defines the 'Markov Boundary' of $W_{t}$ (being the non-abstracted physics-ish model) as a function of the random variable $W_{t}$ (where he wri... (read more)

2abramdemski10mo

Critch's formalism isn't a markov blanket anyway, as far as I understand it, since he cares about approximate information boundaries rather than perfect Markov properties. Possibly he should not have called his thing "directed markov blankets" although I could be missing something. If I take your point in isolation, and try to imagine a Markov blanket where the variables of the boundary Btcan depend on the value of Wt, then I have questions about how you define conditional independence, to generalize the usual definition of Markov blankets. My initial thought is that your point will end up equivalent to John's comment. IE we can construct random variables which allow us to define Markov blankets in the usual fixed way, while still respecting the intuition of "changing our selection of random variables depending on the world state".

When Are Results from Computational Complexity Not Too Coarse?

Dalcy10mo10

I thought if one could solve one NP-complete problem then one can solve all of them. But you say that the treewidth doesn't help at all with the Clique problem. Is the parametrized complexity filtration by treewidth not preserved by equivalence between different NP-complete problems somehow?

All NP-complete problems should have parameters that makes the problem polynomial when bounded, trivially so by the => 3-SAT => Bayes Net translation, and using the treewidth bound.

This isn't the case for the clique problem (finding max clique) because it's not NP... (read more)

1

2Noosphere8910mo

Yeah, I probably messed up here quite a bit, sorry.

When Are Results from Computational Complexity Not Too Coarse?

Dalcy10mo10

You mention treewidth - are there other quantities of similar importance?

I'm not familiar with any, though ChatGPT does give me some examples! copy-pasted below:

Solution Size (k): The size of the solution or subset that we are trying to find. For example, in the k-Vertex Cover problem, k is the maximum size of the vertex cover. If k is small, the problem can be solved more efficiently.
Treewidth (tw): A measure of how "tree-like" a graph is. Many hard graph problems become tractable when restricted to graphs of bounded treewidth. Algorithms that leverage tr

... (read more)

When Are Results from Computational Complexity Not Too Coarse?

Dalcy10mo40

I like to think of treewidth in terms of its characterization from tree decomposition, a task where you find a clique tree (or junction tree) of an undirected graph.

Clique trees for an undirected graph is a tree such that:

node of a tree corresponds to a clique of a graph
maximal clique of a graph corresponds to a node of a tree
given two adjacent tree nodes, the clique they correspond to inside the graph is separated given their intersection set (sepset)

You can check that these properties hold in the example below. I will also refer to nodes of a clique tree... (read more)

1

Dalcy's Shortform

Dalcy10mo10

Bayes Net inference algorithms maintain its efficiency by using dynamic programming over multiple layers.

Level 0: Naive Marginalization

No dynamic programming whatsoever. Just multiply all the conditional probability distribution (CPD) tables, and sum over the variables of non-interest.

Level 1: Variable Elimination

Cache the repeated computations within a query.
For example, given a chain-structured Bayes Net $A ⟶ B ⟶ C ⟶ D$ , instead of doing $P (D) = \sum_{A} \sum_{B} \sum_{C} P (A, B, C, D)$ , we can do $P (D) = \sum_{C} P (D | C) \sum_{B} P (C | B) \sum_{A} P (A) P (B | A)$ . Check my post for more.

Level 2: Clique-tree... (read more)

Dalcy's Shortform

Dalcy10mo135

Perhaps I should ~~one day in the far far future~~ write a sequence on bayes nets.

Some low-effort TOC (this is basically mostly koller & friedman):

why bayesnets and markovnets? factorized cognition, how to do efficient bayesian updates in practice, it's how our brain is probably organized, etc. why would anyone want to study this subject if they're doing alignment research? explain philosophy behind them.
simple examples of bayes nets. basic factorization theorems (the I-map stuff and separation criterion)
tangent on why bayes nets aren't causal nets, though

... (read more)

1

1

1

1