All of Dmitry Vaintrob's Comments + Replies

Thanks!

I haven't grokked your loss scales explanation (the "interpretability insights" section) without reading your other post though.

Not saying anything deep here. The point is just that you might have two cartoon pictures:

every correctly classified input is either the result of a memorizing circuit or of a single coherent generalizing circuit behavior. If you remove a single generalizing circuit, your accuracy will degrade additively.
a correctly classified input is the result of a "combined" circuit consisting of multiple parallel generalizing "subproce

... (read more)

Dmitry's Koan

The generalization phase diagram

Thanks for the questions!

You first introduce the SLT argument that tells us which loss scale to choose (the "Watanabe scale", derived from the Watanabe critical temperature).

Sorry, I think the context of the Watanabe scale is a bit confusing. I'm saying that in fact it's the wrong scale to use as a "natural scale". The Watanabe scale depends only on the number of training datapoints, and doesn't notice any other properties of your NN or your phenomenon of interest.

Roughly, the Watanabe scale is the scale on which loss improves if you memorize a... (read more)

Against blanket arguments against interpretability

Thanks! Yes the temperature picture is the direction I'm going in. I had heard the term "rate distortion", but didn't realize the connection with this picture. Might have to change the language for my next post

What's the Right Way to think about Information Theoretic quantities in Neural Networks?

This seems overstated

In some sense this is the definition of the complexity of an ML algorithm; more precisely, the direct analog of complexity in information theory, which is the "entropy" or "Solomonoff complexity" measurement, is the free energy (I'm writing a distillation on this but it is a standard result). The relevant question then becomes whether the "SGLD" sampling techniques used in SLT for measuring the free energy (or technically its derivative) actually converge to reasonable values in polynomial time. This is checked pretty extensively in th... (read more)

1Louis Jaburi3mo

The linked paper considers only large models which are DLNs. I don't find this too compelling evidence for large models with non-linearities. Other measurements I have seen for bigger/deeper non-linear models seem promising, but I wouldn't call them robust yet (though it is not clear to me if this is because of an SGLD implementation/hyperparameter issue or if there is a more fundamental problem here). As long as I don't have a more clear picture of the relationship between free energy and training dynamics under SGD, I agree with OP that the claim is too strong.

On polytopes

Dmitry Vaintrob3mo52

Thanks for the reference, and thanks for providing an informed point of view here. I would love to have more of a debate here, and would quite like being wrong as I like tropical geometry.

First, about your concrete question:

As I understand it, here the notion of "density of polygons' is used as a kind of proxy for the derivative of a PL function?

Density is a proxy for the second derivative: indeed, the closer a function is to linear, the easier it is to approximate it by a linear function. I think a similar idea occurs in 3D graphics, in mesh optimiz... (read more)

2Alexander Gietelink Oldenziel3mo

Like David Holmes I am not an expert in tropical geometry so I can't give the best case for why tropical geometry may be useful. Only a real expert putting in serious effort can make that case. Let me nevertheless respond to some of your claims. * PL functions are quite natural for many reasons. They are simple. They naturally appear as minimizers of various optimization procedures, see e.g. the discussion in section 5 here. * Polynomials don't satisfy the padding argument and architectures based on them therefore will typically fail to have the correct simplicitity bias. As for 1." Algebraic geometry isn't good at dealing with deep composition of functions, and especially approximate composition." I agree a typical course in algebraic geometry will not much consider composition of functions but that doesn't seem to me a strong argument for the contention that the tools of algebraic geometry are not relevant here. Certainly, more sophisticated methods beyond classical scheme theory may be important [likely involving something like PROPs] but ultimately I'm not aware of any fundamental obstruction here. 2. >> I don't agree with the contention that algebraic geometry is somehow not suited for questions of approximation. e.g. the Weil conjectures is really an approximate/ average statement about points of curves over finite fields. The same objection you make could have been made about singularity theory before we knew about SLT. I agree with you that a probabilistic perspective on ReLUs/ piece-wise linear functions is probably important. It doesn't seem unreasonable to me in the slightest to consider some sort of tempered posterior on the space of piecewise linear functions. I don't think this invalidates the potential of polytope-flavored thinking.

1DavidHolmes3mo

Hmm, so I'm very wary of defending tropical geometry when I know so little about it; if anyone more informed is reading please jump in! But until then, I'll have a go. Hmm, even for a very small value of `might'? I'm not saying that someone who wants to contribute to ML needs to seriously consider learning some tropical geometry, just that if one already knows tropical geometry it's not a crazy idea to poke around a bit and see if there are applications. I agree this is an important point. I don’t actually have a good idea what activation functions people use in practise these days. Thinking about asymptotic linearity makes me think about the various papers appearing using polynomial activation functions. Do you have an opinion on this? For people in algebraic geometry it’s appealing as it generates lots of AG problems (maybe v hard), but I don’t have a good feeling as to whether it’s got anything much to do with `real life’ ML. I can link to some of the papers I’m thinking of if that’s helpful, or maybe you are already a bit familiar. I think you’re right; this paper just came to mind because I was reading it recently. A little googling suggests there are some applications. This paper seems to give an application of tropical geometry to complexity of linear programming: https://inria.hal.science/hal-03505719/document and this list of conference abstracts seems to give other applications: https://him-application.uni-bonn.de/fileadmin/him/Workshops/TP3_21_WS1_Abstracts.pdf Whether they are 'convincing' I leave up to you. Fair, though one might also see that as an interesting challenge. I don’t have a feeling as to whether this is for really fundamental reasons, or people haven’t tried so hard yet. There are plenty of cases where "high degree" is enough (Falting’s Theorem is the first thing that comes to mind, but there are lots). But I agree that "degree approximately 5" feels quite unnatural.

If I understand correctly, you want a way of thinking about a reference class of programs that has some specific, perhaps interpretability-relevant or compression-related properties in common with the deterministic program you're studying?

I think in this case I'd actually say the tempered Bayesian posterior by itself isn't enough, since even if you work locally in a basin, it might not preserve the specific features you want. In this case I'd probably still start with the tempered Bayesian posterior, but then also condition on the specific properties/explicit features/ etc. that you want to preserve. (I might be misunderstanding your comment though)

What's the Right Way to think about Information Theoretic quantities in Neural Networks?

Dmitry Vaintrob3mo121

Statistical localization in disordered systems, and dreaming of more realistic interpretability endpoints

[epistemic status: half fever dream, half something I think is an important point to get across. Note that the physics I discuss is not my field though close to my interests. I have not carefully engaged with it or read the relevant papers -- I am likely to be wrong about the statements made and the language used.]

A frequent discussion I get into in the context of AI is "what is an endpoint for interpretability". I get into this argument from two sides:... (read more)

2Noosphere893mo

I think the axiom that they'd use to prove something like this mathematically probably depends on assuming scale separation, such that you can discover laws, that while not fully accurate, are much better than random chance and cheap to compute, which means you can get more compute to learn new laws, until you hit the limiting point of a Theory of Everything: https://www.lesswrong.com/posts/HcjL8ydHxPezj6wrt/book-review-the-structure-of-scientific-revolutions#qyPgFjpqNDaZ8cneh

Answer by Dmitry VaintrobJan 20, 2025102

What application do you have in mind? If you're trying to reason about formal models without trying to completely rigorously prove things about them, then I think thinking of neural networks as stochastic systems is the way to go. Namely, you view the weights as a random variable solving a stochastic optimization problem to produce a weight-valued random variable, then conditioning it on whatever knowledge about the weights/activations you assume is available. This can be done both in the Bayesian "thermostatic" sense as a model of idealized networks, and ... (read more)

1Dalcy3mo

I just read your koan and wow it's a great post, thank you for writing it. It also gave me some new insights as to how to think about my confusions and some answers. Here's my chain of thought: * if I want my information theoretic quantities to not degenerate, then I need some distribution over the weights. What is the natural distribution to consider? * Well, there's the Bayesian posterior. * But I feel like there is a sense in which an individual neural network with its weight should be considered as a deterministic information processing system on its own, without reference to an ensemble. * Using the Bayesian posterior won't let me do this: * If I have a fixed neural network that contains a circuit C that takes activation X (at a particular location in the network) to produce activation Y (at a different location), it would make sense to ask questions about the nature of information processing that C does, like I(X;Y). * But intuitively, taking the weight as an unknown averages everything out - even if my original fixed network had a relatively high probability density in the Bayesian posterior, it is unlikely that X and Y would be related by similar circuit mechanisms given another random sample weight from the posterior. * Same with sampling from the post-SGD distribution. * So it would be nice to find a way to interpolate the two. And I think the idea of a tempered local Bayesian posterior from your koan post basically is the right way to do this! (and all of this makes me think papers that measure mutual information between activations in different layers via introducing a noise distribution over the parameters of f are a lot more reasonable than I originally thought)

Renormalization Redux: QFT Techniques for AI Interpretability

Dmitry Vaintrob3mo30

This is where this question of "scale" comes in. I want to add that (at least morally/intuitively) we are also thinking about discrete systems like lattices, and then instead of a regulator you have a coarsegraining or a "blocking transformation", which you have a lot of freedom to choose. For example in PDLT, the object that plays the role of coarsegraining is the operation that takes a probability distribution on neurons and applies a single-layer NN to it.

1Tahp3mo

I consider the lattice to be a regulator as well, but, semantics aside, thank you for the example.

Renormalization Redux: QFT Techniques for AI Interpretability

https://www.cond-mat.de/events/correl22/manuscripts/vondelft.pdf

The quantum red pill or: They lied to you, we live in the (density) matrix

Dmitry Vaintrob3mo50

Thanks for the reference -- I'll check out the paper (though there are no pointer variables in this picture inherently).

I think there is a miscommunication in my messaging. Possibly through overcommitting to the "matrix" analogy, I may have given the impression that I'm doing something I'm not. In particular, the view here isn't a controversial one -- it has nothing to do with Everett or einselection or decoherence. Crucially, I am saying nothing at all about quantum branches.

I'm now realizing that when you say map or territory, you're probably talking abo... (read more)

Renormalization Redux: QFT Techniques for AI Interpretability

Dmitry Vaintrob3mo*70

Thanks for the questions!

Yes, "QFT" stands for "Statistical field theory" :). We thought that this would be more recognizable to people (and also, at least to some extent, statistical is a special case of quantum). We aren't making any quantum proposals.
1. We're following (part of) this community, and interested in understanding and connecting the different parts better. Most papers in the "reference class" we have looked at come from (a variant of) this approach. (The authors usually don't assume Gaussian inputs or outputs, but just high width compared

... (read more)

1Simon Pepin Lehalleur3mo

On 1., you should consider that, for people who don't know much about QFT and its relationship with SFT (like, say, me 18 months ago), it is not at all obvious that QFT can be applied beyond quantum systems! In my case, the first time I read about "QFT for deep learning" I dismissed it automatically because I assumed it would involve some far-fetched analogies with quantum mechanics.

1Simon Pepin Lehalleur3mo

Where could I read about this?

The quantum red pill or: They lied to you, we live in the (density) matrix

The quantum red pill or: They lied to you, we live in the (density) matrix

To add: I think the other use of "pure state" comes from this context. Here if you have a system of commuting operators and take a joint eigenspace, the projector is mixed, but it is pure if the joint eigenvalue uniquely determines a 1D subspace; and then I think this terminology gets used for wave functions as well

Dmitry Vaintrob3mo71

One person's "occam's razor" may be description length, another's may be elegance, and a third person's may be "avoiding having too much info inside your system" (as some anti-MW people argue). I think discussions like "what's real" need to be done thoughtfully, otherwise people tend to argue past each other, and come off overconfident/ underinformed.

To be fair, I did use language like this so I shouldn't be talking -- but I used it tongue-in-cheek, and the real motivation given in the above is not "the DM is a more fundamental notion" but "DM lets y... (read more)

2Charlie Steiner3mo

Maybe talking about the construction of pointer states? That linked paper does it just as you might prefer, putting the Boltzmann distribution into a density matrix. But of course you could rephrase it as a probability distribution over states and the math goes through the same, you've just shifted the vibe from "the Boltzmann distribution is in the territory" to "the Boltzmann distribution is in the map." Sure, at some level of description it's useful to say that measurement is irreversible, just like at some level of description it's useful to say entropy always increases. Just like with entropy, it can be derived from boundary conditions + reversible dynamics + coarse-graining. Treating measurements as reversible probably has more applications than treating entropy as reversible, somewhere in quantum optics / quantum computing.

The quantum red pill or: They lied to you, we live in the (density) matrix

The quantum red pill or: They lied to you, we live in the (density) matrix

Thanks - you're right. I have seen "pure state" referring to a basis vector (e.g. in quantum computation), but in QTD your definition is definitely correct. I don't like the term "pointer variable" -- is there a different notation you like?

1jacob_drori3mo

I'd prefer "basis we just so happen to be measuring in". Or "measurement basis" for short. You could use "pointer variable", but this would commit you to writing several more paragraphs to unpack what it means (which I encourage you to do, maybe in a later post).

The quantum red pill or: They lied to you, we live in the (density) matrix

Yeah, this also bothered me. The notion of "probability distribution over quantum states" is not a good notion: the matrix I is both (|0\rangle \langle 0|+|1\rangle \langle 1|) and (|a\rangle \langle a|+|b\rangle \langle b|) for any other orthogonal basis. The fact that these should be treated equivalently seems totally arbitrary. The point is that density matrix mechanics is the notion of probability for quantum states, and can be formalized as such (dynamics of informational lower bounds given observations). I was sort of getting at this with the long "explaining probability to an alien" footnote, but I don't think it landed (and I also don't have the right background to make it precise)

Dmitry Vaintrob3mo4-1

I've found our Agent Smith :) If you are serious, I'm not sure what you mean. Like there is no ontology in physics -- every picture you make is just grasping at pieces of whatever theory of everything you eventually develop

Charlie Steiner3mo135

When you say there's "no such thing as a state," or "we live in a density matrix," these are statements about ontology: what exists, what's real, etc.

Density matrices use the extra representational power they have over states to encode a probability distribution over states. If we regard the probabilistic nature of measurements as something to be explained, putting the probability distribution directly into the thing we live in is what I mean by "explain with ontology."

Epistemology is about how we know stuff. If we start with a world that does not inherent... (read more)

Quantum without complication

Dmitry Vaintrob3mo90

I like this! Something I would add at some point before unitarity is that there is another type of universe that we almost inhabit, where your vectors of states have real positive coefficients that sum to 1, and your evolution matrices are Markovian (i.e., have positive coefficients and preserve the sum of coordinates). In a certain sense in such a universe it's weird to say "the universe is .3 of this particle being in state 1 and .7 of it being in state 2", but if we interpret this as a probability, we have lived experience of this.

Something that I like ... (read more)

1Optimization Process3mo

That is... a very interesting and attractive way of looking at it. I'll chew on your longer post and respond there!

Permanents: much more than you wanted to know

Dmitry Vaintrob3mo30

The elves care, Alex. The elves care.

Dmitry Vaintrob3mo152

Why I'm in AI sequence: 2020 Journal entry about gpt3

I moved from math academia to full-time AI safety a year ago -- in this I'm in the same boat as Adam Shai, whose reflection post on the topic I recommend you read instead of this.

In making the decision, I went through a lot of thinking and (attempts at) learning about AI before that. A lot of my thinking had been about whether a pure math academic can make a positive difference in AI, and examples that I thought counterindicated this -- I finally decided this might be a good idea after talking to my sis... (read more)

Dmitry's Koan

Yeah I agree that it would be even more interesting to look at various complexity parameters. The inspiration here of course is physics: isolating a particle/effective particle (like a neutron in a nucleus) or an interaction between a fixed set of particles, by putting it in a regime where other interactions and groupings drop out. The goto for a physicist is temperature: you can isolate a neutron by putting the nucleus in a very high-temperature environment like a collider where the constituent baryons separate. This (as well as the behavior wrt generalit... (read more)

The purposeful drunkard

Dmitry Vaintrob4mo90

Thanks! Are you saying there is a better way to find citations than a random walk through the literature? :)

I didn't realize that the pictures above limit to literal pieces of sin and cos curves (and Lissajous curves more generally). I suspect this is a statement about the singular values of the "sum" matrix S of upper-triangular 1's?

The "developmental clock" observation is neat! Never heard of it before. Is it a qualitative "parametrization of progress" thing or are there phase transition phenomena that happen specifically around the midpoint?

2Linda Linsefors3mo

One of the talks at ILIAD had a set for PCA plots where the PC2 turned around at different points for different training setups. I think the turning point corresponded to when the model started to overfit. I don't quite remember. But what ever the meaning of the turning point was, I think they also verified this with some other observation. Given that this was ILIAD the other observation was probably the LLC. If you want to look it up I can try to find the talk among the recordings.

7Daniel Murfet4mo

Hehe. Yes that's right, in the limit you can just analyse the singular values and vectors by hand, it's nice. No general implied connection to phase transitions, but the conjecture is that if there are phase transitions in your development then you can for general reasons expect PCA to "attempt" to use the implicit "coordinates" provided by the Lissajous curves (i.e. a binary tree, the first Lissajous curve uses PC2 to split the PC1 range into half, and so on) to locate stages within the overall development. I got some way towards proving that by extending the literature I cited in the parent, but had to move on, so take the story with a grain of salt. This seems to make sense empirically in some cases (e.g. our paper).

The purposeful drunkard

Do the images load now?

2Morpheus4mo

Yes!

Adam Shai's Shortform

Hmm, I'm not sure how what you're describing (learn on a bunch of examples of (query, well-thought-out guess)) is different from other forms of supervised learning.

Based on the paper Adam shared, it seems that part of the "amortizing" picture is that instead of simple supervised learning you look at examples of the form (context1, many examples from context1), (context2, many examples from context2), etc., in order to get good at quickly performing inference on new contexts.

It sounds like in the Paul Christiano example, you're assuming access to some inter... (read more)

4jessicata4mo

I was trying to say things related to this: Which sounds like supervised learning. Adam seemed to want to know how that relates to scaling up inference time compute so I said some ways they are related. I don't know much about amortized inference in general. The Goodman paper seems to be about saving compute by caching results between different queries. This could be applied to LLMs but I don't know of it being applied. It seems like you and Adam like this "amortized inference" concept and I'm new to it so don't have any relevant comments. (Yes I realize my name is on a paper talking about this but I actually didn't remember the concept) I don't think I implied anything about o3 relating to parallel heuristics.

Adam Shai's Shortform

Category Theory Without The Baggage

Thanks! I spent a bit of time understanding the stochastic inverse paper, though haven't yet fully grokked it. My understanding here is that you're trying to learn the conditional probabilities in a Bayes net from samples. The "non-amortized" way to do this for them is to choose a (non-unique) maximal inverse factorization that satisfies some d-separation condition, then guess the conditional probabilities on the latent-generating process by just observing frequencies of conditional events -- but of course this is very inefficient, in particular because th... (read more)

Category Theory Without The Baggage

FWIW, I like John's description above (and probably object much less than baseline to humorously confrontational language in research contexts :). I agree that for most math contexts, using the standard definitions with morphism sets and composition mappings is easier to prove things with, but I think the intuition described here is great and often in better agreement with how mathematicians intuit about category-theoretic constructions than the explicit formalism.

This phenomenon exists, but is strongly context-dependent. Areas of math adjacent to abstract algebra are actually extremely good at updating conceptualizations when new and better ones arrive. This is for a combination of two related reasons: first, abstract algebra is significantly concerned about finding "conceptual local optima" of ways of presenting standard formal constructions, and these are inherently stable and require changing infrequently; second, when a new and better formalism is found, it tends to be so powerfully useful that papers that use ... (read more)

This is very nice! So the way I understand what you linked is this: the class of perturbative expansions in the "Edgeworth expansion" picture I was distilling is that the order-d approximation for the probability distribution $P_{S_{n}} (x)$ associated to the sum variable S_n above is $p_{Gauss}^{n} (x) \cdot F_{n} (x)$ where $p_{Gauss}^{n} (x)$ is the probability distribution associated with a Gaussian $N (0, const / n)$ and $F_{n} (x)$ is a polynomial in t and the perturbative parameter $1 / \sqrt{n}$ . The paper you linked says that a related natural thing to do is to take the Fourier transform, which will be th... (read more)

3Simon Pepin Lehalleur4mo

Q: How can I use LaTeX in these comments? I tried to follow https://www.lesswrong.com/tag/guide-to-the-lesswrong-editor#LaTeX but it does not seem to render. Here is the simplest case I know, which is a sum of dependent identically distributed variables. In physical terms, it is about the magnetisation of the 1d Curie-Weiss (=mean-field Ising) model. I follow the notation of the paper https://arxiv.org/abs/1409.2849 for ease of reference, this is roughly Theorem 8 + Theorem 10: Let $M_n=\sum_{i=1}^n \sigma(i)$ be the sum of n dependent Bernouilli random variables $\sigma(i)\in\{\pm 1}$, where the joint distribution is given by $$ \mathbb{P}(\sigma)\sim \exp(\frac{\beta}{n}M_n^2)) $$ Then * When $\beta=1$, the fluctuations of $M_n$ are very large and we have an anomalous CLT: $\frac{M_n}{n^{3/4}}$ converges in law to the probability distribution $\sim \exp(-frac{x^4}{12})$. * When $\beta<1$, $M_n$ satisfies a normal CLT: $\frac{M_n}{n^{1/2}}$ converges to a Gaussian. * When $\beta>1$, $M_n$ does not satisfy a limit theorem (there are two lower energy configurations) In statistical mechanics, this is an old result of Ellis-Newman from 1978; the paper above puts it into a more systematic probabilistic framework, and proves finer results about the fluctuations (Theorems 16 and 17). The physical intuition is that $\beta=1$ is the critical inverse temperature at which the 1d Curie-Weiss model goes through a continuous phase transition. In general, one should expect such anomalous CLTs in the thermodynamic limit of continuous phase transitions in statistical mechanics, with the shape of the CLT controlled by the Taylor expansion of the microcanonical entropy around the critical parameters. Indeed Ellis and his collaborators have worked out a number of such cases for various mean-field models (which according to Meliot-Nikeghbali also fit in their mod-Gaussian framework). It is of course very difficult to prove such results rigorously outside of mean-field mod

Dmitry's Koan

Dmitry Vaintrob4mo*40

Thanks for asking! I said in a later shortform that I was trying to do too many things in this post, with only vague relationships between them, and I'm planning to split it into pieces in the future.

Your 1-3 are mostly correct. I'd comment as follows:

(and also kind of 3) That advice of using the tempered local Bayesian posterior (I like the term -- let's shorten it to TLBP) is mostly aimed at non-SLT researchers (but may apply also to some SLT experiments). The suggestion is simpler than to compute expectations. Rather, it's just to run a single experimen

... (read more)

1Simon Pepin Lehalleur4mo

I mentioned samples and expectations for the TLBP because it seems possible (and suggested by the role of degeneracies in SLT) that different samples can correspond to qualitatively different degradations of the model. Cartoon picture : besides the robust circuit X of interest, there are "fragile" circuits A and B, and most samples at a given loss scale degrade either A or B but not both. I agree that there is no strong reason to overindex on the Watanabe temperature, which is derived from an idealised situation: global Bayesian inference, degeneracies exactly at the optimal parameters, "relatively finite variance", etc. The scale you propose seems quite natural but I will let LLC-practitioners comment on that.

Dmitry Vaintrob4mo134

Why you should try degrading NN behavior in experiments.

I got some feedback on the post I wrote yesterday that seems right. The post is trying to do too many things, and not properly explaining what it is doing, why this is reasonable, and how the different parts are related.

I want to try to fix this, since I think the main piece of advice in this post is important, but gets lost in all the mess.

This main point is:

experimentalists should in many cases run an experiment on multiple neural nets with a variable complexity dial that allows some "natural" deg

... (read more)

Thanks so much for this! Will edit

Cool! I haven't seen these, good to have these to point to (and I'm glad that Richard Ngo has thought about this)

Thanks for the context! I didn't follow this discourse very closely, but I think your "optimistic assumptions" post wasn't the main offender -- it's reasonable to say that "it's suspicious when people are bad at backchaining but think they're good at backchaining or their job depends on backchaining more than they are able to". I seem to remember reading some responses/ related posts that I had more issues with, where the takeaway was explicitly that "alignment researchers should try harder at backchaining and one-shotting baba-is-you-like problems because... (read more)

So the oscillating phase formula is about approximately integrating the function $exp (- f (x) / ℏ)$ against various "priors" p(x) (or more generally any fixed function g), where f is a Lagrangian (think energy) and (\hbar) is a small parameter. It gives an asymptotic series in powers of $ℏ$ . The key point is that (more or less) the kth perturbative term only depends on the kth-order power series expansion of f around the "stationary points" (i.e., saddlepoints, Jac(f) = 0) when f is imaginary, on the maxima of f when f is real, and there is a mixed form that depen... (read more)

Dmitry Vaintrob4mo5917

Alignment is not all you need. But that doesn't mean you don't need alignment.

One of the fairytales I remember reading from my childhood is the "Three sillies". The story is about a farmer encountering three episodes of human silliness, but it's set in one more frame story of silliness: his wife is despondent because there is an axe hanging in their cottage, and she thinks that if they have a son, he will walk underneath the axe and it will fall on his head.

The frame story was much more memorable to me than any of the "body" stories, and I randomly remembe... (read more)

Raemon4mo275

FYI I think by the time I wrote Optimistic Assumptions, Longterm Planning, and "Cope", I think I had updated on the things you criticize about it here (but, I had started writing it awhile ago from a different frame and there is something disjointed about it)

But, like, I did mean both halfs of this seriously:

I think you should be scared about this, if you're the sort of theoretic researcher, who's trying to cut at the hardest parts of the alignment problem (whose feedback loops are weak or nonexistent)
I think you should be scared about this, if you'r

... (read more)

8Noosphere894mo

Something like this is a big reason why I'm not a fan of MIRI, because I think this sort of neuroticism is at somewhat encouraged by that group. Also, remember that the current LW community is selected for scrupulosity and neuroticism, which IMO is not that good for solving a lot of problems: Richard Ngo and John Maxwell illustrate it here: https://www.lesswrong.com/posts/uM6mENiJi2pNPpdnC/takeaways-from-one-year-of-lockdown#g4BJEqLdvzgsjngX2 https://www.lesswrong.com/posts/uM6mENiJi2pNPpdnC/takeaways-from-one-year-of-lockdown#3GqvMFTNdCqgfRNKZ

Dmitry Vaintrob4mo*20

I'm not exactly sure about what you mean wrt "what you want" here. It is not the case that you can exactly reconstruct most probability distributions you'll encounter in real life from their moments/ cumulants (hence the expansion is perturbative, not exact).

But in the interpretability/ field-theoretic model of wide NN's point of view, this is what you want (specifically, the fourth-order correction)

1transhumanist_atom_understander4mo

Well, usually I'm not inherently interested in a probability density function, I'm using it to calculate something else, like a moment or an entropy or something. But I guess I'll see what you use it for in future posts.

Reasons for and against working on technical AI safety at a frontier AI lab

Yes, I actually thought about this a bit. It is definitely the case that the LC (or RLCT) in the SLT context is also exactly a (singular) stationary phase expansion. Unfortunately, the Fourier transform of a random variable, including a higher-dimensional one, really does have an isolated nondegenerate maximum at 0 (unless the support of your random variable is contained in a union of linear subspaces, which is kinda boring/ reducible to simpler contexts). Maybe if you think about some kind of small perturbation of a lower-dimensional system, you can get s... (read more)

2Alexander Gietelink Oldenziel4mo

Sorry these words are not super meaningful to me. Would you be able to translate this from physics speak ?

Dmitry Vaintrob4mo63

Thanks for writing this! I've participated in some similar conversations and on balance, think that working in a lab is probably net good for most people assuming you have a reasonable amount of intellectual freedom (I've been consistently impressed by some papers coming out of Anthropic).

Still, one point made by Kaarel in a recent conversation seemed like an important update against working in a lab (and working on "close-to-the-metal" interpretability in general). Namely, I tend to not buy arguments by MIRI-adjacent people that "if we share our AI insigh... (read more)

Maybe a reductive summary is "general is good if outer alignment is easy but inner alignment is hard, but bad in the opposite case"

2Alexander Gietelink Oldenziel4mo

Isn't it the other way around ? If inner alignment is hard then general is bad because applying less selection pressure, i.e. more generally, more simplicity prior, means more daemons/gremlins

I haven't thought about this enough to have a very mature opinion. On one hand being more general means you're liable to goodheart more (i.e., with enough deeply general processing power, you understand that manipulating the market to start World War 3 will make your stock portfolio grow, so you act misaligned). On the other hand being less general means that AI's are more liable to "partially memorize" how to act aligned in familiar situations, and go off the rails when sufficiently out-of-distribution situations are encountered. I think this is related to the question of "how general are humans", and how stable are human values to being much more or much less general

2Alexander Gietelink Oldenziel4mo

I guess im mostly thinking about the regime where AIs are more capable and general than humans. It seems at first glance that the latter failure mode is more of a capability failure. Something one would expect to go away as AI truly surpasses humans. It doesnt seem core to the alignment problem to me.

2Dmitry Vaintrob4mo

Maybe a reductive summary is "general is good if outer alignment is easy but inner alignment is hard, but bad in the opposite case"

You mean on more general algorithms being good vs. bad?

2Alexander Gietelink Oldenziel4mo

Yes.

Yep, have been recently posting shortforms (as per your recommendation), and totally with you on the "halfbaked-by-design" concept (if Cheeseboard can do it, it must be a good idea right? :)

I still don't agree that free energy is core here. I think that the relevant question, which can be formulated without free energy, is whether various "simplicity/generality" priors push towards or away from human values (and you can then specialize to questions of effective dimension/llc, deep vs. shallow networks, ICL vs. weight learning, generalized ood generalizatio... (read more)

2Noosphere894mo

I think simplicity/generality priors effectively have 0 effect on whether it's pushed towards or away from human values, and is IMO kind of orthogonal to alignment-relevant questions.

2Alexander Gietelink Oldenziel4mo

Could you give some examples of what you are thinking of here ?

The subset parity learning problem: much more than you wanted to know

Dmitry Vaintrob4mo*222

On the surprising effectiveness of linear regression as a toy model of generalization.

Another shortform today (since Sunday is the day of rest). This time it's really a hot take: I'm not confident about the model described here being correct.

Neural networks aren't linear -- that's the whole point. They notice interesting, compositional, deep information about reality. So when people use linear regression as a qualitative comparison point for behaviors like generalization and learning, I tend to get suspicious. Nevertheless, the track record of linear regre... (read more)

4Alexander Gietelink Oldenziel4mo

Loving this! Epic. A couple things that come to mind. * Linear features = sufficients statistics of exponential families ? * simplest case is case of Gaussians and covariance matrix (which comes down to linear regression) * formalized by GPD theorem * see generalization by John * exponential families are a fairly good class but not closed under hierarchichal structure. Basic example is a mixture of Gaussians is not exponential, i.e. not described in terms of just linear regression. * The centrality of ReLU neural networks. * Understanding ReLU neural networks is probably 80-90% of understanding NN- architectures. At sufficient scale pure MLP have the same or better scaling laws than transformers. * There is several lines of evidence gradient descent has an inherent bias towards splines/piecewise linear functions/tropical polynomials. see e.g. here and references therein. * Serious analysis of ReLU neural network can be done through tropical methods. A key paper is here. You say: "very cool piece of the analysis here is locally modelling ReLU learning as building a convex function as a max of linear functions (and explaining why non-ReLU learning should exhibit a softer version of the same behavior). This is a somewhat "shallow" point of view on learning, but probably captures a nontrivial part of what's going on, and this predicts that every new weight update only has local effect -- i.e., is felt in a significant way only by a small number of datapoints (the idea being that if you're defining a convex function as the max of a bunch of linear functions, shifting one of the linear functions will only change the values in places where this particular linear function was dominant). The way I think about this phenomenon is that it's a good model for "local learning", i.e., learning closer to memorization on the memorization-generalization spectrum that only updates the behavior on a small cluster of similar datapoints (e.g. the LLM

7Adam Shai4mo

Ari's work is on Arxiv here

Dmitry Vaintrob4mo80

Thanks! I definitely believe this, and I think we have a lot of evidence for this in both toy models and LLMs (I'm planning a couple of posts on this idea of "training stories"), and also theoretical reasons in some contexts. I'm not sure how easy it is to extend the specific approach used in the proof for parity to a general context. I think it inherently uses the fact of orthogonality of Fourier functions on boolean inputs, and understanding other ML algorithms in terms of nice orthogonal functions seems hard to do rigorously, unless you either make some... (read more)