LESSWRONG
LW

All of Zach Furman's Comments + Replies

I’ve been trying to understand modules for a long time. They’re a particular algebraic structure in commutative algebra which seems to show up everywhere any time you get anywhere close to talking about rings - and I could never figure out why. Any time I have some simple question about algebraic geometry, for instance, it almost invariably terminates in some completely obtuse property of some module. This confused me. It was never particularly clear to me from their definition why modules should be so central, or so “deep.”

I’m going to try to explain the ... (read more)

9Daniel Murfet6d

Yeah it's a nice metaphor. And just as the most important thing in a play is who dies and how, so too we can consider any element x∈M as a module homomorphism ϕx:R→M and consider the kernel Ann(x)=Kerϕx which is called the annihilator (great name). Then ϕx factors as R→R/Ann(x)→M where the second map is injective, and so in some sense M is "made up" of all sorts of quotients R/I where I varies over annihilators of elements. There was a period where the structure of rings was studied more through the theory of ideals (historically this as in turn motivated by the idea of an "ideal" number) but through ideas like the above you can see the theory of modules as a kind of "externalisation" of this structure which in various ways makes it easier to think about. One manifestation of this I fell in love with (actually this was my entrypoint into all this since my honours supervisor was an old-school ring theorist and gave me Stenstrom to read) is in torsion theory.

9Simon Pepin Lehalleur7d

Some of my favourite topics in pure mathematics! Two quick general remarks: 1. I don't hold such a strong qualitative distinction between the theory of group actions, and in particular linear representations, and the theory of modules. They are both ways to study an object by having it act on auxiliary structures/geometry. Because there are in general fewer tools to study group actions than modules, a lot of pure mathematics is dedicated to linearizing the former to the latter in various ways. 2. There is another perspective on modules over commutative rings which is central to algebraic geometry: modules are a specific type of sheaves which generalize vector bundles. More precisely, a module over a commutative ring R is equivalent to a "quasicoherent sheaf" on the affine scheme Spec(R), and finitely generated projective modules correspond in this way to vector bundles over Spec(R). Once you internalise this equivalence, most of the basic theory of modules in commutative algebra becomes geometrically intuitive, and this is the basis for many further developments in algebraic geometry.

Zach Furman's Shortform

Zach Furman2mo106

Yeah, I completely agree this is a good research direction! My only caveat is I don’t think this is a silver bullet in the same way capabilities benchmarks are (not sure if you’re arguing this, just explaining my position here). The inevitable problem with interpretability benchmarks (which to be clear, your paper appears to make a serious effort to address) is that you either:

Train the model in a realistic way - but then you don’t know if the model really learned the algorithm you expected it to
Train the model to force it to learn a particular algorithm -

... (read more)

Zach Furman's Shortform

Zach Furman2mo7633

Understanding deep learning isn’t a leaderboard sport - handle with care.

Saliency maps, neuron dissection, sparse autoencoders - each surged on hype, then stalled^[1] when follow‑up work showed the insight was mostly noise, easily spoofed, or valid only in cherry‑picked settings. That risks being negative progress: we spend cycles debunking ghosts instead of building cumulative understanding.

The root mismatch is methodological. Mainstream ML capabilities research enjoys a scientific luxury almost no other field gets: public, quantitative benchmarks tha... (read more)

6Thomas Kwa2mo

Why not require model organisms with known ground truth and see if the methods accurately reveal them, like in the paper? From the abstract of that paper: This reduces the problem from covering all sources of doubt to making a sufficiently realistic model organism. This was our idea with InterpBench, and I still find it plausible that with better execution one could iterate on it (or the probe, crosscoder, etc. equivalent) and make interpretability progress.

1[comment deleted]2mo

4tailcalled2mo

Research thrives on answering important questions. However, the trouble with interpretability for AI safety is that there are no important questions getting answered. Typically the real goal is to understand the neural networks well enough to know whether they are scheming, but that's a threefold bad idea: * You cannot make incremental progress on it; either you know whether they are scheming, or you don't * Scheming is not the main AI danger/x-risk * Interpretability is not a significant bottleneck in detecting scheming (we don't even have good, accessible examples of contexts where AI is applied and scheming would be a huge risk) To solve this, people substitute various goals, e.g. predictive accuracy, under the assumption that incremental predictive accuracy is helpful. But we already have perfectly adequate ways of predicting the behavior of the neural networks, it's called running the neural networks.

RA x ControlAI video: What if AI just keeps getting smarter?

Zach Furman2mo30

It's not relevant to predictions about how well the learning algorithm generalises. And that's the vastly more important factor for general capabilities.

Quite tangential to your point, but the problem with the universal approximation theorem is not just "it doesn't address generalization" but that it doesn't even fulfill its stated purpose: it doesn't answer the question of why neural networks can space-efficiently approximate real-world functions, even with arbitrarily many training samples. The statement "given arbitrary resources, a neural network can a... (read more)

Interpreting Complexity

Zach Furman4mo63

After convergence, the samples should be viewed as drawn from the stationary distribution, and ideally they have low autocorrelation, so it doesn't seem to make sense to treat them as a vector, since there should be many equivalent traces.

This is a very subtle point theoretically, so I'm glad you highlighted this. Max may be able to give you a better answer here, but I'll try my best to attempt one myself.

I think you may be (understandably) confused about a key aspect of the approach. The analysis isn't focused on autocorrelation within individual traces, ... (read more)

3rpglover644mo

Thanks! I did notice that we were comparing traces at the same parameters values by the third read-through, so I appreciate the clarification. I think the thing that would have made this clear to me is an explicit mention that it only makes sense to compare traces within the same run.

Stan van Wingerden's Shortform

Zach Furman7mo30

IIRC @jake_mendel and @Kaarel have thought about this more, but my rough recollection is: a simple story about the regularization seems sufficient to explain the training dynamics, so a fancier SLT story isn't obviously necessary. My guess is that there's probably something interesting you could say using SLT, but nothing that simpler arguments about the regularization wouldn't tell you also. But I haven't thought about this enough.

Singular learning theory: exercises

Zach Furman9mo10

Good catch, thanks! Fixed now.

Generalization, from thermodynamics to statistical physics

Zach Furman2y30

It's worth noting that Jesse is mostly following the traditional "approximation, generalization, optimization" error decomposition from learning theory here - where "generalization" specifically refers to finite-sample generalization (gap between train/test loss), rather than something like OOD generalization. So e.g. a failure of transformers to solve recursive problems would be a failure of approximation, rather than a failure of generalization. Unless I misunderstood you?

5Noosphere892y

Ok, I understand now. You haven't misunderstood me. I'm not sure what to do with my comment above now.

Generalization, from thermodynamics to statistical physics

Zach Furman2y*Ω130

Repeating a question I asked Jesse earlier, since others might be interested in the answer: how come we tend to hear more about PAC bounds than MAC bounds?

3Jesse Hoogland2y

I think this mostly has to do with the fact that learning theory grew up in/next to computer science where the focus is usually worst-case performance (esp. in algorithmic complexity theory). This naturally led to the mindset of uniform bounds. That and there's a bit of historical contingency: people started doing it this way, and early approaches have a habit of sticking.

Singular learning theory and bridging from ML to brain emulations

Zach Furman2y21

Note that in the SLT setting, "brains" or "neural networks" are not the sorts of things that can be singular (or really, have a certain $λ$ ) on their own - instead they're singular for certain distributions of data.

This is a good point I often see neglected. Though there's some sense in which a model $p (x | w)$ can "be singular" independent of data: if the parameter-to-function map $w \mapsto p (x | w)$ is not locally injective. Then, if a distribution $p (x)$ minimizes the loss, the preimage of $p (x)$ in parameter space can have non-t... (read more)

TOMORROW: the largest AI Safety protest ever!

Zach Furman2y105

A possible counterpoint, that you are mostly advocating for awareness as opssosed to specific points is null, since pretty much everyone is aware of the problem now - both society as a whole, policymakers in particular, and people in AI research and alignment.

I think this specific point is false, especially outside of tech circles. My experience has been that while people are concerned about AI in general, and very open to X-risk when they hear about it, there is zero awareness of X-risk beyond popular fiction. It's possible that my sample isn't representa... (read more)

Arguments for optimism on AI Alignment (I don't endorse this version, will reupload a new version soon.)

Zach Furman2y*1813

In the cybersecurity analogy, it seems like there are two distinct scenarios being conflated here:

1) Person A says to Person B, "I think your software has X vulnerability in it." Person B says, "This is a highly specific scenario, and I suspect you don't have enough evidence to come to that conclusion. In a world where X vulnerability exists, you should be able to come up with a proof-of-concept, so do that and come back to me."

2) Person B says to Person A, "Given XYZ reasoning, my software almost certainly has no critical vulnerabilities of any kind. I'm ... (read more)

7bigjeff52y

The reason Person A in scenario 2 has the intuition that Person B is very wrong is because there are dozens, if not hundreds of examples where people claimed no vulnerabilities and were proven wrong. Usually spectacularly so, and often nearly immediately. Consider the fact that the most robust software developed by the most wealthy and highly motivated companies in the world, who employ vast teams of talented software engineers, have monthly patch schedules to fix their constant stream vulnerabilities, and I think it's pretty easy to immediately discount anybody's claim of software perfection without requiring any further evidence. All the evidence Person A needs is the complete and utter lack of anybody having achieved such a thing in the history of software to discount Person B's claims. I've never heard of an equivalent example for AI. It just seems to me like Scenario 2 doesn't apply, or at least it cannot apply at this point in time. Maybe in 50 years we'll have the vast swath of utter failures to point to, and thus a valid intuition against someone's 9-9's confidence of success, but we don't have that now. Otherwise people would be pointing out examples in these arguments instead of vague unease regarding problem spaces.

Thomas Kwa's MIRI research experience

Zach Furman2y32

I wish I had a more short-form reference here, but for anyone who wants to learn more about this, Rocket Propulsion Elements is the gold standard intro textbook. We used in my university rocketry group, and it's a common reference to see in industry. Fairly well written, and you should only need to know high school physics and calculus.

Alexander Gietelink Oldenziel's Shortform

Zach Furman2y10

Obviously this is all speculation but maybe I'm saying that the universal approximation theorem implies that neural architectures are fractal in space of all distributtions (or some restricted subset thereof)?

Oh I actually don't think this is speculation, if (big if) you satisfy the conditions for universal approximation then this is just true (specifically that the image of $W$ is dense in function space). Like, for example, you can state Stone-Weierstrass as: for a Hausdorff space X, and the continuous functions under the sup norm $C (X, R)$ , th... (read more)

2Alexander Gietelink Oldenziel2y

Ooooo okay so this seems like it's directly pointing to the fractal story! Exciting!

Alexander Gietelink Oldenziel's Shortform

Zach Furman2y10

Sorry, I realized that you're mostly talking about the space of true distributions and I was mainly talking about the "data manifold" (related to the structure of the map $x \mapsto p (x ∣ w^{*})$ for fixed $w^{*}$ ). You can disregard most of that.

Though, even in the case where we're talking about the space of true distributions, I'm still not convinced that the image of $W$ under $p (x ∣ w)$ needs to be fractal. Like, a space-filling assumption sounds to me like basically a universal approximation argument - you're assuming that the image of&n... (read more)

3Alexander Gietelink Oldenziel2y

Obviously this is all speculation but maybe I'm saying that the universal approximation theorem implies that neural architectures are fractal in space of all distributtions (or some restricted subset thereof)? Curious what's your beef with universal approximation? Stone-weierstrass isn't quantitative - is that the reason? If true it suggest the fractal dimension (probably related to the information dimension I linked to above) may be important.

2Alexander Gietelink Oldenziel2y

Obviously this is all speculation but maybe I'm saying that the universal approximation theorem implies that neural architectures are fractal in space of all distributtions (or some restricted subset thereof)? Stone-weierstrass isn't quantitative. If true it suggest the fractal dimension (probably related to the information dimension I linked to above) may be important.

Alexander Gietelink Oldenziel's Shortform

Zach Furman2y10

Very interesting, glad to see this written up! Not sure I totally agree that it's necessary for $W$ to be a fractal? But I do think you're onto something.

In particular you say that "there are points $y$ in the larger dimensional space that are very (even arbitrarily) far from $W$ ," but in the case of GPT-4 the input space is discrete, and even in the case of e.g. vision models the input space is compact. So the distance must be bounded.

Plus if you e.g. sample a random image, you'll find there's usually a finite distance you need to trave... (read more)

2Alexander Gietelink Oldenziel2y

The input and output spaces etc Ω are all discrete but the spaces of distributions Δ(Ω) on those spaces are infinite (but still finite-dimensional). It depends on what kind of metric one uses, compactness assumptions etc whether or not you can be arbitrarily far. I am being rather vague here. For instance, if you use the KL-divergence, then K(q|puniform) is always bounded - indeed it equals the entropy of the true distribution H(q)! I don't really know what ML people mean by the data manifold so won't say more about that. I am talking about the space W of parameter values of a conditional probability distribution p(x|w). I think that W having nonconstant local dimension doesn't seem that relevant since the largest dimensional subspace would dominate? Self-intersections and singularities could certainly occur here. (i) singularities in the SLT sense have to do with singularities in the level sets of the KL-divergence (or loss function) - don't see immediately how these are related to the singularities that you are talking about here (ii) it wouldn't increase the dimensionality (rather the opposite). The fractal dimension is important basically because of space-filling curves : a space that has a low-dimensional parameterization can nevertheless have a very large effective dimensions when embedded fractally into a larger-dimensional space. These embeddings can make a low-dimensional parameterization effectively have higher dimension.

Against Almost Every Theory of Impact of Interpretability

Zach Furman2y*10

if the distribution of intermediate neurons shifts so that Othello-board-state-detectors have a reasonably high probability of being instantiated

Yeah, this "if" was the part I was claiming permutation invariance causes problems for - that identically distributed neurons probably couldn't express something as complicated as a board-state-detector. As soon as that's true (plus assuming the board-state-detector is implemented linearly), agreed, you can recover it with a linear probe regardless of permutation-invariance.

This is a more reasonable objection(alth

Zach Furman2y10

The reason the Othello result is surprising to the NTK is that neurons implementing an "Othello board state detector" would be vanishingly rare in the initial distribution, and the NTK thinks that the neuron function distribution does not change during training.

Yeah, that's probably the best way to explain why this is surprising from the NTK perspective. I was trying to include mean-field and tensor programs as well (where that explanation doesn't work anymore).

As an example, imagine that our input space consisted of five pixels, and at initialization neur

... (read more)

2interstice2y

The same argument applies - if the distribution of intermediate neurons shifts so that Othello-board-state-detectors have a reasonably high probability of being instantiated, it will be possible to construct a linear probe detecting this, regardless of the permutation-invariance of the distribution. This is a more reasonable objection(although actually, I'm not sure if independence does hold in the tensor programs framework - probably?)

Against Almost Every Theory of Impact of Interpretability

Zach Furman2y1-2

I think the core surprising thing is the fact that the model learns a representation of the board state. The causal / linear probe parts are there to ensure that you've defined "learns a representation of the board state" correctly - otherwise the probe could just be computing the board state itself, without that knowledge being used in the original model.

This is surprising to some older theories like statistical learning, because the model is usually treated as effectively a black box function approximator. It's also surprising to theories like NTK, mean-... (read more)

2interstice2y

Permutation-invariance isn't the reason that this should be surprising. Yes, the NTK views neurons as being drawn from an IID distribution, but once they have been so drawn, you can linearly probe them as independent units. As an example, imagine that our input space consisted of five pixels, and at initialization neurons were randomly sensitive to one of the pixels. You would easily be able to construct linear probes sensitive to individual pixels even though the distribution over neurons is invariant over all the pixels. The reason the Othello result is surprising to the NTK is that neurons implementing an "Othello board state detector" would be vanishingly rare in the initial distribution, and the NTK thinks that the neuron function distribution does not change during training.

1kave2y

Thanks! The permutation-invariance of a bunch of theories is a helpful concept

Against Almost Every Theory of Impact of Interpretability

Zach Furman2y20

Yeah, that was what I was referring to. Maybe “algorithmic model” isn’t the most precise - what we know is that the NN has an internal model of the board state that’s causal (i.e. the NN actually uses it to make predictions, as verified by interventions). Theoretically it could just be forming this internal model via a big lookup table / function approximation, rather than via a more sophisticated algorithm. Though we’ve seen from modular addition work, transformer induction heads, etc that at least some of the time NNs learn genuine algorithms.

1kave2y

I think that means one of the following should be surprising from theoretical perspectives: 1. That the model learns a representation of the board state 1. Or that a linear probe can recover it 2. That the board state is used causally Does that seem right to you? If so, which is the surprising claim? (I am not that informed on theoretical perspectives)

Against Almost Every Theory of Impact of Interpretability

Zach Furman2y10

Agreed - that alone isn’t particularly much, just one of the easier things to express succinctly. (Though the fact that this predates deep learning does seem significant to me. And the fact that SLT can delineate precisely where statistical learning theory went wrong here seems important too.)

Another is that can explain phenomena like phase transitions, as observed in e.g. toy models of superposition, at a quantitative level. There’s also been a substantial chunk of non-SLT ML literature that has independently rediscovered small pieces of SLT, like failure... (read more)

Against Almost Every Theory of Impact of Interpretability

Zach Furman2y90

Yeah, I can expand on that - this is obviously going be fairly opinionated, but there are a few things I'm excited about in this direction.

The first thing that comes to mind here is singular learning theory. I think all of my thoughts on DL theory are fairly strongly influenced by it at this point. It definitely doesn't have all the answers at the moment, but it's the single largest theory I've found that makes deep learning phenomena substantially "less surprising" (bonus points for these ideas preceding deep learning). For instance, one of the first thin... (read more)

2interstice2y

FWIW most potential theories of deep learning are able to explain these, I don't think this distinguishes SLT particularly much.

Against Almost Every Theory of Impact of Interpretability

Zach Furman2y169

I can't speak for Richard, but I think I have a similar issue with NTK and adjacent theory as it currently stands (beyond the usual issues). I'm significantly more confident in a theory of deep learning if it cleanly and consistently explains (or better yet, predicts) unexpected empirical phenomena. The one that sticks out most prominently in my mind, that we see constantly in interpretability, is this strange correspondence between the algorithmic "structure" we find in trained models (both ML and biological!) and "structure" in the data generating proces... (read more)

2Noosphere899mo

A note is that as it turns out, OthelloGPT learned a bag of heuristics, and there was no clean algorithm: https://www.lesswrong.com/posts/gcpNuEZnxAPayaKBY/othellogpt-learned-a-bag-of-heuristics-1

2kave2y

What is the work that finds the algorithmic model of the game itself for Othello? I'm aware of (but not familiar with) some interpretability work on Othello-GPT (Neel Nanda's and Kenneth Li), but thought it was just about board state representations.

interstice2y110

I intended my comment to apply to "theories of deep learning" in general, the NTK was only meant as an example. I agree that the NTK has problems such that it can at best be a 'provisional' grand theory. The big question is how to think about feature learning. At this point, though, there are a lot of contenders for "feature learning theories" - the Maximal Update Parameterization, Depth Corrections to the NTK, Perturbation Theory, Singular Learning Theory, Stochastic Collapse, SGD-Induced Sparsity....

So although I don't think the NTK can be a final answe... (read more)

Thoughts on Loss Landscapes and why Deep Learning works

Zach Furman2y1312

Someone with better SLT knowledge might want to correct this, but more specifically:

Studying the "volume scaling" of near-min-loss parameters, as beren does here, is really core to SLT. The rate of change of this volume as you change your epsilon loss tolerance is called the "density of states" (DOS) function, and much of SLT basically boils down to an asymptotic analysis of this function. It also relates the terms in the asymptotic expansion to things you care about, like generalization performance.

You might wonder why SLT needs so much heavy machinery, s... (read more)

3beren2y

Looks like I really need to study some SLT! I will say though that I haven't seen many cases in transformer language models where the eigenvalues of the Hessian are 90% zeros -- that seems extremely high.

When is correlation transitive?

Zach Furman2y30

If anyone wants an interactive visual for the correlation upper and lower bounds, I made one for myself here: https://www.math3d.org/chXa4xZrC.

The x, y, and z axes are the correlations between X and Y, and Y and Z, and X and Z respectively. Everything inside the blue surface is possible. There are also some sliders to help visualize level sets.

Residual stream norms grow exponentially over the forward pass

Zach Furman2y20

Yep, pre-LN transformers avoid the vanishing gradient problem.
Haven't checked this myself, but the phenomenon seems to be fairly clean? See figure 3.b in the paper I linked, or figure 1 in this paper.

I actually wouldn't think of vanishing/exploding gradients as a pathological training problem but a more general phenomenon about any dynamical system. Some dynamical systems (e.g. the sigmoid map) fall into equilibria over time, getting exponentially close to one. Other dynamical systems (e.g. the logistic map) become chaotic, and similar trajectories diverge... (read more)

Residual stream norms grow exponentially over the forward pass

Zach Furman2y*12-1

Exponential growth is a fairly natural thing to expect here, roughly for the same reason that vanishing/exploding gradients happen (input/output sensitivity is directly related to param/output sensitivity). Based on this hypothesis, I'm preregistering the prediction that (all other things equal) the residual stream in post-LN transformers will exhibit exponentially shrinking norms, since it's known that post-LN transformers are more sensitive to vanishing gradient problems compared to pre-LN ones.

Edit: On further thought, I still think this intuition is co... (read more)

1StefanHex2y

Oh I hadn't thought of this, thanks for the comment! I don't think this apply to Pre-LN Transformers though? 1. In Pre-LN transformers every layer's output is directly connected to the residual stream (and thus just one unembedding away from logits), wouldn't this remove the vanishing gradient problem? I just checkout out the paper you linked, they claim exponentially vanishing gradients is a problem (only) in Post-LN, and how Pre-LN (and their new method) prevent the problem, right? 2. The residual stream norm curves seem to follow the exponential growth quite precisely, do vanishing gradient problems cause such a clean result? I would have intuitively expected the final weights to look somewhat pathological if they were caused by such a problem in training. Re prediction: Isn't the sign the other way around? Vanishing gradients imply growing norms, right? So vanishing gradients in Post-LN would cause gradients to grow exponentially towards later (closer to output) layers (they also plot something like this in Figure 3 in the linked paper). I agree with the prediction that Post-LN will probably have even stronger exponential norm growth, but I think that this has a different cause to what we find here.

Deep learning models might be secretly (almost) linear

Zach Furman2y20

Great discussion here!

Leaving a meta-comment about priors: on one hand, almost-linear features seem very plausible (a priori) for almost-linear neural networks; on the other, linear algebra is probably the single mathematical tool I'd expect ML researchers to be incredibly well-versed in, and the fact that we haven't found a "smoking gun" at this point with so much potential scrutiny makes me suspect.

And while this is a very natural hypothesis to test, and I'm excited for people to do so, it seems possible that the field's familiarity with linear methods i... (read more)

Exploring the Lottery Ticket Hypothesis

Zach Furman2y107

A bit of a side note, but I don't even think you need to appeal to new architectures - it looks like the NTK approximation performs substantially worse even with just regular MLPs (see this paper, among others).

Interpreting Neural Networks through the Polytope Lens

Zach Furman2y70

For anyone who wants to play around with this themselves, you might be interested in a small Colab notebook I made, with some interactive 2D and 3D plots.

Are extrapolation-based AIs alignable?

Zach Furman2y*10

To be clear, I don't know the answer to this!

Spitballing here, the key question to me seems to be about the OOD generalization behavior of ML models. Models that receive similarly low loss on the training distribution still have many different ways they can behave on real inputs, so we need to know what generalization strategies are likely to be learned for a given architecture, training procedure, and dataset. There is some evidence in this direction, suggesting that ML models are biased towards a simplicity prior over generalization strategies.

If this is... (read more)

Are extrapolation-based AIs alignable?

Zach Furman2y30

In other words, does there exist any dataset such that generating extrapolations from it leads to good outcomes, even in the hands of bad actors?

I think this is an important question to ask, but "even in the hands of bad actors" is just too difficult a place to start. I'm sure you're aware, but it's an unsolved problem whether there exists a dataset / architecture / training procedure such that "generating extrapolations from it leads to good outcomes," for sufficiently capable ML models, even in the hands of good actors. (And the "bad actor" piece can at ... (read more)

2cousin_it2y

Ok, let's assume good actors all around. Imagine we have a million good people volunteering to generate/annotate/curate the dataset, and the eventual user of the AI will also be a good person. What should we tell these million people, what kind of dataset should they make?

Empirical risk minimization is fundamentally confused

Zach Furman2y61

My summary (endorsed by Jesse):

1. ERM can be derived from Bayes by assuming your "true" distribution is close to a deterministic function plus a probabilistic error, but this fact is usually obscured
2. Risk is not a good inner product (naively) - functions with similar risk on a given loss function can be very different
3. The choice of functional norm is important, but uniform convergence just picks the sup norm without thinking carefully about it
4. There are other important properties of models/functions than just risk
5. Learning theory has failed to find tight (generalization) bounds, and bounds might not even be the right thing to study in the first place

Coordinate-Free Interpretability Theory

Zach Furman2y170

Since nobody here has made the connection yet, I feel obliged to write something, late as I am.

To make the problem more tractable, suppose we restrict our set of coordinate changes to ones where the resulting functions can still (approximately) be written as a neural network. (These are usually called "reparameterizations.") This occurs when multiple neural networks implement (approximately) the same function; they're redundant. One trivial example of this is the invariance of ReLU networks to scaling one layer by a constant, and the next layer by the inve... (read more)

3johnswentworth2y

That's a great connection which I had indeed not made, thanks! Strong-upvoted.

Interpreting Neural Networks through the Polytope Lens

Zach Furman2y40

Dropping some late answers here - though this isn't my subfield, so forgive me if I mess things up here.

Correct me if I'm wrong, but it struck while reading this that you can think of a neural network as learning two things at once:
a classification of the input into 2^N different classes (where N is the total number of neurons), each of which gets a different function applied to it
those functions themselves

This is exactly what a spline is! This is where the spline view of neural networks comes from (mentioned in Appendix C of the post). What you call "clas... (read more)

A Neural Network undergoing Gradient-based Training as a Complex System

Zach Furman2y20

The field of complex systems seems like a great source of ideas for interpretability and alignment. In lieu of a longer comment, I'll just leave this great review by Teehan et al. on emergent structures in LLMs. Section 3 in particular is great.

AGI in sight: our look at the game board

Zach Furman2y3859

But in the last few years, we’ve gotten: [...]
Robots (Boston Dynamics)

Broadly agree with this post, though I'll nitpick the inclusion of robotics here. I don't think it's progressing nearly as fast as ML, and it seems fairly uncontroversial that we're not nearly as close to human-level motor control as we are to (say) human-level writing. I only bring this up because a decent chunk of bad reasoning (usually underestimation) I see around AGI risk comes from skepticism about robotics progress, which is mostly irrelevant in my model.

1quetzal_rainbow2y

I think you can find it interesting: https://ai.googleblog.com/2022/12/rt-1-robotics-transformer-for-real.html?m=1

1a3orn2y138

I'm not sure why some skepticism would be unjustified from lack of progress in robots.

Robots require reliability, because otherwise you destroy hardware and other material. Even in areas where we have had enormous progress, (LLMs, Diffusion) we do not have reliability, such that you can trust the output of them without supervision, broadly. So such lack of reliability seems indicative of perhaps some fundamental things yet to be learned.

7Andrea_Miotti2y

Good point, and I agree progress has been slower in robotics compared to the other areas. I just edited the post to add better examples (DayDreamer, VideoDex and RT-1) of recent robotics advances that are much more impressive than the only one originally cited (Boston Dynamics), thanks to Alexander Kruel who suggested them on Twitter.

6[anonymous]2y

Do you have a hypothesis why? Robotic tasks add obvious tangible value, you would expect significant investment into robotics driven by sota AI models. Yet no one appears to be seriously trying and well funded.

Abstraction As Symmetry and Other Thoughts

Zach Furman2y40

Have you looked into "conditionally conserved" quantities/symmetries here? Most macroscopic properties fall into this category - e.g. the color of a particular material is conserved so long as it doesn't change phase or transmute (i.e. it stays within a particular energy range). This is associated with a (spontaneously-broken) symmetry, since the absorption spectrum of a material can be uniquely determined from its space group. I'd be willing to bet that the only information accessible at a distance (up to a change of variables) are these conditionally con... (read more)

Does the Telephone Theorem give us a free lunch?

Answer by Zach FurmanFeb 15, 202352

This is something I've thought about recently - a full answer would take too long to write, but I'll leave a couple comments.

First, what this implies about learning algorithms can be summarized as "it explains the manifold hypothesis." The Telephone Theorem creates an information bottleneck that limits how much information can be captured at a distance. This means that a 64x64 RGB image, despite being nominally 12288-dimensional, in reality captures far less information and lies on a much lower-dimensional latent space. Chaos has irreversibly dispersed all... (read more)

Abstraction As Symmetry and Other Thoughts

Zach Furman2y10

This is something I've been thinking about recently. In particular, you can generalize this by examining temporary conserved quantities, such as phases of matter (typically produced by spontaneous symmetry-breaking). This supports a far richer theory of information-accessible-at-a-distance than only permanently conserved quantities like energy can provide, and allows for this information to have dynamics like a stochastic process. In fact, if you know a bit of solid-state physics you probably realize exactly how much of our observed macroscopic properties ... (read more)

How does GPT-3 spend its 175B parameters?

Zach Furman2y20

Why does GPT-3 use the same matrix for word embedding and final predictions? I would expect this to constrain the model, and the only potential upsides I can see are saving parameters (lol) and preserving interpretability (lmao)^[8]. Other resources like A Mathematical Framework for Transformer Circuits use different embedding/unembedding matrices - their $W_{E}$ and $W_{U}$ . Perhaps this is not necessary for GPT-3 since the final feed-forward network can perform an appropriate linear transformation, and in A Mathematical Framework they are looking at

Zach Furman3y65

I don't think the game is an alarming capability gain at all - I agree with LawrenceC's comment below. It's more of a "gain-of-function research" scenario to me. Like, maybe we shouldn't deliberately try to train a model to be good at this? If you've ever played Diplomacy, you know the whole point of the game is manipulating and backstabbing your way to world domination. I think it's great that the research didn't actually seem to come up with any scary generalizable techniques or dangerous memetics, but I think ideally shouldn't even be trying in the first place.

Is the speed of training large models going to increase significantly in the near future due to Cerebras Andromeda?

Zach Furman3y20

So if streaming works as well as Cereberas claims, GPUs can do that as well or better.

Hmm, I'm still not sure I buy this, after spending some more time thinking about it. GPUs can't stream a matrix multiplication efficiently, as far as I'm aware. My understanding is that they're not very good at matrix-vector operations compared to matrix-matrix because they rely on blocked matrix multiplies to efficiently use caches and avoid pulling weights from RAM every time.

Cerebras says that the CS-2 is specifically designed for fast matrix-vector operations, and use... (read more)

2jacob_cannell3y

Of course GPUs can and do stream a larger matrix multiplication from RAM - the difference is that the GPU design has multiple OOM more bandwidth to the equivalent external RAM (about 3 OOM to be more specific). Also the latest lovelace/hopper GPUs have more SRAM now - 50MB per chip, so about 1GB of SRAM for a 200 GPU pod similar to the cerebras wafer. The CS-2 is only good at matrix-vector operations that fit in its SRAM capacity. As a thought experiment, consider running a brain-like ANN with 10B neurons and 10T sparse weights. Simulating one second of activity requires only on order 10T sparse ops or a couple OOM more dense ops which is already within current single GPU capability. The problem is that streaming in the 10TB of weight data would take several minutes on the CS-2's pathetic slow IO path. Meanwhile the equivalent priced 200 GPU pod can fit the weights in GPU RAM and has the performance to simulate about a hundred instances of that brain sized model in real time, so about 10000x higher performance than the CS-2. Weights outnumber activations by 3 or 4 OOM so moving weights over long distances as in CS-2 is enormously inefficient, compared to moving the activations around (as in the GPU design) which uses very little bandwidth. The future is in the opposite direction of that CS-2 'weight streaming' - towards more optimal neurmorphic computing - where the weights stay in place and the activations flow through them.

Is the speed of training large models going to increase significantly in the near future due to Cerebras Andromeda?

Zach Furman3y20

The Andromeda 'supercomputer' has peak performance of 120 pflops dense compared to 512 pflops dense for a single 256 H100 GPU pod from nvidia

I'm not sure if PFLOPs are a fair comparison here though, if I understand Cerebras' point correctly. Like, if you have ten GPUs with one PFLOP each, that's technically the same number of PFLOPs as a single GPU with ten PFLOPs. But actually that single GPU is going to train a lot faster than the ten GPUs because the ten GPUs are going to have to spend time communicating with each other. Especially as memory limitations... (read more)

3jacob_cannell3y

Well that's not quite right - otherwise everyone would be training on single GPUs using very different techniques, which is not what we observe. Every parallel system has communication, but it doesn't necessarily 'spend time' on that in the blocking sense, it typically happens in parallel with computation. SOTA models do now seem often limited by RAM, so model parallelism is increasingly important as it is RAM efficient. This is actually why cerebras's strategy doesn't make sense: GPUs are optimized heavily for the sweet spot in terms of RAM capacity/$ and RAM bandwidth. The wafer scale approach instead tries to use on-chip SRAM to replace off-chip RAM, which is just enormously more expensive - at least an OOM more expensive in practice. This of course is bogus because with model parallelism you can tune the interconnect requirements based on the model design, and nvidia has been tuning their interconnect tradeoffs for years in tandem with researchers cotuning their software/models for nvidia hardware. So current training setups are not strongly limited by interconnect vs other factors - some probably are, some underutilize interconnect and are limited by something else, but nvidia knows all of this, has all that data, and has been optimizing for these use cases weighted by value for years now (and is empirically better at this game than anybody else). The upside of a wafer scale chip is fast on-chip transfer, the downside is slower off-chip transfer (as that is limited by the 2d perimeter of the much larger chip). For equal flops and or $$, the GPU design of breaking up the large tile into alternating logic and RAM subsections has higher total off chip RAM and off-chip transfer bandwidth. The more ideal wafer design would be one where you had RAM stacked above in 3D, but cerebras doesn't do that presumably because they need that whole surface for heat transfer. If you look inside the engine block of the CS-2 form their nice virtual tour you can see that the waf

Is the speed of training large models going to increase significantly in the near future due to Cerebras Andromeda?

Zach Furman3y80

Hmm, I see how that would happen with other architectures, but I'm a bit confused how this is $O (n^{2})$ here? Andromeda has the weight updates computed by a single server (MemoryX) and then distributed to all the nodes. Wouldn't this be a one-to-many broadcast with $O (n)$ transmission time?

3Razied3y

You're completely right, I don't know how I missed that, I must be more tired than I thought I was.

A Mystery About High Dimensional Concept Encoding

Zach Furman3y126

No substantive reply, but I do want to thank you for commenting here - original authors publicly responding to analysis of their work is something I find really high value in general. Especially academics that are outside the usual LW/AF sphere, which I would guess you are given your account age.

Should we push for requiring AI training data to be licensed?

Answer by Zach FurmanOct 19, 2022166

I'm not sure exactly where I land on this, but I think it's important to consider that restricting the data companies can train on could influence the architectures they use. Self-supervised autoregressive models a-la GPT-3 seem a lot more benign than full-fledged RL agents. The latter is a lot less data hungry than the former (especially in terms of copyrighted data). There are enough other factors here to not make me completely confident in this analysis, but it's worth thinking about.

3Jozdien3y

I'm leaning toward the current paradigm being preferable to a full-fledged RL one, but want to add a point - one of my best guesses for proto-AGI involves massive LLMs hooked up to some RL system. This might not require RL capabilities on the same level of complexity as pure RL agents, and RL is still being worked on today.

A shot at the diamond-alignment problem

Zach Furman3y75

This proposal looks really promising to me. This might be obvious to everyone, but I think much better interpretability research is really needed to make this possible in a safe(ish) way. (To verify the shard does develop, isn't misaligned, etc.) We'd just need to avoid the temptation to take the fancy introspection and interpretability tools this would require and use them as optimization targets, which would obviously make them useless as safeguards.

Zach Furman3y30

This is definitely the core challenge of the language model approach, and may be the reason it fails. I actually believe language models aren't the most likely approach to achieve superintelligence. But I also place a non-trivial probability on this occurring, which makes it worth thinking about for me.

Let me try to explain why I don't rule this possibility out. Obviously GPT-3 doesn't know more than a human, as evident in its sub-human performance on common tasks and benchmarks. But suppose we instead have a much more advanced system, a near-optimal seque... (read more)

2jbash3y

Hmm. That made me actually try to think concretely about how to elicit "superhuman" information. You could give it a counterfactual prompt. You could keep sweetening the pot with stuff that made it harder and harder to explain how the prompt could occur without the problem actually being solved. ... but of course you'd still have to be sure that what you got was right. Especially if it determined internally that the problem was totally impossible, it might always output something that would convince everybody if it were proposed, but would still be wrong. It might do that even if the problem could be solved, if the actual solution were less likely to be widely believed by humans than some attractive pseudo-solution. Or it could itself be wrong. Or it might decide it was supposed to be writing a science fiction story. Seems like the approach would work better for questions where you could actually check the results. I don't personally think that sort of model will ever get that smart internally, but it's not like I'm perfect either...