Zach Furman - LessWrong

After convergence, the samples should be viewed as drawn from the stationary distribution, and ideally they have low autocorrelation, so it doesn't seem to make sense to treat them as a vector, since there should be many equivalent traces.

This is a very subtle point theoretically, so I'm glad you highlighted this. Max may be able to give you a better answer here, but I'll try my best to attempt one myself.

I think you may be (understandably) confused about a key aspect of the approach. The analysis isn't focused on autocorrelation within individual traces, but rather correlations between different input traces evaluated on the same parameter samples from SGLD.

What this approach is actually doing, from a theoretical perspective, is estimating the expected correlation of per-datapoint losses over the posterior distribution of parameters. SGLD serves purely as a mechanism for sampling from this posterior (as it ordinarily does).

When examining correlations between and $T (x_{2})_{t}$ across different inputs $x_{1}, x_{2}$ but identical parameter samples $θ_{t}$ , the method approximates the posterior expectation

$E_{θ \sim p (θ | D)} [(L (θ | x_{1}) - L (θ_{0} | x_{1})) (L (θ | x_{2}) - L (θ_{0} | x_{2}))]$ .

Assuming that SGLD is taking unbiased IID samples from the posterior, the dot product of traces $T (x_{1})_{t} \cdot T (x_{2})_{t}$ is an unbiased estimator of this correlation. The vectorization of traces is therefore an efficient mechanism for computing these expectations in parallel, and representing the correlation structure geometrically.

At the risk of being repetitive, at an intuitive level, this is designed to detect when different inputs respond similarly to the same parameter perturbations. When inputs $x_{1}$ and $x_{2}$ share functional circuitry within the model, they'll likely show correlated loss patterns when that shared circuitry is disrupted at parameter $θ_{t}$ . When two trace vectors $T (x_{1})_{t}, T (x_{2})_{t}$ are orthogonal, that means the per-sample losses for $x_{1}$ and $x_{2}$ are uncorrelated across the posterior. The presence or absence of synchronization reveals functional similarities between inputs that might not be apparent through other means.

What all this means is that we do in fact want to use SGLD for plain old MCMC sampling - we are genuinely attempting to use random samples from the posterior rather than e.g. examining the temporal behavior of SGLD. Ideally, we want all the usual desiderata of MCMC sampling algorithms here, like convergence, unbiasedness, low autocorrelation, etc. You're completely correct that if SGLD is properly converged and sampling ideally, it has fairly boring autocorrelation within a trace - but this is exactly what we want.

Stan van Wingerden's Shortform

Zach Furman3mo30

IIRC @jake_mendel and @Kaarel have thought about this more, but my rough recollection is: a simple story about the regularization seems sufficient to explain the training dynamics, so a fancier SLT story isn't obviously necessary. My guess is that there's probably something interesting you could say using SLT, but nothing that simpler arguments about the regularization wouldn't tell you also. But I haven't thought about this enough.

Singular learning theory: exercises

Zach Furman6mo10

Good catch, thanks! Fixed now.

Generalization, from thermodynamics to statistical physics

Zach Furman1y30

It's worth noting that Jesse is mostly following the traditional "approximation, generalization, optimization" error decomposition from learning theory here - where "generalization" specifically refers to finite-sample generalization (gap between train/test loss), rather than something like OOD generalization. So e.g. a failure of transformers to solve recursive problems would be a failure of approximation, rather than a failure of generalization. Unless I misunderstood you?

Generalization, from thermodynamics to statistical physics

Zach Furman1y*Ω130

Repeating a question I asked Jesse earlier, since others might be interested in the answer: how come we tend to hear more about PAC bounds than MAC bounds?

Singular learning theory and bridging from ML to brain emulations

Zach Furman1y21

Note that in the SLT setting, "brains" or "neural networks" are not the sorts of things that can be singular (or really, have a certain ) on their own - instead they're singular for certain distributions of data.

This is a good point I often see neglected. Though there's some sense in which a model $p (x | w)$ can "be singular" independent of data: if the parameter-to-function map $w \mapsto p (x | w)$ is not locally injective. Then, if a distribution $p (x)$ minimizes the loss, the preimage of $p (x)$ in parameter space can have non-trivial geometry.

These are called "degeneracies," and they can be understood for a particular model without talking about data. Though the actual $p (x)$ that minimizes the loss is determined by data, so it's sort of like the "menu" of degeneracies are data-independent, and the data "selects one off the menu." Degeneracies imply singularities, but not necessarily vice-versa, so they aren't everything. But we do think that degeneracies will be fairly important in practice.

TOMORROW: the largest AI Safety protest ever!

Zach Furman1y105

A possible counterpoint, that you are mostly advocating for awareness as opssosed to specific points is null, since pretty much everyone is aware of the problem now - both society as a whole, policymakers in particular, and people in AI research and alignment.

I think this specific point is false, especially outside of tech circles. My experience has been that while people are concerned about AI in general, and very open to X-risk when they hear about it, there is zero awareness of X-risk beyond popular fiction. It's possible that my sample isn't representative here, but I would expect that to swing in the other direction, given that the folks I interact with are often well-educated New-York-Times-reading types, who are going to be more informed than average.

Even among those aware, there's also a difference between far-mode "awareness" in the sense of X-risk as some far away academic problem, and near-mode "awareness" in the sense of "oh shit, maybe this could actually impact me." Hearing a bunch of academic arguments, but never seeing anybody actually getting fired up or protesting, will implicitly cause people to put X-risk in the first bucket. Because if they personally believed it to be big a near-term risk, they'd certainly be angry and protesting, and if other people aren't, that's a signal other people don't really take it seriously. People sense a missing mood here and update on it.

Arguments for optimism on AI Alignment (I don't endorse this version, will reupload a new version soon.)

Zach Furman1y*1813

In the cybersecurity analogy, it seems like there are two distinct scenarios being conflated here:

1) Person A says to Person B, "I think your software has X vulnerability in it." Person B says, "This is a highly specific scenario, and I suspect you don't have enough evidence to come to that conclusion. In a world where X vulnerability exists, you should be able to come up with a proof-of-concept, so do that and come back to me."

2) Person B says to Person A, "Given XYZ reasoning, my software almost certainly has no critical vulnerabilities of any kind. I'm so confident, I give it a 99.99999%+ chance." Person A says, "I can't specify the exact vulnerability your software might have without it in front of me, but I'm fairly sure this confidence is unwarranted. In general it's easy to underestimate how your security story can fail under adversarial pressure. If you want, I could name X hypothetical vulnerability, but this isn't because I think X will actually be the vulnerability, I'm just trying to be illustrative."

Story 1 seems to be the case where "POC or GTFO" is justified. Story 2 seems to be the case where "security mindset" is justified.

It's very different to suppose a particular vulnerability exists (not just as an example, but as the scenario that will happen), than it is to suppose that some vulnerability exists. Of course in practice someone simply saying "your code probably has vulnerabilities," while true, isn't very helpful, so you may still want to say "POC or GTFO" - but this isn't because you think they're wrong, it's because they haven't given you any new information.

Curious what others have to say, but it seems to me like this post is more analogous to story 2 than story 1.

Thomas Kwa's MIRI research experience

Zach Furman1y32

I wish I had a more short-form reference here, but for anyone who wants to learn more about this, Rocket Propulsion Elements is the gold standard intro textbook. We used in my university rocketry group, and it's a common reference to see in industry. Fairly well written, and you should only need to know high school physics and calculus.

Alexander Gietelink Oldenziel's Shortform

Zach Furman1y10

Obviously this is all speculation but maybe I'm saying that the universal approximation theorem implies that neural architectures are fractal in space of all distributtions (or some restricted subset thereof)?

Oh I actually don't think this is speculation, if (big if) you satisfy the conditions for universal approximation then this is just true (specifically that the image of is dense in function space). Like, for example, you can state Stone-Weierstrass as: for a Hausdorff space X, and the continuous functions under the sup norm $C (X, R)$ , the Banach subalgebra of polynomials is dense in $C (X, R)$ . In practice you'd only have a finite-dimensional subset of the polynomials, so this obviously can't hold exactly, but as you increase the size of the polynomials, they'll be more space-filling and the error bound will decrease.

Curious what's your beef with universal approximation? Stone-weierstrass isn't quantitative - is that the reason?

The problem is that the dimension of $W$ required to achieve a given $ϵ$ error bound grows exponentially with the dimension $d$ of your underlying space $X$ . For instance, if you assume that weights depend continuously on the target function, $ϵ$ -approximating all $C^{n}$ functions on $[0, 1]^{d}$ with Sobolev norm $\leq 1$ provably takes at least $O (ϵ^{- d / n})$ parameters (DeVore et al.). This is a lower bound.

So for any realistic $d$ universal approximation is basically useless - the number of parameters required is enormous. Which makes sense because approximation by basis functions is basically the continuous version of a lookup table.

Because neural networks actually work in practice, without requiring exponentially many parameters, this also tells you that the space of realistic target functions can't just be some generic function space (even with smoothness conditions), it has to have some non-generic properties to escape the lower bound.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments