gallabytes — LessWrong

in practice, we seem to train the world model and understanding machine first and the policy only much later as a thin patch on top of the world model. this is not guaranteed to stay true but seems pretty durable so far. thus, the relevant heuristics are about base models not about randomly initialized neural networks.

separately, I do think randomly initialized neural networks have some strong baseline of fuzziness and conceptual corrigibility, which is in a sense what it means to have a traversible loss landscape.

Varieties Of Doom

gallabytes23d62

what's the easiest thing you think LLMs won't be able to do in 5 years ie by EoY 2030? what about EoY 2026?

Varieties Of Doom

gallabytes25d8-2

It strains credibility to imagine economically useless humans being allowed to keep a disproportionate share of capital in a society where every decision they make with it is net-negative in the carefully tuned structures of posthuman minds.

I'm not really convinced this is true. the share in this case is disproportionate but astronomically small - eg even full dominion over earth is a very cheap thing to grant. if it's even 0.001% destabilizing to ai society to expropriate the humans it's not going to be worth it.

we don't need much caring about humans to lead to ~preserved property rights. I expect we'll overshoot the needed amount by quite a lot, and extra marginal caring is probably good.

Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

gallabytes1mo10

in my use sonnet 4.5 seems genuinely more aligned than sonnet 4, even in scenarios which aren't eval flavored. purely anecdotal and not remotely systematic, but it really does seem a lot less inclined to try to pass off failure as success ("the tests are failing, but this is intended behavior!" kind of thing)

The Obliqueness Thesis

gallabytes1y124

the track record of people trying to broadly ensure that humanity continues to be in control of the future

What track record?

Ironing Out the Squiggles

gallabytes2y1610

But do they also generalize out of training distribution more similarly? If so, why?

Neither of them is going to generalize very well out of distribution, and to the extent they do it will be via looking for features that were present in-distribution. The old adage "to imagine 10-dimensional space, first imagine 3-space, then say 10 really hard".

My guess is that basically every learning system which tractably approximates Bayesian updating on noisy high dimensional data is going to end up with roughly Gaussian OOD behavior. There's been some experiments where (non-adversarially-chosen) OOD samples quickly degrade to uniform prior, but I don't think that's been super robustly studied.

The way humans generalize OOD is not that our visual systems are natively equipped to generalize to contexts they have no way of knowing about, that would be a true violation of no-free-lunch theorems, but that through linguistic reflection & deliberate experimentation some of us can sometimes get a handle on the new domain, and then we use language to communicate that handle to others who come up with things we didn't, etc. OOD generalization is a process at the (sub)cultural & whole-nervous-system level, not something that individual chunks of the brain can do well on their own.

This is also confusing/concerning for me. Why would it be necessary or helpful to have such a large dataset to align the shape/texture bias with humans?

Well it might not be, but you need large datasets to motivate studying large models, as their performance on small datasets like imagenet is often only marginally better.

A 20b param ViT trained on 10m images at 224x224x3 is approximately 1 param for every 75 subpixels, and 2000 params for every image. Classification is an easy enough objective that it very likely just overfits, unless you regularize it a ton, at which point it might still have the expected shape bias at great expense. Training a 20b param model is expensive, I don't think anyone has ever spent that much on a mere imagenet classifier, and public datasets >10x the size of imagenet with any kind of labels only started getting collected in 2021.

To motivate this a bit, humans don't see in frames but let's pretend we do. At 60fps for 12h/day for 10 years, that's nearly 9.5 billion frames. Imagenet is 10 million images. Our visual cortex contains somewhere around 5 billion neurons, which is around 50 trillion parameters (at 1 param / synapse & 10k synapses / neuron, which is a number I remember being reasonable for the whole brain but vision might be 1 or 2 OOM special in either direction).

Ironing Out the Squiggles

gallabytes2y50

adversarial examples definitely still exist but they'll look less weird to you because of the shape bias.

anyway this is a random visual model, raw perception without any kind of reflective error correction loop, I'm not sure what you expect it to do differently, or what conclusion you're trying to draw from how it does behave? the inductive bias doesn't precisely match human vision, so it has different mistakes, but as you scale both architectures they become more similar. that's exactly what you'd expect for any approximately Bayesian setup.

the shape bias increasing with scale was definitely conjectured long before it was tested. ML scaling is very recent though,and this experiment was quite expensive. Remember when GPT-2 came out and everyone thought that was a big model? This is an image classifier which is over 10x larger than that. They needed a giant image classification dataset which I don't think even existed 5 years ago.

Ironing Out the Squiggles

gallabytes2y30

Scale basically solves this too, with some other additions (not part of any released version of MJ yet) really putting a nail in the coffin, but I can't say too much here w/o divulging trade secrets. I can say that I'm surprised to hear that SD3 is still so much worse than Dalle3, Ideogram on that front - I wonder if they just didn't train it long enough?

Ironing Out the Squiggles

gallabytes2y134

They put too much emphasis on high frequency features, suggesting a different inductive bias from humans.

This was found to not be true at scale! It doesn't even feel that true w/weaker vision transformers, seems specific to convnets. I bet smaller animal brains have similar problems.

On Anthropic’s Sleeper Agents Paper

gallabytes2y10

Order matters more at smaller scales - if you're training a small model on a lot of data and you sample in a sufficiently nonrandom manner, you should expect catastrophic forgetting to kick in eventually, especially if you use weight decay.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments