AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
- curious about your optimism regarding learned masks as attribution method - seems like the problem of learning mechanisms that don't correspond to model mechanisms is real for circuits (see Interp Bench) and would plausibly bite here too (through should be able to resolve this with benchmarks on downstream tasks once ADP is more mature)
We think this may not be a problem here, because the definition of parameter component 'activity' is very constraining. See Appendix section A.1.
To count as inactive, it's not enough for components to not influence the output if you turn them off, every point on every possible monotonic trajectory between all components being on, and only the components deemed 'active' being on, has to give the same output. If you (approximately) check for this condition, I think the function that picks the learned masks can kind of be as expressive as it likes, because the sparse forward pass can't rely on the mask to actually perform any useful computation labor.
Conceptually, this is maybe one of the biggest difference between APD and something like, say, a transcoder or a crosscoder. It's why it doesn't seem to me like there'd be an analog to 'feature splitting' in APD. If you train a transcoder on a -dimensional linear transformation, it will learn ever sparser approximations of this transformation the larger you make the transcoder dictionary, with no upper limit. If you train APD on a -dimensional linear transformation, provided it's tuned right, I think it should learn a single -dimensional component. Regardless of how much larger than d you make the component dictionary. Because if it tried to learn more components than that to get a sparser solution, it wouldn't be able to make the components sum to the original model weights anymore.
Despite this constraint on its structure, I think APD plausibly has all the expressiveness it needs, because even when there is an overcomplete basis of features in activation space, circuits in superposition math and information theory both suggest that you can't have an overcomplete basis of mechanisms in parameter space. So it seems to me that you can just demand that components must compose linearly, without that restricting their ability to represent the structure of the target model. And that demand then really limits the ability to sneak in any structure that wasn't originally in the target model.
Yes, I don't think this will let you get away with no specification bits in goal space at the top level like John's phrasing might suggest. But it may let you get away with much less precision?
The things we care about aren't convergent instrumental goals for all terminal goals, the kitchen chef's constraints aren't doing that much to keep the kitchen liveable to cockroaches. But it seems to me that this maybe does gesture at a method to get away with pointing at a broad region of goal space instead of a near-pointlike region.
On first read the very rough idea of it sounds ... maybe right? It seems to perhaps actually centrally engage with the source of my mind's intuition that something like corrigibility ought to exist?
Wow.
I'd love to get a spot check for flaws from a veteran of the MIRI corrigibility trenches.
It's disappointing that you wrote me off as a crank in one sentence. I expect more care, including that you also question your own assumptions.
I think it is very fair that you are disappointed. But I don't think I can take it back. I probably wouldn’t have introduced the word crank myself here. But I do think there’s a sense in which Oliver’s use of it was accurate, if maybe needlessly harsh. It does vaguely point at the right sort of cluster in thing-space.
It is true that we discussed this and you engaged with a lot of energy and in good faith. But I did not think Forrest’s arguments were convincing at all, and I couldn’t seem to manage to communicate to you why I thought that. Eventually, I felt like I wasn’t getting through to you, Quintin Pope also wasn’t getting through to you, and continuing started to feel draining and pointless to me.
I emerged from this still liking you and respecting you, but thinking that you are wrong about this particular technical matter in a way that does seem like the kind of thing people imagine when they hear ‘crank’.
This. Though I don't think the interpretation algorithm is the source of most of the specification bits here.
To make an analogy with artificial neural networks, the human genome needs to contain a specification of the architecture, the training signal and update algorithm, and some basic circuitry that has to work from the start, like breathing. Everything else can be learned.
I think the point maybe holds up slightly better for non-brain animal parts, but there's still a difference between storing a blueprint for what proteins cells are supposed to make and when, and storing the complete body plan of the resulting adult organism. The latter seems like a closer match to a Microsoft Word file.
If you took the adult body plans of lots of butterflies, and separated all the information of an adult butterfly bodyplan into the bits common to all of the butterflies, and the bits specifying the exact way things happened to grow in this butterfly, the former is more or less[1] what would need to fit into the butterfly genome, not the former plus the latter.
EDIT: Actually, maybe that'd be overcounting what the genome needs to store as well. How individual butterfly bodies grow might be determined by the environment, meaning some of their complexity would actually be specified by the environment, just as in the case of adult butterfly brains. Since this could be highly systematic (the relevant parts of the environment are nigh-identical for all butterflies), those bits would not be captured in our sample of butterfly variation.
Up to the bits of genome description length that vary between individual butterflies, which I'd guess would be small compared to both the bits specifying the butterfly species and the bits specifying details of the procedural generation outcome in individual butterflies?
I have heard from many people near AI Safety camp that they also have judged AI safety camp to have gotten worse as a result of this.
Hm. This does give me serious pause. I think I'm pretty close to the camps but I haven't heard this. If you'd be willing to share some of what's been relayed to you here or privately, that might change my decision. But what I've seen of the recent camps still just seemed very obviously good to me?
I don't think Remmelt has gone more crank on the margin since I interacted with him in AISC6. I thought AISC6 was fantastic and everything I've heard about the camps since then still seemed pretty great.
I am somewhat worried about how it'll do without Linda. But I think there's a good shot Robert can fill the gap. I know he has good technical knowledge, and from what I hear integrating him as an organiser seems to have worked well. My edition didn't have Linda as organiser either.
I think I'd rather support this again than hope something even better will come along to replace it when it dies. Value is fragile.
That's not clear to me? Unless they have a plan to ensure future ASIs are aligned with them or meaningfully negotiate with them, ASIs seem just as likely to wipe out any earlier non-superhuman AGIs as they are to wipe out humanity.
I can come up with specific scenarios where they'd be more interested in sabotaging safety research than capabilities research, as well as the reverse, but it's not evident to me that the combined probability mass of the former outweighs the latter or vice-versa.
If someone has an argument for this I would be interested in reading it.
We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process
This line caught my eye while reading. I don't know much about RL on LLMs, is this a common failure mode these days? If so, does anyone know what such reward hacks tend to look like in practice?
Yes, I am reinforcing John's point here. I think the case for control being a useful stepping stone for solving alignment of ASI seems to rely on a lot conditionals that I think are unlikely to hold.
I think I would feel better about this if control advocates were clear that their strategy is two-pronged, and included somehow getting a pause on ASI development of some kind. Then they would at least be actively trying to make what I would consider one of the most key conditionals for control substantially reducing doom hold.
I am additionally leery on AI control beyond my skepticism of its value in reducing doom because creating a vast surveillance and enslavement apparatus to get work out of lots and lots of misaligned AGI instances seems like a potential moral horror. The situation is sufficiently desperate that I am willing in principle to stomach some moral horror (unaligned ASI would likely kill any other AGIs we made before it as well), but not if it isn't even going to save the world.
This paper claims to sample the Bayesian posterior of NN training, but I think it's wrong.
"What Are Bayesian Neural Network Posteriors Really Like?" (Izmailov et al. 2021) claims to have sampled the Bayesian posterior of some neural networks conditional on their training data (CIFAR-10, MNIST, IMDB type stuff) via Hamiltonian Monte Carlo sampling (HMC). A grand feat if true! Actually crunching Bayesian updates over a whole training dataset for a neural network that isn't incredibly tiny is an enormous computational challenge. But I think they're mistaken and their sampler actually isn't covering the posterior properly.
They find that neural network ensembles trained by Bayesian updating, approximated through their HMC sampling, generalise worse than neural networks trained by stochastic gradient descent (SGD). This would have been incredibly surprising to me if it were true. Bayesian updating is prohibitively expensive for real world applications, but if you can afford it, it is the best way to incorporate new information. You can't do better.[1]
This is kind of in the genre of a lot of papers and takes I think used to be around a few years back, which argued that the then still quite mysterious ability of deep learning to generalise was primarily due to some advantageous bias introduced by SGD. Or momentum, or something along these lines. In the sense that SGD/momentum/whatever were supposedly diverging from Bayesian updating in a way that was better rather than worse.
I think these papers were wrong, and the generalisation ability of neural networks actually comes from their architecture, which assigns exponentially more weight configurations to simple functions than complex functions. So, most training algorithms will tend to favour making simple updates, and tend to find simple solutions that generalise well, just because there's exponentially more weight settings for simple functions than complex functions. This is what Singular Learning Theory talks about. From an algorithmic information theory perspective, I think this happens for reasons similar to why exponentially more binary strings correspond to simple programs than complex programs in Turing machines.
This picture of neural network generalisation predicts that SGD and other training algorithms should all generalise worse than Bayesian updating, or at best do similarly. They shouldn't do better.
So, what's going on in the paper? How are they finding that neural network ensembles updated on the training data with Bayes rule make predictions that generalise worse than predictions made by neural networks trained the normal way?
My guess: Their Hamiltonian Monte Carlo (HMC) sampler isn't actually covering the Bayesian posterior properly. They try to check that it's doing a good job by comparing inter-chain and intra-chain variance in the functions learned.
They seem to think a good ^R in function space implies that the chains are doing a good job of covering the important parts of the space. But I don't think that's true. You need to mix in weight space, not function space, because weight space is where the posterior lives. Function space and weight space are not bijective, that's why it's even possible for simpler functions to have exponentially more prior than complex functions. So good mixing in function space does not necessarily imply good mixing in weight space, which is what we actually need. The chains could be jumping from basin to basin very rapidly instead of spending more time in the bigger basins corresponding to simpler solutions like they should.
And indeed, they test their chains' weight space ^R value as well, and find that it's much worse:
So I think they aren't actually sampling the Bayesian posterior. Instead, their chains jump between modes a lot and thus unduly prioritise low-volume minima compared to high volume minima. And those low-volume minima are exactly the kind of solutions we'd expect to generalise poorly.
I don't blame them here. It's a paper from early 2021, back when very few people understood the importance of weight space degeneracy properly aside from some math professor in Japan whom almost nobody in the field had heard of. For the time, I think they were trying something very informative and interesting. But since the paper has 300+ citations and seems like a good central example of the SGD-beats-Bayes genre, I figured I'd take the opportunity to comment on it now that we know so much more about this.
The subfield of understanding neural network generalisation has come a long way in the past four years.
Thanks to Lawrence Chan for pointing the paper out to me. Thanks also to Kaarel Hänni and Dmitry Vaintrob for sparking the argument that got us all talking about this in the first place.
See e.g. the first chapters of Jaynes for why.