Lucius Bushnaq

AI notkilleveryoneism researcher, focused on interpretability. 

Personal account, opinions are my own. 

I have signed no contracts or agreements whose existence I cannot mention.

Wiki Contributions

Comments

Sorted by

I have not updated on these results much so far. Though I haven't looked at them in detail yet. My guess is that if you already had a view of SAE-style interpretability somewhat similar to mine [1,2], these papers shouldn't be much of an additional update for you.

I don't share the feeling that not enough of relevance has happened over the last ten years for us to seem on track for solving it in a hundred years, if the world's technology[1] were magically frozen in time.

Some more insights from the past ten years that look to me like they're plausibly nascent steps in building up a science of intelligence and maybe later, alignment:

  • We understood some of the basics of general pattern matching: How it is possible for embedded minds that can't be running actual Solomonoff induction to still have some ability to extrapolate from old data to new data. This used to be a big open problem in embedded agency, at least to me, and I think it is largely solved now. Admittedly a lot of the core work here actually happened more than ten years ago, but people in ml or our community didn't know about it. [1,2]
  • Natural latents. [1,2,3]
  • Some basic observations and theories about the internal structure of the algorithms neural networks learn, and how they learn them. Yes, our networks may be a very small corner of mind space, but one example is way better than no examples! There's a lot on this one, so the following is just a very small and biased selection. Note how some of these works are starting to properly build on each other. [1,2,3,4,5,6,7,8,9,10,11,12]
  • Some theory trying to link how AIs work to how human brains work. I feel less able to evaluate this one, but if the neurology basics are right it seems quite useful. [1]
  • QACI. What I'd consider the core useful QACI insight maybe sounds kind of obvious once you know about it. But I, at least, didn't know about it. Like, if someone had told me: "A formal process we can describe that we're pretty sure would return the goals we want an AGI to optimise for is itself often a sufficient specification of those goals." I would've replied: "Well, duh." But I wouldn't have realised the implication. I needed to see an actual example for that. Plausibly MIRI people weren't as dumb as me here and knew this pre-2015, I'm not sure.
  • The mesa-optimiser paper. This one probably didn't have much insight that didn't already exist pre-2015. But I think it communicated something central about the essence of the alignment problem to many people who hadn't realised it before. [1]

If we were a normal scientific field with no deadline, I would feel very good about our progress here. Particularly given how small we are. CERN costs ca. 1.2 billion a year, I think all the funding for technical work and governance over the past 20 years taken together doesn't add up to one year of that. Even if at the end of it all we still had to get ASI alignment right on the first try, I would still feel mostly good about this, if we had a hundred years. 

I would also feel better about the field building situation if we had a hundred years. Yes, a lot of the things people tried for field building over the past ten years didn't work as well as hoped. But we didn't try that many things, a lot of the attempts struck me as inadequate in really basic ways that seem fixable in principle, and I would say the the end result still wasn't no useful field building. I think the useful parts of the field have grown quite a lot even in the past three years! Just not as much as people like John or me thought they would, and not as much as we probably needed them to with the deadlines we seem likely to have.

Not to say that I wouldn't still prefer to do some human intelligence enhancement first, even if we had a hundred years. That's just the optimal move, even in a world where things look less grim. 

But what really kills it for me is just the sheer lack of time. 
 

  1. ^

    Specifically AI and intelligence enhancement

I wonder whether this is due to the fact that he's used to thinking about human brains, where we're (AFAIK) nowhere near being able to identify the representation of specific concepts, and so we might as well use the most philosophically convenient description.

I don't think this description is philosophically convenient. Believing  and believing things that imply  are genuinely different states of affairs in a sensible theory of mind. Thinking through concrete mech interp examples of the former vs. the latter makes it less abstract in what sense they are different, but I think I would have objected to Chalmer's definition even back before we knew anything about mech interp. It would just have been harder for me to articulate what exactly is wrong with it.

(Abstract) I argue for the importance of propositional interpretability, which involves interpreting a system’s mechanisms and behavior in terms of propositional attitudes
...
(Page 5) Propositional attitudes can be divided into dispositional and occurrent. Roughly speaking, occurrent attitudes are those that are active at a given time. (In a neural network, these would be encoded in neural activations.) Dispositional attitudes are typically inactive but can be activated. (In a neural network, these would be encoded in the weights.) For example, I believe Paris is the capital of France even when I am asleep and the belief is not active. That is a dispositional belief. On the other hand, I may actively judge France has won more medals than Australia. That is an occurrent mental state, sometimes described as an “occurrent belief”, or perhaps better, as a “judgment” (so judgments are active where beliefs are dispositional). One can make a similar distinction for desires and other attitudes.

I don't like it. It does not feel like a clean natural concept in the territory to me.

Case in point:

(Page 9) Now, it is likely that a given AI system may have an infinite number of propositional attitudes, in which case a full log will be impossible. For example, if a system believes a proposition p, it arguably dispositionally believes p-or-q for all q. One could perhaps narrow down to a finite list by restricting the log to occurrent propositional attitudes, such as active judgments. Alternatively, we could require the system to log the most significant propositional attitudes on some scale, or to use a search/query process to log all propositional attitudes that meet a certain criterion.

I think what this is showing is that Chalmer's definition of "dispositional attitudes" has a problem: It lacks any notion of the amount and kind of computational labour required to turn 'dispositional' attitudes into 'occurrent' ones. That's why he ends up with AI systems having an uncountably infinite number of dispositional attitudes. 

One could try to fix up Chalmer's definition by making up some notion of computational cost, or circuit complexity or something of the sort, that's required to convert a dispositional attitude into an occurrent attitude, and then only list dispositional attitude up to some cost cutoff  we are free to pick as applications demand.

But I don't feel very excited about that. At that point, what is this notion of "dispositional attitudes" really still providing us that wouldn't be less cumbersome to describe in the language of circuits? There, you don't have this problem. An AI can have a query-key lookup for proposition  and just not have a query-key lookup for the proposition . Instead, if someone asks whether  is true, it first performs the lookup for , then uses some general circuits for evaluating simple propositional logic to calculate that  is true. This is an importantly different computational and mental process from having a query-key lookup for  in the weights and just directly performing that lookup, so we ought to describe a network that does the former differently from a network that does the latter. It does not seem like Chalmer's proposed log of 'propositional attitudes' would do this. It'd describe both of these networks the same way, as having a propositional attitude of believing , discarding a distinction between them that is important for understanding the models' mental state in a way that will let us do things such as successfully predicting the models' behavior in a different situation. 

I'm all for trying to come up with good definitions for model macro-states which throw away tiny implementation details that don't matter, but this definition does not seem to me to carve the territory in quite the right way. It throws away details that do matter.

The idea of the motivation is indeed that you want to encode the attribution of each rank-1 piece separately. In practice, computing the attribution of  as a whole actually does involve calculating the attributions of all rank-1 pieces and summing them up, though you're correct that nothing we do requires storing those intermediary results. 

While it technically works out, you are pointing at a part of the math that I think is still kind of unsatisfying. If Bob calculates the attributions and sends them to Alice, why would Alice care about getting the attribution of each rank-1 pieces separately if she doesn't need them to tell what component to activate? Why can't Bob just sum them before he sends them? It kind of vaguely makes sense to me that Alice would want the state of a multi-dimensional object on the forward pass described with multiple numbers, but what exactly are we assuming she wants that state for? It seems that she has to be doing something with it that isn't just running her own sparser forward pass.

I'm brooding over variations of this at the moment, trying to find something for Alice to do that connects better to what we actually want to do. Maybe she is trying to study the causal traces of some forward passes, but pawned the cost of running those traces off to Bob, and now she wants to get the shortest summary of the traces for her investigation under the constraint that uncompressing the summary shouldn't cost her much compute. Or maybe Alice wants something else. I don't know yet.

This paper claims to sample the Bayesian posterior of NN training, but I think it's wrong.

"What Are Bayesian Neural Network Posteriors Really Like?" (Izmailov et al. 2021) claims to have sampled the Bayesian posterior of some neural networks conditional on their training data (CIFAR-10, MNIST, IMDB type stuff) via Hamiltonian Monte Carlo sampling (HMC). A grand feat if true! Actually crunching Bayesian updates over a whole training dataset for a neural network that isn't incredibly tiny is an enormous computational challenge. But I think they're mistaken and their sampler actually isn't covering the posterior properly.

They find that neural network ensembles trained by Bayesian updating, approximated through their HMC sampling, generalise worse than neural networks trained by stochastic gradient descent (SGD). This would have been incredibly surprising to me if it were true. Bayesian updating is prohibitively expensive for real world applications, but if you can afford it, it is the best way to incorporate new information. You can't do better.[1] 

This is kind of in the genre of a lot of papers and takes I think used to be around a few years back, which argued that the then still quite mysterious ability of deep learning to generalise was primarily due to some advantageous bias introduced by SGD. Or momentum, or something along these lines. In the sense that SGD/momentum/whatever were supposedly diverging from Bayesian updating in a way that was better rather than worse

I think these papers were wrong, and the generalisation ability of neural networks actually comes from their architecture, which assigns exponentially more weight configurations to simple functions than complex functions. So, most training algorithms will tend to favour making simple updates, and tend to find simple solutions that generalise well, just because there's exponentially more weight settings for simple functions than complex functions. This is what Singular Learning Theory talks about. From an algorithmic information theory perspective, I think this happens for reasons similar to why exponentially more binary strings correspond to simple programs than complex programs in Turing machines.

This picture of neural network generalisation predicts that SGD and other training algorithms should all generalise worse than Bayesian updating, or at best do similarly. They shouldn't do better

So, what's going on in the paper? How are they finding that neural network ensembles updated on the training data with Bayes rule make predictions that generalise worse than predictions made by neural networks trained the normal way?

My guess: Their Hamiltonian Monte Carlo (HMC) sampler isn't actually covering the Bayesian posterior properly. They try to check that it's doing a good job by comparing inter-chain and intra-chain variance in the functions learned. 

We apply the classic Gelman et al. (1992) “” potential-scale-reduction diagnostic to our HMC runs. Given two or more chains,  estimates the ratio between the between-chain variance (i.e., the variance estimated by pooling samples from all chains) and the average within-chain variance (i.e., the variances estimated from each chain independently). The intuition is that, if the chains are stuck in isolated regions, then combining samples from multiple chains will yield greater diversity than taking samples from a single chain.

They seem to think a good  in function space implies that the chains are doing a good job of covering the important parts of the space. But I don't think that's true. You need to mix in weight space, not function space, because weight space is where the posterior lives. Function space and weight space are not bijective, that's why it's even possible for simpler functions to have exponentially more prior than complex functions. So good mixing in function space does not necessarily imply good mixing in weight space, which is what we actually need. The chains could be jumping from basin to basin very rapidly instead of spending more time in the bigger basins corresponding to simpler solutions like they should. 

And indeed, they test their chains' weight space  value as well, and find that it's much worse:

Figure 2. Log-scale histograms of  convergence diagnostics. Function-space s are computed on the test-set softmax predictions of the classifiers and weight-space s are computed on the raw weights. About 91% of CIFAR-10 and 98% of IMDB posterior-predictive probabilities get an  less than 1.1. Most weight-space  values are quite small, but enough parameters have very large s to make it clear that the chains are sampling from different distributions in weight space.
...
(From section 5.1) In weight space, although most parameters show no evidence of poor mixing, some have very large s, indicating that there are directions in which the chains fail to mix.

....
(From section 5.2) The qualitative differences between (a) and (b) suggest that while each HMC chain is able to navigate the posterior geometry the chains do not mix perfectly in the weight space, confirming our results in Section 5.1.

So I think they aren't actually sampling the Bayesian posterior. Instead, their chains jump between modes a lot and thus unduly prioritise low-volume minima compared to high volume minima. And those low-volume minima are exactly the kind of solutions we'd expect to generalise poorly.

I don't blame them here. It's a paper from early 2021, back when very few people understood the importance of weight space degeneracy properly aside from some math professor in Japan whom almost nobody in the field had heard of. For the time, I think they were trying something very informative and interesting. But since the paper has 300+ citations and seems like a good central example of the SGD-beats-Bayes genre, I figured I'd take the opportunity to comment on it now that we know so much more about this. 

The subfield of understanding neural network generalisation has come a long way in the past four years.

Thanks to Lawrence Chan for pointing the paper out to me. Thanks also to Kaarel Hänni and Dmitry Vaintrob for sparking the argument that got us all talking about this in the first place.

  1. ^

     See e.g. the first chapters of Jaynes for why.

  • curious about your optimism regarding learned masks as attribution method - seems like the problem of learning mechanisms that don't correspond to model mechanisms is real for circuits (see Interp Bench) and would plausibly bite here too (through should be able to resolve this with benchmarks on downstream tasks once ADP is more mature)

We think this may not be a problem here, because the definition of parameter component 'activity' is very constraining. See Appendix section A.1. 

To count as inactive, it's not enough for components to not influence the output if you turn them off, every point on every possible monotonic trajectory between all components being on, and only the components deemed 'active' being on, has to give the same output. If you (approximately) check for this condition, I think the function that picks the learned masks can kind of be as expressive as it likes, because the sparse forward pass can't rely on the mask to actually perform any useful computation labor. 

Conceptually, this is maybe one of the biggest difference between APD and something like, say, a transcoder or a crosscoder. It's why it doesn't seem to me like there'd be an analog to 'feature splitting' in APD. If you train a transcoder on a -dimensional linear transformation, it will learn ever sparser approximations of this transformation the larger you make the transcoder dictionary, with no upper limit. If you train APD on a -dimensional linear transformation, provided it's tuned right, I think it should learn a single -dimensional component. Regardless of how much larger than d you make the component dictionary. Because if it tried to learn more components than that to get a sparser solution, it wouldn't be able to make the components sum to the original model weights anymore.

Despite this constraint on its structure, I think APD plausibly has all the expressiveness it needs, because even when there is an overcomplete basis of features in activation space, circuits in superposition math and information theory both suggest that you can't have an overcomplete basis of mechanisms in parameter space. So it seems to me that you can just demand that components must compose linearly, without that restricting their ability to represent the structure of the target model. And that demand then really limits the ability to sneak in any structure that wasn't originally in the target model.

Yes, I don't think this will let you get away with no specification bits in goal space at the top level like John's phrasing might suggest. But it may let you get away with much less precision? 

The things we care about aren't convergent instrumental goals for all terminal goals, the kitchen chef's constraints aren't doing that much to keep the kitchen liveable to cockroaches. But it seems to me that this maybe does gesture at a method to get away with pointing at a broad region of goal space instead of a near-pointlike region.

On first read the very rough idea of it sounds ... maybe right? It seems to perhaps actually centrally engage with the source of my mind's intuition that something like corrigibility ought to exist?

Wow. 

I'd love to get a spot check for flaws from a veteran of the MIRI corrigibility trenches.

It's disappointing that you wrote me off as a crank in one sentence. I expect more care, including that you also question your own assumptions.

I think it is very fair that you are disappointed. But I don't think I can take it back. I probably wouldn’t have introduced the word crank myself here. But I do think there’s a sense in which Oliver’s use of it was accurate, if maybe needlessly harsh. It does vaguely point at the right sort of cluster in thing-space.

It is true that we discussed this and you engaged with a lot of energy and in good faith. But I did not think Forrest’s arguments were convincing at all, and I couldn’t seem to manage to communicate to you why I thought that. Eventually, I felt like I wasn’t getting through to you, Quintin Pope also wasn’t getting through to you, and continuing started to feel draining and pointless to me.

I emerged from this still liking you and respecting you, but thinking that you are wrong about this particular technical matter in a way that does seem like the kind of thing people imagine when they hear ‘crank’.

Load More