This is a perspective I have on how to do useful AI alignment research. Most perspectives I’m aware of are constructive: they have some blueprint for how to build an aligned AI system, and propose making it more concrete, making the concretisations more capable, and showing that it does in fact produce an aligned AI system. I do not have a constructive perspective - I’m not sure how to build an aligned AI system, and don’t really have a favourite approach. Instead, I have an analytic perspective. I would like to understand AI systems that are built. I also want other people to understand them. I think that this understanding will hopefully act as a ‘filter’ that means that dangerous AI systems are not deployed. The following dot points lay out the perspective.
Since the remainder of this post is written as nested dot points, some readers may prefer to read it in workflowy.
Background beliefs
- I am imagining a future world in which powerful AGI systems are made of components roughly like neural networks (either feedforward or recurrent) that have a large number of parameters.
- Furthermore, I’m imagining that the training process of these ML systems does not provide enough guarantees about deployment performance.
- In particular, I’m supposing that systems are being trained based on their ability to deal with simulated situations, and that that’s insufficient because deployment situations are hard to model and therefore simulate.
- One reason that they are hard to model is the complexities of the real world.
- The real world might be intrinsically difficult to model for the relevant system. For instance, it’s difficult to simulate all the situations in which the CEO of Amazon might find themselves.
- Another reason that real world situations may be hard to model is that they are dependent on the final trained system.
- The trained system may be able to affect what situations it ends up in, meaning that situations during earlier training are unrepresentative.
- Parts of the world may be changing their behaviour in response to the trained system…
- in order to exploit the system.
- by learning from the system’s predictions.
- The real world is also systematically different than the trained world: for instance, while you’re training, you will never see the factorisation of RSA-2048 (assuming you’re training in the year 2020), but in the real world you eventually will.
- This is relevant because you could imagine mesa-optimisers appearing in your system that choose to act differently when they see such a factorisation.
- One reason that they are hard to model is the complexities of the real world.
- In particular, I’m supposing that systems are being trained based on their ability to deal with simulated situations, and that that’s insufficient because deployment situations are hard to model and therefore simulate.
- I’m imagining that the world is such that if it’s simple for developers to check if an AI system would have disastrous consequences upon deployment, then they perform this check, and fail to deploy if the check says that it would.
Background desiderata
- I am mostly interested in allowing the developers of AI systems to determine whether their system has the cognitive ability to cause human extinction, and whether their system might try to cause human extinction.
- I am not primarily interested in reducing the probabilities of other ways in which AI systems could cause humanity to go extinct, such as research groups intentionally behaving badly, or an uncoordinated set of releases of AI systems that interact in negative ways.
- That being said, I think that pursuing research suggested by this perspective could help with the latter scenario, by making it clear which interaction effects might be present.
- I am not primarily interested in reducing the probabilities of other ways in which AI systems could cause humanity to go extinct, such as research groups intentionally behaving badly, or an uncoordinated set of releases of AI systems that interact in negative ways.
- I want this determination to be made before the system is deployed, in a ‘zero-shot’ fashion, since this minimises the risk of the system actually behaving badly before you can detect and prevent it.
Transparency
- The type of transparency that I’m most excited about is mechanistic, in a sense that I’ve described elsewhere.
- The transparency method itself should be based on a trusted algorithm, as should the method of interpreting the transparent artefact.
- In particular, these operations should not be done by a machine learning system, unless that system itself has already been made transparent and verified.
- This could be done amplification-style.
- In particular, these operations should not be done by a machine learning system, unless that system itself has already been made transparent and verified.
- Ideally, models could be regularised for transparency during training, with little or no cost to performance.
- This would be good because by default models might not be very transparent, and it might be hard to hand-design very transparent models that are also capable.
- I think of this as what one should derive from Rich Sutton’s bitter lesson
- This will be easier to do if the transparency method is simpler, more ‘mathematical’, and minimally reliant on machine learning.
- You might expect little cost to performance since neural networks can often reach high performance given constraints, as long as they are deep enough.
- This paper on the intrinsic dimension of objective landscapes shows that you can constrain neural network weights to a low-dimensional subspace and still find good solutions.
- This paper argues that there are a large number of models with roughly the same performance, meaning that ones with good qualities (e.g. interpretability) can be found.
- This paper applies regularisation to machine learning models that ensures that they are represented by small decision trees.
- This would be good because by default models might not be very transparent, and it might be hard to hand-design very transparent models that are also capable.
- The transparency method only has to reveal useful information to developers, not to the general public.
- This makes the problem easier but still difficult.
- Presumably developers will not deploy catastrophically terrible systems, since catastrophes are usually bad for most people, and I’m most interested in averting catastrophic outcomes.
Foundations
- In order for the transparency to be useful, practitioners need to know what problems to look for, and how to reason about these problems.
- I think that an important part of this is ‘agent foundations’, by which I broadly mean a theory of what agents should look like, and what structural facts about agents could cause them to display undesired behaviour.
- Examples:
- Work on mesa-optimisation
- Utility theory, e.g. the von Neumann-Morgenstern theorem
- Methods of detecting which agents are likely to be intelligent or dangerous.
- Examples:
- For this, it is important to be able to look at a machine learning system and learn if (or to what degree) it is agentic, detect belief-like structures and preference-like structures (or to deduce things analogous to beliefs and preferences), and learn other similar things.
- This requires structural definitions of the relevant primitives (such as agency), not subjective or performance-based definitions.
- By ‘structural definitions’, I mean definitions that refer to facts that are easily accessible about the system before it is run.
- By ‘subjective definitions’, I mean definitions that refer to an observer’s beliefs or preferences regarding the system.
- By ‘performance-based definitions’, I mean definitions that refer to facts that can be known about the system once it starts running.
- Subjective definitions are inadequate because they do not refer to easily-measurable quantities.
- Performance-based definitions are inadequate because they can only be evaluated once the system is running, when it could already pose a danger, violating the “zero-shot” desideratum.
- Structural definitions are required because they are precisely the definitions that are not subjective or performance-based that also only refer to facts that are easily accessible, and therefore are easy to evaluate whether a system satisfies the definition.
- As such, definitions like “an agent is a system whose behaviour can’t usefully be predicted mechanically, but can be predicted by assuming it near-optimises some objective function” (which was proposed in this paper) are insufficient because they are both subjective and performance-based.
- It is possible to turn subjective definitions into structural definitions trivially, by asking a human about their beliefs and preferences. This is insufficient.
- e.g. “X is a Y if you are scared of it” can turn to “X is a Y if the nearest human to X, when asked if they are scared of X, says ‘yes’”.
- It is insufficient because such a definition doesn’t help the human form their subjective beliefs and impressions.
- It is also possible to turn subjective definitions that only depend on beliefs into structural definitions by determining which circumstances warrant a rational being to have which beliefs. This is sufficient.
- Compare the subjective definition of temperature as “the derivative of a system’s energy with respect to entropy at fixed volume and particle number” to the objective definition “equilibrate the system with a thermometer, read it off the thermometer”. For a rational being, these two definitions yield the same temperature for almost all systems.
- This requires structural definitions of the relevant primitives (such as agency), not subjective or performance-based definitions.
Relation between transparency and foundations
- The agent foundations theory should be informed by transparency research, and vice versa.
- This is because the information that transparency methods can yield should be all the information that is required to analyse the system using the agent foundations theory.
- Both lines of research can inform the other.
- Transparency researchers can figure out how to reveal the information required by agent foundations theory, and detect the existence of potential problems that agent foundations theory suggests might occur given certain training procedures.
- Agent foundations researchers can figure out what is implied by the information revealed by existing transparency tools, and theorise about problems that transparency researchers detect.
Criticisms of the perspective
- It isn’t clear if neural network transparency is possible.
- More specifically, it seems imaginable that some information required to usefully analyse an AI system cannot be extracted from a typical neural network in polynomial time.
- It isn’t clear that relevant terms from agency theory can in fact be well-defined.
- E.g. “optimisation” and “belief” have eluded a satisfactory computational grounding for quite a while.
- Relatedly, the philosophical question of which physical systems enable which computations has not to my mind been satisfactorily resolved. See this relevant SEP article.
- An easier path to transparency than the “zero-shot” approach might be to start with simpler systems, observe their behaviour, and slowly scale them up. As you see problems, stop scaling up the systems, and instead fix them so the problems don’t occur.
- I disagree with this criticism.
- At one point, it’s going to be the first time you use a system of a given power in a domain, and the problems caused by the system might be discontinuous with its power, meaning that they would be hard to predict.
- Especially if the power of the system increases discontinuously.
- It is plausibly the case that systems that are a bit ‘smarter than humanity’ are discontinuously more problematic than those that are a bit less ‘smart than humanity’.
- At one point, it’s going to be the first time you use a system of a given power in a domain, and the problems caused by the system might be discontinuous with its power, meaning that they would be hard to predict.
- I disagree with this criticism.
- One could imagine giving up the RL dream for something like debate, where you really can get guarantees from the training procedure.
- I think that this is not true, and that things like debate require transparency tools to work well, so as to let debaters know when other debaters are being deceitful. An argument for an analogous conclusion can be found in evhub’s post on Relaxed adversarial training for inner alignment.
- One could imagine inspecting training-time reasoning and convincing yourself that way that future reasoning will be OK.
- But reasoning could look different in different environments.
- This perspective relies on things continuing to look pretty similar to current ML.
- This would be alleviated if you could come up with some sort of sensible theory for how to make systems transparent.
- I find it plausible that the development of such a theory should start with people messing around and doing things with systems they have.
- Systems should be transparent to all relevant human stakeholders, not just developers.
- Sounds right to me - I think people should work on this broader problem. But:
- I don’t know how to solve that problem without making them transparent to developers initially.
- I have ideas about how to solve the easier problem.
- Sounds right to me - I think people should work on this broader problem. But:
Overall take: Broadly agree that analyzing neural nets is useful and more work should go into it. Broadly disagree with the story for how this leads to reduced x-risk. Detailed comments below:
Background beliefs:
Broadly agree, with one caveat:
I'm assuming "guarantees" means something like "strong arguments", and would include things like "when I train the agent on this loss function and it does well on the validation set, it will also do well on a test set drawn from the same distribution" (although I suppose you can prove that that holds with high probability). Perhaps a more interesting strong argument that's not a proof but that might count as a guarantee would be something like "if I perform adversarial training with a sufficiently smart adversary, it is unlikely that the agent finds and fails on an example that was within the adversary's search space".
If you include these sorts of things as guarantees, then I think the training process "by default" won't provide enough guarantees, but we might be able to get it to provide enough guarantees, e.g. by adversarial training. Alternatively, there will exist training processes that won't provide enough guarantees but will knowably be likely to produce AGI; but there may also be versions that do provide enough guarantees.
Background desiderata:
This seems normative rather than empirical. Certainly we need some form of 'zero-shot' analysis -- in particular, we must be able to predict whether a system causes x-risk in a zero-shot way (you can't see any examples of a system actually causing x-risk). But depending on what exactly you mean, I think you're probably aiming for too strong a property, one that's unachievable given background facts about the world. (More explanation in the Transparency section.)
Ways in which this desideratum is unclear to me:
EDIT: Tbc, I think "deployment" is a relatively crisp concept when considering AI governance, where you can think of it as the point at which you release the AI system into the world and other actors besides the one that trained the system start interacting with it in earnest, and this point is a pretty important point in terms of the impacts of the AI system. For OpenAI Five, this would be the launch of Arena. But this sort of distinction seems much less relevant / crisp for AI alignment.
Transparency:
Mechanistic transparency seems incredibly difficult to achieve to me. As an analogy, I don't think I understand how a laptop works at a mechanistic level, despite having a lot of training in Computer Science. This is a system that is built to be interpretable to humans, human civilization as a whole has a mechanistic understanding of laptops, and lots of effort has been put into creating good educational materials that most clearly convey a mechanistic understanding of (components of) laptops -- we have none of these advantages for neural nets. Of course, a laptop is very complex; but I would expect an AGI-via-neural-nets to be pretty complex as well.
I also think that mechanistic transparency becomes much more difficult as systems become more complex: in the best case where the networks are nice and modular, it becomes linearly harder, which might keep the cost ratio the same (seems plausible to scale human effort spent understanding the net at the same rate that we scale model capacity), but if it is superlinearly harder (seems more likely to me, because I don't expect it to be easy to identify human-interpretable modularity even when present), then as model capacity increases, human oversight becomes a larger and larger fraction of the cost.
Currently human oversight is already 99+% of the cost of mechanistically transparent image classifiers: Chris Olah and co. have spent multiple years on one image classifier and are maybe getting close to a mechanistic-ish understanding of it, though of course presumably future efforts would be less costly because they'll have learned important lessons. (Otoh, things that aren't image classifiers are probably harder to mechanistically understand, especially things that are better-than-human, as in e.g. AlphaGo's move 37.)
Controversial, I'm pretty uncertain but weakly lean against. (Probably not worth discussing though, just wanted to note the disagreement.)
But interestingly, you can't just use fewer neurons (corresponding to a low-dimensional subspace where the projection matrices consists of unit vectors along the axes) -- it has to be a random subspace. I think we don't really understand what's going on here and I wouldn't update too much on the possibility of transparency from it (though it is weak evidence that regularization is possible and strong evidence that there are lots of good models).
Compare: There are a large number of NBA players, meaning that ones who are short can be found.
Looking at the results of the paper, it only seems to work for simple tasks, as you might expect. For the most neural-net-like task (recognizing stop phonemes from audio, which is still far simpler than e.g. speech recognition), the neural net gets ~0.95 AUC while the decision tree gets ~0.75 (a vast difference: random is 0.5 and perfect is 1).
Generally there seem to be people (e.g. Cynthia Rudin) who argue "we can have interpretability and accuracy", and when you look at the details they are looking at some very low-dimensional, simple-looking tasks; I certainly agree with that (and that we should use interpretable models in these situations) but it doesn't seem to apply to e.g. image classifiers or speech recognition, and seems like it would apply even less to AGI-via-neural-nets.
Huh? Surely if you're trying to understand agents that arise, you should have a theory of arbitrary agents rather than ideal agents. John Wentworth's stuff seems way more relevant than MIRI's Agent Foundations for the purpose you have in mind.
I could see it being useful to do MIRI-style Agent Foundations work to discover what sorts of problems could arise, though I could imagine this happening in many other ways as well.
No, which is why I want to stop using the example.
(The counterfactual I was thinking of was more like "imagine we handed a laptop to 19th-century scientists, can they mechanistically understand it?" But even that isn't a good analogy, it overstates the difficulty.)