Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Thread: Circuits(Chris Olah et al) (summarized by Nicholas): The (currently incomplete) Circuits thread of articles builds a case around 3 main claims:
1. Neural network features - the activation values of hidden layers - are understandable.
2. Circuits - the weights connecting these features - are also understandable.
3. Universality - when training different models on different tasks, you will get analogous features.
Zoom In provides an overview of the argument. The next two articles go into detail on particular sets of layers or neurons.
Claim 1: Neural Network Features - the activation values of hidden layers - are understandable.
They make seven arguments for this claim in Zoom In which are expanded upon in subsequent articles.
1. Feature Visualization: By optimizing the input to maximize the activation of a particular neuron, they can obtain an image of what that neuron reacts most strongly to. They create and analyze these for all 1056 neurons in the first five layers of the InceptionV1 image classification model. While some of them were difficult to understand, they were able to classify and understand the purpose of most of the neurons. A simple example is that curve detecting neurons produce feature visualizations of curves of a particular orientation. A more complex example is neurons detecting boundaries between high and low frequency, which often are helpful for separating foreground and background.
2. Dataset examples: They also look at the examples in the dataset that maximize a particular neuron. These align with the feature visualizations. Neurons with a particular curve in the feature visualization also fire strongest on dataset examples exhibiting that curve.
3. Synthetic Examples: They also create synthetic examples and find that neurons fire on the expected synthetically generated examples. For example, they generate synthetic curves with a wide range of orientations and curvatures. Curve detectors respond most strongly to a particular orientation and curvature that matches the feature visualizations and highest activation dataset examples. Curve Detectors includes many more experiments and visualizations of curve detectors on the full distribution of curvature and orientation.
4. Joint Tuning: In the case of curve detectors, they rotate the maximal activation dataset examples and find that as the curves change in orientation, the corresponding curve detector neurons increase and decrease activations in the expected pattern.
5. Feature Implementation: By looking at the circuit used to create a neuron, they can read off the algorithm for producing that feature. For example, curve detectors are made up of line detectors and earlier curve detectors being combined in a way that indicates it would only activate on curves of a particular orientation and curvature.
6. Feature Use: In addition to looking at the inputs to the neuron, they also look at the outputs to see how the feature is used. For example, curves are frequently used in neurons that recognize circles and spirals.
7. Handwritten Circuits: After understanding existing curve detectors, they can implement their own curve detectors by hand-coding all the weights, and those reliably detect curves.
Claim 2: Circuits - the weights connecting the features - are also understandable
They provide a number of examples of neurons, both at deep and shallow layers of the network, that are composed of earlier neurons via clear algorithms. As mentioned above, curve detectors are excited by earlier curve detectors in similar orientations and inhibited by ones of opposing orientations.
A large part of ImageNet is focused on distinguishing a hundred species of dogs. A pose-invariant dog head and neck detector can be shown to be composed of two earlier detectors for dogs facing left and right. These in turn are constructed from earlier detectors of fur in a particular orientation.
Claim 3: Universality: when training different models on different tasks, you will get analogous features.
This is the most speculative claim and most of the articles so far have not addressed it directly. However, the early layers of vision (edges, etc), are believed to be common to many computer vision networks. They describe in detail the first five layers of InceptionV1 and categorize all of the neurons.
Layer 1 is the simplest: 85% of the neurons either detect simple edges or contrasts in colors.
Layer 2 starts to be more varied and detects edges and color contrasts with some invariance to orientation, along with low frequency patterns and multiple colors.
In Layer 3, simple shapes and textures begin to emerge, such as lines, curves, and hatches, along with color contrasts that are more invariant to position and orientation than those in the earlier layers.
Layer 4 has a much more diverse set of features. 25% are textures, but there are also detectors for curves, high-low frequency transitions, brightness gradients, black and white, fur, and eyes.
Layer 5 continues the trend of having features with more variety and complexity. One example is boundary detectors, which combine a number of low-level features into something that can detect boundaries between objects.
They also highlight a few phenomena that are not yet fully understood:
Polysemantic neurons are neurons that respond to multiple unrelated inputs, such as parts of cars and parts of cats. What is particularly interesting is that these are often constructed from earlier features that are then spread out across multiple neurons in a later layer.
The combing phenomenon is that curve and line detectors on multiple models and datasets tend to be excited by small lines that are perpendicular to the curve. Potential hypotheses are that many curves in the data have them (e.g. spokes on a wheel), that it is helpful for fur detection, that it provides higher contrast between the orientation of the curve and the background, or that it is just a side effect rather than an intrinsically useful feature.
Nicholas's opinion: Even from only the first three posts, I am largely convinced that most of neural networks can be understood in this way. The main open question to me is the scalability of this approach. As neural networks get more powerful, do they become more interpretable or less interpretable? Or does it follow a more complex pattern like the one suggested here (AN #72). I’d love to see some quantitative metric of how interpretable a model is and see how that has changed for the vision state of the art each year. Another related topic I am very interested in is how these visualizations change over training. Do early layers develop first? Does finetuning affect some layers more than others? What happens to these features if the model is overfit?
The other thing I found very exciting about all of these posts is the visualization tools that were used (omitting these is a major shortcoming of this summary). For example, you can click on any of the neurons mentioned in the paper and it opens up a Microscope page that lets you see all the information on that feature and its circuits. I hope that as we get better and more generic tools for analyzing neural networks in this way, this could become very useful for debugging and improving neural network architectures.
Online Bayesian Goal Inference for Boundedly-Rational Planning Agents(Tan Zhi-Xuan et al) (summarized by Rohin): Typical approaches to learning from demonstrations rely on assuming that the demonstrator is either optimal or noisily optimal. However, this is a pretty bad description of actual human reasoning: it is more accurate to say we are boundedly-rational planners. In particular, it makes more sense to assume that our plans are computed from a noisy process. How might we capture this in an algorithm?
This paper models the demonstrator as using a bounded probabilistic A* search to find plans for achieving their goal. The planner is also randomized to account for the difficulty of planning: in particular, when choosing which state to “think about” next, it chooses randomly with higher probability for more promising states (as opposed to vanilla A* which always chooses the most promising state).
The search may fail to find a plan that achieves the goal, in which case the demonstrator follows the actions of the most promising plan found by A* search until no longer possible (either an action leads to a state A* search hadn’t considered, or it reaches the end of its partial plan). Thus, this algorithm can assign significant probability to plans that fail to reach the goal.
The experiments show that this feature allows their SIPS algorithm to infer goals even when the demonstrator fails to reach their goal. For example, if an agent needs to get two keys to unlock two doors to get a blue gem, but only manages to unlock the first door, the algorithm can still infer that the agent’s goal was to obtain the blue gem.
I really like that this paper is engaging with the difficulty of dealing with systematically imperfect demonstrators, and it shows that it can do much better than Bayesian IRL for the domains they consider.
Rohin's opinion: It has previously been argued (AN #31) that in order to do better than the demonstrator, you need to have a model of how the demonstrator makes mistakes. In this work, that model is something like, “while running A* search, the demonstrator may fail to find all the states, or may find a suboptimal path before an optimal one”. This obviously isn’t exactly correct, but is hopefully moving in the right direction.
Note that in the domains that the paper evaluates on, the number of possible goals is fairly small (at most 20), presumably because of computational cost. However, even if we ignore computational cost, it’s not clear to me whether this would scale to a larger number of goals. Conceptually, this algorithm is looking for the most likely item out of the set of (optimal demonstrations and plausible suboptimal or failed demonstrations). When the number of goals is low, this set is relatively small, and the true answer will likely be the clear winner. However, once the number of goals is much larger, there may be multiple plausible answers. (This is similar to the fact that since neural nets encode many possible algorithms and there are multiple settings that optimize your objective, usually instead of getting the desired algorithm you get one that fails to transfer out of distribution.)
"Go west, young man!" - Preferences in (imperfect) maps(Stuart Armstrong) (summarized by Rohin): This post argues that by default, human preferences are strong views built upon poorly defined concepts, that may not have any coherent extrapolation in new situations. To put it another way, humans build mental maps of the world, and their preferences are defined on those maps, and so in new situations where the map no longer reflects the world accurately, it is unclear how preferences should be extended. As a result, anyone interested in preference learning should find some incoherent moral intuition that other people hold, and figure out how to make it coherent, as practice for the case we will face where our own values will be incoherent in the face of new situations.
Rohin's opinion: This seems right to me -- we can also see this by looking at the various paradoxes found in the philosophy of ethics, which involve taking everyday moral intuitions and finding extreme situations in which they conflict, and it is unclear which moral intuition should “win”.
FORECASTING
Amplified forecasting: What will Buck's informed prediction of compute used in the largest ML training run before 2030 be?(Ought) (summarized by Rohin): Ought has recently run experiments on how to amplify expert reasoning, to produce better answers than a time-limited expert could produce themselves. This experiment centers on the question of how much compute will be used in the largest ML training run before 2030. Rather than predict the actual answer, participants provided evidence and predicted what Buck’s posterior would be after reading through the comments and evidence.
Buck’s quick prior was an extrapolation of the trend identified in AI and Compute (AN #7), and suggested a median of around 10^13 petaflop/s-days. Commenters pointed out that the existing trend relied on a huge growth rate in the amount of money spent on compute, that seemed to lead to implausible amounts of money by 2030 (a point previously made here (AN #15)). Buck’s updated posterior has a median of around 10^9 petaflop/s-days, with a mode of around 10^8 petaflop/s-days (estimated to be 3,600 times larger than AlphaStar).
Rohin's opinion: The updated posterior seems roughly right to me -- looking at the reasoning of the prize-winning comment, it seems like a $1 trillion training run in 2030 would be about 10^11 petaflop/s-days, which seems like the far end of the spectrum. The posterior assigns about 20% to it being even larger than this, which seems too high to me, but the numbers above do assume a “business-as-usual” world, and if you assign a significant probability to getting AGI before 2030, then you probably should have a non-trivial probability assigned to extreme outcomes.
Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns(Andreas Stuhlmüller) (summarized by Rohin): Ought ran a second competition to amplify my forecast on a question of my choosing. I ended up asking “When will a majority of top AGI researchers agree with safety concerns?”, specified in more detail in the post. Notably, I require the researchers to understand the concerns that I think the AI safety community has converged on, as opposed to simply saying that they are concerned about safety. I chose the question because it seems like any plan to mitigate AI risk probably requires consensus amongst at least AI researchers that AI risk is a real concern. (More details in this comment.)
My model is that this will be caused primarily by compelling demonstrations of risk (e.g. warning shots), and these will be easier to do as AI systems become more capable. So it depends a lot on models of progress; I used a median of 20 years until “human-level reasoning”. Given that we’ll probably get compelling demonstrations before then, but also it can take time for consensus to build, I also estimated a median of around 20 years for consensus on safety concerns, and then made a vaguely lognormal prior with that median. (I also estimated a 25% chance that it never happens, e.g. due to a global catastrophe that prevents more AI research, or because we build an AGI and see it isn’t risky, etc.)
Most of the commenters were more optimistic than I was, thinking that we might already have consensus (given that I restricted it to AGI researchers), which led to several small updates towards optimism. One commenter pointed out that in practice, concern about AI risk tends to be concentrated amongst RL researchers, which are a tiny fraction of all AI researchers, and probably a tiny fraction of AGI researchers as well (given that natural language processing and representation learning seem likely to be relevant to AGI). This led to a single medium-sized update towards pessimism. Overall these washed out, and my posterior was a bit more optimistic than my prior, and was higher entropy (i.e. more uncertain).
Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
HIGHLIGHTS
Thread: Circuits (Chris Olah et al) (summarized by Nicholas): The (currently incomplete) Circuits thread of articles builds a case around 3 main claims:
1. Neural network features - the activation values of hidden layers - are understandable.
2. Circuits - the weights connecting these features - are also understandable.
3. Universality - when training different models on different tasks, you will get analogous features.
Zoom In provides an overview of the argument. The next two articles go into detail on particular sets of layers or neurons.
Claim 1: Neural Network Features - the activation values of hidden layers - are understandable.
They make seven arguments for this claim in Zoom In which are expanded upon in subsequent articles.
1. Feature Visualization: By optimizing the input to maximize the activation of a particular neuron, they can obtain an image of what that neuron reacts most strongly to. They create and analyze these for all 1056 neurons in the first five layers of the InceptionV1 image classification model. While some of them were difficult to understand, they were able to classify and understand the purpose of most of the neurons. A simple example is that curve detecting neurons produce feature visualizations of curves of a particular orientation. A more complex example is neurons detecting boundaries between high and low frequency, which often are helpful for separating foreground and background.
2. Dataset examples: They also look at the examples in the dataset that maximize a particular neuron. These align with the feature visualizations. Neurons with a particular curve in the feature visualization also fire strongest on dataset examples exhibiting that curve.
3. Synthetic Examples: They also create synthetic examples and find that neurons fire on the expected synthetically generated examples. For example, they generate synthetic curves with a wide range of orientations and curvatures. Curve detectors respond most strongly to a particular orientation and curvature that matches the feature visualizations and highest activation dataset examples. Curve Detectors includes many more experiments and visualizations of curve detectors on the full distribution of curvature and orientation.
4. Joint Tuning: In the case of curve detectors, they rotate the maximal activation dataset examples and find that as the curves change in orientation, the corresponding curve detector neurons increase and decrease activations in the expected pattern.
5. Feature Implementation: By looking at the circuit used to create a neuron, they can read off the algorithm for producing that feature. For example, curve detectors are made up of line detectors and earlier curve detectors being combined in a way that indicates it would only activate on curves of a particular orientation and curvature.
6. Feature Use: In addition to looking at the inputs to the neuron, they also look at the outputs to see how the feature is used. For example, curves are frequently used in neurons that recognize circles and spirals.
7. Handwritten Circuits: After understanding existing curve detectors, they can implement their own curve detectors by hand-coding all the weights, and those reliably detect curves.
Claim 2: Circuits - the weights connecting the features - are also understandable
They provide a number of examples of neurons, both at deep and shallow layers of the network, that are composed of earlier neurons via clear algorithms. As mentioned above, curve detectors are excited by earlier curve detectors in similar orientations and inhibited by ones of opposing orientations.
A large part of ImageNet is focused on distinguishing a hundred species of dogs. A pose-invariant dog head and neck detector can be shown to be composed of two earlier detectors for dogs facing left and right. These in turn are constructed from earlier detectors of fur in a particular orientation.
They also describe circuits for dog head, car, boundary, fur, circle, and triangle detectors.
Claim 3: Universality: when training different models on different tasks, you will get analogous features.
This is the most speculative claim and most of the articles so far have not addressed it directly. However, the early layers of vision (edges, etc), are believed to be common to many computer vision networks. They describe in detail the first five layers of InceptionV1 and categorize all of the neurons.
Layer 1 is the simplest: 85% of the neurons either detect simple edges or contrasts in colors.
Layer 2 starts to be more varied and detects edges and color contrasts with some invariance to orientation, along with low frequency patterns and multiple colors.
In Layer 3, simple shapes and textures begin to emerge, such as lines, curves, and hatches, along with color contrasts that are more invariant to position and orientation than those in the earlier layers.
Layer 4 has a much more diverse set of features. 25% are textures, but there are also detectors for curves, high-low frequency transitions, brightness gradients, black and white, fur, and eyes.
Layer 5 continues the trend of having features with more variety and complexity. One example is boundary detectors, which combine a number of low-level features into something that can detect boundaries between objects.
They also highlight a few phenomena that are not yet fully understood:
Polysemantic neurons are neurons that respond to multiple unrelated inputs, such as parts of cars and parts of cats. What is particularly interesting is that these are often constructed from earlier features that are then spread out across multiple neurons in a later layer.
The combing phenomenon is that curve and line detectors on multiple models and datasets tend to be excited by small lines that are perpendicular to the curve. Potential hypotheses are that many curves in the data have them (e.g. spokes on a wheel), that it is helpful for fur detection, that it provides higher contrast between the orientation of the curve and the background, or that it is just a side effect rather than an intrinsically useful feature.
Nicholas's opinion: Even from only the first three posts, I am largely convinced that most of neural networks can be understood in this way. The main open question to me is the scalability of this approach. As neural networks get more powerful, do they become more interpretable or less interpretable? Or does it follow a more complex pattern like the one suggested here (AN #72). I’d love to see some quantitative metric of how interpretable a model is and see how that has changed for the vision state of the art each year. Another related topic I am very interested in is how these visualizations change over training. Do early layers develop first? Does finetuning affect some layers more than others? What happens to these features if the model is overfit?
The other thing I found very exciting about all of these posts is the visualization tools that were used (omitting these is a major shortcoming of this summary). For example, you can click on any of the neurons mentioned in the paper and it opens up a Microscope page that lets you see all the information on that feature and its circuits. I hope that as we get better and more generic tools for analyzing neural networks in this way, this could become very useful for debugging and improving neural network architectures.
TECHNICAL AI ALIGNMENT
MESA OPTIMIZATION
Inner Alignment: Explain like I'm 12 Edition (Rafael Harth) (summarized by Rohin): This post summarizes and makes accessible the paper Risks from Learned Optimization in Advanced Machine Learning Systems (AN #58).
LEARNING HUMAN INTENT
Online Bayesian Goal Inference for Boundedly-Rational Planning Agents (Tan Zhi-Xuan et al) (summarized by Rohin): Typical approaches to learning from demonstrations rely on assuming that the demonstrator is either optimal or noisily optimal. However, this is a pretty bad description of actual human reasoning: it is more accurate to say we are boundedly-rational planners. In particular, it makes more sense to assume that our plans are computed from a noisy process. How might we capture this in an algorithm?
This paper models the demonstrator as using a bounded probabilistic A* search to find plans for achieving their goal. The planner is also randomized to account for the difficulty of planning: in particular, when choosing which state to “think about” next, it chooses randomly with higher probability for more promising states (as opposed to vanilla A* which always chooses the most promising state).
The search may fail to find a plan that achieves the goal, in which case the demonstrator follows the actions of the most promising plan found by A* search until no longer possible (either an action leads to a state A* search hadn’t considered, or it reaches the end of its partial plan). Thus, this algorithm can assign significant probability to plans that fail to reach the goal.
The experiments show that this feature allows their SIPS algorithm to infer goals even when the demonstrator fails to reach their goal. For example, if an agent needs to get two keys to unlock two doors to get a blue gem, but only manages to unlock the first door, the algorithm can still infer that the agent’s goal was to obtain the blue gem.
I really like that this paper is engaging with the difficulty of dealing with systematically imperfect demonstrators, and it shows that it can do much better than Bayesian IRL for the domains they consider.
Rohin's opinion: It has previously been argued (AN #31) that in order to do better than the demonstrator, you need to have a model of how the demonstrator makes mistakes. In this work, that model is something like, “while running A* search, the demonstrator may fail to find all the states, or may find a suboptimal path before an optimal one”. This obviously isn’t exactly correct, but is hopefully moving in the right direction.
Note that in the domains that the paper evaluates on, the number of possible goals is fairly small (at most 20), presumably because of computational cost. However, even if we ignore computational cost, it’s not clear to me whether this would scale to a larger number of goals. Conceptually, this algorithm is looking for the most likely item out of the set of (optimal demonstrations and plausible suboptimal or failed demonstrations). When the number of goals is low, this set is relatively small, and the true answer will likely be the clear winner. However, once the number of goals is much larger, there may be multiple plausible answers. (This is similar to the fact that since neural nets encode many possible algorithms and there are multiple settings that optimize your objective, usually instead of getting the desired algorithm you get one that fails to transfer out of distribution.)
"Go west, young man!" - Preferences in (imperfect) maps (Stuart Armstrong) (summarized by Rohin): This post argues that by default, human preferences are strong views built upon poorly defined concepts, that may not have any coherent extrapolation in new situations. To put it another way, humans build mental maps of the world, and their preferences are defined on those maps, and so in new situations where the map no longer reflects the world accurately, it is unclear how preferences should be extended. As a result, anyone interested in preference learning should find some incoherent moral intuition that other people hold, and figure out how to make it coherent, as practice for the case we will face where our own values will be incoherent in the face of new situations.
Rohin's opinion: This seems right to me -- we can also see this by looking at the various paradoxes found in the philosophy of ethics, which involve taking everyday moral intuitions and finding extreme situations in which they conflict, and it is unclear which moral intuition should “win”.
FORECASTING
Amplified forecasting: What will Buck's informed prediction of compute used in the largest ML training run before 2030 be? (Ought) (summarized by Rohin): Ought has recently run experiments on how to amplify expert reasoning, to produce better answers than a time-limited expert could produce themselves. This experiment centers on the question of how much compute will be used in the largest ML training run before 2030. Rather than predict the actual answer, participants provided evidence and predicted what Buck’s posterior would be after reading through the comments and evidence.
Buck’s quick prior was an extrapolation of the trend identified in AI and Compute (AN #7), and suggested a median of around 10^13 petaflop/s-days. Commenters pointed out that the existing trend relied on a huge growth rate in the amount of money spent on compute, that seemed to lead to implausible amounts of money by 2030 (a point previously made here (AN #15)). Buck’s updated posterior has a median of around 10^9 petaflop/s-days, with a mode of around 10^8 petaflop/s-days (estimated to be 3,600 times larger than AlphaStar).
Rohin's opinion: The updated posterior seems roughly right to me -- looking at the reasoning of the prize-winning comment, it seems like a $1 trillion training run in 2030 would be about 10^11 petaflop/s-days, which seems like the far end of the spectrum. The posterior assigns about 20% to it being even larger than this, which seems too high to me, but the numbers above do assume a “business-as-usual” world, and if you assign a significant probability to getting AGI before 2030, then you probably should have a non-trivial probability assigned to extreme outcomes.
Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns (Andreas Stuhlmüller) (summarized by Rohin): Ought ran a second competition to amplify my forecast on a question of my choosing. I ended up asking “When will a majority of top AGI researchers agree with safety concerns?”, specified in more detail in the post. Notably, I require the researchers to understand the concerns that I think the AI safety community has converged on, as opposed to simply saying that they are concerned about safety. I chose the question because it seems like any plan to mitigate AI risk probably requires consensus amongst at least AI researchers that AI risk is a real concern. (More details in this comment.)
My model is that this will be caused primarily by compelling demonstrations of risk (e.g. warning shots), and these will be easier to do as AI systems become more capable. So it depends a lot on models of progress; I used a median of 20 years until “human-level reasoning”. Given that we’ll probably get compelling demonstrations before then, but also it can take time for consensus to build, I also estimated a median of around 20 years for consensus on safety concerns, and then made a vaguely lognormal prior with that median. (I also estimated a 25% chance that it never happens, e.g. due to a global catastrophe that prevents more AI research, or because we build an AGI and see it isn’t risky, etc.)
Most of the commenters were more optimistic than I was, thinking that we might already have consensus (given that I restricted it to AGI researchers), which led to several small updates towards optimism. One commenter pointed out that in practice, concern about AI risk tends to be concentrated amongst RL researchers, which are a tiny fraction of all AI researchers, and probably a tiny fraction of AGI researchers as well (given that natural language processing and representation learning seem likely to be relevant to AGI). This led to a single medium-sized update towards pessimism. Overall these washed out, and my posterior was a bit more optimistic than my prior, and was higher entropy (i.e. more uncertain).
AI STRATEGY AND POLICY
Overcoming Barriers to Cross-cultural Cooperation in AI Ethics and Governance (Seán S. ÓhÉigeartaigh et al) (summarized by Rohin): This paper argues that it is important that AI ethics and governance is cross-cultural, and provides a few recommendations towards this goal:
1. Develop AI ethics and governance research agendas requiring cross-cultural cooperation
2. Translate key papers and reports
3. Alternate continents for major AI research conferences and ethics and governance conferences
4. Establish joint and/or exchange programmes for PhD students and postdocs
Read more: Longer summary from MAIEI
How Will National Security Considerations Affect Antitrust Decisions in AI? An Examination of Historical Precedents (Cullen O'Keefe) (summarized by Rohin): This paper looks at whether historically the US has used antitrust law to advance unrelated national security objectives, and concludes that it is rare and especially recently economic considerations tend to be given more weight than national security considerations.
PODCAST
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.