Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.
HIGHLIGHTS
Evaluating Large Language Models Trained on Code (Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan et al) (summarized by Rohin): You’ve probably heard of GitHub Copilot, the programming assistant tool that can provide suggestions while you are writing code. This paper evaluates Codex, a precursor to the model underlying Copilot. There’s a lot of content here; I’m only summarizing what I see as the highlights.
The core ingredient for Codex was the many, many public repositories on GitHub, which provided hundreds of millions of lines of training data. With such a large dataset, the authors were able to get good performance by training a model completely from scratch, though in practice they finetuned an existing pretrained GPT model as it converged faster while providing similar performance.
Their primary tool for evaluation is HumanEval, a collection of 164 hand-constructed Python programming problems where the model is provided with a docstring explaining what the program should do along with some unit tests, and the model must produce a correct implementation of the resulting function. Problems are not all equally difficult; an easier problem asks Codex to “increment all numbers in a list by 1” while a harder one provides a function that encodes a string of text using a transposition cipher and asks Codex to write the corresponding decryption function.
To improve performance even further, they collect a sanitized finetuning dataset of problems formatted similarly to those in HumanEval and train Codex to perform well on such problems. These models are called Codex-S. With this, we see the following results:
1. Pretrained GPT models get roughly 0%.
2. The largest 12B Codex-S model succeeds on the first try 29% of the time. (A Codex model of the same size only gets roughly 22%.)
3. There is a consistent scaling law for reduction in loss. This translates into a less consistent graph for performance on the HumanEval dataset, where once the model starts to solve at least (say) 5% of the tasks, there is a roughly linear increase in the probability of success when doubling the size of the model.
4. If instead we generate 100 samples and check whether they pass the unit tests to select the best one, then Codex-S gets 78%. If we still generate 100 samples but select the sample that has the highest mean log probability (perhaps because we don’t have an exhaustive suite of unit tests), then we get 45%.
They also probe the model for bad behavior, including misalignment. In this context, they define misalignment as a case where the user wants A, but the model outputs B, and the model is both capable of outputting A and capable of distinguishing between cases where the user wants A and the user wants B.
Since Codex is trained primarily to predict the next token, it has likely learned that buggy code should be followed by more buggy code, that insecure code should be followed by more insecure code, and so on. This suggests that if the user accidentally provides examples with subtle bugs, then the model will continue to create buggy code, even though the user would want correct code. They find that exactly this effect occurs, and that the divergence between good and bad performance increases as the model size increases (presumably because larger models are better able to pick up on the correlation between previous buggy code and future buggy code).
Rohin's opinion: I really liked the experiment demonstrating misalignment, as it seems like it accurately captures the aspects that we expect to see with existentially risky misaligned AI systems: they will “know” how to do the thing we want, they simply won’t be “motivated” to actually do it.
TECHNICAL AI ALIGNMENT
TECHNICAL AGENDAS AND PRIORITIZATION
Measurement, Optimization, and Take-off Speed (Jacob Steinhardt) (summarized by Sudhanshu): In this blogpost, the author argues that "trying to measure pretty much anything you can think of is a good mental move that is heavily underutilized in machine learning". He motivates the value of measurement and additional metrics by (i) citing evidence from the history of science, policy-making, and engineering (e.g. x-ray crystallography contributed to rapid progress in molecular biology), (ii) describing how, conceptually, "measurement has several valuable properties" (one of which is to act as interlocking constraints that help to error-check theories), and (iii) providing anecdotes from his own research endeavours where such approaches have been productive and useful (see, e.g. Rethinking Bias-Variance Trade-off (AN #129)).
He demonstrates his proposal by applying it to the notion of optimization power -- an important idea that has not been measured or even framed in terms of metrics. Two metrics are offered: (a) the change (typically deterioration) of performance when trained with a perturbed objective function with respect to the original objective function, named Outer Optimization, and (b) the change in performance of agents during their own lifetime (but without any further parameter updates), such as the log-loss on the next sentence for a language model after it sees X number of sequences at test time, or Inner Adaptation. Inspired by these, the article includes research questions and possible challenges.
He concludes with the insight that take-off would depend on these two continuous processes, Outer Optimization and Inner Adaptation, that work on very different time-scales, with the former being, at this time, much quicker than the latter. However, drawing an analogy from evolution, where it took billions of years of optimization to generate creatures like humans that were exceptional at rapid adaptation, we might yet see a fast take-off were Inner Adaptation turns out to be an exponential process that dominates capabilities progress. He advocates for early, sensitive measurement of this quantity as it might be an early warning sign of imminent risks.
Sudhanshu's opinion: Early on, this post reminded me of Twenty Billion Questions; even though they are concretely different, these two pieces share a conceptual thread. They both consider the measurement of multiple quantities essential for solving their problems: 20BQ for encouraging AIs to be low-impact, and this post for productive framings of ill-defined concepts and as a heads-up about potential catastrophes.
Measurement is important, and this article poignantly argues why and illustrates how. It volunteers potential ideas that can be worked on today by mainstream ML researchers, and offers up a powerful toolkit to improve one's own quality of analysis. It would be great to see more examples of this technique applied to other contentious, fuzzy concepts in ML and beyond. I'll quickly note that while there seems to be minimal interest in this from academia, measurement of optimization power has been discussed earlier in several ways, e.g. Measuring Optimization Power, or the ground of optimization (AN #105).
Rohin's opinion: I broadly agree with the perspective in this post. I feel especially optimistic about the prospects of measurement for (a) checking whether our theoretical arguments hold in practice and (b) convincing others of our positions (assuming that the arguments do hold in practice).
FORECASTING
Fractional progress estimates for AI timelines and implied resource requirements (Mark Xu et al) (summarized by Rohin): One methodology for forecasting AI timelines is to ask experts how much progress they have made to human-level AI within their subfield over the last T years. You can then extrapolate linearly to see when 100% of the problem will be solved. The post linked above collects such estimates, with a typical estimate being 5% of a problem being solved in the twenty year period between 1992 and 2012. Overall these estimates imply a timeline of 372 years.
This post provides a reductio argument against this pair of methodology and estimate. The core argument is that if you linearly extrapolate, then you are effectively saying “assume that business continues as usual: then how long does it take”? But “business as usual” in the case of the last 20 years involves an increase in the amount of compute used by AI researchers by a factor of ~1000, so this effectively says that we’ll get to human-level AI after a 1000^{372/20} = 10^56 increase in the amount of available compute. (The authors do a somewhat more careful calculation that breaks apart improvements in price and growth of GDP, and get 10^53.)
This is a stupendously large amount of compute: it far dwarfs the amount of compute used by evolution, and even dwarfs the maximum amount of irreversible computing we could have done with all the energy that has ever hit the Earth over its lifetime (the bound comes from Landauer’s principle).
Given that evolution did produce intelligence (us), we should reject the argument. But what should we make of the expert estimates then? One interpretation is that “proportion of the problem solved” behaves more like an exponential, because the inputs are growing exponentially, and so the time taken to do the last 90% can be much less than 9x the time taken for the first 10%.
Rohin's opinion: This seems like a pretty clear reductio to me, though it is possible to argue that this argument doesn’t apply because compute isn’t the bottleneck, i.e. even with infinite compute we wouldn’t know how to make AGI. (That being said, I mostly do think we could build AGI if only we had enough compute; see also last week’s highlight on the scaling hypothesis (AN #156).)
MISCELLANEOUS (ALIGNMENT)
Progress on Causal Influence Diagrams (Tom Everitt et al) (summarized by Rohin): Many of the problems we care about (reward gaming, wireheading, manipulation) are fundamentally a worry that our AI systems will have the wrong incentives. Thus, we need Causal Influence Diagrams (CIDs): a formal theory of incentives. These are graphical models (AN #49) in which there are action nodes (which the agent controls) and utility nodes (which determine what the agent wants). Once such a model is specified, we can talk about various incentives the agent has. This can then be used for several applications:
1. We can analyze what happens when you intervene on the agent’s action. Depending on whether the RL algorithm uses the original or modified action in its update rule, we may or may not see the algorithm disable its off switch.
2. We can avoid reward tampering (AN #71) by removing the connections from future rewards to utility nodes; in other words, we ensure that the agent evaluates hypothetical future outcomes according to its current reward function.
3. A multiagent version allows us to recover concepts like Nash equilibria and subgames from game theory, using a very simple, compact representation.
AI GOVERNANCE
A personal take on longtermist AI governance (Luke Muehlhauser) (summarized by Rohin): We’ve previously seen (AN #130) that Open Philanthropy struggles to find intermediate goals in AI governance that seem robustly good to pursue from a longtermist perspective. (If you aren’t familiar with longtermism, you probably want to skip to the next summary.) In this personal post, the author suggests that there are three key bottlenecks driving this:
1. There are very few longtermists in the world; those that do exist often don’t have the specific interests, skills, and experience needed for AI governance work. We could try to get others to work on relevant problems, but:
2. We don’t have the strategic clarity and forecasting ability to know which intermediate goals are important (or even net positive). Maybe we could get people to help us figure out the strategic picture? Unfortunately:
3. It's difficult to define and scope research projects that can help clarify which intermediate goals are worth pursuing when done by people who are not themselves thinking about the issues from a longtermist perspective.
Given these bottlenecks, the author offers the following career advice for those who hope to do work from a longtermist perspective in AI governance:
1. Career decisions should be especially influenced by the value of experimentation, learning, aptitude development, and career capital.
2. Prioritize future impact, for example by building credentials to influence a 1-20 year “crunch time” period. (But make sure to keep studying and thinking about how to create that future impact.)
3. Work on building the field, especially with an eye to reducing bottleneck #1. (See e.g. here.)
4. Try to reduce bottleneck #2 by doing research that increases strategic clarity, though note that many people have tried this and it doesn’t seem like the situation has improved very much.
NEWS
Open Philanthropy Technology Policy Fellowship (Luke Muehlhauser) (summarized by Rohin): Open Philanthropy is seeking applicants for a US policy fellowship program focused on high-priority emerging technologies, especially AI and biotechnology. Application deadline is September 15.
Read more: EA Forum post
FEEDBACK
I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
PODCAST
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.
I am unlikely to post on the EA forum. (I only recently started posting much here, and I find most of EA rather unconvincing, aside from the one sentence summary, which is obviously a good thing.) Considering my negativity toward long-termism, I'm glad you decided more on the productive side for your response. My response is a bit long, I didn't manage to get what I was trying to say down when it was shorter. Feel free to ignore it.
I will state that all of that is AI safety. Even the safety of the AI is determined by the overarching world upon which it is acting. A perfectly well controlled AI is unsafe if regulations followed by defense-bot-3000 state that all rebels must be ended, and everyone matches the definition of a rebel. The people that built defense-bot-3000 probably didn't intend to end humanity because a human law said to. Identically, they probably didn't mean for defense-bot-4000 to stand by and let it happen because a human is required in the loop by the 4000 version, and defense-bot-3000 made sure to kill those in charge of defense-bot-4000 at the start for its instrumental value.
Should a police bot let criminals it can prove are guilty run free, because their actions are justified in this instance? Should a facial recognition system point out that it has determined that new intern matches a spy for the government of that country? Should people be informed that a certain robot is malfunctioning, and likely to destroy an important machine in a hospital [when that means the people will simply destroy the sapient robot, but if the machine is destroyed people might die]? These are moral, and legal governance questions, that are also clearly AI safety questions.
I'd like to compare it to computer science where we know seemingly very disparate things are theoretically identical, such as iteration versus recursion, and hardware vs software. Regulation internal to the AI is the narrow construal of AI safety, while regulation external to it is governance. (Whether this regulation is on people or on the AI directly can be an important distinction, but either way it is still governance.)
Governance is thus actually a subset of AI safety broadly construed. And it should be broadly construed, since there is no difference between an inbuilt part of the AI and a part of the environment it is being used in if the lead to the same actions.
That wasn't actually my point though. The definition of whether or not you call it AI safety isn't important. You want to make it safe to have AI in use in society through regulation and cultural action. If you don't understand AI, your regulation and cultural bits will be wrong. You do not currently understand AI, especially what effects it will actually have dealing with people [since sufficient AIs don't exist to get data, and current approaches are not well understood in terms of why they do what they do].
Human culture has been massively changed by computers, the internet, cellphones, and so on. If I was older, I'd have a much longer list. If [and this is a big if] AI turns out to be that big of a thing, you can't know what it will look like at this stage. That's why you have to wait to find out [while trying to figure out what it will actually do.] If AI turns out to mostly be good at tutoring people, you need completely different regulation that if it turns out to only be good at war, and both are very different than if it is good at a wide variety of things.
Questions of human society rest on two things. First, what are people truly like on the inside. We aren't good at figuring that out, but we have documented several thousand years of trying, and we're starting to learn. Second, what is the particular culture like? Actual human level AI would massively change all of our cultures, to fit or banish the contours of the actual and perceived effects of the devices. (Also, what are the AI's like on the inside? What are their natures? What cultures would spring up amongst different AIs?)