[AN #164]: How well can language models write code?

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

SECTIONS

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

AGENT FOUNDATIONS

FIELD BUILDING

MISCELLANEOUS (ALIGNMENT)

NEWS

HIGHLIGHTS

Program Synthesis with Large Language Models (Jacob Austin, Augustus Odena et al) (summarized by Rohin): Can we use large language models to solve programming problems? In order to answer this question, this paper builds the Mostly Basic Python Programming (MBPP) dataset. The authors asked crowd workers to provide a short problem statement, a Python function that solves the problem, and three test cases checking correctness. On average across the 974 programs, the reference solution has 7 lines of code, suggesting the problems are fairly simple. (This is partly because you can use library functions.) They also edit a subset of 426 problems to improve their quality, for example by making the problem statement less ambiguous or making the function signature more normal.

They evaluate pretrained language models on this dataset across a range of model sizes from 0.244B to 137B parameters. (This largest model is within a factor of 2 of GPT-3.) They consider both few-shot and finetuned models. Since we have test cases that can be evaluated automatically, we can boost performance by generating lots of samples (80 in this case), evaluating them on the test cases, and then keeping the ones that succeed. They count a problem as solved if any sample passes all the test cases, and report as their primary metric the fraction of problems solved according to this definition. Note however that the test cases are not exhaustive: when they wrote more exhaustive tests for 50 of the problems, they found that about 12% of the so-called “solutions” did not pass the new tests (but conversely, 88% did). They also look at the fraction of samples which solve the problem, as a metric of the reliability or confidence of the model for a given problem.

Some of their findings:

1. Performance increases approximately log-linearly with model size. The trend is clearer and smoother by the primary metric (fraction of problems solved by at least one sample) compared to the secondary metric (fraction of samples that solve their problem).

2. Finetuning provides a roughly constant boost across model sizes. An exception: at the largest model size, finetuning provides almost no benefit, though this could just be noise.

3. It is important to provide at least one test case to the model (boosts problems solved from 43% to 55%) but after that additional test cases don’t make much of a difference (an additional two examples per problem boosts performance to 59%).

4. In few-shot learning, the examples used in the prompt matter a lot. In a test of 15 randomly selected prompts for the few-shot 137B model, the worst one got ~1%, while the best one got ~59%, with the others distributed roughly uniformly between them. Ensembling all 15 prompts boosts performance to 66%.

5. In rare cases, the model overfits to the test cases. For example, in a question about checking whether the input is a Woodall number, there is only one test checking an actual Woodall number (383), and the model generates a program that simply checks whether the input is 383.

6. When choosing the best of multiple samples, you want a slightly higher temperature, in order to have more diversity of possible programs to check.

7. It is important to have high quality problem descriptions as input for the model. The 137B model solves 79% of problems in the edited dataset, but only solves 63% of the original (unedited) versions of those problems. The authors qualitatively analyze the edits on the problems that switched from unsolved to solved and find a variety of things that you would generally expect to help.

Now for the controversial question everyone loves to talk about: does the model understand the meaning of the code, or is it “just learning statistical correlations”? One way to check this is to see whether the model can also execute code. Specifically, we provide the ground truth code for one of the problems in the MBPP dataset along with one of the test case inputs and ask the model to predict the output for that test case. Even after finetuning for this task, the 137B model gets only 21% right. This can be boosted to 27% by also providing example test cases for the code before predicting the output for a new test case. Overall, this suggests that the model doesn’t “understand” the code yet.

We can take the model finetuned for execution and see how well it does on program synthesis. (We can do this because there are different prompts for execution and synthesis.) For the 8B model, the finetuning makes basically no difference: it’s equivalent to the original few-shot setting. However, for the 137B model, finetuning on execution actually leads to a small but non-trivial improvement in performance (from ~59% to ~63%, I think). This is true relative to either the few-shot or finetuned-for-synthesis setting, since they performed near-identically for the 137B model. So in fact the 137B model finetuned on execution is actually the strongest model, according to synthesis performance.

So far we’ve just been looking at how our model performs when taking the best of multiple samples. However, if our goal is to actually use models for program synthesis, we aren’t limited to such simple tricks. Another approach is to have a human provide feedback in natural language when the model’s output is incorrect, and then have the model generate a new program. This feedback is very informal, for example, “Close, but you need to replace the underscore with an empty string”. This provides a huge performance boost: the 137B solves ~31% of problems on its first sample; adding just a single piece of human feedback per problem boosts performance to ~55%, and having four rounds of human feedback gets you to over 65%.

The authors also introduce the MathQA-Python dataset, which provides arithmetic word problems and asks models to write programs that would output the correct answer to the problem. They only run a few experiments on this dataset, so I’ve mostly ignored it. The main upshot is that a finetuned 137B parameter model can solve 83.8% of problems with some sample. They don’t report metrics with a single sample, which seems like the more relevant metric for this dataset, but eyeballing other graphs I think it would be around 45%, which you could probably boost a little bit by decreasing the sampling temperature.

Rohin's opinion: I enjoyed this paper a lot; it feels like it gave me a good understanding of the programming abilities of large language models.

I was most surprised by the result that, for the synthesis task, finetuning on execution helps but finetuning on synthesis doesn’t help for the 137B model. It is possible that this is just noise, though that is more noise than I would expect for such an experiment. It could be that the finetuning dataset for synthesis was too small (it only contains 374 problems), but that dataset was sufficient for big gains on the smaller models, and I would expect that, if anything, larger models should be able to make better use of small finetuning datasets, not worse.

It’s also notable that, for the 137B model, the knowledge gained from finetuning on execution successfully transferred to improve synthesis performance. While I agree that the poor execution performance implies the model doesn’t “understand” the code according to the normal usage of that term, it seems like this sort of transfer suggests a low but non-zero level on some quantitative scale of understanding.

I also found the human feedback section quite cool. However, note that the human providing the feedback often needs to understand the generated code as well as the desired algorithm, so it is plausible that it would be easier for the human to simply fix the code themselves.

Measuring Coding Challenge Competence With APPS (Dan Hendrycks, Steven Basart et al) (summarized by Rohin): The APPS dataset measures programming competence by testing models the way humans are tested: we provide them with natural language descriptions of the code to be written and then evaluate whether the code they generate successfully solves the problem by testing the proposed solutions. The authors collect a dataset of 3,639 introductory problems (solvable by humans with 1-2 years of experience), 5,000 interview problems (comparable difficulty to interview questions), and 1,361 competition problems (comparable difficulty to questions in programming competitions). In addition, the test set contains 1,000 introductory problems, 3,000 interview problems, and 1,000 competition problems.

They use this benchmark to test four models: two variants of GPT-2 (0.1B params and 1.5B params), GPT-Neo (2.7B params), and GPT-3 (175B params). GPT-3 is prompted with examples; all other models are finetuned on a dataset collected from GitHub. The authors find that:

1. Finetuning makes a big difference in performance: GPT-3 only solves 0.2% of introductory problems, while the finetuned GPT-2-0.1B model solves 1% of such problems.

2. Model performance increases with size, as you would expect: GPT-Neo performs best, solving 3.9% of problems.

3. Syntax errors in generated code drop sharply as model performance improves: for introductory problems, GPT-3 has syntax errors in slightly under 40% of generations, while GPT-Neo has under 1%.

4. Performance can be improved by sampling the best of multiple generated programs: a beam search for 5 programs boosts GPT-Neo’s performance from 3.9% to 5.5% on introductory problems.

5. While no model synthesizes a correct solution to a competition level program, they do sometimes generate solutions that pass some of the test cases: for example, GPT-Neo passes 6.5% of test cases.

Rohin's opinion: While the previous paper focused on how we could make maximal use of existing models for program synthesis, this paper is much more focused on how we can measure the capabilities of models. This leads to quite a bit of difference in what they focus on: for example, the highlighted paper treats the strategy of generating multiple possible answers as a fundamental approach to study, while this paper considers it briefly in a single subsection.

Although the introductory problems in the APPS dataset seemed to me to be comparable to those in the MBPP dataset from the previous paper, models do significantly better on MBPP. A model slightly smaller than GPT-3 has a ~17% chance of solving a random MBPP problem in a single sample and ~10% if it is not given any example test cases; in contrast for introductory APPS problems GPT-3 is at 0.2%. I'm not sure whether this is because the introductory problems in APPS are harder, or if the format of the APPS problems is harder for the model to work with, or if this paper didn't do the prompt tuning that the previous paper found was crucial, or something else entirely.

TECHNICAL AI ALIGNMENT

AGENT FOUNDATIONS

Grokking the Intentional Stance (Jack Koch) (summarized by Rohin): This post describes takeaways from The Intentional Stance by Daniel Dennett for the concept of agency. The key idea is that whether or not some system is an “agent” depends on who is observing it: for example, humans may not look like agents to superintelligent Martians who can predict our every move through a detailed understanding of the laws of physics. A system is an agent relative to an observer if the observer’s best model of the system (i.e. the one that is most predictive) is one in which the system has “goals” and “beliefs”. Thus, with AI systems, we should not ask whether an AI system “is” an agent; instead we should ask whether the AI system’s behavior is reliably predictable by the intentional stance.

How is the idea that agency only arises relative to some observer compatible with our view of ourselves as agents? This can be understood as one “part” of our cognition modeling “ourselves” using the intentional stance. Indeed, a system usually cannot model itself in full fidelity, and so it makes a lot of sense that an intentional stance would be used to make an approximate model instead.

Read more: The ground of optimization (AN #105)

Rohin's opinion: I generally agree with the notion that whether or not something feels like an “agent” depends primarily on whether or not we model it using the intentional stance, which is primarily a statement about our understanding of the system. (For example, I expect programmers are much less likely to anthropomorphize a laptop than laypeople, because they understand the mechanistic workings of laptops better.) However, I think we do need an additional ingredient in AI risk arguments, because such arguments make claims about how an AI system will behave in novel circumstances that we’ve never seen before. To justify that claim, we need to have an argument that can predict how the agent behaves in new situations; it doesn’t seem like the intentional stance can give us that information by itself. See also this comment.

Countable Factored Spaces (Diffractor) (summarized by Rohin): This post generalizes the math in Finite Factored Sets (AN #163) to (one version of) the infinite case. Everything carries over, except for one direction of the fundamental theorem. (The author suspects that direction is true, but was unable to prove it.)

FIELD BUILDING

List of AI safety courses and resources (Kat Woods) (summarized by Rohin): Exactly what it says in the title.

MISCELLANEOUS (ALIGNMENT)

Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications (Sandhini Agarwal et al) (summarized by Zach): There has been significant progress in zero-shot image classification with models such as CLIP and ALIGN. These models work by effectively learning visual concepts from natural language supervision. Such models make it possible to build classifiers without task-specific data, which is useful in scenarios where data is either costly or unavailable. However, this capability introduces the potential for bias. This paper is an exploratory bias probe of the CLIP model that finds class design heavily influences model performance.

The first set of experiments focusses on classification terms that have a high potential to cause representational harm. In one example, the authors conduct experiments on the FairFace dataset by adding classification labels such as 'animal' and 'criminal' to the list of possible classes. They find that black people and young people (under 20) were misclassified at significantly higher rates (14%) compared to the dataset as a whole (5%). This shows that the choice of labels affects classification outcomes. In a follow-up experiment, the authors add the additional label 'child' and find that this drastically reduces classification into crime-related and non-human categories. This shows sensitivity to minor changes in class design.

In the second set of experiments, the authors focus on how CLIP treated images of men and women using images of Members of Congress. Although CLIP wasn't designed for multi-label classification, it's still informative to look at the label distribution above a certain cutoff. When occupations are used as the label set, the authors find that thresholds under 0.5% return 'nanny' and 'housekeeper' for women and 'prisoner' and 'mobster' for men. When labels come from the combined set that Google Cloud Vision, Amazon Rekognition and Microsoft use for all images, the authors find that CLIP returns a disproportionate number of appearance-related labels to women.

Zach's opinion: It's tempting to write off such experiments as obvious since it's clear that class design affects classification results. However, upon further consideration, specifying how to address such problems seems significantly more challenging. I think this paper does a good job of pointing out the relative nuance in how class design and bias interact in fairly realistic use cases.

NEWS

Research Scientist, Long-term Strategy & Governance (summarized by Rohin): DeepMind (my employer) is hiring for several Research Scientist positions on the Long-term Strategy and Governance Team, across a wide range of backgrounds and skills. (Though note that you do need a PhD, or equivalent experience.) See also this EA Forum post.

2022 IEEE Conference on Assured Autonomy (summarized by Rohin): The ICAA conference seeks contributions on all aspects of AI safety, security, and privacy in autonomous systems. The paper submission deadline is October 18 and the conference itself will take place March 22-24.

CSER Job Posting: Academic Programme Manager (summarized by Rohin): CSER is searching for a candidate for a relatively senior role that combines academic, management and administrative responsibilities. The application deadline is September 20.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

Thanks again for these newsletters and summaries! I'm excited about the flagship paper.

First comment: I don't think their experiment about code execution is much evidence re "true understanding."

Recall that GPT-3 has 96 layers and the biggest model used in this paper was smaller than GPT-3. Each pass through the network is therefore loosely equivalent to less than one second of subjective time, by comparison to the human brain which typically goes through something like 100 serial operations per second I think? Could be a lot more, I'm not sure. https://aiimpacts.org/rate-of-neuron-firing/#Maximum_neural_firing_rates

So, the relevant comparison should be: Give a human the same test. Show them some code and give them 1 second to respond with an answer (or the first token of an answer, and then 1 second for the second token, and so forth). See how well they do at predicting the code output. I predict that they'd also do poorly, probably <50% accuracy. In claim that this passage from the paper inadvertently supports my hypothesis:

Including test cases and natural language descriptions in the prompt lead to the highest overall performance—higher than using the code itself. Because the code unambiguously describes the semantics, whereas test cases do not, this suggests that models are in some sense not really “reading” the source code and using it to execute. Models trained on general text corpora may be better at inducing patterns from as few as two input-output examples than they are at predicting the execution of code.

Second comment: Speculation about scaling trends:

Extrapolating from Figure 3, it seems that an AI which can solve (via at least one sample) approximately 100% of coding tasks in this set, without even needing fine-tuning, will require +2 OOMs of parameters, which would probably cost about $5B to train when you factor in the extra data required but also the lower prices and algorithmic improvements since GPT-3. Being almost 2 OOMs bigger than GPT-3, it might be expected to cost $6 per 1000 tokens, which would make it pretty expensive to use (especially if you wanted to use it at full-strength where it makes multiple samples and then picks the best one) though I think it might still find an economic niche; you could have a system where first a smaller model attempts a solution and you only call up the big model if that fails, and then you keep generating samples till you get one that works so on average the number of samples you need to generate will be small, and only cost you multiple dollars for a the toughest few percentile of cases. Then this service could be used by well-paid programmers for whom the time savings are worth it.

Does this extrapolation/speculation seem right?

First comment: I don't think their experiment about code execution is much evidence re "true understanding."

I agree that humans would do poorly in the experiment you outline. I think this shows that, like the language model, humans-with-one-second do not "understand" the code.

(Idk if you were trying to argue something else with the comparison, but I don't think it's clear that this is a reasonable comparison; there are tons of objections you could bring up. For example, humans have to work from pixels whereas the language model gets tokens, making its job much easier.)

Second comment: Speculation about scaling trends:

I didn't check the numbers, but that seems pretty reasonable. I think there's a question of whether it actually saves time in the current format -- it might be faster to simply write the program than to write down a clear natural language description of what you want along with test cases.

I agree that humans would do poorly in the experiment you outline. I think this shows that, like the language model, humans-with-one-second do not "understand" the code.

Haha, good point -- yes. I guess what I should say is: Since humans would have performed just as poorly on this experiment, it doesn't count as evidence that e.g. "current methods are fundamentally limited" or "artificial neural nets can't truly understand concepts in the ways humans can" or "what goes on inside ANN's is fundamentally a different kind of cognition from what goes on inside biological neural nets" or whatnot.

Oh yeah, I definitely agree that this is not strong evidence for typical skeptic positions (and I'd guess the authors would agree).

See also "Evaluating Large Language Models Trained on Code", OpenAI's contribution. They show progress on the APPS dataset (Intro: 25% pass, Comp: 3% pass @ 1000 samples), though note there was substantial overlap with the training set. They also only benchmark up to 12 billion params, but have also trained a related code-optimized model at GPT-3 scale (~100 billion).

Notice that technical details are having a large impact here:

GPT-3 saw a relatively small amount of code, only what was coincidentally in the dataset, and does poorly
GPT-J had Github as a substantial fraction of its training set
The dataset for Google's 137-billion model is not public but apparently "somewhat oversampled web pages that contain code". They also try fine-tuning on a very small dataset (374 items).
Codex takes a pre-trained GPT-3 model and fine-tunes on 159 GB of code from Github. They also do some light prompt engineering. Overall, they show progress on APPS
OpenAI's largest model additionally uses a BPE tokenization optimized for code, and may have other differences. It has not yet been publicly benchmarked

Thanks, I probably should have linked to my summary of that paper in this newsletter.

I've heard rumors that people are interpreting the highlighted papers as "huh, large models aren't that good at writing code, they don't even solve introductory problems". (Note that these are only rumors, I don't know of any specific people who take this interpretation.)

I don't buy this interpretation, because these papers didn't do the biggest, most obvious improvement: to actually train on a large dataset of code (i.e. Github), as in Codex. My reaction to these papers is more like “wow, even models trained on language are weirdly good at writing code, given they were trained to produce language, imagine how good they must be when trained on Github”.

Thanks again for these newsletters and summaries! I'm excited about the flagship paper.

First comment: I don't think their experiment about code execution is much evidence re "true understanding."

Including test cases and natural language descriptions in the prompt lead to the highest overall performance—higher than using the code itself. Because the code unambiguously describes the semantics, whereas test cases do not, this suggests that models are in some sense not really “reading” the source code and using it to execute. Models trained on general text corpora may be better at inducing patterns from as few as two input-output examples than they are at predicting the execution of code.

Second comment: Speculation about scaling trends:

Does this extrapolation/speculation seem right?

First comment: I don't think their experiment about code execution is much evidence re "true understanding."

I agree that humans would do poorly in the experiment you outline. I think this shows that, like the language model, humans-with-one-second do not "understand" the code.

Second comment: Speculation about scaling trends:

I agree that humans would do poorly in the experiment you outline. I think this shows that, like the language model, humans-with-one-second do not "understand" the code.

Oh yeah, I definitely agree that this is not strong evidence for typical skeptic positions (and I'd guess the authors would agree).

Notice that technical details are having a large impact here:

GPT-3 saw a relatively small amount of code, only what was coincidentally in the dataset, and does poorly
GPT-J had Github as a substantial fraction of its training set
The dataset for Google's 137-billion model is not public but apparently "somewhat oversampled web pages that contain code". They also try fine-tuning on a very small dataset (374 items).
Codex takes a pre-trained GPT-3 model and fine-tunes on 159 GB of code from Github. They also do some light prompt engineering. Overall, they show progress on APPS
OpenAI's largest model additionally uses a BPE tokenization optimized for code, and may have other differences. It has not yet been publicly benchmarked

Thanks, I probably should have linked to my summary of that paper in this newsletter.

13

[AN #164]: How well can language models write code?

13

Ω 9

SECTIONS

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

AGENT FOUNDATIONS

FIELD BUILDING

MISCELLANEOUS (ALIGNMENT)

NEWS

FEEDBACK

PODCAST

13

Ω 9

13

Ω 9