Above is a link to an interesting post about synthetic code generation with a transformer model trained on The Pile, which contains a large chuck of GitHub and StackOverflow. Due to CommonCrawl's deficiency in this area, the much smaller GPT-J-6B outperforms OpenAI’s largest publicly available GPT-3 models. The performance is impressive enough that one wonders how capable a 100+ billion parameter model trained on The Pile will be, let alone what an AlphaGo-level engineering effort towards the end of synthetic code generation would achieve.
As the The Pile was created to provide a dataset for 100 billion paramater+ models, we may not have to wait long. The examples in the post are clearly trivial, but I personally take this to be something of a fire alarm. I was not previously aware of how poorly-optimized GPT-3 was for code generation, and I have updated toward surprising gains in this area in the next few years.
I no longer consider agents with superhuman performance in competitive programming to be a ridiculous thing to pursue.
It is useful to remind myself of how shocked I would be to see such things in 2012. In 2012 I would have taken this as a sign that AGI was near.
Scenario-based planning postulates that one should predict symptoms emblematic of a given scenario and then robotically assume you are in said scenario once a sufficient number of these symptoms come to pass. I am unsure whether there is wisdom in this approach, but I find it a discomfiting line of thought.
Dan Hendrycks and Steven Basart et al. recently released APPS, an ML benchmark for measuring the performance of ML models at the task of writing code. One part of their benchmark measures the performance of code on competitive programming questions. I wrote a Metaculus question on when people expect this benchmark to be solved -- operationalized as getting above 80% strict accuracy on the competitive programming section.
Initial results are encouraging. GPT-Neo 2.7B passes nearly 20% of test cases on average for introductory coding problems, when the model is allowed to give 5 attempts (see Table 4 in the paper). A fine-tuned GPT-J-6B is likely to be even better.
The APPS repository also gives the fine-tuned weights for GPT-Neo-2.7 and code to run it. Though without a GPU it takes roughly forever.
I asked Dan Hendrycks for the performance of GPT-J-6B on APPS on the Eleuther AI discord. He didn't say they were definitely going to test it, but my take-away was that it might happen.
I could image a test driven automated programming evolving in the next ten to twenty years, were a LM-guided search tries to create functions according to a description that pass all the test cases.