Above is a link to an interesting post about synthetic code generation with a transformer model trained on The Pile, which contains a large chuck of GitHub and StackOverflow. Due to CommonCrawl's deficiency in this area, the much smaller GPT-J-6B outperforms OpenAI’s largest publicly available GPT-3 models. The performance is impressive enough that one wonders how capable a 100+ billion parameter model trained on The Pile will be, let alone what an AlphaGo-level engineering effort towards the end of synthetic code generation would achieve.
As the The Pile was created to provide a dataset for 100 billion paramater+ models, we may not have to wait long. The examples in the post are clearly trivial, but I personally take this to be something of a fire alarm. I was not previously aware of how poorly-optimized GPT-3 was for code generation, and I have updated toward surprising gains in this area in the next few years.
I no longer consider agents with superhuman performance in competitive programming to be a ridiculous thing to pursue.
It is useful to remind myself of how shocked I would be to see such things in 2012. In 2012 I would have taken this as a sign that AGI was near.
Scenario-based planning postulates that one should predict symptoms emblematic of a given scenario and then robotically assume you are in said scenario once a sufficient number of these symptoms come to pass. I am unsure whether there is wisdom in this approach, but I find it a discomfiting line of thought.
To elaborate on this a little more: maintenance is the kind of nasty field where '99% accurate' may still not be nearly good enough if you want to unlock big productivity gains of the sort you get by replacing humans entirely, rather than merely saving a few minutes here or there looking up API docs etc. Amdahl's law is not mocked: if a human has to manually review and step in, then it cannot deliver more than modest factor gains, any more than learning to type really fast will deliver life-changing productivity gains. Maintenance is almost by definition about the long tail of subtle bugs, system interactions, faulty assumptions, and business-driven requirement changes.* If you're a SWE at Google, you don't spend very much time writing little self-contained greenfield scripts of 100-500 lines. You'll spend a lot more time doing, say, code reviews of new pulls, which involve no writing of greenfield code at all. Something like Codex can help knock out the occasional script or help in learning a new system or be a very useful substrate for static analysis tools (like Coverity on steroids), but I can confidently predict that Codex is not going to make programmers even 10x more productive. Utility doesn't increase smoothly with accuracy: it plateaus and jumps. You don't want to use a voice transcription system which makes 10% errors, but at 5% it might suddenly become useful.
But ironically, in many ways, developing DL code is far simpler. Sometimes, solving a much harder problem is much easier. DL is much more self-contained and amenable to self-modification. The complexity of the learned tasks resides in the weights, not the seed algorithm which learns the NN; the seed algorithm may be extremely simple and short, a few hundred lines at most, including all the boilerplate and wrappers and ceremony. You can write backprop and CNNs in a few hundred lines for a self-contained CPU implementation. Available DL libraries let you create & train an arch like GPT in a few dozen lines (Karpathy does minGPT in <300 lines of bloated code). Rip Van Winkle is an interesting exercise in estimating complexity, in a Kolmogorov sort of way, of a formerly-SOTA CNN ResNet at 1,032 bits. Evolutionary search programs like AutoML-Zero can recapitulate backprop and other core algorithms in a few lines. We also see this in the breakthroughs themselves: why do MLPs suddenly work? Because you add like 1 line to re-normalize or gate intermediate activations, while 99.9% of the code remains the same. Why did resnets suddenly make 'deep' (>10) layer NNs work? Because you add like 1-3 lines to define a shortcut connection. Why did NNs suddenly start working around 2009? Because you added 1 line for the right initialization, and 1 line for a ReLU instead of sigmoid nonlinearity. Why did X work - we could go on all day. (Why is one person a genius and another person ordinary? Differences at a few thousand alleles which could be encoded in less than a kilobyte. Why did humans take over the world and chimpanzees are in zoos if our genomes are like 99.9% identical? Everything is fragile.) The space of all possible programs of a few hundred self-contained lines to bootstrap a general meta-learning agent is vast... but it's also exactly the sort of task where a self-supervised agent can acquire most of the necessary bits from the environment, solving basic problems like how to create valid ASTs (the sort of knowledge that isn't in AutoML-Zero-esque systems, and mostly accounts for their boil-the-ocean inefficiency), and then use the tiny bit of supervision from evolutionary RL losses to close the gap by selecting only plausible modifications to test, running a feasible number of iterations, and modifying the last handful of key lines.
Thus, an asymmetry in code-generating AIs. A code-generating AI could be almost completely useless for 'easy' maintenance tasks like fixing bugs in production code because it comes with so much overhead and unreliability that it isn't worth the hassle, but also still offer enormous exponential gains in ranking candidates for the 'hard' problem of rewriting a core DL algorithm. It is unfortunate that we live in a world where you can apparently be 99.9% of the way to a human or an AGI and the result be completely useless, rather than 99.9% as powerful, because it means you may get no warning signs before that last 1-line fix; but that looks like the world we live in, as opposed to a gradualist world where half-working AIs take over half the global economy or something.
* If you've paid attention to the popups on Gwern.net, you've probably noticed that they've changed a number of times; the Wikipedia popups, specifically, have now gone through 8 completely different implementations. The 8th iteration, ironically, is very similar to the 1st iteration: it requests from the Wikipedia APIs an article summary and displays it; that's all. I & Obormot have spent a breathtaking amount of time on this, not because the actual coding itself takes up substantial time (none of it is remotely impressive algorithmically), but because the hard part is understanding what even should be done in the first place and what tradeoff between static, dynamic, inlined vs external, popup vs popin etc works best, implementing and testing in the real world to see how it felt in practice and what users thought, how it scaled as I fixed bugs & found edge-cases... By the 8th iteration, what we'd learned was that static or inlined couldn't work at scale or provide recursion in any feasible way and were deadends, and the main motivation for those - displaying hyperlinked excerpts - was moot because we were using the wrong WP API in the first iteration, and there was a 'mobile' API which, I discovered after hours of docs reading, provided useful rather than butchered excerpts and worked fine all along. "Time is a circle."