We recruited 95 professional developers, split them randomly into two groups, and timed how long it took them to write an HTTP server in JavaScript. One group used GitHub Copilot to complete the task, and the other one didn’t. We tried to control as many factors as we could–all developers were already familiar with JavaScript, we gave everyone the same instructions, and we leveraged GitHub Classroom to automatically score submissions for correctness and completeness with a test suite. We’re sharing a behind-the-scenes blog post soon about how we set up our experiment!
In the experiment, we measured—on average—how successful each group was in completing the task and how long each group took to finish.
- The group that used GitHub Copilot had a higher rate of completing the task (78%, compared to 70% in the group without Copilot).
- The striking difference was that developers who used GitHub Copilot completed the task significantly faster–55% faster than the developers who didn’t use GitHub Copilot. Specifically, the developers using GitHub Copilot took on average 1 hour and 11 minutes to complete the task, while the developers who didn’t use GitHub Copilot took on average 2 hours and 41 minutes.
My opinion: Because of the usual reasons (publication bias, replication crisis, the task being "easy," etc.) I don't think we should take this particularly seriously until much more independent experiments have been run. However, it's worth knowing about at least.
Related: https://ai.googleblog.com/2022/07/ml-enhanced-code-completion-improves.html
We compare the hybrid semantic ML code completion of 10k+ Googlers (over three months across eight programming languages) to a control group and see a 6% reduction in coding iteration time (time between builds and tests) when exposed to single-line ML completion. These results demonstrate that the combination of ML and SEs can improve developer productivity. Currently, 3% of new code (measured in characters) is now generated from accepting ML completion suggestions.
In my experience it's best at extending things, because it can predict from the context of the file you're working in. If I try to generate code from scratch, it goes off in weird directions I don't actually want pretty quickly, and I constantly have to course-correct with comments.
Honestly I think the whole "build from ground up"/"extending, modifying, and fixing" dichotomy here is a little confused though. What scale are we even talking?
A big part of Copilot's efficiency gains come from very small-scale suggestions, like filling out the rest of a for loop statement. It can generally immediately guess what you want to iterate over. I happened to be on a plane without internet access recently, decided to do a bit of coding anyway, needed to write a for loop, and then was seriously off-put by the fact that I couldn't write the whole thing by just pressing "Tab". I had to actually think about how a stupid for loop is written! What a waste of time!