In late June of 2021, GitHub launched a ‘technical preview’ of what they termed GitHub Copilot, described as an ‘AI pair programmer which helps you write better code’. Quite predictably, responses to this announcement varied from glee at the glorious arrival of our code-generating AI overlords, to dismay and predictions of doom and gloom as before long companies would be firing software developers en-masse.
As is usually the case with such controversial topics, neither of these extremes are even remotely close to the truth. In fact, the OpenAI Codex machine learning model which underlies GitHub’s Copilot is derived from OpenAI’s GPT-3 natural language model, and features many of the same stumbles and gaffes which GTP-3 has. So if Codex and with it Copilot isn’t everything it’s cracked up to be, what is the big deal, and why show it at all?
The Many Definitions of AI
The first major attempt at establishing a true field of artificial intelligence was the Dartmouth workshop in 1956. This would see some of the foremost minds in the fields of mathematics, neuroscience, and computer sciences come together to essentially brainstorm on a way to create what they would term ‘artificial intelligence’, following the more common names at the time like ‘thinking machines’ and automata theory.
Despite the hopeful attitude during the 1950s and 1960s, it was soon acknowledged that Artificial Intelligence was a much harder problem than initially assumed. Today, AI capable of thinking like a human is referred to as artificial general intelligence (AGI) and still firmly the realm of science-fiction. Much of what we call ‘AI’ today is in fact artificial narrow intelligence (ANI, or Narrow AI) and encompasses technologies that approach aspects of AGI, but which are generally very limited in their scope and application.
Most ANIs are based around artificial neural networks (ANNs) which roughly copy the concepts behind biological neural networks such as those found in the neocortex of mammals, albeit with major differences and simplifications. ANNs like classical NNs and recurrent NNs (RNNs) — what’s used for GPT-3 and Codex — are programmed during training using backpropagation, which is a process that has no biological analog.
Essentially, RNN-based models like GPT-3 are curve fitting models, which use regression analysis in order to match a given input with its internal data points, the latter of which are encoded in the weights assigned to the connections within its network. This makes NNs at their core mathematical models, capable of efficiently finding probable matches within their network of parameters. When it comes to GPT-3 and similar natural language synthesis systems, their output is therefore based on probability rather than understanding. Therefore much like with any ANN the quality of this output is is highly dependent on the training data set.
Garbage In, Garbage Out
All of this means that an ANN is not capable of thought or reasoning and is thus not aware of the meaning of the text which it generates. In the case of OpenAI’s Codex, it has no awareness of what code it writes. This leads to the inevitability of having a human check the work of the ANN, as also concluded in a recent paper by OpenAI (Mark Chen et al., 2021). Even though Codex was trained on code instead of natural language, it has as little concept of working code as it has of proper English grammar or essay writing.
This is borne out by the FAQ on GitHub’s Copilot page as well, which notes that on the first attempt to fill in a blanked out function’s code it got it right only 43% of the time and 57% when given 10 attempts. Mark Chen et al. tested the generated Python output from Codex against prepared unit tests. They showed that different versions of Codex managed to generate correct code significantly less than half the time for a wide variety of inputs. These inputs ranged from interview questions to docstring descriptions.
Furthermore, Chen et al. note that since Codex has no awareness of what code means, there are no guarantees that generated code will run, be functionally correct, and not contain any security or other flaws. Considering that the training set for Codex consisted of gigabytes of code taken from GitHub without a full validation for correctness, function, or security issues, this means that whatever results roll out of the regression analysis has at most the guarantee of being as correct as code copied from a vaguely relevant StackOverflow post.
Let’s See the Code
Of note when it comes to using GitHub Copilot is that OpenAI’s Codex being based on GPT-3, it too is exclusively licensed to Microsoft, which also explains its association with GitHub, and why at least during the current technical preview phase it requires the use of the Visual Studio Code IDE. After installing the GitHub Copilot extension in VSC and logging in, your code will be sent to the Microsoft data center where Codex runs, for analysis and suggestions.
Any code suggestions by Copilot will be offered automatically without explicit input from the user. All it needs is some comments which describe the functionality of code that should follow, and possibly a function signature. When the system figures it has found something to contribute, it will show these options and allow the user to pick them.
Unfortunately, the technical preview for Copilot only provides access to a very limited number of people, so after the initial Zerg Rush following the announcement I haven’t been able to obtain access yet. Fortunately a couple of those who have gained access have written up their thoughts.
One TypeScript developer (Simona Winnekes) wrote up their thoughts after using Copilot to create a minimal quiz app in TypeScript and Chakra. After describing the intention for sections of the code in comments, Copilot would suggest code, which first involved bludgeoning Copilot into actually using Chakra UI as a dependency. Checking Copilot’s suggestions would often reveal faulty or incorrect code, which got fixed by writing more explicit instructions in the comments and picking the intended option from Copilot’s suggestions.
Over at Scott Logic, Colin Eberhardt had a very mixed experience with Copilot. While he acknowledged a few ‘wow’ moments where Copilot was genuinely somewhat useful or even impressive, but the negatives won out in the end. His complaints focused on the latency between typing something and a suggestion from Copilot popping up. This, along with the ‘autocomplete’ model used by Copilot leads to a ‘workflow’ akin to a pair programming body who seemingly randomly rips your keyboard away from you to type something.
Colin’s experience was that when Copilot stuck to suggesting 2-3 lines of code, the cognitive load of validating Copilot’s suggestions was acceptable. However, when larger blocks of code were suggested, he didn’t feel like the overhead of validating Copilot’s suggestions was worth it over just typing the code oneself. Even so he sees potential in Copilot, especially once it becomes a real AI partner programming buddy.
The most comprehensive analysis probably comes from Jeremy Howard over at Fast.ai. In a blog post titled ‘Is GitHub Copilot a blessing, or a curse?’, Jeremy makes the astute observation that most time is taken up not by writing code, but by designing, debugging, and maintaining it. This leads into the ‘curse’ part, as Copilot’s (Python) code turns out to be rather verbose. What happens to code design and architecture (not to mention ease of maintenance) when the code is largely whatever Copilot and kin generate?
When Jeremy asked Copilot to generate code to fine-tune a PyTorch model, the resulting code did work, but was slow and led to poor tuning. This leads to another issue with Copilot: how does one know that the solution presented is the most optimal one for a given problem? When digging through StackOverflow and programming forums and blogs, you’re likely to stumble over a whole range of possible approaches, along with advantages and disadvantages.
Since Copilot’s generated code goes through no such considerations, what is ultimately the true value of the generated code beyond that it passes the (auto-generated) unit test?
Evolution, Not Revolution
Also helpfully noted by Jeremy is that Copilot isn’t nearly as revolutionary as it makes itself out to be. For a number of years now there have been options like GitHub’s Semantic Code Search, Tabnine with an ‘AI assistant’ that works with a myriad of languages (including non-scripting ones), and earlier this year Microsoft released IntelliCode for Visual Studio. The common pattern here? AI-based code completion.
With this much competition already out there for GitHub’s Copilot, it’s more important than ever to realize where it fits in the development process, and how it could be adjusted to fit different development styles. Most importantly, we need to get rid of the bubbly, starry-eyed notion that these are ‘AI pair programmer buddies’. Clearly these are more akin to ambitious auto-completion algorithms, with all of their advantages and disadvantages.
Some developers love to toggle on all auto-completion features in the IDE, from brackets to function and class names so that they can practically hit Enter to generate half of their code, while others prefer to painstakingly chisel each character into the file alongside screens filled with documentation and API references. Obviously Copilot isn’t going to win over such disparate types of developers.
Perhaps the most important argument against Copilot and kin is that these are just dumb-as-bricks algorithms with zero consideration for the code they generate. With the human developer always having to validate the generated code, it would seem that the days of StackOverflow et al. aren’t quite numbered yet, and software developer jobs are still quite safe.