GitHub Copilot And The Unfulfilled Promises Of An Artificial Intelligence Future

In late June of 2021, GitHub launched a ‘technical preview’ of what they termed GitHub Copilot, described as an ‘AI pair programmer which helps you write better code’. Quite predictably, responses to this announcement varied from glee at the glorious arrival of our code-generating AI overlords, to dismay and predictions of doom and gloom as before long companies would be firing software developers en-masse.

As is usually the case with such controversial topics, neither of these extremes are even remotely close to the truth. In fact, the OpenAI Codex machine learning model which underlies GitHub’s Copilot is derived from OpenAI’s GPT-3 natural language model,  and features many of the same stumbles and gaffes which GTP-3 has. So if Codex and with it Copilot isn’t everything it’s cracked up to be, what is the big deal, and why show it at all?

The Many Definitions of AI

Baker Library at DarthMouth College. (Credit: Gavin Huang, CC BY 3.0)

The first major attempt at establishing a true field of artificial intelligence was the Dartmouth workshop in 1956. This would see some of the foremost minds in the fields of mathematics, neuroscience, and computer sciences come together to essentially brainstorm on a way to create what they would term ‘artificial intelligence’, following the more common names at the time like ‘thinking machines’ and automata theory.

Despite the hopeful attitude during the 1950s and 1960s, it was soon acknowledged that Artificial Intelligence was a much harder problem than initially assumed. Today, AI capable of thinking like a human is referred to as artificial general intelligence (AGI) and still firmly the realm of science-fiction. Much of what we call ‘AI’ today is in fact artificial narrow intelligence (ANI, or Narrow AI) and encompasses technologies that approach aspects of AGI, but which are generally very limited in their scope and application.

Most ANIs are based around artificial neural networks (ANNs) which roughly copy the concepts behind biological neural networks such as those found in the neocortex of mammals, albeit with major differences and simplifications. ANNs like classical NNs and recurrent NNs (RNNs) — what’s used for GPT-3 and Codex — are programmed during training using backpropagation, which is a process that has no biological analog.

Essentially, RNN-based models like GPT-3 are curve fitting models, which use regression analysis in order to match a given input with its internal data points, the latter of which are encoded in the weights assigned to the connections within its network. This makes NNs at their core mathematical models, capable of efficiently finding probable matches within their network of parameters. When it comes to GPT-3 and similar natural language synthesis systems, their output is therefore based on probability rather than understanding. Therefore much like with any ANN the quality of this output is is highly dependent on the training data set.

Garbage In, Garbage Out

The historic Pioneer Building in San Francisco, home to OpenAI and Neuralink. (Credit: HaeB, CC BY-SA 4.0)

All of this means that an ANN is not capable of thought or reasoning and is thus not aware of the meaning of the text which it generates. In the case of OpenAI’s Codex, it has no awareness of what code it writes. This leads to the inevitability of having a human check the work of the ANN, as also concluded in a recent paper by OpenAI (Mark Chen et al., 2021). Even though Codex was trained on code instead of natural language, it has as little concept of working code as it has of proper English grammar or essay writing.

This is borne out by the FAQ on GitHub’s Copilot page as well, which notes that on the first attempt to fill in a blanked out function’s code it got it right only 43% of the time and 57% when given 10 attempts. Mark Chen et al. tested the generated Python output from Codex against prepared unit tests. They showed that different versions of Codex managed to generate correct code significantly less than half the time for a wide variety of inputs. These inputs ranged from interview questions to docstring descriptions.

Furthermore, Chen et al. note that since Codex has no awareness of what code means, there are no guarantees that generated code will run, be functionally correct, and not contain any security or other flaws. Considering that the training set for Codex consisted of gigabytes of code taken from GitHub without a full validation for correctness, function, or security issues, this means that whatever results roll out of the regression analysis has at most the guarantee of being as correct as code copied from a vaguely relevant StackOverflow post.

Let’s See the Code

Of note when it comes to using GitHub Copilot is that OpenAI’s Codex being based on GPT-3, it too is exclusively licensed to Microsoft, which also explains its association with GitHub, and why at least during the current technical preview phase it requires the use of the Visual Studio Code IDE. After installing the GitHub Copilot extension in VSC and logging in, your code will be sent to the Microsoft data center where Codex runs, for analysis and suggestions.

Any code suggestions by Copilot will be offered automatically without explicit input from the user. All it needs is some comments which describe the functionality of code that should follow, and possibly a function signature. When the system figures it has found something to contribute, it will show these options and allow the user to pick them.

Unfortunately, the technical preview for Copilot only provides access to a very limited number of people, so after the initial Zerg Rush following the announcement I haven’t been able to obtain access yet. Fortunately a couple of those who have gained access have written up their thoughts.

I doubt I will ever use the GitHub Copilot on a daily basis, definitely not in professional settings like work for a client or while employed. --Simona WinnekesOne TypeScript developer (Simona Winnekes) wrote up their thoughts after using Copilot to create a minimal quiz app in TypeScript and Chakra. After describing the intention for sections of the code in comments, Copilot would suggest code, which first involved bludgeoning Copilot into actually using Chakra UI as a dependency. Checking Copilot’s suggestions would often reveal faulty or incorrect code, which got fixed by writing more explicit instructions in the comments and picking the intended option from Copilot’s suggestions.

Simona’s findings were that while Copilot works with JavaScript, Python, and TypeScript, and can help when writing repetitive code or unit tests, the generated code needed constant validation and Copilot would often refuse to use desired modules and dependencies. The generated code had a distinct ‘stitched together’ feeling to it as well, lacking the consistency expected from a human developer. Ultimately writing this quiz by hand took Simona about 15 minutes, and two hours while humoring this Copilot AI buddy. Enthusiasm for continued use of Copilot was understandably low after this experience.

I think it’s going to be a little longer before Copilot delivers a genuine productivity boost. But I am convinced that this is coming. --Colin EberhardtOver at Scott Logic, Colin Eberhardt had a very mixed experience with Copilot. While he acknowledged a few ‘wow’ moments where Copilot was genuinely somewhat useful or even impressive, but the negatives won out in the end. His complaints focused on the latency between typing something and a suggestion from Copilot popping up. This, along with the ‘autocomplete’ model used by Copilot leads to a ‘workflow’ akin to a pair programming body who seemingly randomly rips your keyboard away from you to type something.

Colin’s experience was that when Copilot stuck to suggesting 2-3 lines of code, the cognitive load of validating Copilot’s suggestions was acceptable. However, when larger blocks of code were suggested, he didn’t feel like the overhead of validating Copilot’s suggestions was worth it over just typing the code oneself. Even so he sees potential in Copilot, especially once it becomes a real AI partner programming buddy.

Copilot might be more useful for languages that are high on boilerplate, and have limited meta-programming functionality, such as Go. --Jeremy HowardThe most comprehensive analysis probably comes from Jeremy Howard over at Fast.ai. In a blog post titled ‘Is GitHub Copilot a blessing, or a curse?’, Jeremy makes the astute observation that most time is taken up not by writing code, but by designing, debugging, and maintaining it. This leads into the ‘curse’ part, as Copilot’s (Python) code turns out to be rather verbose. What happens to code design and architecture (not to mention ease of maintenance) when the code is largely whatever Copilot and kin generate?

When Jeremy asked Copilot to generate code to fine-tune a PyTorch model, the resulting code did work, but was slow and led to poor tuning. This leads to another issue with Copilot: how does one know that the solution presented is the most optimal one for a given problem? When digging through StackOverflow and programming forums and blogs, you’re likely to stumble over a whole range of possible approaches, along with advantages and disadvantages.

Since Copilot’s generated code goes through no such considerations, what is ultimately the true value of the generated code beyond that it passes the (auto-generated) unit test?

Evolution, Not Revolution

Also helpfully noted by Jeremy is that Copilot isn’t nearly as revolutionary as it makes itself out to be. For a number of years now there have been options like GitHub’s Semantic Code Search, Tabnine with an ‘AI assistant’ that works with a myriad of languages (including non-scripting ones), and earlier this year Microsoft released IntelliCode for Visual Studio. The common pattern here? AI-based code completion.

Example of Microsoft’s Visual Studio IntelliCode ‘AI-assisted development’.

With this much competition already out there for GitHub’s Copilot, it’s more important than ever to realize where it fits in the development process, and how it could be adjusted to fit different development styles. Most importantly, we need to get rid of the bubbly, starry-eyed notion that these are ‘AI pair programmer buddies’. Clearly these are more akin to ambitious auto-completion algorithms, with all of their advantages and disadvantages.

Some developers love to toggle on all auto-completion features in the IDE, from brackets to function and class names so that they can practically hit Enter to generate half of their code, while others prefer to painstakingly chisel each character into the file alongside screens filled with documentation and API references. Obviously Copilot isn’t going to win over such disparate types of developers.

Perhaps the most important argument against Copilot and kin is that these are just dumb-as-bricks algorithms with zero consideration for the code they generate. With the human developer always having to validate the generated code, it would seem that the days of StackOverflow et al. aren’t quite numbered yet, and software developer jobs are still quite safe.

36 thoughts on “GitHub Copilot And The Unfulfilled Promises Of An Artificial Intelligence Future

  1. This is not bright for new projects since you’re putting all of your capital expense write-offs into the post-deployment issue log and then wasting more time since nobody knows what’s going on. If the model trained with the project devs, learned to finish their sentences like a best friend, and kept its mouth shut until deployment, it might find some use with the support team since they often have no idea what they’ve just been thrown into.

    I shall call my tool “mini me.”

    1. Yeah, that’s been a thing in Visual Studio for a pretty long time now. It’s called Intellicode. It’s like Intellisense except it trains on the code you’re actually working with and starts predicting snippets you often type, modified for the current context. It’s pretty great.

      Before complaining that a tool doesn’t do what you think it should, learn what tools are already available.

      1. We also have compile time reflection, ct source generation, macros, scaffolding, etc etc etc. This thing is kind of worse than just copy-pasting random stuff since at least someone looked at a general question and response list to pick one. Signing off on code really needs to be treated like signing off on an audit, unless your name is Arthur Anderson.

        If it learned off of the engineering team then it would help level 2 support or whatever fix the code with the autocomplete behavior of the team and possibly avoid escalation.

        The one tool that I wish we did have is a plugin for browsers that detects someone acting like a jerk, and offer a minor shock when clicking submit. Maybe it could disable network connections for a month as well. You could easily know the kind of person you’re dealing with by the torched scars on their hands.

  2. So they grabbed all the github GPL/BSD/??? code and used that to train a tool to help generate code for commercial and open source projects. With co-pilot, now the provenance of every single line of code and it’s original license status is critical for the legal use of the tool.

    For BSD licensed code legally this is no problem. But if BSD source code is then locked up in GPL 3.0 straightjacket, that is a major disrespect to the developers of the code. That their code which previously could be used in all commercial products with no question of copyright violations may be tarnished by being also shoved into GPL 3.0 projects.

    Now if they has trained it on GPL code and had only co-pilot GPL projects, that would be OK.
    And trained a second one on BSD code only and had that co-pilot for commercial and BSD projects, that would be legally OK.

    There may be trouble ahead for anyone who use co-pilot in it’s current incarnation.

      1. Bye bye total freedom, hello GPL jail cell. If the developers wanted that they would not have use the BSD license, they wanted total freedom so they chose BSD license. The GPL 3.0 license in particular is like an infectious virus, anything touched by it can only ever be GPL 3.0 from then on. The person would be stealing BSD licensed code with no way to ever give anything back to repay the moral dept.

        Lets look at it the other way lets pick the magic number “0x5F3759DF”
        (e.g. https://github.com/id-Software/Quake-III-Arena/blob/master/code/game/q_math.c#L561 – GPL-2.0 License )
        Now if you copied the function verbatim and used it in a commercial product, technically you would be in breach of copyright.

        But if you dig into the history of the “fast inverse square root” and using only the maths you would be fine since mathematical and chemical formulas aren’t protected by copyright or patents!
        (ref: https://en.wikipedia.org/wiki/Fast_inverse_square_root#History )

      2. This makes the sweeping assumption that the person who cliked the button on GitHub saying the code is GPL has the legal right to do do.

        There is loads of code on Github, with very dubious pedigree which is often marked as GPL.

    1. If it gives code samples with license intact then there’s little difference than just going to github and copying it off. The programmer’s responsibility remains the same.

          1. No, it’s definitely not okay. Even code bit Copilot “generates itself” must be license encumbered: there’s no way for Copilot to know what the code actually does (the test corpus has no output) so the work has to be derivative. It can’t possibly be considered a reimplementation.

            It’s a terrible idea by GitHub trying to monetize their data with no effort. Of course, I’m sure someone will find a way to get it declared legal, but there’s no way it should be.

    2. How about this for an idea? I take some BSD licensed code, and make a GPL derivative project that uses it.

      Those who use _my_ code, may only do so under the GPL (whatever version I selected). The original BSD-licensed code is still permitted to be used, as BSD-licensed code, for whatever purpose the BSD license permits.

      IF, and only IF, I make changes to that BSD license, I do have the right to contribute those changes BACK to the BSD-licensed project: the up-stream contribution is BSD-licensed. Then IF accepted, THAT code contained in the contrbution is available to ALL users of that BSD library or project under the BSD license. GPL restrictions do not apply here, _other_ users of that same BSD-licensed code are not affected.

      If I do not make changes to the BSD library, there is nothing to contribute, and _other_ users of that same BSD-licensed code, are not restricted by the GPL in any way shape or form.

      You are free to call yourself “Truth”, but it’d be a good idea if you’d post some occasionally.

  3. This sounds like a steroid-enhanced version of the same faulty premise that has always plagued developer “productivity aids”: to wit, the assumption that the hard part of coding is *typing.*

    The hard part of coding is holding the design in your head. If code is generated for you, that task becomes harder, not easier. If you have to pay attention to what your tools are doing, that makes it harder. If your flow is interrupted, that makes it harder.

    IMO even autocomplete is counterproductive most of the time; the idea of having Clippy just yelling drunken syntax errors at me while I’m trying to focus sounds like a living hell. I can only assume this is a case of a solution looking for a problem, possibly combined with tool developers assuming their users are idiots (because that’s definitely a common theme).

    1. I will agree, most typing aids aren’t actually that important.

      Keeping track of the project and its associated mechanisms is the far more challenging part. Even when splitting a project into many sub functions that can each be handled individually.

      Then there is reading up on documentation for how various libraries and APIs work. Getting a snippet of code isn’t really any help when the commands within it are frankly cryptic unless one have a good understanding of the library/API in question. Since even clear commands can have slight nuances that impacts the code in considerable ways that one wouldn’t know unless one has studied the documentation.

      1. Communication is key. The issue exists for every service relationship, where people in the know will have a set of skills and common approaches, and a user/customer who has a wish.

        Sometimes the service provider is inflexible, sometimes the user, sometimes the problem itself is not solvable.

        Essentially you just add another barrier to a succesful implementation: communication between people.

        (Though obviously exchange and communication can be useful, in general).

  4. I have heard people talk about the copilot feature for some time now. I myself don’t see it as all that useful to be fair.

    If we first off all just skip past the whole debate around licensing surrounding the code snippests. Or the morality of taking millions of developers code and just plagiarizing it through an AI system that seemingly doesn’t do much on its own.

    My question is rather, is this type of tool even all that useful for a developer?

    I myself at least don’t consider random poorly commented snippets of code as useful in the slightest. Nor does it help speed up programming either, since I prefer having a decent understanding of the code being used.

    I have heard some people say that copilot is going to be revolutionary for new programmers, but I couldn’t disagree more with that statement, since a new program has no clue what the snippet of code actually is nor understand the nuances within the code.

    One thing I think is far more important for programmers in general is documentation surrounding various libraries, APIs, and so forth. A lot of documentation is in general fairly abhorrent, usually being sufficiently on point to where the information given in a lot of cases is meaningless to people who don’t have a deep understanding of the thing being documented. Tutorials are rarely better and usually end up skipping a lot of important details, since obviously one should already know that when dealing with the library/API in question….

    Though, I will however accept that it is hard to make actually good documentation that is informative while not blabbering on forever about surrounding fluff.

    But in the end, most of my time programming is not spent on typing code, but rather reading documentation for the libraries and APIs that are used in the project. And copilot doesn’t help with this in the slightest.

  5. They should have instead focused on auto generated unit tests from the code that a human wrote.

    However, the value of even that is dubious at best since many developers ignore unit yet anyway.

    Most of the value would be from the human looking at the test code and realizing that it shouldn’t pass as written.

    It would also be nice when refactoring to have tests that would indicate where behavior changed (un)intentionally.

    Note: I wouldn’t be surprised to hear that those tools already exist, but I have yet to use them, so, they’d be new to me.

  6. > All of this means that an ANN is not capable of thought or reasoning and is thus not aware of the meaning of the text which it generates.
    I don’t see how this follows from what you said. We know that language models are at least capable of things like arithmetic on problems not from their training data (GPT-2/3 are hampered by the BPE tokenization they use, which means the model doesn’t see numbers as digit sequences but weirdly scrambled things which may contain multiple digits, but can do a little). They can’t do a lot of things humans can, but there are several orders of magnitude of scale difference.

  7. If you’re going to write on a subject, be knowledgeable. Perhaps run it by someone that is knowledgeable when you’re not. AGI has nothing to do with human-like intelligence. AGI simply is a machine that is capable of performing any task a human can. Human-like intelligence is not a prerequisite. Feynman understood this 40 years ago. We need not simulate a thing, in order to emulate it. Planes do not fly like birds, AI need not have intelligence like humans. Odds are, AI will *never* have human-like intelligence, as it’s an entirely different substrate with an entirely unique evolutionary path. That’s a good thing, as most humans aren’t exactly the models I’d like an AGI or ASI to be modeled on.

    1. There is also Explainable Artificial Intelligence which has been a big topic in self-driving cars, because they should be able to prove they are safe, and if they fail, why they failed.

      In coding this becomes even more necessary. You shouldn’t use a piece of code that you don’t understand, unless it’s a well tested and validated library.

      For that very reason the AI should write code that is understandable or be able to explain every little detail.

  8. This looks like a real-life implementation of the joke about the StackSort algorithm, where a script would search Stack Overflow for code claimed to sort a list, run each claimed set of code, and see if it actually sorted the data.

  9. Fundamentally this shouldn’t be legal. At all. No way.

    Training a network on copyrighted (even copylefted) data in such a way that the network can reproduce the data with the right prompt is obviously copying. It’s like a network that’s fed songs, and then you ask “I want a song like the Beatles” and it dumps out Yesterday.

    Using a network for recognition is one thing, but there’s no way this is okay unless the code’s completely public domain or this tool provides *all* the licenses of its content.

      1. No, it’s not the same thing. Humans learn with *context* – the knowledge of the behavior of the code and its inputs/outputs. Here you’re specifically missing that (because you don’t feed code in/out in). It’s just the text, which is definitely copyrighted.

        It’s not even like figuring out arithmetic from natural language because that also contains the context in the form of responses.

        Being handed a corpus of code with no ability to see it run (and no prior knowledge of code!) means it’s impossible to extract the ideas and reimplement. You couldn’t do clean-room reengineering here: there’s no way to confirm your “new” behavior works.

      1. Of course humans do it. But if I work for a company, memorize all their code, leave, and duplicate it, obviously that’s infringement.

        The entire idea of extracting the idea of the code from the implementation *requires* that you know what the code does, which means you *have* to run it. This doesn’t: so it cannot be extracting the ideas.

  10. I’m not sure people are paying enough attention to this part:

    “After installing the GitHub Copilot extension in VSC and logging in, your code will be sent to the Microsoft data center where Codex runs, for analysis and suggestions…”

    That means people who work work on proprietary code with this stuff are literally giving away company secrets. Kudos to the author for making this critical point.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.