New technology often brings with it a bit of controversy. When considering stem cell therapies, self-driving cars, genetically modified organisms, or nuclear power plants, fears and concerns come to mind as much as, if not more than, excitement and hope for a brighter tomorrow. New technologies force us to evolve perspectives and establish new policies in hopes that we can maximize the benefits and minimize the risks. Artificial Intelligence (AI) is certainly no exception. The stakes, including our very position as Earth’s apex intellect, seem exceedingly weighty. Mathematician Irving Good’s oft-quoted wisdom that the “first ultraintelligent machine is the last invention that man need make” describes a sword that cuts both ways. It is not entirely unreasonable to fear that the last invention we need to make might just be the last invention that we get to make.
Artificial Intelligence and Learning
Artificial intelligence is currently the hottest topic in technology. AI systems are being tasked to write prose, make art, chat, and generate code. Setting aside the horrifying notion of an AI programming or reprogramming itself, what does it mean for an AI to generate code? It should be obvious that an AI is not just a normal program whose code was written to spit out any and all other programs. Such a program would need to have all programs inside itself. Instead, an AI learns from being trained. How it is trained is raising some interesting questions.
Humans learn by reading, studying, and practicing. We learn by training our minds with collected input from the world around us. Similarly, AI and machine learning (ML) models learn through training. They must be provided with examples from which to learn. The examples that we provide to an AI are referred to as the data corpus of the training process. The robot Johnny 5 from “Short Circuit”, like any curious-minded student, needs input, more input, and more input.
Learning to Program
A primary input that humans use to learn programming is a collection of example programs. These example programs are generally printed in books, provided by teachers, or found in various online samples or projects. Such example programs make up the corpus for training the student programmer. Students can carefully read through example programs and then attempt to recreate those programs or modify them to create different programs. As a student advances, they usually study increasingly complex programs and they start combining techniques discovered from multiple example programs into more complex patterns.
Just as humans learn to program by studying program code, an AI can learn to program by studying existing programs. Stated more correctly, the AI trains on a corpus of existing program code. The corpus is not stored within the AI model anymore than books studied by the human program are stored within the student. Instead, the corpus is actually used to train the model in a statistical sense. Outputs generated by the trained AI do not come from copies of programs in the corpus, because the trained AI does not contain those programs. The outputs should instead be generated from the statistical model of the corpus that has been trained into the AI system.
AI Systems that Generate Code
GitHub Copilot is based on the OpenAI Codex. It uses comments in the code of a human programmer as its natural language prompts. From these prompts, Copilot can suggest code blocks directly into the human programmer’s editor screen. The programmer can accept the code blocks, or not, and then test the new code as part of their program. The OpenAI Codex has been trained on a corpus of publicly available program code along with associated natural language text. Public GitHub repositories are included in that corpus.
Copilot documentation does claim that its outputs are generated from a statistical model and that the model does not contain a database of code. On the other hand, it has been discovered that code suggested by the AI model will match a code snippet from the training set only about one percent of the time. One reason for this happening at all is that some natural language prompts correspond to a relatively universal solution. Similarly, if we were to ask a group of programmers to write C code for using binary trees, the results might largely resemble the code in chapter six of Kernighan & Ritchie because that is a common component in the training corpus for human C programmers. If accused of plagiarism, some of those programmers might even retort, “That’s just how a binary tree works.”
But [sometimes Copilot will recreate code _and comments_ verbatim](https://github.blog/2021-06-30-github-copilot-research-recitation/). Copilot has implemented a filter to detect and suppress code suggestions that match public code from GitHub. The filter can be enabled or disable by the user. There are plans eventually provide references for code suggestions that match public code from GitHub so that the user can look into the match and decide how to proceed.
Is Learning Always Encouraged?
Even if it’s very rare that an AI model trained on a corpus of example code later generates code matching the corpus, we should still consider instances where the code should not have been used to train the model to begin with. There may be limits to when and which source code can be used for training AI models. Looking to the field of intellectual property, software can be protected by patent, copyright, trademark, and trade secret.
Patents generally offer the broadest protection. When a system or method practices one or more claims of a patent, it is said to infringe the patent. It does not mater who wrote the code, where it came from, or even if the programmer had no idea of the existence of the patent. Objections to software patents aside, this one is straightforward. If an AI model generates code that practices a patented method, it does not mater if that code does or does not match any existing code, there is a real risk of patent infringement.
Trade secret only applies in the highly pathological situation where the source code was misappropriated, or stolen, from the original owner who was acting to keep the source code secret. Obviously, stolen source code should not be used for any purpose including the training of AI models. Source code that has been published online by its author or owner is not being protected as a trade secret. Trademarks only really apply to names, logos, slogans, or other identifying marks associated with the software and not to the source code itself.
When considering AI model training, copyright concerns can a little more nuanced. Copyright protection covers original works of authorship fixed in a tangible medium of expression including literary, dramatic, musical, and artistic works, such as poetry, novels, movies, songs, computer software, and architecture. Copyrights do not protect facts, ideas, systems, or methods of operation. Generally, studying copyrighted code and then rewriting your own code is not an infringement of the original copyright. Copyright does not protect the concepts or operations of computer code, it merely protects the specific expression or presentation of the code. Anyone else can write their own code that accomplishes the same thing without offending the copyright.
Copyright can protect computer code from being reproduced into other code that is substantially similar to the original. However, copyright does not protect against reading, studying, or learning from computer code. If the code has been published online, it is generally accepted that others are allowed to read it and learn from it. At one extreme, the concept clearly does not extend to reading the protected work with a photocopier to make a duplicate. So it remains to be seen if, and to what extant, the concept of being free to read will extend to “reading” the copyrighted work into an AI model.
Law and Ethics Controlling the Corpus
There is litigation pending against GitHub, Microsoft, and OpenAI alleging that the AI systems violate the legal rights of programmers who have posted code on public GitHub repositories. The lawsuits specifically point out that much of the public code was posted under one of several open-source licenses that require derivative works to include attribution to the original author, notice of that author’s copyright, and a copy of the license itself. These include the GPL, Apache, and MIT licenses. The lawsuits accuse defendants of training on computer code that does not belong to them without proper attribution, ignoring privacy policies, violating online terms of service, and offending the Digital Millennium Copyright Act (DMCA) provisions that protect against removal or alteration of copyright management information.
It is interesting to note however, that the pending suits do not explicitly allege copyright violation. The defendants posit that any assertion of copyright would be defeated under the fair use doctrine. The facts do appear to parallel those in Authors Guild v. Google where Google scanned in the contents of books to make them searchable online. Publishers and authors complained that Google did not have permission to scan in their copyrighted works. However, the court granted summary judgement in favor of Google affirming that Google met the legal requirements of the fair use doctrine.
An interesting open project for the development of source code models is The Stack. The Stack is part of BigCode and maintains a 6.4 TB corpus of source code under permissive license. The project seems strongly rooted in ethical transparency. For example, The Stack allows creators to request removal of their code from the corpus.
Projects like Copilot, OpenAI, and The Stack will likely continue to bring very interesting questions to light. As AI technology advances in its ability to suggest code blocks, or eventually write code itself, clarity around authorship rights will evolve. Of course, authorship right may be the least of our worries.
12 thoughts on “Will A.I. Steal All The Code And Take All The Jobs?”
I wonder how many folks will still be annoyed if every program written with AI aid, so likely trained only on stuff under various open licenses, is always automatically copy-left licensed itself. You do loose the attribution as don’t want to waste time looking for all the prior programmers that contributed bits here and there, but at least the results are supposed to be available – so the benefits of open-source the programmers obviously believed in enough to publish under remain.
Still the chance of the AI killing off folks jobs, which is perhaps as much the reason for kicking up a fuss and it doesn’t mean all the prior licences will be entirely complied with, the GPL versions,MIT etc are similar but not the same as each other – but at least it stays permissive – which has to be better than taking stuff that should be copy-left style only and using it incorrectly…
Chinese AI is already stealing ALL the code.
Only if AI advances to a point where it can innovate.
I’ve tried OpenAI to solve some really obscure problems and it didn’t come up with one single usable answer. I guess all it can do well is generate code for checklists, a dvd or movie database etc because those are the most common examples on the internet. As soon as you try something that’s off the beaten track it’ll produce nonsense.
But it’s great that whenever it does give you proper answers, although chances are what you’re trying to do is reinventing the wheel and should use a library instead. It would be great if the AI sees you doing things and suggest using a library that not only does it do exactly what you want, but it does it much better and more secure than the average programmer can do him/her self.
I think AI is great for doing the menial and repetitive tasks of programming like refactoring, filling in scaffolding, searching for common mistakes and security holes etc so that you can focus on solving the actual problems that AI have never seen before and make a mess of trying to solve.
An AI can’t really do inspired thinking. Like, this piece of metal, bent with several sharp turns is something to hold papers together, then think of unbending it to poke in a hole to unlock a dvd drive for example, not unless it has come across several references in its training set.
Agreed. Now…I do think AI poses severe dangers if not vigorously restrained and controlled. A Navy funded study, a few years back, said much, recognizing possible “Skynet” scenarios.
However, I do not believe AI will ever be truly conscious or self-aware…even if capable of pretending it is.
If you are training it on github etc examples willy-nilly, you’re training it with bugs.
Garbage in, garbage out.
Maybe it wont be code.
The thing was that people in the past said the low skilled low paid jobs would be replaced with robots.
Now that does happen but robots cost money initially and on going.
Replacing a job with AI has a much higher ROI given that the job you replace is likely desk based creating digital things and likely higher paid.
So the idea that the blue collar / lower classes will all loose their jobs because of automation is rapidly turning into being the white collar / middle classes.
Suddenly universal income got a lot more appealing.
A parasite, once attached to a reliable host, has no need or incentive to do anything futher.
If you think guaranteed income will “free” us to suddenly become a nation of 300 million artists and craftsman, think again.
In the end, “universal income” is a phrase used to sound brainy when in fact one is really proposing the dumb idea of paying people to sit on their rears. Its arrival in any wide-sweeping form will mark the beginning of the end of civilization.
Don’t look now; we’re already paying people to sit on their rears.
Under the current regime, you may have significant benefits as an unemployed person that you would lose the moment you do find a job. There’s actually incentive to NOT work, in many cases.
UBI would not be conditional on employment, so any small amount you earn over UBI would only contribute to your well being, not disqualify you for existing benefits. And UBI should be set very near subsistence levels; it SHOULD be uncomfortably close to poverty. If someone is ok living at that level, I’m here to tell you they’re already living off of welfare.
my reaction to the headline:
Gosh, I hope so!
Historically, technology exists specifically to increase productivity. One person with an ox can plow more than one person by hand. And a tractor instead of an ox, a person is more productive. End result, far fewer people work on farms than a century ago (yes, that was only 1923. and they had tractors back then).
Repeat this for every industry. From textiles to construction. A person operates a machine that does the labor of many. Ideally this increase in productivity yields a higher standard of living.
So why shouldn’t doctors be able to handle more patients in the same amount of time. Or programmers debug or write more software? Engineers design more chips in a smaller area and at a lower power and in less time?
That said, current so-called AI is garbage. What it outputs is basically useless. I can’t use its output to publish whitepapers. I can’t take the code and use it in a production environment. I can’t even effectively use an AI generated test plan because I still have to make sure it really did what it was asked and covers my original requirements.
Ultimately AI is going to be trained to pretend to do its job. Like your laziest and more irresponsible employee. It will be rewarded time and time again for meeting the letter of the requirement without grasping the spirit and without going above and beyond what is required. The only reason AI will have a job is because it’s free, any employee like that you would have fired right away.
A recent article (https://www.theregister.com/2023/02/06/uh_oh_attackers_can_extract/) covered research showing that generative AI was capable of regurgitating degraded but clearly recognisable copies of images from the training data. I think results like this make it very hard to argue that the a trained ML model is not a derivative work of the training set.
It will be interesting to see which way the courts go here because there doesn’t seem to be much difference between feeding copyright material into an AI and getting a slightly degraded copy of the original back, and feeding copyright material (say a movie) into a different mathematical model (say the H.264 algorithm) and getting a slightly degraded copy of the original back.
What I have seen so far seem to be training on bad examples. I think we are going to have a case of severe inbreeding, sitting with moron AIs three generation from now, each training on the gobblygook from the previous generation
Please be kind and respectful to help make the comments section excellent. (Comment Policy)