A recent change was announced to the NetBSD commit guidelines which amends these to state that code which was generated by Large Language Models (LLMs) or similar technologies, such as ChatGPT, Microsoft’s Copilot or Meta’s Code Llama is presumed to be tainted code. This amendment was to the existing section about tainted code, which originally referred to any code that was not written directly by the person committing the code, and was due to licensing concerns. The obvious reason behind this is that otherwise code may be copied into the NetBSD codebase which may have been licensed under an incompatible (or proprietary) license.
In the case of LLM-based code generators like the above-mentioned, the problem stems from the fact that they are trained on millions of lines of code from all over the internet, which are naturally released under a wide variety of licenses. Invariably, some of that code will be covered by a license that’s not acceptable for the NetBSD codebase. Although the guideline mentions that these auto-generated code commits may still be admissible, they require written permission from core developers, and presumably an in-depth audit of the code’s heritage. This should leave non-trivial commits that got churned out by ChatGPT and kin out in the cold.
The debate about the validity of works produced by current-gen “artificial intelligence” software is only just beginning, but there’s little question that NetBSD has made the right call here. From a legal and software engineering perspective this policy makes perfect sense, as LLM-generated code simply doesn’t meet the project’s standards. That said, code produced by humans brings with it a whole different set of potential problems.
The history of BSD is well know, why would they risk that nightmare again.
The only thing people using the tool would be to validate what was a blatant theft of code. It Microsoft was serious they would have trained it on their own source code, which they do have control over the licensing and then give it away free for everyone to use. But even they do not trust it.
I wouldn’t trust it if it was based on their code.
Good luck with explaining that to their paying customers.
“The obvious reason behind this is that otherwise code may be copied into the NetBSD codebase which may have been licensed under an incompatible (or proprietary) license.”
Nothing to do with correctness of generated code?
That’s part of the standard review policies. This is specifically addressing the potential legal ramifications of using LLM-generated code.
While that’s certainly an issue, the principle risks apply whether or not the code in question is correct or not (c/p from a post I made on fedi about three minutes ago):
* AI-produced content has no authorship by a natural person, and the Copyright Office (and in turn the courts) have consistently held that without authorship by a natural person, a work is ineligible for Copyright protection. Without the legal mechanisms afforded by Copyright protection, licenses cannot be enforceable.
* There is as of yet no legal consensus as to whether the Copyright status of works which are included in the training corpus of a “generative” model affect, at all, the legality of the generated output.
Licensing generates lawsuits. “Correctness” usually does not. Most code today was made by programmers not fit to shine the shoes of the ones from several decades ago, yet they are human, all too human.
Next step is that no commits will be allowed from any coder who has read anything more then K&R and not lived in a white walled windowless cubicle starting from age 14.
Finally! My chance to shine!
Rise of The Planet of The Code Monkeys
Cubicles are already a luxury, Open Spaces are one shared cubicle for everyone.
We only use code from free range programmers, who eat natural grass and code in the sunshine.
Perhaps you could say that a human brain is much like an organic LLM. We absorb large amounts of information all day, weigh it for quality and store it in our biological neural network. Much of the information we process is probably copyrighted (one way or another), but luckily that doesn’t prevent us from still using that information internally to create code and other things.
Biological and artificial LLMs work in the same way and are trained using absurd amounts of training data. The difference is that, biological LLMs are part of humans in such a way that, for every line of code generated by a human, there is a valuable contribution to society, and an human can only generate a limited amount of code. Also, training a biological LLM is called reading and is how text and open source code is intended to be used. The amount of code generated by artificial LLMs is only limited by the amount of natural resources (materials and energy) used to build and run the infrastructure supporting the LLM, without bringing a contribution to society. Also, artificial LLMs are not limited by the speed at which they can read and type. All of this, in my opinion, makes artificial LLMs an unfair competition to humans. But that is a complicated philosophical question.
Brains don’t work like LLMs – at all.
“Brain is a computer” has set us back.
Some people’s brains do.
Dumb AF people will just string together wordsalad, just like a LLM.
But enough about politics.
VPs exist to keep people from shooting the boss.
The right wingest rightwinger would take a bullet for Biden.
A commie would have taken a bullet to keep Cheney out of office.
An LLM has zero idea about what an algorithm or code expressing it means, and no ability to find out. Even if just cut-n-pasting code from Stack Overflow, a human has at the capability – even if not the desire – to review and understand it.
An AI/LLM is perfectly capable of identifying what the code is intended to do, finding bugs in the code, and improving or extending it. Therefore, it is capable of reviewing code. Whether you would call that ‘understanding’ ventures into philosophical terrain, but from a functional perspective, it doesn’t really matter.
Sorry, none of this adds up to meaningful differences between brains and LLMs (which there are many, by the way). All you are saying is that humans are not massively parallel enough, that we just need more and more of them. LLMs can only generate a limited amount of whatever before they give out as well; hardware is not immortal. You can just install the same software on other hardware? You can do this with humans as well, and as a bonus the hardware actually tends to assemble its own replacements as a bonus.
This is just another restatement of the fact that most AI is actually just a low-quality crowdsourcing operation based out of India until the VC check goes through. AI sucks because its quality is still low and it requires handholding from somebody who actually knows what it is meant to be doing.. The same is true for the largest percentile of humans.
Humans can comprehend and experiment, and generate code that’d completely novel to the individual; even if it’s just by virtue of being code monkeys on ~~typewriters~~ mechanical keyboards.
A LLM on the other hand can only ever generate code directly based on its training data. And an LLM is guaranteed to have seen code with incompatible licences, whereas a human could theoretically become a programmer without ever having dealt with a less permissive licence.
Your comment raises an important point about the creative and experimental nature of human programmers compared to AI/LLMs. But, it’s worth noting that an AI can sometimes demonstrate a form of creativity or innovation that surprises even it’s creators. An early example of this is AlphaGo, developed by Google’s DeepMind. AlphaGo, an AI designed to play the board game Go, famously made a move during a match against a top human player that was considered highly unconventional and unexpected. Move 37 in game two of AlphaGo’s series against Lee Sedol was initially deemed a mistake by many experts. However, it later became clear that this move was a brilliant strategic play that contributed significantly to AlphaGo’s victory. This instance showed that an AI could come up with novel approaches that were not directly taught or explicitly present in its training data.
In the context of coding, while LLMs generate code based on patterns learned from their training data, they can combine and apply these patterns in ways that may seem innovative or unexpected. Just as AlphaGo’s novel move wasn’t a direct replication of its training data but a unique application of its learned strategies, LLMs can sometimes produce code solutions that appear novel and inventive. So, while humans indeed bring a unique ability to comprehend, experiment, and create, dismissing AI’s potential for innovation might overlook the nuanced ways in which an AI can surprise us with its outputs.
Legally speaking, humans and LLMs are very different, which is what matters for licensing.
So long as it’s audited by a human and doesn’t have adversely blatant proprietary stuff in it, who cares where the code came from? Is some unscrupulous patent troll going to claim “for…. next” as proprietary? That got shot down a long time ago when that guy put transistors in a windshield-wiper delay and GM or Ford claimed just doing that was proprietary… to no avail for them.
The Copyright office (currently) cares.
Output from an LLM is not eligible for protection under Copyright – this under the established principle that Copyright can only be granted to works of ‘human authorship.’
Who’s going to know?
Well unlike the monkey case humans are everywhere in the chain that’s LLMs and more involved than the camera owner.
Just to say the artwork here is gorgeous. I love how angry the programmer is, and the cat ears on the AI squiddy
Nod to the Matrix machines seen during the battle for Zion.
No doc on how this artwork was made? Which LLM and training data was used?
If I ask ChatGPT a question, and it answers “yes”
– does that mean it might be a tainted answer and I need to get permission from all possible copyright holders?
Yes
Smart move, AIs biggest inherent risk right now is its legality. Any company seeking to utilize AI faces possible implosion. LLM could become illegal given copyrighted data in the training sets. Why people want to invest in this is beyond me. It is so obscenely unstable an ecosystem, it reminds me of the early Napster era. Download free music, why, who cares just do it.
Seriously how have there not been huge copyright infrigement lawsuits against ai companies yet?
There’s several ongoing.
Just wondering, how would they tell weather it’s generated code or just poorly written?
Who hasn’t at some point or another relied upon or copied code from a website, book, or another developer? Is any software development actually devoid of all prior copyrighted work?
I think to address the concern here, all LLM’s must be scratched and start over with new training data completely void of anything copyrighted or from an unknown source. Is that even possible? Is the use of website text and (nearly) all printed text off limits for generating character and word frequency?
It seems lawyers and ethicists are going to have a field day for the next couple of decades.
With code, what is permitted as fair use? I use LLM’s all the time to rough ideas out and it pretty much looks like what I might write. Sometimes it comes back with a cleverer way to do something, often times I can point out that is is being silly and there is an easier/faster way. The thing is I do not see it pulling the whole of anything out of something else, at least not for me. So, what is someone going to yell code theft over a loop or and if/then? I think with llm’s it is not so much not to train them on things but to make sure the output is not overly from one source. This would also jibe with it trying to produce correct code. I also think the llm’s show have a way to put user input back into them. When pointed out as being wrong and it is provable, they should consider that moving forward. If it is questionable go with what they have until there is a good number of users refuting it.