The world of AI is abuzz, or at least parts of it are, at the news of Meta’s release of Llama 2. This is an AI text model which is thought to surpass ChatGPT in capabilities, and which the social media turned VR turned own all your things company wants you to know is open to all. That’s right, the code is open source and you can download the model, and Meta want you to feel warm and fuzzy about it. Unfortunately all is not as it seems, because of course the model isn’t open-source and is subject to a licensing restriction which makes it definitely not free of charge for larger users. This is of course disappointing to anyone hoping for an AI chatbot without restrictions, but we’re guessing Meta would prefer not to inadvertently enable a competitor.
Happily for the open source user large or small who isn’t afraid of a little work there’s an alternative in the form of OpenLLaMA, but we understand that won’t be for all users. Whichever LLM you use though, please don’t make the mistake of imagining that it possesses actual intelligence.
Thanks to the CoupledAI team for the tip!
And there’s still the outstanding question of how the licensing of the training data affects the license of the LLM output.
Exactly. These cases haven’t gone to court yet. If they are ever found to have violated copyright by having trained it on copyrighted material, then using their model might open you up to liability. An individual user or research team might not be at a particularly high risk for being sued, but a university or other larger organization would be and might not be able to afford that risk.
An individual user could very easily be at risk if they’re delivering code/content to a customer.
If an LLM is ruled to be copyright infringing, then my clients can ask me to confirm I’ve not used it. If I have, then the code I’ve delivered then is now worthless, and they can demand I rewrite at my expense to meet the contract (which would include assurances on copyright / licensing status) it or can demand their money back.
Even if they were trained on copyright materials, aren’t, as humans, also doing the same exact thing? We are even taking notes of such copyrighted material.
Legal issue is on the end result, which human may be close but not exact, while it may be possible for LLM. However, at some point, everything is copyrighted so that become hard to not copy something from someone (depending on the size)
It’s not necessarily a copyright violation to use copyrighted material for training. It may become a problem in systems that overfit, and reproduce recognizable parts of copyrighted material in their outputs.
It’s gonna fall under fair use. Anything else would be a goose chase
That is far from guaranteed and it depends on how the copyrighted material is used. Todays AI does not create new art, to use a USPTO term, it is just copying the material, and that is not fair use.
I would go one step further, if the entire training dataset used is not available, is it really open source ? Since you can not independently replicate the same results by using the source code alone.
Only if we ever get GNUGPT.
Multiple GNU/GPTs running as kernels in GNU/TURD.
But you would still end up with where you need access to exoflops or zetaflops of processing to generate the models.
For a BSDGPT or GNUGPT it may require some kind of bittorrent like system from distributing the massive amounts of training and testing data to the millions of participating peer machines. And similar distributed writable file-system for receiving the model generated by the millions of peers.
Sounds like a blockchain database..😏
Ah, the dataset is available to anyone – it’s all the nice copyrighted content you can find on Google!
IMHO this is getting into the weeds a bit on open source philosophy, but from my perspective open source is really about the ability to inspect and modify the functioning of software – including the training data wouldn’t actually make it easier to inspect the functioning of an LLM because they’re literally impossible to understand internally using currently available methods and tools, they’re too complex and their emergent properties act in non intuitive ways. And you can modify/retrain a model without the prior training data just as easily as you can with that data. The only thing missing is some indirect insight into what types of responses the LLM might have developed but that type of insight isn’t a required inclusion for open source. If it’s delivered in a format that enables inspection and modification of the neural network, the code to further train it and the support code to make it run then for all practical purposes it is open source.
The real question is surely not what the licenses say, but rather whether this LLama model can be downloaded to run entirely locally (and not need extra training after download unless you want to train it further for some specific field of topics to discuss) and then run entirely without connecting to any Meta owned or other remote infrastructure. And also whether you can, in practical terms, modify the downloaded model to add or remove capabilities for specific forms of apparent “reasoning” and/or restrictions on what text it is willing to output. How does it score on all those measures? Is this yet an AI you can download, run locally, and tinker with in practical terms to your heart’s content so long as you don’t then use it commercially or release a derivative version publically?
Lol! Hell no.
No to which part? Is there yet an effective AI language model which can be run fully locally offline?
Only if the local system has massive amounts of storage AND the data that is downloaded doesn’t contain any copyrighted material.