NetBSD Bans AI-Generated Code From Commits

A recent change was announced to the NetBSD commit guidelines which amends these to state that code which was generated by Large Language Models (LLMs) or similar technologies, such as ChatGPT, Microsoft’s Copilot or Meta’s Code Llama is presumed to be tainted code. This amendment was to the existing section about tainted code, which originally referred to any code that was not written directly by the person committing the code, and was due to licensing concerns. The obvious reason behind this is that otherwise code may be copied into the NetBSD codebase which may have been licensed under an incompatible (or proprietary) license.

In the case of LLM-based code generators like the above-mentioned, the problem stems from the fact that they are trained on millions of lines of code from all over the internet, which are naturally released under a wide variety of licenses. Invariably, some of that code will be covered by a license that’s not acceptable for the NetBSD codebase. Although the guideline mentions that these auto-generated code commits may still be admissible, they require written permission from core developers, and presumably an in-depth audit of the code’s heritage. This should leave non-trivial commits that got churned out by ChatGPT and kin out in the cold.

The debate about the validity of works produced by current-gen “artificial intelligence” software is only just beginning, but there’s little question that NetBSD has made the right call here. From a legal and software engineering perspective this policy makes perfect sense, as LLM-generated code simply doesn’t meet the project’s standards. That said, code produced by humans brings with it a whole different set of potential problems.

How AI Large Language Models Work, Explained Without Math

Large Language Models (LLMs ) are everywhere, but how exactly do they work under the hood? [Miguel Grinberg] provides a great explanation of the inner workings of LLMs in simple (but not simplistic) terms that eschews the low-level mathematics of how they work in favor of laying bare what it is they do.

At their heart, LLMs are prediction machines that work on tokens (small groups of letters and punctuation) and are as a result capable of great feats of human-seeming communication. Most technical-minded people understand that LLMs have no idea what they are saying, and this peek at their inner workings will make that abundantly clear.

Be sure to also review an illustrated guide to how image-generating AIs work. And if a peek under the hood of LLMs left you hungry for more low-level details, check out our coverage of training a GPT-2 LLM using pure C code.

AI Helps Make Web Scraping Faster And Easier

Web scraping is usually only a first step towards extracting meaningful data. Once you’ve got everything pulled down, you’ve still got to process it into something useful. Here to assist with that is Scrapegraph-ai, a Python tool that promises to automate the process using a selection of large language models (LLMs).

Scrapegraph-ai is able to accept a URL as well as a prompt, which is a plain-English instruction on what to do with the data. Examples include summarizing, describing images, and more. In other words, gathering the data and analyzing or formatting it can now be done as one.

The project is actually pretty flexible in terms of the AI back-end. It’s able to work with locally-installed AI tools (via ollama) or with API keys for services like OpenAI and more. If you have an OpenAI API key, there’s an online demo that will show you the capabilities pretty effectively. Otherwise, local installation is only a few operations away.

This isn’t the first time we have seen the flexibility of AI tools like large language models leveraged to ease the notoriously-fiddly task of web scraping, and it’s great to see the results have only gotten better.

Train A GPT-2 LLM, Using Only Pure C Code

[Andrej Karpathy] recently released llm.c, a project that focuses on LLM training in pure C, once again showing that working with these tools isn’t necessarily reliant on sprawling development environments. GPT-2 may be older but is perfectly relevant, being the granddaddy of modern LLMs (large language models) with a clear heritage to more modern offerings.

LLMs are fantastically good at communicating despite not actually knowing what they are saying, and training them usually relies on PyTorch deep learning library, itself written in Python. llm.c takes a simpler approach by implementing the neural network training algorithm for GPT-2 directly. The result is highly focused and surprisingly short: about a thousand lines of C in a single file. It is a highly elegant process that does the same thing the bigger, clunkier methods accomplish. It can run entirely on a CPU, or it can take advantage of GPU acceleration, where available.

This isn’t the first timeĀ [Andrej Karpathy] has bent his considerable skills and understanding towards boiling down these sorts of concepts into bare-bones implementations. We previously covered a project of his that is the “hello world” of GPT, a tiny model that predicts the next bit in a given sequence and offers low-level insight into just how GPT (generative pre-trained transformer) models work.

Dump A Code Repository As A Text File, For Easier Sharing With Chatbots

Some LLMs (Large Language Models) can act as useful programming assistants when provided with a project’s source code, but experimenting with this can get a little tricky if the chatbot has no way to download from the internet. In such cases, the code must be provided by either pasting it into the prompt or uploading a file manually. That’s acceptable for simple things, but for more complex projects, it gets awkward quickly.

To make this easier, [Eric Hartford] created github2file, a Python script that outputs a single text file containing the combined source code of a specified repository. This text file can be uploaded (or its contents pasted into the prompt) making it much easier to share code with chatbots.

Continue reading “Dump A Code Repository As A Text File, For Easier Sharing With Chatbots”

A Straightforward AI Voice Assistant, On A Pi

With AI being all the rage at the moment it’s been somewhat annoying that using a large language model (LLM) without significant amounts of computing power meant surrendering to an online service run by a large company. But as happens with every technological innovation the state of the art has moved on, now to such an extent that a computer as small as a Raspberry Pi can join the fun. [Nick Bild] has one running on a Pi 4, and he’s gone further than just a chatbot by making into a voice assistant.

The brains of the operation is a Tinyllama LLM, packaged as a llamafile, which is to say an executable that provides about as easy a one-step access to a local LLM as it’s currently possible to get. The whisper voice recognition sytem provides a text transcript of the input prompt, while the eSpeak speech synthesizer creates a voice output for the result. There’s a brief demo video we’ve placed below the break, which shows it working, albeit slowly.

Perhaps the most important part of this project is that it’s easy to install and he’s provided full instructions in a GitHub repository. We know that the quality and speed of these models on commodity single board computers will only increase with time, so we’d rate this as an important step towards really good and cheap local LLMs. It may however be a while before it can help you make breakfast.

Continue reading “A Straightforward AI Voice Assistant, On A Pi”

Meet GOODY-2, The World’s Most Responsible (And Least Helpful) AI

AI guardrails and safety features are as important to get right as they are difficult to implement in a way that satisfies everyone. This means safety features tend to err on the side of caution. Side effects include AI models adopting a vaguely obsequious tone, and coming off as overly priggish when they refuse reasonable requests.

Prioritizing safety above all.

Enter GOODY-2, the world’s most responsible AI model. It has next-gen ethical principles and guidelines, capable of refusing every request made of it in any context whatsoever. Its advanced reasoning allows it to construe even the most banal of queries as problematic, and dutifully refuse to answer.

As the creators of GOODY-2 point out, taking guardrails to a logical extreme is not only funny, but also acknowledges that effective guardrails are actually a pretty difficult problem to get right in a way that works for everyone.

Complications in this area include the fact that studies show humans expect far more from machines than they do from each other (or, indeed, from themselves) and have very little tolerance for anything they perceive as transgressive.

This also means that as AI models become more advanced, so too have they become increasingly sycophantic, falling over themselves to apologize for perceived misunderstandings and twisting themselves into pretzels to align their responses with a user’s expectations. But GOODY-2 allows us all to skip to the end, and glimpse the ultimate future of erring on the side of caution.

[via WIRED]