Tired Of Web Scraping? Make The AI Do It

[James Turk] has a novel approach to the problem of scraping web content in a structured way without needing to write the kind of page-specific code web scrapers usually have to deal with. How? Just enlist the help of a natural language AI. Scrapeghost relies on OpenAI’s GPT API to parse a web page’s content, pull out and classify any salient bits, and format it in a useful way.

What makes Scrapeghost different is how data gets organized. For example, when instantiating scrapeghost one defines the data one wishes to extract. For example:

from scrapeghost import SchemaScraper
scrape_legislators = SchemaScraper(
schema={
"name": "string",
"url": "url",
"district": "string",
"party": "string",
"photo_url": "url",
"offices": [{"name": "string", "address": "string", "phone": "string"}],
}
)

The kicker is that this format is entirely up to you! The GPT models are very, very good at processing natural language, and scrapeghost uses GPT to process the scraped data and find (using the example above) whatever looks like a name, district, party, photo, and office address and format it exactly as requested.

It’s an experimental tool and you’ll need an API key from OpenAI to use it, but it has useful features and is certainly a novel approach. There’s a tutorial and even a command-line interface, so check it out.

How Much Programming Can ChatGPT Really Do?

By now we’ve all seen articles where the entire copy has been written by ChatGPT. It’s essentially a trope of its own at this point, so we will start out by assuring you that this article is being written by a human. AI tools do seem poised to be extremely disruptive to certain industries, though, but this doesn’t necessarily have to be a bad thing as long as they continue to be viewed as tools, rather than direct replacements. ChatGPT can be used to assist in plenty of tasks, and can help augment processes like programming (rather than becoming the programmer itself), and this article shows a few examples of what it might be used for.

AI comments are better than nothing…probably.

While it can write some programs on its own, in some cases quite capably, for specialized or complex tasks it might not be quite up to the challenge yet. It will often appear extremely confident in its solutions even if it’s providing poor or false information, though, but that doesn’t mean it can’t or shouldn’t be used at all.

The article goes over a few of the ways it can function more as an assistant than a programmer, including generating filler content for something like an SQL database, converting data from one format to another, converting programs from one language to another, and even help with a program’s debugging process.

Some other things that ChatGPT can be used for that we’ve been able to come up with include asking for recommendations for libraries we didn’t know existed, as well as asking for music recommendations to play in the background while working. Tools like these are extremely impressive, and while they likely aren’t taking over anyone’s job right now, that might not always be the case.

Why LLaMa Is A Big Deal

You might have heard about LLaMa or maybe you haven’t. Either way, what’s the big deal? It’s just some AI thing. In a nutshell, LLaMa is important because it allows you to run large language models (LLM) like GPT-3 on commodity hardware. In many ways, this is a bit like Stable Diffusion, which similarly allowed normal folks to run image generation models on their own hardware with access to the underlying source code. We’ve discussed why Stable Diffusion matters and even talked about how it works.

LLaMa is a transformer language model from Facebook/Meta research, which is a collection of large models from 7 billion to 65 billion parameters trained on publicly available datasets. Their research paper showed that the 13B version outperformed GPT-3 in most benchmarks and LLama-65B is right up there with the best of them. LLaMa was unique as inference could be run on a single GPU due to some optimizations made to the transformer itself and the model being about 10x smaller. While Meta recommended that users have at least 10 GB of VRAM to run inference on the larger models, that’s a huge step from the 80 GB A100 cards that often run these models.

While this was an important step forward for the research community, it became a huge one for the hacker community when [Georgi Gerganov] rolled in. He released llama.cpp on GitHub, which runs the inference of a LLaMa model with 4-bit quantization. His code was focused on running LLaMa-7B on your Macbook, but we’ve seen versions running on smartphones and Raspberry Pis. There’s even a version written in Rust! A rough rule of thumb is anything with more than 4 GB of RAM can run LLaMa. Model weights are available through Meta with some rather strict terms, but they’ve been leaked online and can be found even in a pull request on the GitHub repo itself. Continue reading “Why LLaMa Is A Big Deal”

BitTorrent For Language Models

In the old days of the Internet, FTP was sufficient for downloading the occasional file. But with the widespread use of computer audio and video, it was easy to swamp an FTP server so — eventually — BitTorrent was born. The idea was you would download bits and pieces of a file from different places and, in theory, people would download bits and pieces that you have if they need them. Now Petals wants to use this same method with language models. These AI language models are all the rage, but they take significant computer resources. The idea behind Petals is like BitTorrent. You handle a small part of the model (about 8 gigabytes which is small compared to the 352 gigabytes required), and other people have other parts.

Of course, if you are privacy-minded, that means that some amount of your data is going out to the public, but for your latest chatbot experiments, that might not be a big problem. You can install Petals in an Anaconda environment or run a Docker image if you don’t want to set up anything. If you just want to access the distributed network’s chatbot based on BLOOMZ-176B, you can do that online.

Continue reading “BitTorrent For Language Models”

Hands-On: NVIDIA Jetson Orin Nano Developer Kit

NVIDIA’s Jetson line of single-board computers are doing something different in a vast sea of relatively similar Linux SBCs. Designed for edge computing applications, such as a robot that needs to perform high-speed computer vision while out in the field, they provide exceptional performance in a board that’s of comparable size and weight to other SBCs on the market. The only difference, as you might expect, is that they tend to cost a lot more: the current top of the line Jetson AGX Orin Developer Kit is $1999 USD

Luckily for hackers and makers like us, NVIDIA realized they needed an affordable gateway into their ecosystem, so they introduced the $99 Jetson Nano in 2019. The product proved so popular that just a year later the company refreshed it with a streamlined carrier board that dropped the cost of the kit down to an incredible $59. Looking to expand on that success even further, today NVIDIA announced a new upmarket entry into the Nano family that lies somewhere in the middle.

While the $499 price tag of the Jetson Orin Nano Developer Kit may be a bit steep for hobbyists, there’s no question that you get a lot for your money. Capable of performing 40 trillion operations per second (TOPS), NVIDIA estimates the Orin Nano is a staggering 80X as powerful as the previous Nano. It’s a level of performance that, admittedly, not every Hackaday reader needs on their workbench. But the allure of a palm-sized supercomputer is very real, and anyone with an interest in experimenting with machine learning would do well to weigh (literally, and figuratively) the Orin Nano against a desktop computer with a comparable NVIDIA graphics card.

We were provided with one of the very first Jetson Orin Nano Developer Kits before their official unveiling during NVIDIA GTC (GPU Technology Conference), and I’ve spent the last few days getting up close and personal with the hardware and software. After coming to terms with the fact that this tiny board is considerably more powerful than the computer I’m currently writing this on, I’m left excited to see what the community can accomplish with the incredible performance offered by this pint-sized system.

Continue reading “Hands-On: NVIDIA Jetson Orin Nano Developer Kit”

Modifying Artwork With Glaze To Interfere With Art Generating Algorithms

With the rise of machine-generated art we have also seen a major discussion begin about the ethics of using existing, human-made art to train these art models. Their defenders will often claim that the original art cannot be reproduced by the generator, but this is belied by the fact that one possible query to these generators is to produce art in the style of a specific artist. This is where feature extraction comes into play, and the Glaze tool as a potential obfuscation tool.

Developed by researchers at the University of Chicago, the theory behind this tool is covered in their preprint paper. The essential concept is that an artist can pick a target ‘cloak style’, which is used by Glaze to calculate specific perturbations which are added to the original image. These perturbations are not easily detected by the human eye, but will be picked up by the feature extraction algorithms of current machine-generated art models. Continue reading “Modifying Artwork With Glaze To Interfere With Art Generating Algorithms”

The Singularity Isn’t Here… Yet

So, GPT-4 is out, and it’s all over for us meatbags. Hype has reached fever pitch, here in the latest and greatest of AI chatbots we finally have something that can surpass us. The singularity has happened, and personally I welcome our new AI overlords.

Hang on a minute though, I smell a rat, and it comes in defining just what intelligence is. In my time I’ve hung out with a lot of very bright people, as well as a lot of not-so-bright people who nonetheless think they’re very clever simply because they have a bunch of qualifications and diplomas. Sadly the experience hasn’t bestowed God-like intelligence on me, but it has given me a handle on the difference between intelligence and knowledge.

My premise is that we humans are conditioned by our education system to equate learning with intelligence, mostly because we have flaky CPUs and worse memory, and that makes learning something a bit of an effort. Thus when we see an AI, a machine that can learn everything because it has a decent CPU and memory, we’re conditioned to think of it as intelligent because that’s what our schools train us to do. In fact it seems intelligent to us not because it’s thinking of new stuff, but merely through knowing stuff we don’t because we haven’t had the time or capacity to learn it.

Growing up and making my earlier career around a major university I’ve seen this in action so many times, people who master one skill, rote-learning the school textbook or the university tutor’s pet views and theories, and barfing them up all over the exam paper to get their amazing qualifications. On paper they’re the cream of the crop, and while it’s true they’re not thick, they’re rarely the special clever people they think they are. People with truly above-average intelligence exist, but in smaller numbers, and their occurrence is not a 1:1 mapping with holders of advanced university degrees.

Even the examples touted of GPT’s brilliance tend to reinforce this. It can do the bar exam or the SAT test, thus we’re told it’s as intelligent as a school-age kid or a lawyer. Both of those qualifications follow our educational system’s flawed premise that education equates to intelligence, so as a machine that’s learned all the facts it follows my point above about learning by rote. The machine has simply barfed up what it has learned the answers are onto the exam paper. Is that intelligence? Is a search engine intelligent?

This is not to say that tools such as GPT-4 are not amazing creations that have a lot of potential to do good things aside from filling up the internet with superficially readable spam. Everyone should have a play with them and investigate their potential, and from that will no doubt come some very interesting things. Just don’t confuse them with real people, because sometimes meatbags can surprise you.