[James Turk] has a novel approach to the problem of scraping web content in a structured way without needing to write the kind of page-specific code web scrapers usually have to deal with. How? Just enlist the help of a natural language AI. Scrapeghost relies on OpenAI’s GPT API to parse a web page’s content, pull out and classify any salient bits, and format it in a useful way.
What makes Scrapeghost different is how data gets organized. For example, when instantiating scrapeghost
one defines the data one wishes to extract. For example:
from scrapeghost import SchemaScraper scrape_legislators = SchemaScraper( schema={ "name": "string", "url": "url", "district": "string", "party": "string", "photo_url": "url", "offices": [{"name": "string", "address": "string", "phone": "string"}], } )
The kicker is that this format is entirely up to you! The GPT models are very, very good at processing natural language, and scrapeghost
uses GPT to process the scraped data and find (using the example above) whatever looks like a name, district, party, photo, and office address and format it exactly as requested.
It’s an experimental tool and you’ll need an API key from OpenAI to use it, but it has useful features and is certainly a novel approach. There’s a tutorial and even a command-line interface, so check it out.
I’m tired of web scraping existing in the first place. The internet is for humans.
Humans use web scraping to make life easier for themselves and others.
That’s always how it starts.
What in the world does that mean?
Don’t get him started.
Only about 5% of the Internet is for humans. Only 0.03 – 0.04% is available without a login.
96% of statistics are made up on the spot……
Plenty of sources. Here is one
https://www.spiceworks.com/it-security/security-general/articles/dark-web-vs-deep-web/
> The internet is for humans.
When was that ever true / the case?
It was developed as a nuclear Armageddon safe communications network for military use (wasn’t it?).
And since then it was fur businesses, advertisement, porn, cats, lies, propaganda, connecting all kinds of idiots to stay in their Q believes (religions=sects, conspiracy stories, esoteric BS just to name a few).
I mean yes, it helped connect scientists, paved the way for more open source ides in several areas (and many other things) but still…
Wars are being fought “for humans”, the developed world is destroying nature for >100 years for (their) humans.
-> What exactly do you mean with “for humans”?
well said
Likely to be used more and more by different types of bots as time goes on. Humans maybe relegated to 2nd class users. Or 3rd class if you include cats.
Indeed. We have electronic mail before the Internet. Systems used to dial up to each other on a regular basis to exchange messages.
… really makes you think !
The nuclear war thing is a myth, some of the Arpanet crowd began to speculate about survivability of communication networks later on, but it was created as a boring old way to network a bunch of mainframes operated by the military and academia. I believe the origin myth came from InfoWorld in the 1990s, and it doesn’t really make sense given the dependence on a fairly small number of leased lines and point to point microwave links.
The point about it not being for humans still stands though.
Surely “scrapegoat” would be a better name?
“you’ll need an API key from OpenAI to use it,”
This means it’s going to cost money to scrape pages… slowly. I wouldn’t really trust it to return perfect results either. I’ll stick to writing bots.
Right… What happens when it decides that the scraping results are too sensitive and it leaves out information? You’d have to run a normal web scraper in parallel to verify that it isn’t leaving information out or hallucinating extra information.
Exactly this ,the limitations of chatgpt doesnt make it worth it
Pretty cool project. I’m working on something similar that uses GPT to generate web scrapers -> https://www.kadoa.com