Tired Of Web Scraping? Make The AI Do It

April 9, 2023

[James Turk] has a novel approach to the problem of scraping web content in a structured way without needing to write the kind of page-specific code web scrapers usually have to deal with. How? Just enlist the help of a natural language AI. Scrapeghost relies on OpenAI’s GPT API to parse a web page’s content, pull out and classify any salient bits, and format it in a useful way.

What makes Scrapeghost different is how data gets organized. For example, when instantiating scrapeghost one defines the data one wishes to extract. For example:

from scrapeghost import SchemaScraper
scrape_legislators = SchemaScraper(
schema={
"name": "string",
"url": "url",
"district": "string",
"party": "string",
"photo_url": "url",
"offices": [{"name": "string", "address": "string", "phone": "string"}],
}
)

The kicker is that this format is entirely up to you! The GPT models are very, very good at processing natural language, and scrapeghost uses GPT to process the scraped data and find (using the example above) whatever looks like a name, district, party, photo, and office address and format it exactly as requested.

It’s an experimental tool and you’ll need an API key from OpenAI to use it, but it has useful features and is certainly a novel approach. There’s a tutorial and even a command-line interface, so check it out.

19 thoughts on “Tired Of Web Scraping? Make The AI Do It”

Anonymous says:

April 9, 2023 at 7:38 am

I’m tired of web scraping existing in the first place. The internet is for humans.

Report comment

Reply
1. 70sjukebox says:
  
  April 9, 2023 at 7:59 am
  
  Humans use web scraping to make life easier for themselves and others.
  
  Report comment
  
  Reply
  1. Anonymous says:
    
    April 9, 2023 at 9:53 am
    
    That’s always how it starts.
    
    Report comment
    
    Reply
    1. Josh says:
      
      April 9, 2023 at 1:59 pm
      
      What in the world does that mean?
      
      Report comment
      
      Reply
      1. Reactive Light says:
        
        April 10, 2023 at 3:13 pm
        
        Don’t get him started.
        
        Report comment
  2. Ken P says:
    
    April 9, 2023 at 2:05 pm
    
    Only about 5% of the Internet is for humans. Only 0.03 – 0.04% is available without a login.
    
    Report comment
    
    Reply
    1. 70sjukebox says:
      
      April 9, 2023 at 3:47 pm
      
      96% of statistics are made up on the spot……
      
      Report comment
      
      Reply
      1. Ken P says:
        
        April 9, 2023 at 4:09 pm
        
        Plenty of sources. Here is one
        https://www.spiceworks.com/it-security/security-general/articles/dark-web-vs-deep-web/
        
        Report comment
limroh says:

April 9, 2023 at 7:58 am

> The internet is for humans.

When was that ever true / the case?
It was developed as a nuclear Armageddon safe communications network for military use (wasn’t it?).
And since then it was fur businesses, advertisement, porn, cats, lies, propaganda, connecting all kinds of idiots to stay in their Q believes (religions=sects, conspiracy stories, esoteric BS just to name a few).

I mean yes, it helped connect scientists, paved the way for more open source ides in several areas (and many other things) but still…

Wars are being fought “for humans”, the developed world is destroying nature for >100 years for (their) humans.
-> What exactly do you mean with “for humans”?

Report comment

Reply
1. Jorb says:
  
  April 9, 2023 at 8:16 am
  
  well said
  
  Report comment
  
  Reply
2. Reluctant Cannibal says:
  
  April 9, 2023 at 9:00 am
  
  Likely to be used more and more by different types of bots as time goes on. Humans maybe relegated to 2nd class users. Or 3rd class if you include cats.
  
  Report comment
  
  Reply
3. Jon Mayo says:
  
  April 9, 2023 at 10:03 am
  
  Indeed. We have electronic mail before the Internet. Systems used to dial up to each other on a regular basis to exchange messages.
  
  Report comment
  
  Reply
4. Blue deep says:
  
  April 9, 2023 at 2:50 pm
  
  … really makes you think !
  
  Report comment
  
  Reply
5. hyperlogik says:
  
  April 9, 2023 at 4:11 pm
  
  The nuclear war thing is a myth, some of the Arpanet crowd began to speculate about survivability of communication networks later on, but it was created as a boring old way to network a bunch of mainframes operated by the military and academia. I believe the origin myth came from InfoWorld in the 1990s, and it doesn’t really make sense given the dependence on a fairly small number of leased lines and point to point microwave links.
  
  The point about it not being for humans still stands though.
  
  Report comment
  
  Reply
Andrew says:

April 9, 2023 at 12:38 pm

Surely “scrapegoat” would be a better name?

Report comment

Reply
Gravis says:

April 9, 2023 at 4:52 pm

“you’ll need an API key from OpenAI to use it,”

This means it’s going to cost money to scrape pages… slowly. I wouldn’t really trust it to return perfect results either. I’ll stick to writing bots.

Report comment

Reply
1. H says:
  
  April 9, 2023 at 9:47 pm
  
  Right… What happens when it decides that the scraping results are too sensitive and it leaves out information? You’d have to run a normal web scraper in parallel to verify that it isn’t leaving information out or hallucinating extra information.
  
  Report comment
  
  Reply
  1. onlytronix says:
    
    October 1, 2023 at 5:40 am
    
    Exactly this ,the limitations of chatgpt doesnt make it worth it
    
    Report comment
    
    Reply
Tavis says:

April 26, 2023 at 10:42 am

Pretty cool project. I’m working on something similar that uses GPT to generate web scrapers -> https://www.kadoa.com

Report comment

Reply

Hackaday

Tired Of Web Scraping? Make The AI Do It

19 thoughts on “Tired Of Web Scraping? Make The AI Do It”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

What Happened To Running What You Wanted On Your Own Machine?

Ore Formation: Return Of The Revenge Of The Fluids

Word Processing: Heavy Metal Style

A Tale Of Two Car Design Philosophies

Rubik’s WOWCube: What Really Makes A Toy?

Our Columns

FLOSS Weekly Episode 852: Sir, This Is A Wendy’s

2025 Hackaday Supercon: Two New Workshops, Costume Party, Lightning Talks, And A New-Space Panel

Ask Hackaday: When Good Lithium Batteries Go Bad

Hackaday Links: October 19, 2025

Precision, Imprecision, Intellectual Honesty, And Little Green Men

19 thoughts on “Tired Of Web Scraping? Make The AI Do It”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns