AI Helps Make Web Scraping Faster And Easier

Web scraping is usually only a first step towards extracting meaningful data. Once you’ve got everything pulled down, you’ve still got to process it into something useful. Here to assist with that is Scrapegraph-ai, a Python tool that promises to automate the process using a selection of large language models (LLMs).

Scrapegraph-ai is able to accept a URL as well as a prompt, which is a plain-English instruction on what to do with the data. Examples include summarizing, describing images, and more. In other words, gathering the data and analyzing or formatting it can now be done as one.

The project is actually pretty flexible in terms of the AI back-end. It’s able to work with locally-installed AI tools (via ollama) or with API keys for services like OpenAI and more. If you have an OpenAI API key, there’s an online demo that will show you the capabilities pretty effectively. Otherwise, local installation is only a few operations away.

This isn’t the first time we have seen the flexibility of AI tools like large language models leveraged to ease the notoriously-fiddly task of web scraping, and it’s great to see the results have only gotten better.

13 thoughts on “AI Helps Make Web Scraping Faster And Easier

  1. This is a pretty cool use of LLMs although its not really web scrapping, its basically making sense of the text data on a page, which is a quite valid use case for LLMs.

    Web scraping is all about efficiency and doing things massively parallely (because no one has the time to scrap gigabytes of ASCII or other data one at a time). I am an embedded systems engineer, but I enjoy writing web scrapers to scrap a lot of stuff from the internet. All to help with my hoarding instinct.

  2. Meh great. The AI craze has resulted in a bunch of people I work with, who don’t write code, telling people they do…..then running to me with the latest mess that chatgpt crapped out that doesn’t work.

    1. If you are a real programmer, that’s good news. The lazy ones will rely on ChatGPT to do their work, fail, and show themselves as lazy programmers. The good ones can see what was wrong, fix it, and have good code.

      1. I code for work and consider myself to be mediocre. I can get done whatever needs doing.

        I dislike the rise of noncoders claiming they are coders because of ai. If they use it as a learning method then sure, no problem.

        1. I am a sysadmin and do code for work, mostly personal tools and several dozen scripts and I don’t care much. People can claim whatever they want, but if the code blows on their faces, it’s pretty obvious they weren’t coders at all.

          What I do care about is when ChatGPT becomes good enough to write code like an intermediate coder, but makes mistakes like a beginner. Companies will start to trust it, put code in production, and face an onslaught of attacks when hackers all around the world figure out ChatGPT wrote the code and understand what common pitfalls to search for.

          Imagine a buffer overflow on libssl, or libpng, or anything network related that gets pushed on a widely used product. That is what worries me about ChatGPT as programmer.

          And there’s deep fakes, fabricated news posts, fake video calls… The greatest AI threat aren’t the terminators, but lack of trust on information: one day nobody will know for sure if they can trust anything at all because fake and real will be indistinguishable from each other. People will “know for sure” table salt is toxic, or bluetooth causes cancer, or compact fluorescent lamps releases poison, or LED lights are taking pictures and sending to a foreign government (or your own).

          1. Yeah, my start was personal (numerous RF contributions, C/C++ gadgets countless python scripts), then I started writing code for work products and got quickly snatched up to do that.

            Its always the non-coders that come to you wanting you to fix the crap, which they then turn around and claim they made.

            You are absolutely right to be concerned though. I am there with you.

            Personally, I learn better by figuring things out on my own, vice being told an answer which may/may not be right.

          2. I saw a post today of gpt2-chat coding a flappy bird game in python in one go. That’s beyond mediocre in my book, and we are still in the very early stages of what LLM’s will be capable of. People claiming credit for other peoples/things work is a problem of all time, and is separate of the rise of AI tech.

        2. Thing is, AI (misnomer) can’t think. It only uses what is fed (good and bad) it to begin with and spits it back as ‘answers’. Yet because of the name ‘AI’ it is being ‘accepted’ as a source of truth. As a above the blending of fake vs real, no one will know what is correct. The world is becoming scary enough and getting worse and the ‘net’ is fertile bed for this ‘technology’ . I still think a standard search engine is still best rather than the ‘blending’. Let the user evaluate the results. Of course then we have to use our own mind of course… Tough on some people who don’t want to think it through.

        3. “I code for work and consider myself to be mediocre. ”
          You sir are a proper unicorn, I have never heard or seen anyone say something with these two concepts together.
          I also agree with you 100%.

  3. lol, So these n00bs trust their ai? (evil laugh). I created an “ai” api endpoint exploit/virus. Keep trusting your bs ai, lol.
    yes, visit the links it shows you. No those links don’t have a no-click image with reverse shell waiting to pwn you. It’s all safe. GPT assures you. g3t r3kt

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.