How Search Engines Enabled Finding Needles In A WWW-Sized Haystack

When the World Wide Web surged into existence during the 1990s, we were introduced to the problem of how to actually find something in this ever-ballooning construction zone that easily outpaced even the fastest post-WW2 urban sprawl. Although domain names provided a way to find servers using DNS rather than having to mash in IP addresses, you still somehow had to know the relevant URL.

A range of solutions were thought up over time, ranging from printed Yellow Pages type guides, to online curated lists of resources, as well as things like web rings where one website would link to a relevant similar website. This was the time when word-of-mouth was also very relevant, with people proudly announcing their own website on Geocities or other hosting service.

Search engines already existed long before the WWW became the hot new thing during the 1990s, but it was the WWW that would really push them to their limits. As anyone who used search engines for the WWW can attest, they had many issues. Often you’d end up using multiple search engines to find something, and despite fierce competition between web search engines to become the starting page for their browser, actually finding things on the WWW remained a tough problem.

Since a web search engine ‘just’ has to index the WWW and match a search query against the results, why was this such a hard problem that persisted until Google apparently cracked the code?

Unplanned Sprawl

URLs branching off from the main Wikipedia page in 2004. (Credit: Chris 73, Wikimedia)
URLs branching off from the main Wikipedia page in 2004. (Credit: Chris 73, Wikimedia)

A nice thing about the WWW is that it was designed to be accessible to all, requiring only an Internet connection and thus opening up the possibility of setting up your own webserver. This unsurprisingly led to a very rapid growth of pages on the WWW, with content appearing, being modified and sometimes vanishing at an ever-increasing pace, making it extremely hard to keep up with.

This is however not how things started when the World Wide Web was created in 1989. Before its opening to the public in 1993 the pace of growth was slow enough that a manually maintained index was maintained. This was kept up until late 1992, with the last version of said index still online on the W3 website.

Over the course of a short few years, the WWW would change the face of the world forever alongside a surge of IBM-compatible PCs, exploding multimedia content, all the dot-com hype and perhaps best of all endless ‘free’ hosting services as long as you didn’t mind an advertising banner plastered above your personal homepage’s content.

Even internet service providers (ISPs) would often offer their own hosting service, along with endless n00b-friendly tools to make something resembling a website for whatever hobby you fancied. In addition to proving that one can absolutely argue about style and the prevalence of colorblindness, this would also serve to balloon the number of websites at an exponential rate.

Whether or not the WWW killing off the Gopher-based internet was a bad thing remains the topic of debate, though it’s beyond question that Gopher integrated search functionality into its protocol, mirroring a file system.

Infinite Library Indexing

Without any provisions in the HTTP protocol of the WWW, the only realistic way for search engines to create an index of the ever-expanding and changing WWW is to perform so-called web crawling. This means going through every known document, following any links found in them, and making sure to revisit any documents in case their contents got changed since the last visit.

The first complication here is that since the search engine’s database is the only real index for the web, initial discovery is purely organic, starting from a certain number of URL seeds in what is called the crawl frontier. This forms an integral part of a web crawler.

The Structure of Queues that Feed the URL Stream in the WebFountain Crawler (Credit: Edwards et al., 2001)
The Structure of Queues that Feed the URL Stream in the WebFountain Crawler (Credit: Edwards et al., 2001)

Development of the algorithms and architecture behind these crawlers formed a major part of the early WWW, with IBM researchers on the WebFountain project in 2001 estimating a grand total of about 500 million pages, with – as they put it – web crawlers caught between the comfortable cushion of Moore’s Law and the hard place of the web’s exponential growth. Today this number is probably closer to forty billion pages.

Although the Google Search web crawler was already pretty good back in 2001, WebFountain improved on it by using a distributed system, with ‘ants’ working through their own list of URLs to crawl, as described in the development paper by Jenny Edwards et al.

Beyond the basic recursive following of links in a document there are many confounding factors, such as when to recrawl a URL, which very much depends on how often the content on it is expected to be updated. Here one dives into the territory of statistics, as depending on the type of site we can make an educated guess on how often it is expected to be updated. For example, a government’s historical news pages are unlikely to see frequent updates, whereas the front page of a news site can see updates practically every few minutes.

Inverted Indexing

As complex the topic of web crawling is, the fun part begins when you have pruned all duplicate documents and stripped all the irrelevant fluff that’s not text to be indexed. In order to make the resulting search index at all searchable before the heat death of the Universe you cannot simply do a full text search on every single document whenever someone enters a search query.

Instead an index is constructed whereby certain keywords are mapped to documents. This inverted index is generally implemented as a hash table or similar data structure where it provides a quick access into the full text documents, not unlike the keyword index in the back of a book, or the more elaborate concordance of yesteryear. These latter works also provide a keyword index, but add accompanying text to provide immediate context to further save time.

Creating an inverted index is a fairly labor-intensive process, with a new document often used for a forward index that decomposes the text into its keywords prior to updating (or creating) the inverted index. As with all of such text processing related tasks and data structures in general there are many ways to go about it, with some fun curveballs thrown into the mix such as parsing languages that do not separate words with spaces, like Japanese.

All of which is to say that implementing a search engine is easy, but making it performant, accurate and efficient  at the same time is a minor nightmare. This is basically why search engines took so long to stop being so terrible, as the engineers behind them were trying to solve many rather complex problems, presumably with the C-suite and investors breathing down their necks during the dot-com days.

Search Battles

Over on the Wikipedia entry for ‘Search engine‘ we find a pretty good timeline of web search engines, along with their current status. Perhaps unsurprisingly none of the 1993-era ones made it, but 1994’s WebCrawler somehow crawled into the modern age, along with Lycos. Much like 1990’s Archie search engine and similar for the Gopher web, many of these early search engines simply couldn’t compete in the rapidly changing years leading up to the new millennium.

This was also the era in which some figured that the WWW simply needed to become more ‘3D’ with virtual environments using VRML, bringing it closer to sci-fi like that portrayed in Snow Crash or Tron. Perhaps unfortunately the WWW remained the domain of mostly text and images, although most recently the flood of JavaScript frameworks appear to want to turn once simple HTML documents into full-blown desktop-like applications, all probably to the delight of web crawler engineers.

Meanwhile some search engines figured that they could lift along on the hard work of others, with so-called meta search engines collating the results from multiple search engines to save people the trouble of querying them individually. Here 1996’s Dogpile is still going strong.

Some search engines are missing from the list, such as Marginalia, which boasts the use of open source software for its indexing and crawling, while focusing on non-commercial content. There is also the ever excellent Frog Find that provides a bridge between modern search engines and systems that really cannot run the latest web browser.

Today’s Survivors

The search engine landscape remains a brutal one today, with us having to recently say farewell to Jeeves, of Ask Jeeves fame, most recently seen carrying the Ask.com name. Personally I didn’t really Ask Jeeves much back in the day, instead mostly using AltaVista (RIP) and probably Lycos and a few others that I do not recall off the top of my head.

Having Google Search burst on the scene by 2000 was definitely quite the event, which was certainly when the web search game improved. Looking back it probably was less that Google Search was simply better, but more that it pushed hard just being a search engine, whereas the others were still very much stuck in that early WWW mindset of being a portal to the web.

To a certain extent this is understandable, as search engines aren’t a charity and running the associated hardware as well as the required bandwidth costs a lot of money. Despite this it would seem that we still have a rather thriving web search engine landscape, even if ChatGPT, Claude and kin are trying to become the very last ‘site’ you will ever need. This even as their little web crawlers are still doing the same crawling as has been done since the birth of the WWW.

19 thoughts on “How Search Engines Enabled Finding Needles In A WWW-Sized Haystack

  1. Occasionally I search for the topic of a particular HaD article (using DDG rather than G), and I am impressed how often a brand newly published HaD article not only appears, but is near the top!

  2. If I remember right the most optimistic guesstimates were around 10% of the online pages are index-ABLE, ie can be potentially indexed.

    Google and other for-profit ventures are not in the biz of finding the pages you are looking for, they are in the biz of making profit of those pages offering profit. As a nice side effect they happen to index the rest while at it, but in all fairness the train had long left the station once web 2.0 (semantic web) mostly faded away; though, coincidence has it, AI may have picked up the slack for unrelated reasons (though, AI-assisted search sure does more daydreaming or navel gazing, whichever occupies its attention span at the moment).

    WWW still remains mostly walled gardens, and it is a good thing in a sense, because the absolute majority of them is still free to access, if you know where to look. I’d say it managed to survive the commercial enshitification mostly unscathed, and it rode through separating WWWII (education institutions only) as its own VPN – it also learned not to rely on any one provider as the only main one and work around artificial walls. I’d say these two feats accomplished about compensate for rather okay indexing of the inter-ether : ]

  3. frogfind

    On Firefox:

    Search functions are disabled on modern devices to save API quota.
    Please visit this site on a retro computer to search.

    On Dillo:

    Hier entsteht eine neue Internetpräsenz – hosted by 1blu

    NetSurf:

    [ SYSTEM ALERT: API QUOTA EXCEEDED ]

    Our search servers have reached their limit (reset mignight GMT-7)!

    Well OK?!

  4. The number of times I have done a search for something very specific, like an exact part number, and all the search results do not contain the part number I searched for, would lead me to believe this article is the first half of a two-parter on “The Rise Of…” and the conclusion is yet to come.

    Google used to be magic, then they turned it to shit by returning results not for your search string but for what they think you are looking for, and they always got that wrong. But Verbatim mode still worked like the old magic! Then they broke that too. Then DuckDuckGo… which never was the same kind of magic, but now it’s just trash. As above, most of the time, every single result does not contain the search string (CTRL+F). Putting the string in double quotes sometimes very rarely works. Two terms in double quotes, no matter what they are or how common or even if you know for a fact there are pages with those two exact terms in them, gives No Results Found…

    I just try to never do anything that needs a web search, because I already know it won’t find what I am looking for.

      1. It’s more than just a decline.
        I stopped using Google in 2018 when I found that over 75% of my searches returned nothing relevant on the first 200 results.
        Refusing to actually search for what you ask for has made them completely useless.

        They have become a used-car-salesman stereotype.
        They don’t care what you want or need, they are going to do their best to have you leave in that 1987 Fordge Crumbach

    1. The search optimization ass hats are to blame for that.
      It’s like those ads on Craigslist that have a big block of text listing everything remotely similar to what is advertised.

      Googles search by photo is pretty amazing. You can find stuff without knowing what it is.

  5. I find that if you continually look for “example p/n” over the course of a week or so the search results get better. Fot example I was looking for information on an old school thermal camera and over the course of several days of continuous search the result got better. I always thought it was due to the web spiders looking in places they had not before.

    1. More like your personal search bubble is expanding.

      Sometimes you have to wave your hands in the right way to distract the search engine into breaking out of the bubble it put you in to give you ‘more relevant’ (read: more monetizable) results.

  6. Before Google, I remember a thing called SavvySearch that basically was a search engine aggregator — enter a search term, and it would be submitted to several configured search engines. Don’t remember what happened to it, though — my professor at the time was big into using it.

  7. Though it may not be practical today, back when Google was becoming popular I almost always found what I wanted quicker (and often only) with Altavista. Turns out that what most links referenced was not often what I wanted. It remains much more trouble to find obscure sites with today’s popular search engines than it was then.

  8. When Google started to become popular, somebody would test the efficiency of said search engine by typing in 2 (apparently) unrelated words. Then look at how many hits were returned.

    Thus the cyber-sport of “Google Whacking” was created. Scoring in GW is similar to golf: lower is better.
    The perfect GW score? 1. My personal best was 200.

  9. One of the key features that distinguished Google from earlier indexing schemes was an attempt to automatically use how frequently pages were referenced from other pages as a rough indication of the value of their content. One of my own early attempts to manually build an index of a community ranked fairly well on that scale.

    Of course most search engines have now adopted something like this. And the downside is that we now have people attempting Search Engine Optimization, trying to salt links in everywhere they can in order to improve their ranking on the results page.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.