Internet Archive Hits One Trillion Web Pages

Server racks branded with Internet Archive

In case you didn’t hear — on October 22, 2025, the Internet Archive, who host the Wayback Machine at archive.org, celebrated a milestone: one trillion web pages archived, for posterity.

Founded in 1996 by Brewster Kahle the organization and its facilities grew through the late nineties; in 2001 access to their archive was greatly improved by the introduction of the Wayback Machine. From their own website on Oct 21 2009 they explained their mission and purpose:

Most societies place importance on preserving artifacts of their culture and heritage. Without such artifacts, civilization has no memory and no mechanism to learn from its successes and failures. Our culture now produces more and more artifacts in digital form. The Archive’s mission is to help preserve those artifacts and create an Internet library for researchers, historians, and scholars.

We were curious about the Internet Archive technology. Storing a copy (in fact two copies!) of the internet is no mean feat, so we did some digging to find out how it’s done. The best information available is in this article from 2016: 20,000 Hard Drives on a Mission. They keep two copies of every “item”, which are stored in Linux directories. In 2016 they had over 30 petabytes of content and were ingesting at a rate of 13 to 15 terabytes per day, web, and television being the most voluminous.

In 2016 they had around 20,000 individual disk drives, each housed in specialized computers called “datanodes”. The datanodes have 36 data drives plus two operating system drives per machine. Datanodes are organized into racks of 10 machines, having 360 data drives per rack. These racks are interconnected via high-speed Ethernet to form a storage cluster.

Even though content storage tripled over 2012 to 2016, the count of disk drives stayed about the same; this is because of disk drive technology improvements. Datanodes that were once populated with 36 individual 2 terabyte drives are today filled with 8 terabyte drives, moving single node capacity from 72 terabytes (64.8 T formatted) to 288 terabytes (259.2 T formatted) in the same physical space. The evolution of disk density did not happen in a single step, so there are populations of 2, 3, 4, and 8 T drives in the storage clusters.

We will leave you with the visual styling of Hackaday Beta in 2004, and what an early google.com or amazon.com looked like back in the day. Super big shout out to the Internet Archive, thanks for providing such an invaluable service to our community, and congratulations on this excellent achievement.

12 thoughts on “Internet Archive Hits One Trillion Web Pages

  1. I really hope that one day we’ll see them get funding and manpower for people to go through and organize the general archive.

    Its still useful and you can still find things but I feel there is a lot of duplicated content and thing that would benefit from being grouped together.

    Use sourced metadata only gets you so far afterall because a user might forget, make mistakes, leave things blank, or might not even have the option for certain important elements.

    If we had a team of people cleaning up the archive then there would be magic to be had with linking all sorts of items together.

  2. I wonder about the physical security of the archive. If the San Andreas fault slips again in a BIG way, does the archive lose both copies? It’s fantastic that they have this data, but hard drives are fragile things. Most importantly, it’s almost impossible to store digital data for LONG periods of time. Just look at how much trouble it is to read that old tape from Unix version 4? We need to keep this data for thousands of years. I don’t think anybody’s figured out a good way to keep digital data safe and accessible for more than 20 years (which is equivalent to just a few seconds for an archaeologist).

    1. I think right now the bigger issue the archive needs to tackle is all the different groups who try to strike down archives. From authors to politicians everyone goes after the archive and they lack the funding to properly fight back against these pests.

      1. I’m not sure what politicians you’re referring to, but in the U.S., authors have specific rights under longstanding copyright law.

        Copying… excuse me… “archiving”… the copyrighted contents of my website without securing my permission is one such violation. Now, as As their intent is benign and my site, in and of itself, is not a revenue-generator, to me this violation becomes a technicality not worth creating bad feelings over.

        On the other hand, when the full text of books I’ve written show up in the archive–books still in print and still a source of income to me—every “free” download from the archive is literally money out of my pocket. You can be sure I will… and have… taken them to task for this, despite their purpose being… again… benign.

        I like the internet archive, I use it, and I’ve donated to their operation. But authors still have rights.

        1. Copying… excuse me… “archiving”… the copyrighted contents of my website without securing my permission is one such violation

          I don’t know about websites, but libraries and dead-tree-archives are explicitly allowed to make copies of books “for archival purposes.” What they aren’t allowed to do is make those archived materials available for loan like a regular library book until the underlying copyrights expire.

          The idea behind this exception is to preserve things for posterity. So if there’s a rare, out-of-print-but-in-copyright book in my local library’s collection, they can make a copy and put the copy in a vault, but keep the “original” that they already had out on display or even in circulation if they want to.

          I’d have to read the law with the eyes of a lawyer to know how this “archivist’s exception” applies to web sites.

        2. My disdain for authors comes from a false sense of moral superiority. They will preach to the heavens about how important libraries are (because they are). Which are freely allowed to distribute a singular copy of their book to as many people as they want. BUT if the book is an ebook no sir. The library can only distribute one copy at a time and MUST PAY for each loan out.

          If a library wants to loan out a physical book to 1000 people 1 at a time they buy 1 book. If they want to loan out an ebook to 1000 people 1 at a time they must pay for 1000 books. It is absurd and abusive to libraries. The internet archive was only attempting to undo the absurd burden ebooks have put on public libraries.

          1. Really? Why would the library have to buy 1000 ebooks if they’re only loaning out to one person at a time? Or do you mean that they make a “royalty” type payment for each time the ebook is lent out?

            What’s the cost of one physical copy of a book vs. one ebook (license, I assume)? Surely it’s not a 1:1 pricing model.

            Can you tell that I haven’t used the services of a library for quite a number of years? I guess I worry about the day that libraries no longer exist but, on the other hand, I’m not really doing anything to support my local library.

Leave a Reply to ObserverCancel reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.