Server racks branded with Internet Archive

News Sites Are Blocking Internet Archive Over AI Scraping Fears

June 8, 2026 by Maya Posch 31 Comments

Especially in this era of the Internet, the role of the Internet Archive’s Wayback Machine has become increasingly essential as more and more web content vanishes into the ether or is surreptitiously altered to hide salient details. More recently a new worry has seemingly cropped up in the form of scraping of data for so-called AI systems, or at least that’s part of the excuses being offered for blocking the Wayback Machine’s web crawlers, with [Andrew Deck] and [Hanaa’ Tameez] of [Nieman Lab] detailing the impact and reasons provided.

Some news outlets like The Baltimore Banner insist that they’re only blocking the Wayback Machine crawlers because they are worried that LLM chatbots would otherwise ‘improperly cite’ the source of content, while outlets like The Atlantic have put a blanket anti-scraping policy in place. Meanwhile news outlets are generally happy to let paid commercial news archiving outlets like ProQuest and LexisNexis index their content, showing a potential financial incentive.

Whatever the reasons, the direct effect is that as content is modified or vanishes during for example a system migration, buy-out or bankruptcy, researchers who rely on the Wayback Machine are pretty much forced to rely on paid offerings by ProQuest and kin, without the pure archiving focus and free access to information. It will also leave big holes in what the Wayback Machine can cover in its archives, with news especially becoming very spotty.

Incidentally there’s an ongoing petition over at SaveTheArchive.com which people can sign.

Marion Stokes Fought Disinformation With VCRs

January 20, 2026 by Kristina Panos 50 Comments

You’ve likely at least heard of Marion Stokes, the woman who constantly recorded television for over 30 years. She comes up on reddit and other places every so often as a hero archivist who fought against disinformation and disappearing history. But who was Marion Stokes, and why did she undertake this project? And more importantly, what happened to all of those tapes? Let’s take a look.

Marion the Librarian

Marion was born November 25, 1929 in Germantown, Philadelphia, Pennsylvania. Noted for her left-wing beliefs as a young woman, she became quite politically active, and was even courted by the Communist Party USA to potentially become a leader. Marion was also involved in the civil rights movement.

Continue reading “Marion Stokes Fought Disinformation With VCRs” →

Internet Archive Hits One Trillion Web Pages

November 18, 2025 by John Elliot V 14 Comments

In case you didn’t hear — on October 22, 2025, the Internet Archive, who host the Wayback Machine at archive.org, celebrated a milestone: one trillion web pages archived, for posterity.

Founded in 1996 by Brewster Kahle the organization and its facilities grew through the late nineties; in 2001 access to their archive was greatly improved by the introduction of the Wayback Machine. From their own website on Oct 21 2009 they explained their mission and purpose:

Most societies place importance on preserving artifacts of their culture and heritage. Without such artifacts, civilization has no memory and no mechanism to learn from its successes and failures. Our culture now produces more and more artifacts in digital form. The Archive’s mission is to help preserve those artifacts and create an Internet library for researchers, historians, and scholars.

We were curious about the Internet Archive technology. Storing a copy (in fact two copies!) of the internet is no mean feat, so we did some digging to find out how it’s done. The best information available is in this article from 2016: 20,000 Hard Drives on a Mission. They keep two copies of every “item”, which are stored in Linux directories. In 2016 they had over 30 petabytes of content and were ingesting at a rate of 13 to 15 terabytes per day, web, and television being the most voluminous.

In 2016 they had around 20,000 individual disk drives, each housed in specialized computers called “datanodes”. The datanodes have 36 data drives plus two operating system drives per machine. Datanodes are organized into racks of 10 machines, having 360 data drives per rack. These racks are interconnected via high-speed Ethernet to form a storage cluster.

Even though content storage tripled over 2012 to 2016, the count of disk drives stayed about the same; this is because of disk drive technology improvements. Datanodes that were once populated with 36 individual 2 terabyte drives are today filled with 8 terabyte drives, moving single node capacity from 72 terabytes (64.8 T formatted) to 288 terabytes (259.2 T formatted) in the same physical space. The evolution of disk density did not happen in a single step, so there are populations of 2, 3, 4, and 8 T drives in the storage clusters.

We will leave you with the visual styling of Hackaday Beta in 2004, and what an early google.com or amazon.com looked like back in the day. Super big shout out to the Internet Archive, thanks for providing such an invaluable service to our community, and congratulations on this excellent achievement.

Wayback Proxy Lets Your Browser Party Like It’s 1999

May 26, 2025 by Tyler August 14 Comments

This project is a few years old, but it might be appropriate to cover it late since [richardg867]’s Wayback Proxy is, quite literally, timeless.

It does, more-or-less, what it says as on the tin: it is an HTTP proxy that retrieves pages from the Internet Archive’s Wayback Machine, or the Oocities archive of old Geocities sites. (Remember Geocities?) It is meant to sit on a Raspberry Pi or similar SBC between you and the modern internet. A line in a config file lets you specify the exact date. We found this via YouTube in a video by [The Science Elf] (embedded below, for those of you who don’t despise YouTube) in which he attaches a small screen and dial to his Pi to create what he calls the “Internet Time Machine” using the Wayback Proxy. (Sadly [The Science Elf] did not see fit to share his work, but it would not be difficult to recreate the python script that edits config.json.)

What’s the point? Well, if you have a retro-computer from the late 90s or early 2000s, you’re missing out a key part of the vintage experience without access to the vintage internet. This was the era when desktops were being advertised as made to get you “Online”. Using Wayback Proxy lets you relive those halcyon days– or live them for the first time, for the younger set. At least relive those of which parts of the old internet which could be Archived, which sadly isn’t everything. Still, for a nostalgia trip, or a living history exhibit to show the kids? It sounds delightful.

Of course it is possible to hit up the modern web on a retro PC (or on a Mac Plus). As long as you’re not caught up in an internet outage, as this author recently was.

Continue reading “Wayback Proxy Lets Your Browser Party Like It’s 1999” →

Access The Information Superhighway With A Mac Plus

October 17, 2024 by Bryan Cockfield 5 Comments

For some time now, Apple has developed a reputation for manufacturing computers and phones that are not particularly repairable or upgradable. While this reputation is somewhat deserved, especially in recent years, it seems less true for their older machines. With the second and perhaps most influential computer, the Apple II, being so upgradable that the machine had a production run of nearly two decades. Similarly, the Macintosh Plus of 1986 was surprisingly upgradable and repairable and [Hunter] demonstrates its capabilities by bringing one onto the modern Internet, albeit with a few tricks to adapt the old hardware and software to the modern era.

The Mac Plus was salvaged from a thrift store, and the first issue to solve was that it had some rotten capacitors that had to be replaced before the computer could be reliably powered on at all. [Hunter] then got to work bringing this computer online, with the only major hardware modification being a BlueSCSI hard drive emulator which allows using an SD card instead of an original hard disk. It can also emulate an original Macintosh Ethernet card, allowing it to fairly easily get online.

The original operating system and browser don’t support modern protocols such as HTTPS or scripting languages like Javascript or CSS, so a tool called MacProxy was used to bridge this gap. It serves simplified HTML from the Internet to the Mac Plus, but [Hunter] wanted it to work even better, adding modular domain-specific handling to allow the computer to more easily access sites like Reddit, YouTube, and even Hackaday, although he does call us out a bit for not maintaining our retro page perhaps as well as it ought to be.

[Hunter] has also built an extension to use the Wayback Machine to serve websites to the Mac from a specific date in the past, which really enhances the retro feel of using a computer like this to access the Internet. Of course, if you don’t have original Macintosh hardware but still want to have the same experience of the early Internet or retro hardware this replica Mac will get you there too.

Continue reading “Access The Information Superhighway With A Mac Plus” →

This Week In Security: The Internet Archive, Glitching With A Lighter, And Firefox In-the-wild

October 11, 2024 by Jonathan Bennett 7 Comments

The Internet Archive has been hacked. This is an ongoing story, but it looks like this started at least as early as September 28, while the site itself was showing a creative message on October 9th, telling visitors they should be watching for their email addresses to show up on Have I Been Pwnd.

Hi folks, yes, I'm aware of this. I've been in communication with the Internet Archive over the last few days re the data breach, didn't know the site was defaced until people started flagging it with me just now. More soon. https://t.co/uRROXX1CF9

— Troy Hunt (@troyhunt) October 9, 2024

There are questions still. The site defacement seems to have included either a subdomain takeover, or a long tail attack resulting from the polyfill takeover. So far my money is on something else as the initial vector, and the polyfill subdomain as essentially a red herring.

Troy Hunt has confirmed that he received 31 million records, loaded them into the HIBP database, and sent out notices to subscribers. The Internet Archive had email addresses, usernames, and bcrypt hashed passwords.

In addition, the Archive has been facing Distributed Denial of Service (DDoS) attacks off and on this week. It’s open question whether the same people are behind the breach, the message, and the DDoS. So far it looks like one group or individual is behind both the breach and vandalism, and another group, SN_BLACKMETA, is behind the DDoS.

Continue reading “This Week In Security: The Internet Archive, Glitching With A Lighter, And Firefox In-the-wild” →

The Internet Archive Has Been Hacked

October 10, 2024 by Zoe Skyforest 33 Comments

There are a great many organizations out there, all with their own intentions—some selfish, some selfless, some that land somewhere in between. Most would put the Internet Archive in the category of the library—with its aim of preserving and providing knowledge for the aid of all who might call on it. Sadly, as [theresnotime] reports, it appears this grand institution has been hacked.

On Wednesday, users visiting the Internet Archive were greeted with a foreboding popup that stated the following:

Have you ever felt like the Internet Archive runs on sticks and is constantly on the verge of suffering a catastrophic security breach? It just happened. See 31 million of you on HIBP!

The quote appears to refer to Have I Been Pwned (HIBP), a site that collates details of security breaches so individuals can check if their details have been compromised.

According to founder Brewster Kahle, the site was apparently DDOS’d, with the site defaced via a JavaScript library. It’s believed this may have been a polyfill supply chain attack. As for the meat of the hack, it appears the individuals involved made off with usernames, emails, and encrypted and salted passwords. Meanwhile, as Wired reports, it appears Have I Been Pwned first received the stolen data of 31 million users on September 30.

At the time of writing, it appears the Internet Archive has restored the website to some degree of normal operation. It’s sad to see one of the Internet’s most useful and humble institutions fall victim to a hack like this one. As is always the way, no connected machine is ever truly safe, no matter how much we might hope that’s not the case.

[Thanks to Sammy for the tip!]

Hackaday

internet archive

14 Articles

News Sites Are Blocking Internet Archive Over AI Scraping Fears

Marion Stokes Fought Disinformation With VCRs

Marion the Librarian

Internet Archive Hits One Trillion Web Pages

Wayback Proxy Lets Your Browser Party Like It’s 1999

Access The Information Superhighway With A Mac Plus

This Week In Security: The Internet Archive, Glitching With A Lighter, And Firefox In-the-wild

The Internet Archive Has Been Hacked

Search

Never miss a hack

If you missed it

Putting Some Zig In A Linux-Based 3D Printer

UDP Broadcasting And The Joys Of IPv4 Subnetting

The Death Of Physical Media And The Real Challenges To Software Archiving

A Brief History Of The Crazy Old 7-Segment Display

Is Now The Time For Volumetric 3D Printing?

Our Columns

Hackaday Podcast Episode 378: C Coders, Ceramic Printers, And Shadow Archives

This Week In Security: Another Record Patch Tuesday, LAME Is More Secure, Secure Boot Is Less Secure, And Milk Malware

Hackaday Europe 2026 – Build A Cable Modem For Your Arduino

FLOSS Weekly Episode 875: JavaScript As A Systems Language

2026 Hackaday Supercon: Call For Proposals

Marion the Librarian

Search

Never miss a hack

Subscribe

If you missed it

Our Columns