If you’ve logged onto GitHub recently and you’re an active user, you might have noticed a new badge on your profile: “Arctic Code Vault Contributor”. Sounds pretty awesome right? But whose code got archived in this vault, how is it being stored, and what’s the point?
They Froze My Computer!
On February 2nd, GitHub took a snapshot of every public repository that met one of the following criteria:
- Active since Nov 2019
- 250 or more stars
- At least one star and new commits since Feb 2019
Then they traveled to Svalbard, found a decommissioned coal mine, and archived the code in deep storage underground – but not before they made a very cinematic video about it.
How It Works
For the combination of longevity, price and density, GitHub chose film storage, provided by piql.
There’s nothing too remarkable about the storage medium: the tarball of each repository is encoded on standard silver halide film as a 2d barcode, which is distributed across frames of 8.8 million pixels each (roughly 4K). Whilst officially rated for 500, the film should last at least 1000 years.
You might imagine that all of GitHub’s public repositories would take up a lot of space when stored on film, but the data turns out to only be 21TB when compressed – this means the whole archive fits comfortably in a shipping container.
Each reel starts with slides containing an un-encoded human readable text guide in multiple languages, explaining to future humanity how the archive works. If you have five minutes, reading the guide and how GitHub explains the archive to whoever discovers it is good fun. It’s interesting to see the range of future knowledge the guide caters to — it starts by explaining in very basic terms what computers and software are, despite the fact that de-compression software would be required to use any of the archive. To bridge this gap, they are also providing a “Tech Tree”, a comprehensive guide to modern software, compilation, encoding, compression etc. Interestingly, whilst the introductory guide is open source, the Tech Tree does not appear to be.
But the question bigger than how GitHub did it is why did they do it?
Why?
The mission of the GitHub Archive Program is to preserve open source software for future generations.
GitHub talks about two reasons for preserving software like this: historical curiosity and disaster. Let’s talk about historical curiosity first.
There is an argument that preserving software is essential to preserving our cultural heritage. This is an easily bought argument, as even if you’re in the camp that believes there’s nothing artistic about a bunch of ones and zeros, it can’t be denied that software is a platform and medium for an incredibly diverse amount of modern culture.
GitHub also cites past examples of important technical information being lost to history, such as the search for the blueprints of the Saturn V, or the discovery of the Roman mortar which built the Pantheon. But data storage, backup, and networks have evolved significantly since Saturn V’s blueprints were produced. Today people frequently quip, “once it’s on the internet, it’s there forever”. What do you reckon? Do you think the argument that software (or rather, the subset of software which lives in public GitHub repos) could be easily lost in 2020+ is valid?
Whatever your opinion, simply preserving open source software on long timescales is already being done by many other organisations. And it doesn’t require an arctic bunker. For that we have to consider GitHub’s second motive: a large scale disaster.
If Something Goes Boom
We can’t predict what apocalyptic disasters the future may bring – that’s sort of the point. But if humanity gets into a fix, would a code vault be useful?
Firstly, let’s get something straight: in order for us to need to use a code archive buried deep in Svalbard, something needs to have gone really, really, wrong. Wrong enough that things like softwareheritage.org, Wayback Machine, and countless other “conventional” backups aren’t working. So this would be a disaster that has wiped out the majority of our digital infrastructure, including worldwide redundancy backups and networks, requiring us to rebuild things from the ground up.
This begs the question: if we were to rebuild our digital world, would we make a carbon copy of what already exists, or would we rebuild from scratch? There are two sides to this coin: could we rebuild our existing systems, and would we want to rebuild our existing systems.
Tackling the former first: modern software is built upon many, many layers of abstraction. In a post-apocalyptic world, would we even be able to use much of the software with our infrastructure/lower-level services wiped out? To take a random, perhaps tenuous example, say we had to rebuild our networks, DNS, ISPs, etc. from scratch. Inevitably behavior would be different, nodes and information missing, and so software built on layers above this might be unstable or insecure. To take more concrete examples, this problem is greatest where open-source software relies on closed-source infrastructure — AWS, 3rd party APIs, and even low-level chip designs that might not have survived the disaster. Could we reimplement existing software stably on top of re-hashed solutions?
The latter point — would we want to rebuild our software as it is now — is more subjective. I have no doubt every Hackaday reader has one or two things they might change about, well, almost everything but can’t due to existing infrastructure and legacy systems. Would the opportunity to rebuild modern systems be able to win out over the time cost of doing so?
Finally, you may have noticed that software is evolving rather quickly. Being a web developer today who is familiar with all the major technologies in use looks pretty different from the same role 5 years ago. So does archiving a static snapshot of code make sense given how quickly it would be out of date? Some would argue that throwing around numbers like 500 to 1000 years is pretty meaningless for reuse if the software landscape has completely changed within 50. If an apocalypse were to occur today, would we want to rebuild our world using code from the 80s?
Even if we weren’t to directly reuse the archived code to rebuild our world, there are still plenty of reasons it might be handy when doing so, such as referring to the logic implemented within it, or the architecture, data structures and so on. But these are just my thoughts, and I want to hear yours.
Was This a Useful Thing to Do?
The thought that there is a vault in the Arctic directly containing code you wrote is undeniably fun to think about. What’s more, your code will now almost certainly outlive you! But do you, dear Hackaday reader, think this project is a fun exercise in sci-fi, or does it hold real value to humanity?
27 thoughts on “Ask Hackaday: Why Did GitHub Ship All Our Software Off To The Arctic?”
Bad idea, because all software sucks. They should have archived computer science books instead.
On the contrary, this software is looking to be pretty cool.
Fully agree, open source is LOW quality and full of bugs. Remember Heartbleed Bug 2years in production!!
Almost all code is low quality and full of bugs. That’s just how humans do things.
You must be trolling, but I just have to ask: So in order to get good quality and bug free code we just need to hide the code?
Duh, haven’t you ever heard of cold storge?
My very first thought was “So in 1000 years, how are you going to convert the code into an executable. And what hardware will it run the binary.”
But it is funny, there is so little overlap between the most advanced technology about 1000 years ago and today.
You can not even ask these kind of questions about the most advanced technology on the planet about 1000 years ago. I guess that you would be close to the time of the Battle of Hastings (1066) so (just before gunpowder cannons) you would be talking about archiving examples of boiled leather armours, chainmails, javelins, long spears, swords, maces, axes, simple bows and crossbows.
They did a fair job of recording what they thought worth recording though. We’ve got a tapestry, the domesday book, various chronicles.
“But it is funny, there is so little overlap between the most advanced technology about 1000 years ago and today.”
Uh… yes there is? The examples you gave are all weapons, but believe it or not, people do more than fight each other. Hard to believe, I know, but true.
Roman concrete, for instance, is still being studied by scientists today, and we know how to do it because Charlemagne preserved a ton of classical manuscripts during the Carolingian Renaissance. Even with weapons, however, wootz steel, which apparently developed carbon nanotubes in its matrix is still a complicated process not fully understood.
I don’t really doubt that in 1000 years there will be similar technological archaeologists investigating how mid-20th century technological accomplishments happened within the limitations of the time.
> The examples you gave are all weapons
I went for the easiest thing that I could think of from that time period. Roman concrete is from about 2000 years ago. An unfortunate reality but the most advanced technology in any time period is typically how to efficiently spy on each other or kill each other.
You could argue that mathematics, music, paintings or poetry of the period were more technologically advanced than weapons – who is to say. The only invention I could find from that time period, that stood out to me in some way, was the invention of the pound lock for transporting large quantities of cargo in China.
> you would be talking about archiving examples of boiled leather armours, chainmails, javelins, long spears, swords, maces, axes, simple bows and crossbows.
I think a LOT of historians would find such an archive to be very valuable.
My thoughts would run to (soft) archiving of technology/history/mathematics/etc. of the times. Not the physical as that can always be re-created. Lots of history/knowledge was lost for example in the destruction of the library of Alexandria. If all that knowledge of the time were in ‘cold’ dry storage somewhere….
Not sure how code will benefit the future other than to make for interesting history lesson 1000 years from now. Soft of how we appreciate the old mechanical calculators for example…
“Lots of history/knowledge was lost for example in the destruction of the library of Alexandria.”
Also, lots of knowledge was also only preserved because some people thought it was important and copied it, and it was later found. Which… is exactly a parallel to this.
It was the monks during the “Dark Ages” who wrote fresh copy from rotting papyrus/parchment that we have non-Christian works from Seneca, Plato, Socrates, Cicero…
> the Tech Tree does not appear to be [open source]
According to some of the other comments in the issues, it’s not that it’s proprietary, it’s just that it hasn’t been finished yet. It’s not like it’s going to be needed in the short term anyway, so they can take the time to do it right.
Look at the extents we go to, to dig things up just to learn about previous civilizations. Not so we can duplicate it, but just so we know more about our history and how things were done. Bones, tools, household belongings, clues to their beliefs, calendars…This isn’t about rebuilding. This is to preserve that facet of our existence. It is perfectly feasible for our civilization to be “reset” by some combination of catastrophic events (pandemic anyone?). Our various infrastructures are complex and in some ways, fragile. What was the state of our technology 1000 years ago? or even 100?
Unfortunately, digital data is not a physical object and is easily lost. It requires some amount of technology to recover. Film appears to be the oldest technology that we can feasibly store that much data on for “longish” periods of time.
I think it would be more useful to have a cold storage wikipedia backup.
I think I’ve got Encarta 97 well buried somewhere.
At least a few years ago you could download snapshots of Wikipedia. No idea about the size, but the text only should not be too big once compressed. If you have some disk space left, why not?
Github? What is Github, anyway?
I’ve read about this from several sources, what I haven’t been able to figure out is what it cost.
At some (lowish) cost, it’s a valuable exercise as PR stunt or just conversation starter to get people like us thinking about “what happens when.”
As far as functioning as a “useful” backup – no.
Something bad enough has to happen to wipe out all the other live copies of this data. Then we have to recover from that event to the point that we have the _ability_ to recover that archive, read, and translate that data. As well as having both hardware and software infrastructure in place to make that software worth recovering. All within the lifespan of the media.
Seems more likely we’ll be using the film strips as lashings to hold logs together to build a raft to get off the island.
Oh great. m Now my junk files have a place in the Ark in case things more things get out of wack in the next little while. 2020 isn’t over yet…
Even in the movie world, 2020 and 2021 are end of the world:
– In Jade’s World, Skynet became self-aware in 2021. It launched attacked on humanity on 18 June, resulting in Judgment Day.
– In the Dark Fate timeline, after the termination of Skynet and the destruction of a Cyberdyne building, a new timeline was created, in which Judgment Day happened in the 2020s.
Out of interest I took a look at softwareheritage.org, and typed in “Hello World” as a search string, expecting to find the first C program in K&R. All I got was a big pile of spam.
That just looks like a big archive of crap. Github is probably not far off!
“Why Did GitHub Ship All Our Software Off To The Arctic?”
Because they can???
Because they watched the Mad Max series and figure in an apocalypse the Aussies would just use the cans of film as handy “BBQ in a box” fires, if they sent them there.
With all the code, I hope they are including some means of explaining the hardware it runs on…
Binary would be meaningless, unless it is mapped to some Op Code, and the OpCode is mapped to the construction of a processor (NAND gates and such) and I/O.
I forgot another intermediary step.
Converting the 2D barcode to binary, and decompression algorithms…
Otherwise, it will all be Voynich Manuscript to future people.