Ask Hackaday: Why Did GitHub Ship All Our Software Off To The Arctic?

If you’ve logged onto GitHub recently and you’re an active user, you might have noticed a new badge on your profile: “Arctic Code Vault Contributor”. Sounds pretty awesome right? But whose code got archived in this vault, how is it being stored, and what’s the point?

They Froze My Computer!

On February 2nd, GitHub took a snapshot of every public repository that met any one of the following criteria:

  • Activity between Nov 13th, 2019 and February 2nd, 2020
  • At least one star and new commits between February 2nd, 2019 and February 2nd, 2020
  • 250 or more stars

Then they traveled to Svalbard, found a decommissioned coal mine, and archived the code in deep storage underground – but not before they made a very cinematic video about it.

How It Works

Source: GitHub

For the combination of longevity, price and density, GitHub chose film storage, provided by piql.

There’s nothing too remarkable about the storage medium: the tarball of each repository is encoded on standard silver halide film as a 2d barcode, which is distributed across frames of 8.8 million pixels each (roughly 4K). Whilst officially rated for 500, the film should last at least 1000 years.

You might imagine that all of GitHub’s public repositories would take up a lot of space when stored on film, but the data turns out to only be 21TB when compressed – this means the whole archive fits comfortably in a shipping container.

Each reel starts with slides containing an un-encoded human readable text guide in multiple languages, explaining to future humanity how the archive works. If you have five minutes, reading the guide and how GitHub explains the archive to whoever discovers it is good fun. It’s interesting to see the range of future knowledge the guide caters to — it starts by explaining in very basic terms what computers and software are, despite the fact that de-compression software would be required to use any of the archive. To bridge this gap, they are also providing a “Tech Tree”, a comprehensive guide to modern software, compilation, encoding, compression etc. Interestingly, whilst the introductory guide is open source, the Tech Tree does not appear to be.

But the question bigger than how GitHub did it is why did they do it?

Why?

The mission of the GitHub Archive Program is to preserve open source software for future generations.

GitHub talks about two reasons for preserving software like this: historical curiosity and disaster. Let’s talk about historical curiosity first.

There is an argument that preserving software is essential to preserving our cultural heritage. This is an easily bought argument, as even if you’re in the camp that believes there’s nothing artistic about a bunch of ones and zeros, it can’t be denied that software is a platform and medium for an incredibly diverse amount of modern culture.

GitHub also cites past examples of important technical information being lost to history, such as the search for the blueprints of the Saturn V, or the discovery of the Roman mortar which built the Pantheon. But data storage, backup, and networks have evolved significantly since Saturn V’s blueprints were produced. Today people frequently quip, “once it’s on the internet, it’s there forever”. What do you reckon? Do you think the argument that software (or rather, the subset of software which lives in public GitHub repos) could be easily lost in 2020+ is valid?

Whatever your opinion, simply preserving open source software on long timescales is already being done by many other organisations. And it doesn’t require an arctic bunker. For that we have to consider GitHub’s second motive: a large scale disaster.

If Something Goes Boom

We can’t predict what apocalyptic disasters the future may bring – that’s sort of the point. But if humanity gets into a fix, would a code vault be useful?

Firstly, let’s get something straight: in order for us to need to use a code archive buried deep in Svalbard, something needs to have gone really, really, wrong. Wrong enough that things like softwareheritage.org, Wayback Machine, and countless other “conventional” backups aren’t working. So this would be a disaster that has wiped out the majority of our digital infrastructure, including worldwide redundancy backups and networks, requiring us to rebuild things from the ground up.

This begs the question: if we were to rebuild our digital world, would we make a carbon copy of what already exists, or would we rebuild from scratch? There are two sides to this coin: could we rebuild our existing systems, and would we want to rebuild our existing systems.

Tackling the former first: modern software is built upon many, many layers of abstraction. In a post-apocalyptic world, would we even be able to use much of the software with our infrastructure/lower-level services wiped out? To take a random, perhaps tenuous example, say we had to rebuild our networks, DNS, ISPs, etc. from scratch. Inevitably behavior would be different, nodes and information missing, and so software built on layers above this might be unstable or insecure. To take more concrete examples, this problem is greatest where open-source software relies on closed-source infrastructure — AWS, 3rd party APIs, and even low-level chip designs that might not have survived the disaster. Could we reimplement existing software stably on top of re-hashed solutions?

The latter point — would we want to rebuild our software as it is now — is more subjective. I have no doubt every Hackaday reader has one or two things they might change about, well, almost everything but can’t due to existing infrastructure and legacy systems. Would the opportunity to rebuild modern systems be able to win out over the time cost of doing so?

Finally, you may have noticed that software is evolving rather quickly. Being a web developer today who is familiar with all the major technologies in use looks pretty different from the same role 5 years ago. So does archiving a static snapshot of code make sense given how quickly it would be out of date? Some would argue that throwing around numbers like 500 to 1000 years is pretty meaningless for reuse if the software landscape has completely changed within 50. If an apocalypse were to occur today, would we want to rebuild our world using code from the 80s?

Even if we weren’t to directly reuse the archived code to rebuild our world, there are still plenty of reasons it might be handy when doing so, such as referring to the logic implemented within it, or the architecture, data structures and so on. But these are just my thoughts, and I want to hear yours.

Was This a Useful Thing to Do?

The thought that there is a vault in the Arctic directly containing code you wrote is undeniably fun to think about. What’s more, your code will now almost certainly outlive you! But do you, dear Hackaday reader, think this project is a fun exercise in sci-fi, or does it hold real value to humanity?

95 thoughts on “Ask Hackaday: Why Did GitHub Ship All Our Software Off To The Arctic?

    1. I fail to see the suck in software these days. Modern Foss is easy to use, does a lot, and it works reliably.

      The only real suck comes from the cloud based stuff and network dependence, and the modern trend of going back to manual setup, configuration, and use, as opposed to automating things like Python’s auto build and cache process for bytecode, or MDNS lookups.

      Computer science doesn’t quite address the suck, because all the core algorithms we use today are pretty much fine, when they’re actually designed for practical use as a package rather than modular tech demos.

        1. It’s mostly only nightmarish because it’s too big to find where to start without some serious digging. They could do better on making it easier to find the place to make the change you want to make.

          But in general, people are able to maintain this stuff. It’s not bad enough to get in the way of development. Usually the most sucktastic part is removed features and breaking API changes with very little notice, but even that’s not common.

          Under the hood doesn’t need to be perfect, we can tell it’s getting better by the great UI, and how we hear about browser exploits less and less every year.

          1. “Bloat” is rarely a problem. Most apps aren’t even slow because of bloat, they’re slow because nobody optimized them, or they’re making cloud API calls, or someone thought it would be fun to write something by hand that really should have used a GPU accelerated library, or because they’re running in their own sandboxed container loading a whole OS, or some other resources intensive security thing.

            It takes a *lot* of bloat to make a performance issue, and normally, bugs outside the core of the app just show an error message and don’t let you use that particular features, if it’s well designed.

            The only time it usually matters is extreme security critical stuff. For an average user, if it’s an order of magnitude more secure than the odds of getting their entire phone physically stolen, they’re probably going to say it’s good enough.

    2. @Nicci said: “Bad idea, because all software sucks. They should have archived computer science books instead.”

      It’s the humans that learned from those computer science books that wrote all that sucky software in the first place!

    1. I hadn’t, so I looked it up!

      Storge (/ˈstɔːrɡi/,[1] from the Ancient Greek word στοργή storgē[2]) or familial love refers to natural or instinctual affection,[1][3] such as the love of a parent towards offspring and vice versa.
      -Wikipedia

  1. My very first thought was “So in 1000 years, how are you going to convert the code into an executable. And what hardware will it run the binary.”

    But it is funny, there is so little overlap between the most advanced technology about 1000 years ago and today.

    You can not even ask these kind of questions about the most advanced technology on the planet about 1000 years ago. I guess that you would be close to the time of the Battle of Hastings (1066) so (just before gunpowder cannons) you would be talking about archiving examples of boiled leather armours, chainmails, javelins, long spears, swords, maces, axes, simple bows and crossbows.

    1. “But it is funny, there is so little overlap between the most advanced technology about 1000 years ago and today.”

      Uh… yes there is? The examples you gave are all weapons, but believe it or not, people do more than fight each other. Hard to believe, I know, but true.

      Roman concrete, for instance, is still being studied by scientists today, and we know how to do it because Charlemagne preserved a ton of classical manuscripts during the Carolingian Renaissance. Even with weapons, however, wootz steel, which apparently developed carbon nanotubes in its matrix is still a complicated process not fully understood.

      I don’t really doubt that in 1000 years there will be similar technological archaeologists investigating how mid-20th century technological accomplishments happened within the limitations of the time.

      1. > The examples you gave are all weapons
        I went for the easiest thing that I could think of from that time period. Roman concrete is from about 2000 years ago. An unfortunate reality but the most advanced technology in any time period is typically how to efficiently spy on each other or kill each other.

        You could argue that mathematics, music, paintings or poetry of the period were more technologically advanced than weapons – who is to say. The only invention I could find from that time period, that stood out to me in some way, was the invention of the pound lock for transporting large quantities of cargo in China.

        1. “I went for the easiest thing that I could think of from that time period. Roman concrete is from about 2000 years ago. ”

          Roman concrete was *developed* about 2000 years ago, just like a lot of the code GitHub preserved is older than “right now” too.

          Right around the time period you were thinking of was the Carolingian renaissance, like I said – around the late 8th to 9th century. So a little earlier, but not much. That’s when Charlemagne’s empire began re-transcribing many old Latin texts so they could be preserved for the future. And this is how we *know* about Roman concrete’s construction methods.

          An organization deciding to transcribe and preserve many of the great works of scholarship and literature at the time. Sound familiar?

      2. Weapons are often a snapshot of the technology of civilization (ironic word).
        We look at stone tools, bows and arrows, metallurgy to figure what they know in physics, chemistry etc. and skill levels of manufacturing.

    2. > you would be talking about archiving examples of boiled leather armours, chainmails, javelins, long spears, swords, maces, axes, simple bows and crossbows.

      I think a LOT of historians would find such an archive to be very valuable.

    3. My thoughts would run to (soft) archiving of technology/history/mathematics/etc. of the times. Not the physical as that can always be re-created. Lots of history/knowledge was lost for example in the destruction of the library of Alexandria. If all that knowledge of the time were in ‘cold’ dry storage somewhere….

      Not sure how code will benefit the future other than to make for interesting history lesson 1000 years from now. Soft of how we appreciate the old mechanical calculators for example…

      1. “Lots of history/knowledge was lost for example in the destruction of the library of Alexandria.”

        Also, lots of knowledge was also only preserved because some people thought it was important and copied it, and it was later found. Which… is exactly a parallel to this.

        1. It was the monks during the “Dark Ages” who wrote fresh copy from rotting papyrus/parchment that we have non-Christian works from Seneca, Plato, Socrates, Cicero…

  2. > the Tech Tree does not appear to be [open source]

    According to some of the other comments in the issues, it’s not that it’s proprietary, it’s just that it hasn’t been finished yet. It’s not like it’s going to be needed in the short term anyway, so they can take the time to do it right.

    1. Pity. The tech tree is by far the most important/interesting part of this archive. That and a summary of a few (hundred) fundamentals of computing (xor, rot13, compression mechanisms, PKI, etc) is what a future civilization would really want/need – the rest is of historical value but far less “usefulness”.

  3. Look at the extents we go to, to dig things up just to learn about previous civilizations. Not so we can duplicate it, but just so we know more about our history and how things were done. Bones, tools, household belongings, clues to their beliefs, calendars…This isn’t about rebuilding. This is to preserve that facet of our existence. It is perfectly feasible for our civilization to be “reset” by some combination of catastrophic events (pandemic anyone?). Our various infrastructures are complex and in some ways, fragile. What was the state of our technology 1000 years ago? or even 100?

    Unfortunately, digital data is not a physical object and is easily lost. It requires some amount of technology to recover. Film appears to be the oldest technology that we can feasibly store that much data on for “longish” periods of time.

    1. …which is already done on a regular basis by multiple groups.

      The existence of other good ideas doesn’t diminish the value of this. Why are many HaD commenters so ignorant and close-minded?

    2. At least a few years ago you could download snapshots of Wikipedia. No idea about the size, but the text only should not be too big once compressed. If you have some disk space left, why not?

      1. You still can, and it’s not very big by today’s standards. I think the raw archive including talk and user pages and old revisions is a few terabytes, but indexed and compressed versions for offline reading come in at 40GB for the full text of English Wikipedia or 90GB with large thumbnails of all the images.

        Check out Kiwix, it’s an offline browser for Wikipedia and anything else someone wants to package up as a ZIM file; available archives include the above mentioned Wikipedia (and most Wikimedia sites), all of the various Stack Exchange sites, many TED talks, Crashcourse, Project Gutenberg, and many others.

  4. I’ve read about this from several sources, what I haven’t been able to figure out is what it cost.
    At some (lowish) cost, it’s a valuable exercise as PR stunt or just conversation starter to get people like us thinking about “what happens when.”

    As far as functioning as a “useful” backup – no.
    Something bad enough has to happen to wipe out all the other live copies of this data. Then we have to recover from that event to the point that we have the _ability_ to recover that archive, read, and translate that data. As well as having both hardware and software infrastructure in place to make that software worth recovering. All within the lifespan of the media.

    Seems more likely we’ll be using the film strips as lashings to hold logs together to build a raft to get off the island.

  5. Oh great. m Now my junk files have a place in the Ark in case things more things get out of wack in the next little while. 2020 isn’t over yet…

    Even in the movie world, 2020 and 2021 are end of the world:
    – In Jade’s World, Skynet became self-aware in 2021. It launched attacked on humanity on 18 June, resulting in Judgment Day.
    – In the Dark Fate timeline, after the termination of Skynet and the destruction of a Cyberdyne building, a new timeline was created, in which Judgment Day happened in the 2020s.

    1. Only 21tb?! In a few years time, that should fit on a pen drive.

      Come to think of it, I should be able to archive the whole of github on the NAS in my office.

      Time to rattle up a quick bash script.

      Actually, on second thought rural broadband in the UK is so bad it will take the best part of 1000 years to download (and the best part of 1000 years for the current government to get round to improving it).

      I might be quicker rowing a boat to Svalbard and taking pictures of their film on my phone.

          1. That’s an aesthetic choice, you can make the pigeon a little ammo belt thing, or like a photographers jacket with pockets, just put them all in a natty little courier satchel, or make a necklace out of them if you must, don’t drill them through the middle.

      1. FLASH based storage isn’t good for archives as the memory cells can lose electric charges due to high temperature, high energy radiations, material defects due to write/erase cycles.

        The 21TB does not include anything of sell-able commercial values (i.e. IP). There are no p0rn, music, movies, games etc.

  6. Out of interest I took a look at softwareheritage.org, and typed in “Hello World” as a search string, expecting to find the first C program in K&R. All I got was a big pile of spam.

    That just looks like a big archive of crap. Github is probably not far off!

    1. There are projects that have no indication of their status i.e. if all the code works or it is just a storage of non-working code hoping someone would pick up and finish the project. Quite often I thought I found something useful until I look at the less than 10 lines of code and/or lack of status and documentation.

      I make a habit of not committing to my github until the code/hardware is at least a workable state.

    1. Because they watched the Mad Max series and figure in an apocalypse the Aussies would just use the cans of film as handy “BBQ in a box” fires, if they sent them there.

  7. With all the code, I hope they are including some means of explaining the hardware it runs on…
    Binary would be meaningless, unless it is mapped to some Op Code, and the OpCode is mapped to the construction of a processor (NAND gates and such) and I/O.

    1. I forgot another intermediary step.
      Converting the 2D barcode to binary, and decompression algorithms…

      Otherwise, it will all be Voynich Manuscript to future people.

    2. Github deals primarily with source code, not binaries.

      As for Opcode, I was very impressed with the folks that reverse engineered the instruction set of programmable calculators by statistical analysis of the raw binary ROM dump and a lot of trial and error. The full official instruction was later published and it was almost dead on. It is similar to the way cryptos are cracked. e.g. jumps are most frequent. etc So in the right hands of a determined mind with lots of time, even a binary dump carries some information.

  8. Boy! That software will be out of date! … next year. ;-)

    Its a PR stunt plain and simple. M$ does nothing benevolent without a vampiric monetary upside for themselves. Heck nobody will probably even be allowed to prove whether or not the data was ever stored in the vault. Unless, of course, it shows up in a court battle with M$ vs. you.

    Personally I feel the whole thing to be completely irrelevant. But who knows what someone will do 1,000 years from now for amusement or curiosity.

    1. “But who knows what someone will do 1,000 years from now for amusement or curiosity.”

      After they’ve returned from the day’s hunting and gathering, and the fire in their cave still has a bit more light/heat before it goes out for the night.

  9. A bit of a tangent…

    I’d actually like to find a way to archive blueprints on tiny glass slides with a laser. I’d like to build some things, then hide in a little compartment on the object itself glass etched micro copies of all the blueprints needed to make it- so someone with a microscope could find them, pull them out, and see the dimensions needed to rebuild it in a few hundred years.

    1. Isn’t glass a liquid? All run together ‘over time’ ;) . Might be better to store on a hard substance…. Kind of funny actually … going back to writing on tablets of stone … so to speak!

      1. Glass as a slow liquid was proposed when old (e.g. 100 year) glass bottles were measured and found to have thicker bottoms.
        It was later realized that bottle making technique of that era resulted in bottles with thick bottoms!

          1. Oh, lets not bother with facts because people *say* stuff!

            “Glass is a liquid because the windows in old churches is thicker at the bottom!”

            No, it’s thicker at the bottom because old glassmakers couldn’t MAKE perfectly flat glass in the first place and so they installed the windows thick side down! It never changed at all!

  10. I’ve been saying it a lot lately, but we really need an archival-grade way of writing software. A compiled or fast interpreted language that can be done in under 100k lines of C, with modern OOP and everything, and no out there experimental stuff that limits anyone’s interest, that we can all agree to just not change anything about, aside from adding platform ports and such.

    Programs should be distributed as source code, and build and run should be automatic, no thinking about makefiles or anything.

    Python is almost perfect, but it’s not designed to be a fixed archival format, nobody will be maintaining 3.8 in 30 years.

    1. Doesn’t need to run persay as it can always be ‘re-written’ to the language in the language of that future day if necessary. More important is the algorithm or ‘function’ of the code, so the wheel won’t have to be re-invented so to speak.

      1. Rewriting code is subject to economic forces, and aside from algorithms that deal with “soft” stuff like computer vision and media processing, a lot of software is already pretty obvious how one would rewrite it using the current available black box algorithms people take for granted, like hash tables and compression.

        Data formats and protocols are probably the hard part though. Preserving those is important, but companies don’t seem to want to reveal those…

    2. How is that supposed to work for embedded software? The target cannot compile the source. Are we supposed to put all of the required build tools in the tarball? Do we really have to put a whole copies of lex and yacc and swig and latex in our tarball? How idiotic is that?

      1. A lot of embedded software is usually pretty simple and already is archival friendly to some degree. I suspect the very first Arduino programs would be very easy to compile today, and even things like MPLAB haven’t changed much. GUIless C/C++ with no operating system versions to worry about, no shared libraries to share with newer programs, etc, aren’t really that big of an issue.

        But making an “archival” build environment that did in fact have lex/yacc/swing/whatever(As opposed to bundling build tools with every single embedded firmware source tarball), seems like a pretty good idea.

        Even with no disaster, I can imagine someone using software like this for the config GUI of some $2000 piece of gear, to make sure it never becomes unusable with new computers.

        1. As a professional embedded systems engineer, I laugh at your entire assertion.

          Also, let me know when you’ve made the environment that is all things to all people. I wanna see it.

        2. You are aware that all of these build tools have forked and gone their own separate ways so there is no way to have a standard lex or yacc because there are too many different forks and versions?

          1. There’s a way to achieve *a* standard version. As in, the company making crappy 8051s meant to survive the apocalypse can say “Here’s exactly the build tools to work with this, it will always run on this archival VM, it includes the proper lex and yacc, so you’ll always be able to compile old programs for this chip on any host machine”

            Not that practical for daily use, but this is about having a backup plan if things really fall apart.

  11. I haven’t seen a clear presentation of exactly what information they archived. Was it only the files stored in the repositories or was it the related user profiles as well?
    Did they anonymize all the data first by removing names and email addresses?
    What if I had personal information in my repository and I want to be forgotten? Will they send their archiving guy over to Svalbard to scratch my details off the film?

    1. “What if I had personal information in my repository and I want to be forgotten?”

      How does it feel to want? You dumped your stinky turds on GitHub and now you want to kick some cat litter over them. Too bad, you own that crap and it’s right there for your future job interviewers to Google, tough on you

    2. It has been said *at length* never to put sensitive information into a public git repository.

      Why do you demand other people take responsibility for actions you took? It’s not even ambiguous – if you don’t want the information to be public, DON’T MAKE IT PUBLIC.

  12. Archiving it in such a format has the added benefit of reducing the amount of material that can be modified to suit the political narrative at the time. Already here in Australia changes are been made to rewrite our history to make it look better in the eyes of the disruptive minority.

  13. I guess in 1000 years scientist will announce:
    “Before computers served some useful purpose they were mostly greeting the world or blinking a light. We don’t know why people needed more and more complex machines to do that but analysis of The Code from The Arc leaves no daubt about that.”

  14. I’m not sure if the point is to provide something that will be used to rebuild a society or even used at all. I think it would be kinda like when we find artifacts from long vanished civilizations—its a glimpse into a time that is forever gone, a way to see what our ancestors found meaningful and spent their time doing.
    Someday in the fairly distant future WE will be someone’s long-forgotten ancestors and I think it would be a pretty cool thing to find! I don’t think the particulars of what software make a difference, I would imagine it would be studied more on a cultural level so the need to re-compile everything into working programs would be unnecessary.

  15. If society collapses, how long will it take for the survivors to reinvent enough technology to go to the Arctic again? And how long will it take them to, by utter chance, find that vault?

    I think a better thing to do would be to store it in some kind of monument which will still be there in 1000 years from now. Things that apply as such kind of monument are huge pyramids, huge temple complexes, just simply any huge man-made monument that can be seen from afar by people who did not even reinvent the lens yet.

  16. Mu suggestion: keep all the memes instead of real code. That will explain a lot to future generations.
    Nobody will be interested to rebuild the same exact code (with all the needed tools needed to interpret it). This archive will be the same as old vintage sumerian accounting claims: interesting as an historical view, not as real tools.

  17. Ah if only there had been such an archive after Black Death. European leaders would have been able to re-establish key theories like the earth-centered universe, demonic possession as the cause of all illness, and the divine right of Kings.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.