We covered Malamud’s General Index this week, and Mike and I were talking about it on the podcast as well. It’s the boldest attempt we’ve seen so far to open up scientific knowledge for everyone, and not just the wealthiest companies and institutions. The trick is how to do that without running afoul of copyright law, because the results of research are locked inside their literary manifestations — the journal articles.

The Index itself is composed of one-to-five-word snippets of 107,233,728 scientific articles. So if you’re looking for everything the world knows about “tincture of iodine”, you can find all the papers that mention it, and then important keywords from the corpus and metadata like the ISBN of the article. It’s like the searchable card catalog of, well, everything. And it’s freely downloadable if you’ve got a couple terabytes of storage to spare. That alone is incredible.

What I think is most remarkable is this makes good on figuring out how to separate scientific ideas from their prison — the words in which they’re written — which are subject to copyright. Indeed, if you look into US copyright law, it’s very explicit about not wanting to harm the free sharing of ideas.

“In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work.”

But this has always been paradoxical. How do you restrict dissemination of the papers without restricting dissemination of the embodied ideas or results? In the olden days, you could tell others about the results, but that just doesn’t scale. Until today, only the richest companies and institutions had access to this bird’s eye view of scientific research — similar datasets gleaned from Google’s book-scanning program have trained their AIs and seeded their search machines, but they only give you a useless and limited peek.

Of course, if you want to read the entirety of particular papers under copyright, you still have to pay for them. And that’s partly the point, because the General Index is not meant to destroy copyrights, but give you access to the underlying knowledge despite the real world constraints on implementing copyright law, and we think that stands to be revolutionary.

  1. Scientific literature is slowly getting to the point where all papers will be provided completely free (both to read and to publish) in one way or another. Either as preprints (Arxiv or similar), or draft versions on university servers. It seems crazy that we are still not at that point. When the parasitism eventually comes to an end, the greedy publishers will need to find another way to make easy money off of other people’s hard work. They are holding on for the time being with things like ‘paid open access’, but they’ve found it very hard to say no to publishing junk research when the authors or their universities are willing to pay $1k or more for each paper. Easy money. They sometimes try to justify it by claiming that this covers the ‘license’ (usually a direct copy of CC or something). It’s a joke.

    1. But, as with all information becoming ‘free’, there’s is a problem of rationalisation. Not all papers are equal. Some are garbage, some are world-changing. By making them all equally accessible, the world paradoxically gets dumber.

      1. I agree, there needs to be rationalisation and ranking. For the actual quality control, at the moment, this is done by peer-reviewers (don’t get a single penny) and sub-editors (also usually don’t get a single penny). Most of the actual admin is also done by automated online systems these days. This can all continue if the papers are free. As many journals are published by professional bodies, any genuine running costs should come out of the membership fees (this is another racket as all ‘professionals’ are expected to be members or accredited members). Ranking of ‘importance’ of papers is often linked indirectly to the ranking of the journals themselves (with ‘Nature’ and ‘Science’ probably at the top, and usually publishing the most important research). Strangely, subscription/access fees to journals are not very reflective of their ‘ranking’ anyway. The ‘world-changing’ and ‘garbage’ papers are equally cheap/expensive. There’s nothing in the money that publishers get that contributes to the quality of the science.

      2. > By making them all equally accessible, the world paradoxically gets dumber.

        Right now the garbage information is abundant and freely available (e.g. clickbait articles on all the popular news sites) and the good quality world-changing stuff is behind paywalls. I believe making them all equally accessible would only make the situation better.

        1. ^ The amount of people I’ve met that sincerely believed the moon would have a green hue on 4/20 because of [insert false “scientific” article here] astounds me. Equal-access being an improvement also kinda assumes people want to read the good quality world-changing stuff despite a long history of picking up the latest edition of Esquire: “My alien-lizard neighbor was impregnated by a lobster furry, am I the father?” but hey, it can’t get any worse, right?


  2. It seems quite reasonable that the copyright law would clearly define the limits of its scope by not extending its protections to “any idea, procedure, process, system, method of operation, concept, principle, or discovery” as that is the role of patent law.

    1. Now if only patents weren’t horribly broken too.

      Patent examiners these days just simply cannot be expert enough to determine if an idea is novel or if some prior art exists somewhere. But that only partially explains the march of bogus patents being issue these days, often of the form of “doing X but with Y” where Y={computers, the internet, machine learning, blockchain, …}. The EFF regularly does a spotlight on these, like “patent on watching movies….using computers over the internet.” We all know about makerbot’s patent on heated boxes.

      Then there’s trolls who simply buy up old patents or file bogus ones to harass people.

