Separating Ideas From Words

We covered Malamud’s General Index this week, and Mike and I were talking about it on the podcast as well. It’s the boldest attempt we’ve seen so far to open up scientific knowledge for everyone, and not just the wealthiest companies and institutions. The trick is how to do that without running afoul of copyright law, because the results of research are locked inside their literary manifestations — the journal articles.

The Index itself is composed of one-to-five-word snippets of 107,233,728 scientific articles. So if you’re looking for everything the world knows about “tincture of iodine”, you can find all the papers that mention it, and then important keywords from the corpus and metadata like the ISBN of the article. It’s like the searchable card catalog of, well, everything. And it’s freely downloadable if you’ve got a couple terabytes of storage to spare. That alone is incredible.

What I think is most remarkable is this makes good on figuring out how to separate scientific ideas from their prison — the words in which they’re written — which are subject to copyright. Indeed, if you look into US copyright law, it’s very explicit about not wanting to harm the free sharing of ideas.

“In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work.”

But this has always been paradoxical. How do you restrict dissemination of the papers without restricting dissemination of the embodied ideas or results? In the olden days, you could tell others about the results, but that just doesn’t scale. Until today, only the richest companies and institutions had access to this bird’s eye view of scientific research — similar datasets gleaned from Google’s book-scanning program have trained their AIs and seeded their search machines, but they only give you a useless and limited peek.

Of course, if you want to read the entirety of particular papers under copyright, you still have to pay for them. And that’s partly the point, because the General Index is not meant to destroy copyrights, but give you access to the underlying knowledge despite the real world constraints on implementing copyright law, and we think that stands to be revolutionary.