Separating Ideas From Words

November 6, 2021

We covered Malamud’s General Index this week, and Mike and I were talking about it on the podcast as well. It’s the boldest attempt we’ve seen so far to open up scientific knowledge for everyone, and not just the wealthiest companies and institutions. The trick is how to do that without running afoul of copyright law, because the results of research are locked inside their literary manifestations — the journal articles.

The Index itself is composed of one-to-five-word snippets of 107,233,728 scientific articles. So if you’re looking for everything the world knows about “tincture of iodine”, you can find all the papers that mention it, and then important keywords from the corpus and metadata like the ISBN of the article. It’s like the searchable card catalog of, well, everything. And it’s freely downloadable if you’ve got a couple terabytes of storage to spare. That alone is incredible.

What I think is most remarkable is this makes good on figuring out how to separate scientific ideas from their prison — the words in which they’re written — which are subject to copyright. Indeed, if you look into US copyright law, it’s very explicit about not wanting to harm the free sharing of ideas.

“In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work.”

But this has always been paradoxical. How do you restrict dissemination of the papers without restricting dissemination of the embodied ideas or results? In the olden days, you could tell others about the results, but that just doesn’t scale. Until today, only the richest companies and institutions had access to this bird’s eye view of scientific research — similar datasets gleaned from Google’s book-scanning program have trained their AIs and seeded their search machines, but they only give you a useless and limited peek.

Of course, if you want to read the entirety of particular papers under copyright, you still have to pay for them. And that’s partly the point, because the General Index is not meant to destroy copyrights, but give you access to the underlying knowledge despite the real world constraints on implementing copyright law, and we think that stands to be revolutionary.

11 thoughts on “Separating Ideas From Words”

eckythump says:

November 6, 2021 at 10:57 am

Scientific literature is slowly getting to the point where all papers will be provided completely free (both to read and to publish) in one way or another. Either as preprints (Arxiv or similar), or draft versions on university servers. It seems crazy that we are still not at that point. When the parasitism eventually comes to an end, the greedy publishers will need to find another way to make easy money off of other people’s hard work. They are holding on for the time being with things like ‘paid open access’, but they’ve found it very hard to say no to publishing junk research when the authors or their universities are willing to pay $1k or more for each paper. Easy money. They sometimes try to justify it by claiming that this covers the ‘license’ (usually a direct copy of CC or something). It’s a joke.

Report comment

Reply
1. Alphatek says:
  
  November 7, 2021 at 1:24 am
  
  But, as with all information becoming ‘free’, there’s is a problem of rationalisation. Not all papers are equal. Some are garbage, some are world-changing. By making them all equally accessible, the world paradoxically gets dumber.
  
  Report comment
  
  Reply
  1. eckythump says:
    
    November 7, 2021 at 3:28 am
    
    I agree, there needs to be rationalisation and ranking. For the actual quality control, at the moment, this is done by peer-reviewers (don’t get a single penny) and sub-editors (also usually don’t get a single penny). Most of the actual admin is also done by automated online systems these days. This can all continue if the papers are free. As many journals are published by professional bodies, any genuine running costs should come out of the membership fees (this is another racket as all ‘professionals’ are expected to be members or accredited members). Ranking of ‘importance’ of papers is often linked indirectly to the ranking of the journals themselves (with ‘Nature’ and ‘Science’ probably at the top, and usually publishing the most important research). Strangely, subscription/access fees to journals are not very reflective of their ‘ranking’ anyway. The ‘world-changing’ and ‘garbage’ papers are equally cheap/expensive. There’s nothing in the money that publishers get that contributes to the quality of the science.
    
    Report comment
    
    Reply
  2. volt-k says:
    
    November 8, 2021 at 4:25 am
    
    > By making them all equally accessible, the world paradoxically gets dumber.
    
    Right now the garbage information is abundant and freely available (e.g. clickbait articles on all the popular news sites) and the good quality world-changing stuff is behind paywalls. I believe making them all equally accessible would only make the situation better.
    
    Report comment
    
    Reply
    1. MM says:
      
      November 9, 2021 at 8:33 pm
      
      ^ The amount of people I’ve met that sincerely believed the moon would have a green hue on 4/20 because of [insert false “scientific” article here] astounds me. Equal-access being an improvement also kinda assumes people want to read the good quality world-changing stuff despite a long history of picking up the latest edition of Esquire: “My alien-lizard neighbor was impregnated by a lobster furry, am I the father?” but hey, it can’t get any worse, right?
      
      ……Right?
      
      Report comment
      
      Reply
petercat says:

November 6, 2021 at 11:11 am

It seems quite reasonable that the copyright law would clearly define the limits of its scope by not extending its protections to “any idea, procedure, process, system, method of operation, concept, principle, or discovery” as that is the role of patent law.

Report comment

Reply
1. M says:
  
  November 6, 2021 at 12:49 pm
  
  Now if only patents weren’t horribly broken too.
  
  Patent examiners these days just simply cannot be expert enough to determine if an idea is novel or if some prior art exists somewhere. But that only partially explains the march of bogus patents being issue these days, often of the form of “doing X but with Y” where Y={computers, the internet, machine learning, blockchain, …}. The EFF regularly does a spotlight on these, like “patent on watching movies….using computers over the internet.” We all know about makerbot’s patent on heated boxes.
  
  Then there’s trolls who simply buy up old patents or file bogus ones to harass people.
  
  Report comment
  
  Reply
Jose says:

November 6, 2021 at 12:01 pm

I knew this web for a long time ago: http://paperscape.org/
It is a “map” of arxiv papers where you can navigate via different science clusters and find the papers with more references… Also its interesting that the autor of this is Damien George, the author of micropython

Report comment

Reply
1. Elliot Williams says:
  
  November 8, 2021 at 3:10 am
  
  Super cool link! Thanks.
  
  Report comment
  
  Reply
Nick says:

November 8, 2021 at 3:10 am

If real research becomes more accessible, perhaps misinformation will become less attractive.

Report comment

Reply
AbsoluteRecoil says:

November 8, 2021 at 5:48 am

Really useless waste of time. It already exists and it’s called the Internet. Libgen is pretty much all I need. Pirate every single scientific paper.

Report comment

Reply

Hackaday

11 thoughts on “Separating Ideas From Words”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

Thingino Teaches Cheap IP Cameras New Tricks

Hackaday Europe 2026: High Performance SDR On The Cheap

Encryption In The 1790s

The Need For Speed: Internet Speed Measurement (or DIY?)

Postal IRCs Are Almost A Thing Of The Past

Our Columns

Commercialization And Innovation

Hackaday Podcast Episode 380: 3D Printing The Rainbow, IR And IP Camera Hacks, And Americium 241 On The Loose

This Week In Security: What’s In A Name, The AI Bugpocalypse Hits Everyone, OpenWRT Flaws, And Duress Passwords

FLOSS Weekly Episode 877: RCE As A Service

Hackaday Links: July 26, 2026

11 thoughts on “Separating Ideas From Words”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns