Malamud’s General Index: Research Gist, No Slap On The Wrist

Tired of that unsettling feeling you get from looking for paywalled papers on that one site that shall not be named? Yeah, us too. But now there’s an alternative that should feel a little less illegal: this new index of the world’s research papers over on the Internet Archive.

It’s an index of words and short phrases (up to five words) culled from approximately 107 million research papers. The point is to make it easier for scientists to gain insights from papers that they might not otherwise have access to. The Index will also make it easier for computerized analysis of the world’s research. Call it a gist machine.

Technologist Carl Malamud created this index, which doesn’t contain the full text of any paper. Some of the researchers with early access to the Index said that it is quite helpful for text mining. The only real barrier to entry is that there is no web search portal for it — you have to download 5TB of compressed files and roll your own program. In addition to sentence fragments, the files contain 20 billion keywords and tables with the papers’ titles, authors, and DOI numbers which will help users locate the full paper if necessary.

Nature’s write-up makes a salient point: how could Malamud have made this index without access to all of those papers, paywalled and otherwise? Malamud admits that he had to get copies of all 107 million articles in order to build the thing, and that they are safe inside an undisclosed location somewhere in the US. And he released the files under Public Resource, a non-profit he founded in Sebastopol, CA. But we have to wonder how different this really is from say, the Google Books N-Gram Viewer, or Google Scholar. Is the difference that Google is big enough to say they’re big enough get away with it?

If this whole thing reminds you of another defender of free information, remember that you can (and should) remove the DRM from his e-book of collected writings.

Via r/technology

Removing DRM From Aaron Swartz’s EBook

After his death, Aaron Swartz became one of the Internet’s most famous defenders of the free exchange of information, one of the most polarizing figures on the topic of intellectual property, and the most famous person that still held on to the ideals the Internet was founded on. Aaron was against DRM, fought for the users, and encouraged open access to information.

Early this year, Verso Books published the collected writings of Aaron Swartz. This eBook, according to Verso, contains ‘social DRM’, a watermarking technology that Verso estimates will, “contribute £200,000 to the publisher’s revenue in its first year.” This watermarking technology embeds uniquely identifiable personal information into individual copies of eBooks.

With a heavy sigh, you realize you do not live in the best of all possible worlds.

The Institute for Biblio-Immunology had a similar reaction to Verso Books’ watermarking technology applied to the collected writings of Aaron Swartz. In a communique released late last weekend, they cracked this watermarking scheme and released the code to remove this ‘social DRM’ from ePub files.

The watermarking technology in Aaron Swartz’s eBook comes courtesy of BooXtream, a security solution where every eBook sold is unique using advanced watermarking and personalization features. “A publication that has been BooXtreamed can be traced back to the shop and even the individual customer,” the BooXtream website claims, and stands in complete opposition to all of Aaron Swartz’s beliefs.

After analyzing several digital copies of Aaron Swartz’s eBook, the Institute for Biblio-Immunology is confident they have a tool that removes BooXtrem’s watermarks in EPUB eBooks. Several watermarks were found, including the very visible – Ex Libris images, disclaimer page watermarks, and footer watermarks – and the very hidden, including image metadata, filename watermarks, and timestamp fingerprints.

While the Institute believes this tool can be used to de-BooXtream all currently available ‘social DRM’ed’ eBooks, they do expect the watermarking techniques will be quickly modified. This communique from the Institute of Biblio-Immunology merely provides the background of what BooXtream does, not the prescription for the disease of ‘social DRM’. These techniques can be applied to further social DRM’ed eBooks, which, we think, is what Aaron would have done.

This Has Not Been A Good Week For The Hacker Community

RIP

The Internet lost a few great minds this week. [Aaron Swartz], confronted with an upcoming federal trial for his actions in downloading and releasing public domain academic articles from JSTOR, hanged himself this week. As one of the co-developers for RSS, the Creative Commons license, and slew of other works, [Aaron]’s legacy expanded the freedoms and possibilities of the most important human invention since the book.

Perhaps overshadowed in the news by [Aaron] is [Fabio Varesano], the man behind FreeIMU and Femtoduino. He died of a sudden heart attack at the much too young age of 28. The RC helicopter/plane/drone and HCI/physical computing communities lose a great mind with [Fabio]’s passing.

There is talk on the Dangerous Prototypes forum of continuing the development of FreeIMU, a project it seems [Fabio] worked on alone. We’d love to see someone pick up the reigns of the FreeIMU project, hopefully after doing a run of the current hardware and donating the proceeds to [Fabio]’s family.