Page rankings are the secret sauce of websites that automatically aggregate user submissions. The basic formula used by Hacker News was published a few years back. But there are several pieces of the puzzle that are missing from that specification. [Ken Shirriff] recently published an analysis that digs deeper to expose the article penalization system used by Hacker News’ ranking engine.
One might assume that the user up and down votes are what determine a page’s lifespan on the front page. But it turns out that a complex penalization system makes a huge difference. It takes into account keywords, and domain names but also weighs controversy. It’s a bit amusing to note that this article on the topic was itself penalized, knocking it off of the front page.
You can get the full details of the system from his post, but we found his investigation methods to be equally interesting. He scraped two pages of the news feed every minute using Python and the Beautiful Soup package (a pretty common scraping practice). This data set allowed him to compare the known algorithm with actual results. What was left were a set of anomalies that contained enough sense for him to reverse engineer the unpublished formulas being used.
[Rajesh] put web scraping to good use in order to gather the information important to him. He’s published two posts about it. One scrapes Amazon daily to see if the books he wants to read have reached a certain price threshold. The other scrapes Rotten Tomatoes in order to display the audience score next to the critics score for the top renting movies.
Web scraping uses scripts to gather information programmatically from HTML rather than using an API to access data. We recently featured a conceptual tutorial on the topic, and even came across a hack that scraped all of our own posts. [Rajesh’s] technique is pretty much the same.
He’s using Python scripts with the Beautiful Soup module to parse the DOM tree for the information he’s after. In the case of the Amazon script he sets a target price for a specific book he’s after and will get an email automatically when it gets there. With Rotten Tomatoes he sometimes likes to see the audience score when considering a movie, but you can’t get it on the list at the website; you have to click through to each movie. His script keeps a database so that it doesn’t continually scrape the same information. The collected numbers are displayed alongside the critics scores as seen above.