Crunching The News For Fun And Little Profit

Do you ever look at the news, and wonder about the process behind the news cycle? I did, and for the last couple of decades it’s been the subject of one of my projects. The Raspberry Pi on my shelf runs my word trend analysis tool for news content, and since my journey from curious geek to having my own large corpus analysis system has taken twenty years it’s worth a second look.

How Career Turmoil Led To A Two Decade Project

A hanging sign surrounded by ornate metalwork, with the legend "Cyder house".
This is very much a minority spelling. Colin Smith, CC BY-SA 2.0.

In the middle of the 2000s I had come out of the dotcom crash mostly intact, and was working for a small web shop. When they went bust I was casting around as one does, and spent a while as a Google quality rater while I looked for a new permie job. These teams are employed by the search giant through temporary employment agencies, and in loose terms their job is to be the trained monkeys against whom the algorithm is tested. The algorithm chose X, and if the humans also chose X, the algorithm is probably getting it right. Being a quality rater is not in any way a high-profile job, but with the big shiny G on my CV I soon found myself in demand from web companies seeking some white-hat search engine marketing expertise. What I learned mirrored my lesson from a decade earlier in the CD-ROM business, that on the web as in any other electronic publishing medium, good content well presented has priority over any black-hat tricks.

But what makes good content? Forget an obsession with stuffing bogus keywords in the text, and instead talk about the right things, and do it authoritatively. What are the right things in this context? If you are covering a subject, you need to do so using the right language; that which the majority uses rather than language only you use. I can think of a bunch of examples which I probably shouldn’t talk about, but an example close to home for me comes in cider. In the UK, cider is a fermented alcoholic drink made from apples, and as a craft cidermaker of many years standing I have a good grasp of its vocabulary. The accepted spelling is “Cider”, but there’s an alternate spelling of “Cyder” used by some commercial producers of the drink. It doesn’t take long to realise that online, hardly anyone uses cyder with a Y, and thus pages concentrating on that word will do less well than those talking about cider.

A graph of the word football versus the word soccer in British news.
We Brits rarely use the word “soccer” unless there’s a story about the Club World Cup in America.

I started to build software to analyse language around a given topic, with the aim of discerning the metaphorical cider from the cyder. It was a great surprise a few years later to discover that I had invented for myself the already-existing field of computational linguistics, something that would have saved me a lot of time had I known about it when I began. I was taking a corpus of text and computing the frequencies and collocates (words that appear alongside each other) of the words within it, and from that I could quickly see which wording mattered around a subject, and which didn’t. This led seamlessly to an interest in what the same process would look like for news data with a time axis added, so I created a version which harvested its corpus from RSS feeds. Thus began my decades-long project.

Continue reading “Crunching The News For Fun And Little Profit”

Hash Functions With The Golden Ratio

In the realm of computer science, it’s hard to go too far without encountering hashing or hash functions. The concept appears throughout security, from encryption to password storage to crypto, and more generally whenever large or complex data must be efficiently mapped to a smaller, fixed-size set. Hashing makes the process of looking for data much faster for a computer than performing a search and can be incredibly powerful when mastered. [Malte] did some investigation into hash functions and seems to have found a method called Fibonacci hashing that not only seems to have been largely forgotten but which speeds up this lookup process even further.

In a typical hashing operation, the data is transformed in some way, with part of this new value used to store it in a specific location. That second step is often done with an integer modulo function. But the problem with any hashing operation is that two different pieces of data end up with the same value after the modulo operation is performed, resulting in these two different pieces of data being placed at the same point. The Fibonacci hash, on the other hand, uses the golden ratio rather than the modulo function to map the final location of the data, resulting in many fewer instances of collisions like these while also being much faster. It also appears to do a better job of using the smaller fixed-size set more evenly as a consequence of being based around Fibonacci numbers, just as long as the input data doesn’t have a large number of Fibonacci numbers themselves.

Going through the math that [Malte] goes over in his paper shows that, at least as far as performing the mapping part of a hash function, the Fibonacci hash performs much better than integer modulo. Some of the comments mention that it’s a specific type of a more general method called multiplicative hashing. For those using hash functions in their code it might be worth taking a look at either way, and [Malte] admits to not knowing everything about this branch of computer science as well but still goes into an incredible amount of depth about this specific method. If you’re more of a newcomer to this topic, take a look at this person who put an enormous bounty on a bitcoin wallet which shows why reverse-hashing is so hard.

The Python documentation for str.strip().

Faster String Processing With Bloom Filters

At first, string processing might seem very hard to optimize. If you’re looking for a newline in some text, you have to check every character in the string against every type of newline, right? Apparently not, as [Abhinav Upadhyay] tells us how CPython does some tricks in string processing.

The trick in question is based on bloom filters, used here to quickly tell whether a character possibly matches any in a predefined set. A bloom filter works by condensing a set of more complex data to a couple of bits in an array. When an element is added, a bit is set, the index of which is determined by a hash function. To test whether an element might be in the filter, the same is done but by testing the bit instead of setting it. This effectively allows a fast check of whether an element might be in the filter.

CPython doesn’t stop optimizing there: instead of a complicated hash function, it simply uses the lowest 6 bits. It also has a relatively small bit array at only 64 bits which allows it to avoid memory all together, which in turn makes the comparisons much faster. [Abhinav] goes far into more detail in his article, definitely worth a read for any computer scientists among us.

Nowadays there is ever increasing amounts of talk about AI (specifically large language models), so why not apply an LLM to Python to fix the bugs for you?

Taking A Crack At The Traveling Salesman Problem

The human mind is a path-planning wizard. Think back to pre-lockdown days when we all ran multiple errands back to back across town. There was always a mental dance in the back of your head to make sense of how you planned the day. It might go something like “first to the bank, then to drop off the dry-cleaning. Since the post office is on the way to the grocery store, I’ll pop by and send that box that’s been sitting in the trunk for a week.”

This sort of mental gymnastics doesn’t come naturally to machines — it’s actually a famous problem in computer science known as the traveling salesman problem. While it is classified in the industry as an NP-hard problem in combinatorial optimization, a more succinct and understandable definition would be: given a list of destinations, what’s the best round-trip route that visits every location?

This summer brought news that the 44-year old record for solving the problem has been broken. Let’s take a look at why this is a hard problem, and how the research team from the University of Washington took a different approach to achieve the speed up.

Continue reading “Taking A Crack At The Traveling Salesman Problem”

Grab A Stanford Computer Science Education

There are two reasons to go to school: learn about something and to get a coveted piece of paper that helps you get jobs, or at least, job interviews. With so many schools putting material online, you can do the first part without spending much money as long as you don’t expect the school to help you or grant you that piece of paper. Stanford has a huge computer science department and [Rui Ma] cataloged over 150 computer science classes available online in some form from the University. Just the thing to while away time during the quarantine.

Apparently, [Rui] grabbed the 2020 course catalog to find on-campus classes and found the companion website for each class, organizing them for our benefit. The list doesn’t include the actual online class offerings, which you can find directly from Stanford, although there is another list for that.

Continue reading “Grab A Stanford Computer Science Education”

A Turing-Complete CPU From RAM

Building a general-purpose computer means that you’ll have to take a lot of use cases into consideration, and while the end product might be useful for a lot of situations, it will inherently contain a lot of inefficiencies. On the other hand, if you want your computer to do one thing and do it very well, you can optimize to extremes and still get results. This computer, built from RAM, is just such an example.

The single task in this case was to build a computer that can compute the Fibonacci sequence.  Since it only does one thing, another part of the computer that can be simplified (besides the parts list) is the instruction set. In this case, the computer uses a single instruction: byte-byte-jump. Essentially all this computer does is copy one byte to another, and then perform an unconditional jump. Doing this single task properly is enough to build every other operation from, so this was chosen for simplicity even though the science behind why this works is a little less intuitive.

Of course, a single instruction set requires a lot of clock cycles to work (around 200 for a single operation). The hardware used in this build is also interesting and although it uses a Raspberry Pi to handle some of the minutiae, it’s still mostly done entirely in RAM chips, only cost around $15, and is a fascinating illustration of some of the more interesting fundamentals of computer science. If you’re interested, you can build similar computers out of 74-series chips as well.

Tony Brooker And Autocode – The First High-level Language

The field of computer science has undeniably changed the world for virtually every single person by now. Certainly for you as Hackaday reader, but also for everyone around you, whether they’re working in the field themselves, or are simply enjoying the fruits of convenience it bears. What was once a highly specialized niche field for a few chosen people has since grown into a discipline that not only created one of the biggest industry in modern times, but also revolutionized every other industry, some a few times over.

The fascinating part about all this is the relatively short time span it took to get here, and with that the privilege to live in an era where some of the pioneers and innovators, the proverbial giants whose shoulders every one of us is standing on, are still among us. Sadly, one of them, [Tony Brooker], a pioneer of the early programming language concept known as Autocode, passed away in November. Reaching the remarkable age of 94, the truly sad part however is that this might be the first time you hear his name, and there’s a fair chance you never heard of Autocode either.

But Autocode was probably the first high-level computer language, and as such played a fundamental role in the development of whatever you’re coding in today. So to honor the memory of [Tony Brooker], let’s remember the work he did with Autocode, and the leap in computer science history that it represented.

Continue reading “Tony Brooker And Autocode – The First High-level Language”