Your job is to make a circuit that will illuminate a light bulb when it hears the song “Mary Had a Little Lamb”. So you breadboard a mic, op amp, your favorite microcontroller (and an ADC if needed) and get to work. You will sample the incoming data and compare it to a known template. When you get a match, you light the light. The first step is to make the template. But what to make the template of?

“Hey boss, what style of the song do you want to trigger the light? Is it children singing, piano, what?”

“I want the light to shine whenever any version of the song occurs. It could be singing, keyboard, guitar, any musical instrument or voice in any key. And I want it to work even if there’s a lot of ambient noise in the background.”

Uh oh. Your job just got a lot harder. Is it even possible? How do you make templates of every possible version of the song? Stumped, you talk to your friend about your dilemma over lunch, who just so happens to be [Jeff Hawkins] – a guy whose already put a great deal of thought into this very problem.

“Well, the brain solves your puzzle easily.” [Hawkins] says coolly. “Your brain can recall the memory of that song no matter if it’s vocal, instrumental in any key or pitch. And it can pick it out from a lot of noise.”

“Yea, but how does it do that though!” you ask. “The pattern’s of electrical signals entering the brain have to be completely different for different versions of the song, just like the patterns from my ADC. How does the brain store the countless number of templates required to ID the song?”

“Well…” [Hawkins] chuckles. “The brain does not store templates like that”. The brain only remembers the parts of the song that doesn’t change, or are invariant. The brain forms what we call invariant representations of real world data.”

Eureka! Your riddle has been solved. You need to construct an algorithm that stores only the parts of the song that doesn’t change. These parts will be the same in all versions – vocal or instrumental in any key. It will be these invariant, unchanging parts of the song that you will look for to trigger the light. But how do you implement this in silicon?

Some organizations have taken Hawkins’ ideas and stealthily run with them, with schemes already underway at companies like IBM and federal organizations like DARPA to implement his ideas in silicon…

Indeed, companies are already working to implement [Jeff Hawkin’s] theory of intelligence into their own systems. It’s a complicated theory, which is laid out in his book – On Intelligence. Forming invariant representations (IR) is only the beginning, and we will discuss other parts of the theory in later articles. But for now, we will concentrate on how one would go about forming IR’s of real world data in silicon. We simply cannot move forward with the theory until this core component is understood. The problem is nobody seems to know how to do this. Or if they do, they’re not talking This is where you come in!

Consider this image. Let us pretend these are serial signals coming off multiple ADCs. On the other end of the circuit would be different versions of our song, with A – E representing those different versions. Because the data is constantly changing, we sample 4 signals at the same time for each version, which are numbered 1 – 4.

Immediately, we see a common pattern in all versions at times T4, T5 and T6. If we can somehow set our microcontroller to listen to the these times, we can detect all versions of the song. Further, we can see another pattern between the versions at times T1, T2 and T3. This type of analysis can be used to distinguish between the different versions. Both patterns are invariant representations of the song – a common, unchanging pattern hidden in the mist of a constantly changing environment.

This is a hypothetical example of course. In the real world, the signals would vary wildly. The key is to find the part that does not. Can you do it? How would you create an invariant representation of a real world event?

1. NotArduino says:

Umm, this is not a hack Will.

1. It’s hack a day. If you get less than one per day you can start complaining.

1. asfwer says:

Touche.

2. rusty shackled says:

I always did like you.

3. NotArduino says:

Since we’re liberally interpreting what a hack-a-day means I choose the day length of the ISS, 93 minutes. I don’t see any hacks today.

1. doed says:

You may get to use that day length if you actually were on the ISS. Until then, shut your dick holster.

4. Pusalieth says:

ha, booom. Can’t stand that NotArduino asshole anyway. Since his posts in others.

5. pyroavr says:

Pithy and smooth. I like it!

6. FredTheRanger says:

Technically, it’s “fewer than one hack per day.” Things you can count are “fewer” Things you cannot count are “less”. For example, you don’t have “fewer salt”, you have “less salt”. However, you have “fewer grains of salt.” :P

1. FredTheRanger says:

ugh. that was barely funny to begin with, then it stripped my snarky [pedant] [/pedant] tags out.

2. I guess I’d go with something like this:
1-Identify a clear data point such as a note/sound (Hz) or an overlaid combination of them(ex.DTMF)
2-Identify a second data point and calculate ΔHz (or Δ?)
3-Do this for a definite number of data points (ex. The Hunger Games whistling is 4 notes, so we get Δ1,Δ2 and Δ3)
4-Apply a relative scale of Δ1 to Δ2, Δ2 to Δ3 and Δ1 to Δ3 to get something along the form of Δ1=A*Δ2, Δ1=B*Δ3 and Δ2=C*Δ3
5-Those values would represent a frequency ratio between key notes and would always be the same relative to one another.
6-Keep scanning incoming notes until 3 of them in a row are within tolerance of the Invariant representation and voila!

One could also add some timing values or any number of variables as the complexity would be relative to Number of data points * Attributes of a data point in a Matrix shape.

Therefore you get from a small computing to something pretty huge depending on the needs and all you have to program is in fact an Invariant representation.

TL;DR:
Invariable representation = Relative values/ratios of data points (in my understanding)

This approach would work for any variant of a song, in any octave as long as the notes are within their expected ranges.

Relativity solves everything. ;)

1. Jason Doege says:

Basically, look for relative pitch and relative meter to identify a song, then. I wonder how Soundhound does it. I’ve whistled a song into it (a song that had no whistling) and it was able to identify the song for me.

1. Maxwell says:

Probably a similar way, it (probably) just looks for an average pitch, which with whistling would be higher than the normal song. So it steps up what pitches it’s looking for, and then is able to identify it, based on timing between notes most likely.

2. Blue Footed Booby says:

I suddenly want to see how those sorts of services handle music like classical fugues, which basically have multiple independent melodies. Can you just whistle one of the voices, or do you have to provide counterpoint to yourself?

2. hojo says:

I’d guess that doing a FFT of the “noise”, to find your “data points” and then applying your delta approach to results of similar amplitude would help. (full disclosure: I have no idea how to implement a FFT)

3. Rob says:

You run into issues though when you have the same melody but with different sets of lyrics. Your method seems to describe a reduction function to map the melody. For example, the church hymn “Glorious Things of Thee Are Spoken” shares the melody of a classical composition by Haydn. Haydn set that melody with lyrics that would become “Deutschlandleid” (the German national anthem). In some cases, people think of “Deutschlandleid” by the bastardized title “Deutschland Uber Alles” and associate it with the Nazis (who made much use of it for a time). So in that situation, you have one melody with two sets of lyrics with three potential (and *very* different) associations. Your method is going to run into some issues there. Similar scenario with ” ‘O Sole Mio” … opera in one use of the melody but then Elvis recorded “It’s Now or Never” to the same melody.

You’d have to go multiple layers deep with your method, each layer more complex than the last, to sort out the differences in a situation of the kind that I described, and that’s just one of the possible variables. Alternately, each melody could be treated as a single item and upon recall the results would include a list of the possible variants. But that’s rather unsatisfying.

It’s an interesting topic to think about, for sure…

3. Oliver says:

Isnt this in theory how shazam and the likes work too? The audio watermark or something?

4. why not simply count/measure the off times or lower pitch compared upper if noisy?

5. Rusty Shackleford says:

The invariant parts are the relative note durations and the relative pitch intervals between the notes. I see a lot of Fourier transforms in your future…

1. wretch says:

That is cool.

2. JIm B says:

Shazam doesn’t lump version A of a song with version B of a song. In fact, Shazam differentiates between, say, the live and studio versions of a song, or the original recording and a cover version. So while I’m sure it is finding invarients of sort, it isn’t the type which this article is talking about.

6. CodeRed says:

You must go higher. I don’t think the problem is easily solved if you are looking at raw signals, you need more abstraction. Think of how your brain does it. Relative changes in pitch, tone, and volumens in a pattern. More than just tones/notes, you recognize that a bunch of these in series with similar audio qualities represent a song, and that’s how you are able to separate it from background noise. Or at least you think. Its possible your brain has heard many separate versions, and uses all these similar versions in recognizing a new version. Think of how it can be difficult to understand a coworker with an accent, but over time it gets easier as you learn new versions of already familiar words.

I think you could dedicate an entire software framework to solving such problems, and likely would need to.

1. This is censorship at it’s finest.

Deploying a code that achieve the same result as Shazam does not mean it’s infringing a patent. If it did, I could patent “soundwaves as a mean of communication” and attack anyone producing soundwaves on that bogus copyright basis.

Secundo, copyright has been created to promulgate technological advancement by giving the means to the people creating something new to survive while they push the concept forward. A very important point there also is that copyright is based on “for profit” industries and as such, someone figuring out by himself how to make a computer recognize music for the sheer pleasure of the challenge is not doing copyright infringement.

I wish the EFF would grab that case and push it to the court.

7. ruben says:

Your problem is quite similar to that of speech recognition, where you try to recognize words independet of the speed, pitch etc.
Speech Recognition (and in fact most pattern recognition in sequential data) is usually done by using Hidden Markov Models (HMM).
Basically a HMM consists of Markov Chain (MC) which is not directly observable (hidden) and a set of Observations that are each linked to a node in the MC.
What your trying to do is determining the state your system is most likely in based on the observations you made so far. You need to sets of probabilities. One are the transition probabilieties of the MC the other are the conditional probabilieties of the observations. These probabilities are learned from examples.
A HMM is really just a special kind of Baysian Network.

In other words:
the MC could be what you call your IR, whereas the Samples from your Song would be the Observation.
To recap:
HMMs are the de facto standard for that kind of problem and can be implemented quite efficiently.
You should definitly look into that.

8. phreaknik says:

As much as I love playing with silicone, I think you mean “silicon”. I could be wrong though ;)

1. furiousd says:

That got to me as well

9. Duwogg says:

Well, your brain only does it with billions of neurons. And his theory is just that so far. Not saying that this isn’t in fact how our brains handle such things. But AI, as smart as it is, is no where near as good as the human brain, or even mouse brain… yet.
Detecting complex patterns in noise has been going on for years. This is just streamlining the process by sticking to the invariable data so you don’t have to so many
reference patterns.

Now switch in video signals instead of audio and objects instead of patterns…

1. Duwogg says:

store so many reference patterns… sorry

10. butterfly says:

Hawkins is well known in machine learning community as a PR whore – he has zero actual results to prove his theories.

It’s like if Elon Musk had spent the last decade talking about how electric motors worked but never built a car.

11. TERCOM had a pretty nice take on it.

In the real world, we only ever get approximations. I run an IRC bot that someone wrote that only knows the Markov-chain algorithm, but each day among its mad rants something surprisingly insightful emerges. It’s quite uncanny, and makes one wonder about the ghost in the machine.

You seem to be looking for a meta-solution. – There is more than one reason for philosophers starving, and that is how Kurt Gödel died.

Perhaps when the system appears to be in error, you have simply not found out why it did the right thing?

Humans gamble against sure losses and play it safe for small gains. Often we are not aware if we are behaving is rationally or irrationally.

12. One thing Will didn’t mention is that [Jeff’s] study of the brain was a hack in itself. He fell in love with the inner workings of the brain back in 1979. After failing to get Intel (his employer at the time) on board with his research, he tried to join the MIT media lab. Again, he was rejected. [Jeff] then decided to try to use his career in the computer industry to fund his passion for the mind. Considering how big PalmOS was in it’s heyday, I’d say he did a pretty good job.

1. dinre says:

How so? On both counts.

13. vonskippy says:

Might be easier just to leave the light on 24/7.

14. Chris C. says:

To make it reasonably robust, you need to use multiple approaches:

1) Submit a sample to an audio identification service, like Gracenote, Echoprint, etc. Maybe several simultaneously. Look for “Mary Had a Little Lamb” in the song title. This will find versions of the song known to the ID services. Though it may be somewhat dependent on audio quality, if you’re using an open microphone to record the audio.

2) Fuzzy matching of relative pitch/tempo relationships. I’m sure there’s some prior art and services that do this already. For example, isn’t there a website where you can whistle a tune and find out what it is? Or clap the rhythm? Things like this may find some unknown versions of the song.

3) Of course, it’s possible a progressive rock band will give ol’ Mary the treatment, completely noodling around with the pitch/tempo relationships to the point where they can no longer be recognized by #2. And if it’s a new/uncommon song, #1 won’t catch it either. The last resort is speech recognition. Listen for specific lyrics. Pre-filtering out some of the non-vocal content first would help. In case only a few words can be picked out, you can still examine them for relative temporal relationships. More specialized approaches may also be beneficial. For example, detecting only vowel sounds via cepstral analysis, then performing the same tests for temporal relationships; rather than attempting to recognize full words.

4) And finally aggregate all the data gathered using some weighting system.

Of course, it’s all easier said than done. So it’s important to utilize existing resources as much as possible, as I’ve suggested above.

Once I had a very detailed and strange dream, in which I saw a possible future of AI. Relating and extending what I saw to the topic at hand:

There will be a central repository of code modules, that share a common API and language (or at least compile to the same intermediate language). People will write modules to perform specific tasks and upload them. Some will be open-source, so others could tweak existing modules in an attempt to improve performance, and upload them as a different version.

And if you wanted your device to recognize when a particular song is played, it will simply contact the repository and download a handful of “music recognizer” modules. Each of which gets fed the audio, and returns a result and percent certainty, which will be aggregated and presented to the user. The users may occasionally give the device feedback, telling it whether it’s right or wrong; which gets uploaded to the repository along with the result of all modules, generating reliability statistics for each module, and cross-correlating what modules generated similar results, resources (memory/CPU) required to produce the result, etc.

The device will periodically check the available modules and statistics. If it has sufficient resources, it will download more modules; typically high-rated ones, but should occasionally try others as well. If resources are running tight, it will delete modules that get it wrong too often, or which consistently produce similar results to another running module and are therefore redundant.

Modules should be as granular as possible. For example, a “music recognizer” module might require “vowel detection”. In which case it can experimentally try any vowel detector available, saving separate stats on how well each combination worked. Perhaps not all would have compatible APIs, but folks could also write modules that do nothing that translate across particular APIs, increasing the number of usable combinations.

It’s kind of like the concept of RobotWar, extended to more useful purposes. But there’s more.

If any device can allocate some spare resources, it can attempt generating its own modules automatically, using neural nets, genetic algorithms, or other methods yet to be invented. If, for example, a promising “vowel detector” is generated, it will automatically upload the module to the repository. (In reality, it would likely be some far smaller and more granular task that’s only a part of vowel detection.) Other devices can then use it, or continue mutating it in massive parallelism, re-uploading a new version if an improvement is made. As time goes on, machine-generated solutions will gradually increase in prevalence, and true AI will start to be born.

1. aptitude search “recognition”

Github is nice too, but the search isn’t as seamless and many projects lack autotools/cmake/qmake.

1. Chris C. says:

Got a web link? Remember that Linux accounts for less than 2% of desktop usage. Maybe higher around here, but I’m still not in that group. I’m curious what you’re referring to, just not to the point of loading an OS to find out.

1. Sorry for replying so late.

It was a general reference to package management. You were describing something people have been working on for a long time. Perhaps there was an original far-reaching vision such as yours…

15. NBM says:

I’m sorry to say, this guy is wrong. I see a good hammer, but not everything is a nail. It is good to look at problems in different ways, but this theory of mind isn’t the end / whole game. To say that robots as a platform for intelligence is impractical is a bit shortsighted. Best news for me out of this article is that my work remains undiscovered… :) At least he isn’t claiming that the mind is a set of pattern matchers like Kurzweil (that guy is a LOON!!). Rodney Brooks was closer, but I don’t think he knew exactly why or how he was so close…

16. toodlestech says:

Great post! I’m definitely going to buy the book.

18. Pusalieth says:

This I think is a lot harder than what even this makes it out to be. The brain also stores information in multiple places, and sub-sequences of the original. Continuing with the music example, every part of the brain is activated when to neuron is activated, logic, visualization, hearing, smelling, etc. However much information the original neuron storage is the information executed, with the same level of chemical bond relationship. Which is really why garbage in, garbage out, is truer than most people know, because the conscious mind is an amazing piece of technology, that executes incredibly complex and I don’t think we’ll ever comprehend it. I always loved the quote, “If the brain was simple enough to comprehend, then we would be to simple to understand it.” Perfect quote, and just as a comparison, physics has been studied since the beginning of time, just through magic, but people still sought to understand it, and because of this physics is the most advanced field of science, yet still, we understand just as much about the universe now as when we started. Think about it, whether you take one step to infinity or a million, your just as far from the end. To me this also is the mind, the closer we to comprehension, the more complex and advanced it gets, thereby increasing the amount of needed understanding exponentially. Good stuff though

19. Trui says:

Nobody said the light had to turn off, so my solution would just be to leave it on permanently.

20. John says:

Oh wow, an ad for a book! I wonder how much money each Amazon click-through brings in. At least you guys tried to make an article… Someone feels guilty about selling out. Gotta grow that brand though, and really who is a bigger brand in the maker movement than Hackaday?

1. That’s complete bullshit. We would never start slipping down the editorial slope of monetizing Amazon links, or reviewing books for a fee. We’re Hackaday, not the Washington Post or New York Times.

21. Hirudinea says:

I think vonskippy and Trui are on the right track but we need something more subtle, hows about when the when the circuit hears anything that sounds like human speech/song it says “Was that ‘Mary had a little lamb’?” Picking up the user response of “Yes” would turn on the light, and it does meet the parameters of the task. So when do I get my grant?

22. Okian Warrior says:

I’ve been working on this as my day job continuously for the past 7 years.

The problem is wildly complex and abstract, and nothing that can be solved in a few minutes of thinking about invariant representations.

To give an idea of the complexity involved, imagine building a computer program to learn and play any board game – chess, checkers, go-moku, or anything else – with no initial knowledge of the game. As far as anyone can tell, there are no neuronal circuits in the brain which are specific to chess or checkers or any other game. The algorithm of the brain is universal – it applies to *any* game.

If that’s not enough, consider that the game can be given to you in any format. Chess board descriptions might be passed in as an 8×8 grid of integers, where each grid position can be an integer 0=empty, 1=pawn, 2=rook, 3=knight, and so on.

As the algorithm has no built-in information of the game, it also doesn’t have information about the *descriptive format* of the game information. In the previous example, the program doesn’t know whether 1 is pawn or 3 is pawn, and when given the board info it doesn’t know in what order the squares are listed. It could be left-to-right up-down, or it could be down-to-up from left-to-right, or it could be alternating (left to right for the top row, right-to-left for the next, and so on), or outside in (top row, right column, bottom row, left column, then move in by 1 square).

It’s always the *same* format, but the program doesn’t know what that format is.

And as far as the invariant representation goes, note that you can write your name using a pen in your hand, or held in your toes, or held in your mouth, or taped to your elbow, or protruding from your hip. You can write your name in the snow without using your hands.

You never practiced *any* of these output modes, yet you can do them recognizably well on the first try. Imagine programming a robot to translate any action it learns in one input mode into any output mode.

And we can recognize a song sung slower, faster, in a different key, or (within reason) with varying tempo and varying pitch.

We can recognize a song among 2 songs played at the same time.

Solving this requires knowledge of computability theory, information theory, [really high-end concepts in] probability theory, and requires a solution to the [currently unsolved] “mixture of gaussians” problem.

And for the record, Jeff makes some provable mistakes. To take a concrete example, his cortical column simulations use minimum euclidean distance to choose the most likely pattern match. He does not support this choice with evidence or mathematical proof, and it happens to be the wrong choice. The correct “match distance” function can be deduced from first principles. This is one reason the outputs from his commercial software are finicky and noisy.

It’s said that engineers use their tools to build something, and scientists build new tools. This is waaaay out in the realm of theoretical research, and no application of known tools that anyone knows will solve it.

1. Brains do optimization really well though, because of how neural networks do their thing. IBM’s new brain chip might change what you have taken for granted the past 7 years…

1. NBM says:

… I think that chip has its uses, but re-creation of the mind in silicon isn’t one of them. I think it is silly to try and model the brain in silicon (neural, symbol, etc). Propagation delays, chemical changes, these things and how they affect the mind are dependent on the make of the mind – biological, electrochemical, and proximal aspects. The nature of the mind (what it does at a suitable level of abstraction where physical medium is abstracted from function) has to be captured, not its explicit functional form. Siliconized life can never be a one to one clone of biological life – just not possible. In that, life is the common attribute, which begs the question, “What IS Life?”. Then, one is faced with a question that has seemingly forever plagued man. Any thoughts in this will be highly theoretical – but not necessarily wrong. The harder part is moving from a theory to a working example. As for hardware capabilities, I think we are further along than most think. I find this to be no different than what were certainly numerous instances of pre-life in the primordial ooz from which life evolved, with all of the potential of its cousins, lying in wait for that key attribute to be revealed, the attribute that enabled life’s transition.

1. Sorry for replying so late.

You write well.

Good luck!

23. NewCommentor1283 says:

silicon and fat have very very different strengths and weaknesses.

sillicon is very weak at finding patterens, exactly what we’r doing here
but it is very hard to trick/fool

fat on the other hand,
is VERY good at recognising patterens!
just its extremely easy to trick/fool.

PS: “fat” as in braincells,
braincells that happen to be programmed to work as a unit (usually)

1. NewCommentor1283 says:

“combination of the two”

a little hint from the past/future lolz

24. tz2026 says:

Think OCR. Letters are shapes, curves, lines, verticies. They may be wide or thin, bold, italic…
Now imagine a spectrogram of audio. That will have certain shapes which will still be the same for bass to soprano, largo to presto.

25. Anonymous Coward says:

Study of musical theme dictionaries in book form, from before the time when general purpose computers became miniaturized, cheap, widely available, and easy to use, may provide clues for solving the problem: A Dictionary of Musical Themes by Sam Morgenstern and Harold Barlow was published in 1950, and The Dictionary of Tunes and Musical Themes by Denys Parsons was published in 1975.

Both dictionaries use methods that ignore key signatures, time signatures, and meter to generate lookup keys. The former requires transposition to C (major or minor), and the latter invents a method that replaces the musical knowledge necessary to transpose with up, down, and repeat. As with any hashing function, it is possible for themes that are different to appear identical after such processing.

26. JAS II says:

I just finished reading the book today. It was okay. I tinker with neural nets and simple AI sketches and was mostly looking for some new ideas and inspiration. Didn’t really find it here. His big idea is that the prediction is the essence of intelligence. He stresses the importance of recursion, time, and the similarity of pattern processing across senses. He suggests autoassociative memory nets are much closer to real brain function than feedforward nets. Nothing revolutionary there. I felt like he began by saying, “let’s not get lost in the details”, and then proceeds to get lost in the details. I appreciate his humble attitude though. Kurzweil always strikes me as a bit arrogant. I agree with NBM that the cortical column as pattern recognizer doesn’t seem quite right. Hawkins’ prediction model makes more sense to me.

27. Dan says:

I remember hearing that most songs can be identified simply by the sequence of same/up/down pitches, if given say 20 notes or so, so completely ignoring timing and the size of the pitch jumps. I’m not sure what set of songs this applied to though; it was 20 years ago or so I heard this.

28. Quantise notes (whether played via instrument or sung)
You can now pick out time periods.
1,1,1,1/1,1,2 Mary had a little lamb.
Crotchet, crotchet crotchet crotchet crotchet crotchet semi breve.

Pitch needn’t be fixed, but has clear intonation between notes. -which again can be measured.

The biggest problem really is how much to sample, e.g what if I’m playing so slowly the notes last a minute each? There has to be some cut off….

That’s why most x must perform y under any circumstance is usually impossible or easy to call fail. (Is it meant to turn on with the words spoken monotone, what about if I’m singing under my breath next to a jack hammer?