In a previous article, we talked about the idea of the invariant representation and theorized different ways of implementing such an idea in silicon. The hypothetical example of identifying a song without knowledge of pitch or form was used to help create a foundation to support the end goal – to identify real world objects and events without the need of predefined templates. Such a task is possible if one can separate the parts of real world data that changes from that which does not. By only looking at the parts of the data that doesn’t change, or are invariant, one can identify real world events with superior accuracy compared to a template based system.
Consider a friend’s face. Imagine they were sitting in front of you, and their face took up most of your visual space. Your brain identifies the face as your friend without trouble. Now imagine you were in a crowded nightclub, and you were looking for the same friend. You catch a glimpse of her from several yards away, and your brain ID’s the face without trouble. Almost as easily as it did when she was sitting in front of you.
I want you to think about the raw data coming off the eye and going into the brain during both scenarios. The two sets of data would be completely different. Yet your brain is able to find a commonality between the two events. How? It can do this because the data that makes up the memory of your friend’s face is stored in an invariant form. There is no template of your friend’s face in your brain. It only stores the parts that do not change – such as the distance between the eyes, the distance between the eye and the nose, or the ear and the mouth. The shape her hairline makes on her forehead. These types of data points do not change with distance, lighting conditions or other ‘noise’.
One can argue over the specifics of how the brain does this. True or not true, the idea of the invariant representation is a powerful one, and implementing such an idea in silicon is a worthy goal. Read on as we continue to explore this idea in ever deeper detail.
if someone can figure this out, it would be a monumental step forward in computer technology
If we could stick a sensor in different areas of you brain during both scenarios, we would find an interesting pattern. The part of the cortex that is connected directly to the eye is called V1. As one would expect, the neuron firing in this area is changing rapidly and in completely different patterns between seeing your friend’s face up close and seeing it in the night club.
But a peculiar thing happens if we put the probe in the area of the visual cortex known as IT. The patterns are stable, slow changing and very similar to each other. Your brain has somehow identified the invariant representation of your friend’s face in the IT area, from the raw, fast changing data coming from the V1 area.
It does this through a hierarchy. Information flows up the hierarchy, and back down, as we will learn in the next article.
It has been long known that the visual cortex is laid out in a hierarchy. The neurons in V1 fire when certain line segments appear in the visual field. One set of neurons might fire if it sees a horizontal line, while another set will fire when it sees a line at, say, 45 degrees. V2 cells will fire when it sees shapes like circles, boxes and star shapes. It’s not until you get to IT, that you will see cells firing for things like a car, tree or face.
These are fast changing, low level patterns transitioning into slow changing, high level patterns. The cortex forms sequences of sequences, or invariant representations of other invariant representations as information climbs the cortical hierarchy.
This is our goal – to identify a tree, car or any real world object by forming an invariant representation of it, and doing so in a hierarchical form. This is not easy, and has never been successfully demonstrated before. If someone can figure this out, it would be a monumental step forward in computer technology.
Modeling [Hawkin’s] Theory in Silicon
Each level of the hierarchy only has three jobs – to identify repeating patterns, assign these patterns a name, and pass that name onto the next level in the hierarchy.
The primary tier (like V1) sees the pattern 10100101 repeating often. So it gives it a name of 56a and passes only that name to the next level. The next level sees the pattern of 34a, 56a and 12a repeating often. So it gives this pattern the name of 866b and passes only that name to the next level up. That level sees the pattern 845b, 567b, 866b and 435b repeating often. So it gives it a name 7656d and passes it up. This process continues until a steady invariant representation is formed of the real world object.
Let’s work through an example of identifying a simple shape, such as a square. Imagine that whenever a horizontal line is in the field of view of our camera, the pattern 11011101 appears on our ADC. We see this pattern a lot over a period of time, as the square stays in the field of view. So we assign it the name 6A, and pass it up to Tier Three of the hierarchy. The same process takes place for the other three lines of the square.
It is critical to understand that the ONLY thing Tier Three sees are the names passed up from Tier 4. Now, Tier Three does mostly the same thing Tier 4 did – find repeating patterns, give them a name, and pass that name up to Tier 2. It notices that names’ 27B and 76B occur together often, so it assigns the pattern a name of 322C and passes it up to Tier Two. This process gets repeated until the invariant representation of the square is created.
Let this sink in, and in the next article we will explore the roll of feedback in the hierarchy, and how it can be theoretically combined with prediction to create an artificial intelligence.
None of this is possible however, without getting the theory onto hardware and into code. Now the onus is on you. How would you
program an Arduino to implement this theory in hardware and software?