Things have gotten freaky. A few years ago, Google showed us that neural networks’ dreams are the stuff of nightmares, but more recently we’ve seen them used for giving game character movements that are indistinguishable from that of humans, for creating photorealistic images given only textual descriptions, for providing vision for self-driving cars, and for much more.
Being able to do all this well, and in some cases better than humans, is a recent development. Creating photorealistic images is only a few months old. So how did all this come about?
Perceptrons: The 40s, 50s And 60s
We begin in the middle of the 20th century. One popular type of early neural network at the time attempted to mimic the neurons in biological brains using an artificial neuron called a perceptron. We’ve already covered perceptrons here in detail in a series of articles by Al Williams, but briefly, a simple one looks as shown in the diagram.
Given input values, weights, and a bias, it produces an output that’s either 0 or 1. Suitable values can be found for the weights and bias that make a NAND gate work. But for reasons detailed in Al’s article, for an XOR gate you need more layers of perceptrons.
In a famous 1969 paper called “Perceptrons”, Minsky and Papert pointed out the various conditions under which perceptrons couldn’t provide the desired solutions for certain problems. However, the conditions they pointed out applied only to the use of a single layer of perceptrons. It was known at the time, and even mentioned in the paper, that by adding more layers of perceptrons between the inputs and the output, called hidden layers, many of those problems, including XOR, could be solved.
Despite this way around the problem, their paper discouraged many researchers, and neural network research faded into the background for a decade.
Backpropagation And Sigmoid Neurons: The 80s
In 1986 neural networks were brought back to popularity by another famous paper called “Learning internal representations by error propagation” by David Rummelhart, Geoffrey Hinton and R.J. Williams. In that paper they published the results of many experiments that addressed the problems Minsky talked about regarding single layer perceptron networks, spurring many researchers back into action.
Also, according to Hinton, still a key figure in the area of neural networks today, Rummelhart had reinvented an efficient algorithm for training neural networks. It involved propagating back from the outputs to the inputs, setting the values for all those weights using something called a delta rule.
The set of calculations for setting the output to either 0 or 1 shown in the perceptron diagram above is called the neuron’s activation function. However, for Rummelhart’s algorithm, the activation function had to be one for which a derivative exists, and for that they chose to use the sigmoid function (see diagram).
And so, gone was the perceptron type of neuron whose output was linear, to be replaced by the non-linear sigmoid neuron, still used in many networks today. However, the term Multilayer Perceptron (MLP) is often used today to refer not to the network containing perceptrons discussed above but to the multilayer network which we’re talking about in this section with it’s non-linear neurons, like the sigmoid. Groan, we know.
Also, to make programming easier, the bias was made a neuron of its own, typically with a value of one, and with its own weights. That way its weights, and hence indirectly its value, could be trained along with all the other weights.
And so by the late 80s, neural networks had taken on their now familiar shape and an efficient algorithm existed for training them.
Convoluting And Pooling
In 1979 a neural network called Neocognitron introduced the concept of convolutional layers, and in 1989, the backpropagation algorithm was adapted to train those convolutional layers.
What does a convolutional layer look like? In the networks we talked about above, each input neuron has a connection to every hidden neuron. Layers like that are called fully connected layers. But with a convolutional layer, each neuron in the convolutional layer connects to only a subset of the input neurons. And those subsets usually overlap both horizontally and vertically. In the diagram, each neuron in the convolutional layer is connected to a 3×3 matrix of input neurons, color-coded for clarity, and those matrices overlap by one.
This 2D arrangement helps a lot when trying to learn features in images, though their use isn’t limited to images. Features in images occupy pixels in a 2D space, like the various parts of the letter ‘A’ in the diagram. You can see that one of the convolutional neurons is connected to a 3×3 subset of input neurons that contain a white vertical feature down the middle, one leg of the ‘A’, as well as a shorter horizontal feature across the top on the right. When training on numerous images, that neuron may become trained to fire strongest when shown features like that.
But that feature may be an outlier case, not fitting well with most of the images the neural network would encounter. Having a neuron dedicated to an outlier case like this is called overfitting. One solution is to add a pooling layer (see the diagram). The pooling layer pools together multiple neurons into one neuron. In our diagram, each 2×2 matrix in the convolutional layer is represented by one element in the pooling layer. But what value goes in the pooling element?
In our example, of the 4 neurons in the convolutional layer that correspond to that pooling element, two of them have learned features of white vertical segments with some white across the top. But one of them encounters this feature more often. When that one encounters a vertical segment and fires, it will have a greater value than the other. So we put that greater value in the corresponding pooling element. This is called max pooling, since we take the maximum value of the 4 possible values.
Notice that the pooling layer also reduces the size of the data flowing through the network without losing information, and so it speeds up computation. Max pooling was introduced in 1992 and has been a big part of the success of many neural networks.
A deep neural network is one that has many layers. As our own Will Sweatman pointed out in his recent neural networking article, going deep allows for layers nearer to the inputs to learn simple features, as with our white vertical segment, but layers deeper in will combine these features into more and more complex shapes, until we arrive at neurons that represent entire objects. In our example when we show it an image of a car, neurons that match the features in the car fire strongly, until finally the “car” output neuron spits out a 99.2% confidence that we showed it a car.
Many developments have contributed to the current success of deep neural networks. Some of those are:
- the introduction starting in 2010 of the ReLU (Rectified Linear Unit) as an alternative activation function to the sigmoid. See the diagram for ReLU details. The use of ReLUs significantly sped up training. Barring other issues, the more training you do, the better the results you get. Speeding up training allows you to do more.
- the use of GPUs (Graphics Processing Units). Starting in 2004 and being applied to convolutional neural networks in 2006, GPUs were put to use doing the matrix multiplication involved when multiplying neuron firing values by weight values. This too speeds up training.
- the use of convolutional neural networks and other technique to minimize the number of connections as you go deeper. Again, this too speeds up training.
- the availability of large training datasets with tens and hundreds of thousands of data items. Among other things, this helps with overfitting (discussed above).
To give you some idea of just how complex these deep neural networks can get, shown here is Google’s Inception v3 neural network written in their TensorFlow framework. The first version of this was the one responsible for Google’s psychedelic deep dreaming. If you look at the legend in the diagram you’ll see some things we’ve discussed, as well as a few new ones that have made a significant contribution to the success of neural networks.
The example shown here started out as a photo of a hexacopter in flight with trees in the background. It was then submitted to the deep dream generator website, which produced the image shown here. Interestingly, it replaced the propellers with birds.
By 2011, convolutional neural networks with max pooling, and running on GPUs had achieved better-than-human visual pattern recognition on traffic signs with a recognition rate of 98.98%.
Processing And Producing Sequences – LSTMs
The Long Short Term Memory (LSTM) neural network is a very effective form of Recurrent Neural Networks (RNN). It’s been around since 1995 but has undergone many improvements over the years. These are the networks responsible for the amazing advancements in speech recognition, producing captions for images, producing speech and music, and more. While the networks we talked about above were good for seeing a pattern in a fixed size piece of data such as an image, LSTMs are for pattern recognition in a sequence of data or for producing sequences of data. Hence, they do speech recognition, or produce sentences.
They’re typically depicted as a cell containing different types of layers and mathematical operations. Notice that in the diagram, the cell points back to itself, hence the name Recurrent neural network. That’s because when an input arrives, the cell produces an output, but also information that’s passed back in for the next time input arrives. Another way of depicting it is by showing the same cell but at different points in time — the multiple cells with arrows showing data flow between them are really the same cell with data flowing back into it. In the diagram, the example is one where we give an encoder cell a sequence of words, one at a time, the result eventually going to a “thought vector”. That vector then feeds the decoder cell which outputs a suitable response, one word at a time. The example is of Google’s Smart Reply feature.
LSTMs can be used for analysing static images though, and with an advantage over the other types of networks we’ve see so far. If you’re looking at a static image containing a beach ball, you’re more likely to decide it’s a beach ball rather than a basket ball if you’re viewing the image as just one frame of a video about a beach party. An LSTM will have seen all the frames of the beach party leading up to the current frame of the beach ball and will use what it’s previously seen to make its evaluation about the type of ball.
Generating Images With GANs
Perhaps the most recent neural network architecture that’s giving freaky results are really two networks competing with each other, the Generative Adversarial Networks (GANs), invented in 2014. The term, generative, means that one network generates data (images, music, speech) that’s similar to the data it’s trained on. This generator network is a convolutional neural network. The other network is called the discriminator and is trained to tell whether an image is real or generated. The generator gets better at fooling the discriminator, while the discriminator gets better at not being fooled. This adversarial competition produces better results than having just a generator.
In late 2016, one group improved on this further by using two stacked GANs. Given a textual description of the desired image, the Stage-I GAN produces a low resolution image missing some details (e.g. the beak and eyes on birds). This image and the textual description are then passed to the Stage-II GAN which improves the image further, including adding the missing details, and resulting in a higher resolution, photo-realistic image.
And there are many more freaky results announced every week. Neural network research is at the point where, like scientific research, so much is being done that it’s getting hard to keep up. If you’re aware of any other interesting advancements that I didn’t cover, please let us know in the comments below.