I’d like to start this post with a demonstration about human learning and generalization. Below is a training set which has been grouped into YES and NO images. Look at these, then decide which image from the test set should also be labelled YES.
That shouldn’t have been too difficult. In fact, I suspect even those kids in the training set would notice that the YES instances all contain two of the same thing. One might assume that learning this categorisation would come naturally to a Convolutional Neural Network. All the better, one could even start with a pretrained model already able to recognise the main objects in an image, then simply train a couple more layers on top for the final YES / NO classification. In truth, the matter is more subtle.
The reason a neural network succeeds at object recognition is that we specifically architect it for the job, building in our prior knowledge to guide the class of functions it can learn. We constrain it to use convolutions - in essence, to look for structures which are built up compositionally, from parts and which maybe be seen in many different locations. Without these constraints on the kinds of patterns it should look for, the original AlexNet network would have totalled around 80 billion parameters and learned nothing. Adding a little prior knowledge about images into the network’s architecture is precisely what allows it to generalize from ‘only’ a thousand examples per class.
Unfortunately this particular architecture might not do so well at the task above, even if it were given a thousand YES / NO images to train its upper layers on. The convolutional structure is designed to learn about 2 dimensional arrangements of parts, but it has no baked-in preference for what those parts are or how they relate to each other. Specifically, there is nothing in its structure that makes sameness of parts a meaningful concept. Thus, this network could easily be trained to recognise the two ducks pattern or the two monkeys pattern, but still would have no reason to see a common property between them and generalize this to a new example of two otters.
If we wanted to train a neural network on the task above then we could easily modify the convolutional structure to understand ‘sameness’, but that is exactly my point. There is no such thing as general purpose induction, and if we want our models generalize from data like us then they need to have the same inductive biases we have. That’s difficult because we have an awful lot of them.
Evolution has not just endowed our brain with a stew 86 billion neurons (impressive, albeit three times less than an elephant’s). It has also shaped the architecture to an enormous degree so that, before even looking at the world, we come prepared with just the right structure for learning about certain things. Just as CNNs are born to learn 2-dimensional arrangements of parts, our brains seem innately primed for understanding such concepts as motion, faces, spatial layout of the environment, motor control, people and animals, speech, language, and even thinking about thinking.
In fact, an increasingly compelling body of neuroimaging studies have found brain regions specific to each of these, and some abilities such as face detection and imitation may even be present at birth. This of course does not discredit the idea that such abilities can be learned from enough data with fairly general purpose tools, but it does suggest that evolution may have given us a serious kick-start in these areas.
A ten-minute old infant imitates his father sticking out his tongue
If we are to create truly intelligent machines, we will need a way to build much richer structure into our models than is found in CNNs - the kind of structure that we are born with. This is a tremendously difficult problem which, I believe, will need to be attacked jointly from two sides:
1. Use our prior knowledge to design new model structures
This knowledge may come from our intuitions, or from research in cognitive science and neuroscience about how the brain solves certain problems. In either case, how we can bake such knowledge into our models is heavily shaped by the tools we use to build them.
One toolbox we have available includes probability models, programming languages, and especially probabilistic programming languages. These provide a powerful way to write down all of our prior knowledge in a high-level and expressive language, so that our models can generalize well from just a few examples. Brenden Lake’s Science Paper on one-shot learning is a terrific display of what probability models can achieve where deep learning struggles. Of course the difficulty with these models is that, as they grow in complexity, actually using them for inference can become prohibitively slow. Thus, finding fast approximate inference techniques is an important area of research - one with some exciting frontiers that I hope to explore in my next post.
Deep learning takes the opposite approach: the fast learning rule of stochastic gradient descent allows one to provide much weaker priors, so long as they are willing to make up for it with data. Unfortunately, this approach also makes it very difficult to include stronger prior knowledge when we'd like to. We are lucky that convolutions are such a natural way to provide networks some basic rules1 for image recognition (so natural that the idea is much older than I am), yet it is much less obvious what kind of architectural decisions might guide a network to understand objects in terms of moveable parts, joints and surfaces, for example. This difficulty of incorporating prior knowledge into neural networks is one of their biggest weaknesses.
2. Build systems which discover structure from data
The probabilistic learning community has produced some wonderful research in this area. I have a particular fondness for Roger Grosse’s work on matrix decompositions, which can discover a wide range of model structures by composing simpler elements together. The Automatic Statistician takes a similar approach with Gaussian Process kernels, and uses this to produce beautiful reports of the patterns in any given dataset, in pdf format. Still, both systems are fairly coarse grained, making use of just a few large building blocks, and so are somewhat restricted in the type of model structures they can learn.
When the building blocks are more primitive this idea starts to looks more like ‘program synthesis’ - automatically learning the code of a (probabilistic) program from examples of desired output - and it typically involves a stochastic search not unlike the mechanism of evolution. This is very difficult to do well from scratch, but if enough prior knowledge of a program’s overall structure is included (a technique called sketching) then it is possible to successfully fill in the blanks.
To learn a similar structure for a neural network, one could again define a constrained set of architectures and search over this space. However, what I find much more interesting is a new direction of research, aiming to abstract away the structure of the computation itself.
The Neural Programmer-Interpreter is one such model. A recurrent neural network is given access to a persistent memory module and uses it to store 'programs' in a kind of learnable code. The network is then able to read these programs and execute them on new inputs, train them using supervision at various levels of abstraction, and compose them together into new larger programs. By encoding the computation this way, rather than as weights in a fixed graph, the network can learn not just the parameters of an algorithm but also its structure. In a strong sense this is the neural analog of program synthesis, and I find it an exciting direction.
How can we build a mind with the right structure to learn about the world? I'm sure this great challenge will require many breakthroughs over many decades, but I have a feeling that in the next few years we will finally have the right tools to at least get a handle on it: systems which can integrate our rich prior knowledge of the world when we provide it, while teaching themselves from data when we don’t.
1 Formally, features learned by a CNN are translation equivariant. Cohen and Welling show that a generalisation of CNNs can provide equivariance to the larger group of translations, axis-aligned reflections and 90 degree rotations.