The Unreasonable Reputation of Neural Networks

January 12, 2016

It is hard not to be enamoured by deep learning nowadays, watching neural networks show off their endless accumulation of new tricks. There are, as I see it, at least two good reasons to be impressed:

(1) Neural networks can learn to model many natural functions well, from weak priors.
The idea of marrying hierarchical, distributed representations with fast, GPU-optimised gradient calculations has turned out to be very powerful indeed. The early days of neural networks saw problems with local optima, but the ability to train deeper networks has solved this and allowed backpropagation to shine through. After baking in a small amount of domain knowledge through simple architectural decisions, deep learning practitioners now find themselves with a powerful class of parameterised functions and a practical way to optimise them.

The first such architectural decisions were the use of either convolutions or recurrent structure, to imbue models with spatial and temporal invariances. From this alone, neural networks excelled in image classification, speech recognition, machine translation, atari games, and many more domains. More recently, mechanisms for top-down attention over inputs have shown their worth in image and natural language tasks, while differentiable memory modules such as tapes and stacks have even enabled networks to learn the rules of simple algorithms from only input-output pairs.

(2) Neural networks can learn surprisingly useful representations
While the community still waits eagerly for unsupervised learning to bear fruit, deep supervised learning has shown an impressive aptitude for building generalisable and interpretable features. That is to say, the features learned when a neural network is trained to predict P(y|x) often turn out to be both semantically interpretable and useful for modelling some other related function P(z|x).

As just a few examples of this:

  • Units of a convolutional network trained to classify scenes often learn to detect specific objects in those scenes (such as a lighthouse), even though they were not explicitly trained to do so (Zhou et al., 2015)
  • Correlations in the bottom layers of an image classification network provide a surprisingly good signature for the artistic style of an image, allowing new images to be synthesised using the content of one and the style of another (Gatys et al., 2015)
  • A recurrent network [correction below] taught to predict missing words from sentences learns meaningful word embeddings, where simple vector arithmetic can be used to find semantic analogies. For example:
    • vking - vman + vwoman ≈ vqueen
    • vParis - vFrance + vItaly ≈ vRome
    • vWindows - vMicrosoft + vGoogle ≈ vAndroid

I have no doubt that the next few years will see neural networks turn their attention to yet more tasks, integrate themselves more deeply into industry, and continue to impress researchers with new superpowers. This is all well justified, and I have no intention to belittle the current and future impact of deep learning; however, the optimism about the just what these models can achieve in terms of intelligence has been worryingly reminiscent of the 1960s.


Extrapolating from the last few years’ progress, it is enticing to believe that Deep Artificial General Intelligence is just around the corner and just a few more architectural tricks, bigger data sets and faster computing power are required to take us there. I feel that there are a couple of solid reasons to be much more skeptical.

To begin with, it is a bad idea to intuit how broadly intelligent a machine must be, or have the capacity to be, based solely on a single task. The checkers-playing machines of the 1950s amazed researchers and many considered these a huge leap towards human-level reasoning, yet we now appreciate that achieving human or superhuman performance in this game is far easier than achieving human-level general intelligence. In fact, even the best humans can easily be defeated by a search algorithm with simple heuristics. The development of such an algorithm probably does not advance the long term goals of machine intelligence, despite the exciting intelligent-seeming behaviour it gives rise to, and the same could be said of much other work in artificial intelligence such as the expert systems of the 1980s. Human or superhuman performance in one task is not necessarily a stepping-stone towards near-human performance across most tasks.

By the same token, the ability of neural networks to learn interpretable word embeddings, say, does not remotely suggest that they are the right kind of tool for a human-level understanding of the world. It is impressive and surprising that these general-purpose, statistical models can learn meaningful relations from text alone, without any richer perception of the world, but this may speak much more about the unexpected ease of the task itself than it does about the capacity of the models. Just as checkers can be won through tree-search, so too can many semantic relations be learned from text statistics. Both produce impressive intelligent-seeming behaviour, but neither necessarily pave the way towards true machine intelligence.

I’d like to reflect on specifically what neural networks are good at, and how this relates to human intelligence. Deep learning has produced amazing discriminative models, generative models and feature extractors, but common to all of these is the use of a very large training dataset. Its place in the world is as a powerful tool for general-purpose pattern recognition, in situations where n and d are high. Very possibly it is the best tool for working in this paradigm.

This is a very good fit for one particular class of problems that the brain solves: finding good representations to describe the constant and enormous flood of sensory data it receives. Before any sense can be made of the environment, the visual and auditory systems need to fold, stretch and twist this data from raw pixels and waves into a form that better captures the complex statistical regularities in the signal*. Whether this is learned from scratch or handed down as a gift from evolution, the brain solves this problem adeptly - and there is even recent evidence that the representations it finds are not too dissimilar from those discovered by a neural network. I contend, deep learning may well provide a fantastic starting point for many problems in perception.

That said, this high n, high d paradigm is a very particular one, and is not the right environment to describe a great deal of intelligent behaviour. The many facets of human thought include planning towards novel goals, inferring others' goals from their actions, learning structured theories to describe the rules of the world, inventing experiments to test those theories, and learning to recognise new object kinds from just one example. Very often they involve principled inference under uncertainty from few observations. For all the accomplishments of neural networks, it must be said that they have only ever proven their worth at tasks fundamentally different from those above. If they have succeeded in anything superficially similar, it has been because they saw many hundreds of times more examples than any human ever needed to.

Deep learning has brought us one branch higher up the tree towards machine intelligence and a wealth of different fruit is now hanging within our grasp. While the ability to learn good features in high dimensional spaces from weak priors with lots of data is both new and exciting, we should not fall into the trap of thinking that most of the problems an intelligent agent faces can be solved in this way. Gradient descent in neural networks may well play a big part in helping to build the components of thinking machines, but it is not, itself, the stuff of thought.

This has been the first post of what I hope will become a long-running blog, where I'll structure and share my thoughts about machine intelligence. Later posts are likely to address how to define intelligence, what might be the most promising routes to building it, and where we can draw inspiration. I hope it will serve as a platform for discussion too, so please go ahead and rebuke me in the comments below. And subscribe! Subscribe to Syndicate
Correction: The model used to produce word analogies was actually a log linear skip-gram model, trained to discriminate nearby word pairs from negative samples (Mikolov et al., 2013). Many thanks to fnl for pointing this out.
*One suggested principle for this is the Efficient coding hypothesis, which ties into information theory and data compression.



Excellent Analysis

Great assesment, and a well written analysis.
I tend to agree with your conclusion that while deep learning is a useful tool and a big step towards AGI, it is not the be-all-end-all of an intelligent system. There are other missing parts. I'm curious - you mentioned one-shot leaning and unsupervised learning as potential next-steps. Do you think there is a clear road map to develop deep learning frameworks capable of those, or will it require A) "breakthroughs" or B) much more computational power?

I disagree

I disagree. To be clear, I am not saying that deep learning is going to lead to solving general intelligence, but I think there is a possibility that it could.

> This high n, high d paradigm is a very particular one, and is not the right environment to describe a great deal of intelligent behaviour.

It is true that deep learning methods are very data hungry, but there have been some advances in unsupervised, semi-supervised and transfer learning recently. Ladder networks for one are getting 1% error using only 10 labeled examples per class on MNIST.

I am not familiar with the term "high D", but I am assuming it stands for high input dimensionally. I don't think NLP tasks such as machine translation can be described as having high input dimensionality.

> Many semantic relations be learned from text statistics. [They] produce impressive intelligent-seeming behaviour, but [don't] necessarily pave the way towards true machine intelligence.

Nothing "necessarily paves the way towards true machine intelligence". But if you look at Google's Neural Conversations paper you will see that the model learned to answer questions using common sense reasoning. I don't think that can be written off easily as corpus statistics. It requires combining information in new ways. In my opinion it is a (very tiny) step towards intelligence.

I believe that models we have currently are analogous to dedicated circuits in a computer chip. They can only do what they are trained/designed to do. General intelligence requires CPU-like models that can load different programs and modify their own programs. The training objective would be some combination of supervised, unsupervised and reinforcement learning.

I posted the same comment on reddit.

links please

Can you please provide links to the papers and theories you mention in your post? It is all very interesting.

Mikolov et al. isn't a RNN

A recurrent network taught to predict missing words from sentences learns meaningful word embeddings, where simple vector arithmetic can be used to find semantic analogies (Mikolov et al., 2013).

Wrong. The network described there isn't a RNN. For more details, see Yoav Goldberg, who gives an excellent introduction to understanding the factorization presented in said paper.

Unexpected analogies

The early versions of Google Translate would translate Blair in English to Sarkozy in French and Kohl in German.
Apparently they used a lot of official EU documents as parallel texts.