The Unreasonable Reputation of Neural Networks

January 12, 2016

It is hard not to be enamoured by deep learning nowadays, watching neural networks show off their endless accumulation of new tricks. There are, as I see it, at least two good reasons to be impressed:

(1) Neural networks can learn to model many natural functions well, from weak priors.
The idea of marrying hierarchical, distributed representations with fast, GPU-optimised gradient calculations has turned out to be very powerful indeed. The early days of neural networks saw problems with local optima, but the ability to train deeper networks has solved this and allowed backpropagation to shine through. After baking in a small amount of domain knowledge through simple architectural decisions, deep learning practitioners now find themselves with a powerful class of parameterised functions and a practical way to optimise them.

The first such architectural decisions were the use of either convolutions or recurrent structure, to imbue models with spatial and temporal invariances. From this alone, neural networks excelled in image classification, speech recognition, machine translation, atari games, and many more domains. More recently, mechanisms for top-down attention over inputs have shown their worth in image and natural language tasks, while differentiable memory modules such as tapes and stacks have even enabled networks to learn the rules of simple algorithms from only input-output pairs.

(2) Neural networks can learn surprisingly useful representations
While the community still waits eagerly for unsupervised learning to bear fruit, deep supervised learning has shown an impressive aptitude for building generalisable and interpretable features. That is to say, the features learned when a neural network is trained to predict P(y|x) often turn out to be both semantically interpretable and useful for modelling some other related function P(z|x).

As just a few examples of this:

Units of a convolutional network trained to classify scenes often learn to detect specific objects in those scenes (such as a lighthouse), even though they were not explicitly trained to do so (Zhou et al., 2015)
Correlations in the bottom layers of an image classification network provide a surprisingly good signature for the artistic style of an image, allowing new images to be synthesised using the content of one and the style of another (Gatys et al., 2015)
A ~~recurrent network~~ [correction below] taught to predict missing words from sentences learns meaningful word embeddings, where simple vector arithmetic can be used to find semantic analogies. For example:
- v_king - v_man + v_woman ≈ v_queen
- v_Paris - v_France + v_Italy ≈ v_Rome
- v_Windows - v_Microsoft + v_Google ≈ v_Android

I have no doubt that the next few years will see neural networks turn their attention to yet more tasks, integrate themselves more deeply into industry, and continue to impress researchers with new superpowers. This is all well justified, and I have no intention to belittle the current and future impact of deep learning; however, the optimism about the just what these models can achieve in terms of intelligence has been worryingly reminiscent of the 1960s.

Extrapolating from the last few years’ progress, it is enticing to believe that Deep Artificial General Intelligence is just around the corner and just a few more architectural tricks, bigger data sets and faster computing power are required to take us there. I feel that there are a couple of solid reasons to be much more skeptical.

To begin with, it is a bad idea to intuit how broadly intelligent a machine must be, or have the capacity to be, based solely on a single task. The checkers-playing machines of the 1950s amazed researchers and many considered these a huge leap towards human-level reasoning, yet we now appreciate that achieving human or superhuman performance in this game is far easier than achieving human-level general intelligence. In fact, even the best humans can easily be defeated by a search algorithm with simple heuristics. The development of such an algorithm probably does not advance the long term goals of machine intelligence, despite the exciting intelligent-seeming behaviour it gives rise to, and the same could be said of much other work in artificial intelligence such as the expert systems of the 1980s. Human or superhuman performance in one task is not necessarily a stepping-stone towards near-human performance across most tasks.

By the same token, the ability of neural networks to learn interpretable word embeddings, say, does not remotely suggest that they are the right kind of tool for a human-level understanding of the world. It is impressive and surprising that these general-purpose, statistical models can learn meaningful relations from text alone, without any richer perception of the world, but this may speak much more about the unexpected ease of the task itself than it does about the capacity of the models. Just as checkers can be won through tree-search, so too can many semantic relations be learned from text statistics. Both produce impressive intelligent-seeming behaviour, but neither necessarily pave the way towards true machine intelligence.

I’d like to reflect on specifically what neural networks are good at, and how this relates to human intelligence. Deep learning has produced amazing discriminative models, generative models and feature extractors, but common to all of these is the use of a very large training dataset. Its place in the world is as a powerful tool for general-purpose pattern recognition, in situations where n and d are high. Very possibly it is the best tool for working in this paradigm.

This is a very good fit for one particular class of problems that the brain solves: finding good representations to describe the constant and enormous flood of sensory data it receives. Before any sense can be made of the environment, the visual and auditory systems need to fold, stretch and twist this data from raw pixels and waves into a form that better captures the complex statistical regularities in the signal*. Whether this is learned from scratch or handed down as a gift from evolution, the brain solves this problem adeptly - and there is even recent evidence that the representations it finds are not too dissimilar from those discovered by a neural network. I contend, deep learning may well provide a fantastic starting point for many problems in perception.

That said, this high n, high d paradigm is a very particular one, and is not the right environment to describe a great deal of intelligent behaviour. The many facets of human thought include planning towards novel goals, inferring others' goals from their actions, learning structured theories to describe the rules of the world, inventing experiments to test those theories, and learning to recognise new object kinds from just one example. Very often they involve principled inference under uncertainty from few observations. For all the accomplishments of neural networks, it must be said that they have only ever proven their worth at tasks fundamentally different from those above. If they have succeeded in anything superficially similar, it has been because they saw many hundreds of times more examples than any human ever needed to.

Deep learning has brought us one branch higher up the tree towards machine intelligence and a wealth of different fruit is now hanging within our grasp. While the ability to learn good features in high dimensional spaces from weak priors with lots of data is both new and exciting, we should not fall into the trap of thinking that most of the problems an intelligent agent faces can be solved in this way. Gradient descent in neural networks may well play a big part in helping to build the components of thinking machines, but it is not, itself, the stuff of thought.

This has been the first post of what I hope will become a long-running blog, where I'll structure and share my thoughts about machine intelligence. Later posts are likely to address how to define intelligence, what might be the most promising routes to building it, and where we can draw inspiration. I hope it will serve as a platform for discussion too, so please go ahead and rebuke me in the comments below. And subscribe!

Correction: The model used to produce word analogies was actually a log linear skip-gram model, trained to discriminate nearby word pairs from negative samples (Mikolov et al., 2013). Many thanks to fnl for pointing this out.

*One suggested principle for this is the Efficient coding hypothesis, which ties into information theory and data compression.

23 Comments

Excellent Analysis

Submitted by Daniel Fitzgerald (not verified) on Wed, 2016-01-13 20:49

Great assesment, and a well written analysis.
I tend to agree with your conclusion that while deep learning is a useful tool and a big step towards AGI, it is not the be-all-end-all of an intelligent system. There are other missing parts. I'm curious - you mentioned one-shot leaning and unsupervised learning as potential next-steps. Do you think there is a clear road map to develop deep learning frameworks capable of those, or will it require A) "breakthroughs" or B) much more computational power?

The Road

Submitted by Luke Hewitt on Fri, 2016-01-15 04:29

Glad you liked it. Some of this was also mentioned in the comment below, but in general my thoughts are as follows:

We might not need any major breakthroughs at all to do unsupervised learning (or, at least, very weakly supervised learning) somewhat well, and models like ladder networks are already shaping up to look very promising. Maybe it won't be them specifically, but I wouldn't be surprised if weakly-supervised gradient descent with a similarly clever objective on a similarly clever architecture ended up finding representations as good as those we can build in a fully supervised setting now. Then somebody like Google will let it loose on YouTube and we'll have some very, very useful feature extractors.

But I wonder just what these features will represent, and I suspect that they will be very different to the things that humans gleam when looking at an image. On the one hand, I can imagine that they'd have a great deal of breadth, having watched something like 1000 human lifetimes (back-of-the-envelope calculation) worth of video. Perhaps while I would see a lamp, they would see an "IKEA INGALUND Floor Lamp", or something, as well as a great wealth of useful related information that can be scraped from text on the internet. I'm probably brushing over some very difficult scaling-up challenges here, but Google were willing to throw 16,000 cores for 3 days at this sort of thing in 2012 so that might be alright.

What I don't think such neural network models would come anything close to is the rich depth that animal perception has. Despite being given enormous amounts of data, they'd face two big problems:

The learning paradigm. Even if some clever use of temporal continuity would give them a window through which to watch the movement and interactions of objects, they would surely be incredibly handicapped by their lack to agency in the world they observe. For a child encountering a first banana, the 'unsupervised' learning problem is made vastly easier by their ability to pick it up, rotate it, squeeze it, twist it, peel it, break it, etc as they choose.
Priors. Each humans may not have ever encountered anywhere near the amount of perceptual data available on the internet, but their evolutionary ancestors certainly have, and have passed down some very powerful priors. In some cases maybe these are analogous to the specific weights of a network (say, the fact that even newborn babies prefer to look at smiling faces) but more generally they might look something like modelling constraints. One thing absolutely necessary to make neural network work was the use of domain specific constraints such as convolutions - by my calculation, AlexNet without convolutions would have had over 1000 times more parameters (bringing it to 80 billion). This kind of prior-imposing architecture was very natural to come up with in 1989 for something like 2D structure in 2D images, but is far less so for something like the structure and properties of 3D objects (let alone for reasoning about other people's thoughts, desires, beliefs, etc.).

The first problem is nothing to do with the choice of model but is something we'll have to sort out regardless. It's much more general than just reasoning about the structure and properties of 3D objects, too. Right now I feel like building virtual environments for models to explore and learn about is a very sensible direction, and many people much smarter than me are proposing this sort of thing.

The priors problem is one I plan to think a great deal more about, but right now I'm afraid I have no good answer. We can try and scientifically probe into what priors animals have in order to implement them in machines, or we can try and look for very clever ways to learn them from data, repeating the job of evolution. We will probably need a mixture of both, and we will certainly need "breakthroughs".

I disagree

Submitted by Maarten Bosma (not verified) on Wed, 2016-01-13 21:20

I disagree. To be clear, I am not saying that deep learning is going to lead to solving general intelligence, but I think there is a possibility that it could.

> This high n, high d paradigm is a very particular one, and is not the right environment to describe a great deal of intelligent behaviour.

It is true that deep learning methods are very data hungry, but there have been some advances in unsupervised, semi-supervised and transfer learning recently. Ladder networks for one are getting 1% error using only 10 labeled examples per class on MNIST.

I am not familiar with the term "high D", but I am assuming it stands for high input dimensionally. I don't think NLP tasks such as machine translation can be described as having high input dimensionality.

> Many semantic relations be learned from text statistics. [They] produce impressive intelligent-seeming behaviour, but [don't] necessarily pave the way towards true machine intelligence.

Nothing "necessarily paves the way towards true machine intelligence". But if you look at Google's Neural Conversations paper you will see that the model learned to answer questions using common sense reasoning. I don't think that can be written off easily as corpus statistics. It requires combining information in new ways. In my opinion it is a (very tiny) step towards intelligence.

I believe that models we have currently are analogous to dedicated circuits in a computer chip. They can only do what they are trained/designed to do. General intelligence requires CPU-like models that can load different programs and modify their own programs. The training objective would be some combination of supervised, unsupervised and reinforcement learning.

I posted the same comment on reddit.

links please

Submitted by Carlos Argueta (not verified) on Wed, 2016-01-13 22:48

Can you please provide links to the papers and theories you mention in your post? It is all very interesting.

Ladder Networks:

Submitted by Maarten Bosma (not verified) on Thu, 2016-01-14 01:08

Ladder Networks:

Neural Conversations:

http://arxiv.org/pdf/1506.05869.pdf

It's not a great accademic paper, but as someone who tried to create a chat bot before I was very impressed by the results.

I don't have a reference for the last part - it's my own thought. Altough I am sure someone came up with it already indendently.

Excelent, thanks a lot for

Submitted by Carlos Argueta (not verified) on Thu, 2016-01-14 01:48

Excelent, thanks a lot for the links!

One-shot learning

Submitted by Luke Hewitt on Thu, 2016-01-14 03:33

I'm actually a big fan of ladder networks, and I certainly don't want to come across as dismissive of unsupervised/semi-supervised learning. In fact I am rather optimistic that neural networks may soon be able to learn with little-to-no supervision the kinds of representation that fully-supervised models can find currently. But this is not enough:

Even if the MNIST ladder network you mention had only received one label per class and still succeeded, essentially doing unsupervised training and then putting names to the learned categories, this is not the same as learning about brand new types. If a child sees a duck for the first time, they will probably know immediately that it is different from what they have seen before. They might well ask what it is, and then proceed to point out all the other ducks they see (with perhaps one or two mistakes). This is the kind of one-shot learning I was referring to.

Since you mentioned MNIST: a one-shot learning challenge dataset was actually laid out in a very interesting Science paper last month, containing many characters in many alphabets, and the authors of that paper achieve human-level performance through a hand-designed probabilistic model. Now I don't think that building all of these things by hand will take us very far, and I hope that we will soon find good ways to learn them, but I will be very surprised if neural networks manage to achieve this without majorly departing from the kinds of paradigm we've seen so far. Perhaps the 'CPU-like' models you describe can take us there; I remain skeptical.

Mikolov et al. isn't a RNN

Submitted by fnl (not verified) on Wed, 2016-01-13 22:29

A recurrent network taught to predict missing words from sentences learns meaningful word embeddings, where simple vector arithmetic can be used to find semantic analogies (Mikolov et al., 2013).

Wrong. The network described there isn't a RNN. For more details, see Yoav Goldberg, who gives an excellent introduction to understanding the factorization presented in said paper.

Yep.

Submitted by Luke Hewitt on Wed, 2016-01-13 23:32

Indeed, thank you! I'll fix this now.

Unexpected analogies

Submitted by Kevin Marks (not verified) on Fri, 2016-01-15 04:02

The early versions of Google Translate would translate Blair in English to Sarkozy in French and Kohl in German.
Apparently they used a lot of official EU documents as parallel texts.

[ thinking machines ]

The Unreasonable Reputation of Neural Networks

23 Comments

Excellent Analysis

The Road

I disagree

links please

Ladder Networks:

Excelent, thanks a lot for

One-shot learning

Mikolov et al. isn't a RNN

Yep.

Unexpected analogies

Pages

Subscribe

Recent Posts