ELI5: Deep Learning such as Google's DeepMind AI, How it Works

Imagine that you attend a really large school (K-12), and every morning before class, the entire school meets for assembly in the main hall. During assembly, each grade sits in their own row in order, with the kindergartners up the front, followed by the first graders, second graders, and so on. Behind the twelfth graders, the "cool" kids sit in their own row up the very back: it just so happens that they are the captains of each of the different extracurricular activities that the school offers (football, music, debating, chess, etc.).

All of the students at the school participate in one or more of these activities, except the captains who are too busy to participate in any activity other than their own. Occasionally during assembly, the principal will make an announcement about one of the activities, and will ask all of the students involved to stand up. Unfortunately, all of the older kids tune out long before, so only the kindergartners are attentive enough to stand up at first.

Luckily, each student vaguely knows who in the grade below participates in the same activities as them, and so upon seeing some of the students in the row in front stand up, make a best guess as to whether they themselves should stand up too: some of the first graders stand up, followed by some of the second graders, and so on, until finally one of the captains in the very back row stands up.

What we have described here is the forward pass of an artificial neural network. With only the first row of standing kindergartners as input, we are able to figure out which activity the principal is talking about because of the captain standing in the very back row. This is an example of classification.

Now as you can imagine, this process doesn't always go according to plan, and sometimes the wrong captain will stand up during an announcement. When this happens, one of the teachers up the back will quietly tell the mistaken captain that they should be sitting. The captain immediately sits down, but in the process also taps on the shoulders of his teammates standing in front and tells them, "Hey, this wasn't us, I think we should be sitting down."

Each of those students then in turn sits down and tells their standing teammates in front the same thing: some of the twelfth graders sit down and tip off their eleventh grade friends, who tip off their tenth grade friends, and so on, all the way back up to the kindergartners.

Now every time a student realizes that they were standing at the wrong time, they get a little embarrassed, and also a little mad at their friends who stood up in front of them for making them think that they should be standing too. As a result, they are more reluctant to stand up the next time they see those same friends in front of them standing.

What we have described here is known as backpropagation. This algorithm is the key to training neural networks (the "learning" in "machine learning"). When the neural network gives the wrong answer as output, we can propagate the error all the way back to the inputs, and modify the parameters (or weights) of the network in a way that will improve its overall accuracy. In this case, the parameters of the network are each individual student's trust of their friends standing up in front of them.

Now the way that machine learning works is that we train a model with a huge amount of data to give the best performance possible. In this case, if the principal was to make announcements every morning for several months, we can imagine that over time, the students will get better at determining when to stand up based on the students standing in front of them, meaning that the captains less frequently stand up at the wrong time.

Now the "deep" in "deep learning" refers to the many layers of computation involved. In this case, each layer corresponds to each grade sitting in a row: with our K-12 school, we have twelve hidden layers (the kindergartners are the input layer, and the captains are the output layer).

In practice, it has been found in numerous applications that this approach using many layers can lead to big gains in performance compared to simpler models. For example, consider the case if only the kindergartners and the captains attended assembly. This is known as a linear classifier, and it's really quite difficult for the captains to be sure that they are standing at the right time based on just the kindergartners!

/r/explainlikeimfive Thread