Machine Learning in a Year

I've transitioned in my job from doing pure programming to now working on machine learning. I can detail a bit here how I went about learning the requisite background.

The process has overall involved about a year of on and off self-education. I never took any online courses. I just don't find them useful.

The book that kicked me off on the right path was:

This is not strictly a machine learning book. It's pretty much an entire book (written as an IPython notebook, but also available in paperback) dedicated to explaining a powerful algorithm in Bayesian inference known as Markov Chain Monte Carlo (MCMC) using the PyMC library.

The value here is that it's very introductory in nature, it's very hands-on, and it got me in the mindset of thinking about statistics when I hadn't done so in years.

I jumped into this book next:

I didn't read it cover-to-cover and in fact there's still a lot of material here I haven't read. This is a pretty intense book for someone new to statistics. Even if you have a rock-solid background in mathematics, the statistical language may still throw you for a loop here and there. Nevertheless, it made for excellent "background" reading while I absorbed information from other sources.

At this point the Deep Learning book was available:

This is an excellent book, and despite the title it covers more than just pure deep learning. That's definitely the focus, but the introductory chapters on statistics and machine learning in general were fantastic. I recommend reading this book cover-to-cover. It's a gold mine. I was able to go back to The Elements of Statistical Learning after this and follow along much more easily.

After reading that book, I was shocked to find I could read state-of-the-art research papers in deep learning and follow along with most or even every detail. To get started, I recommend:

  • Distill -- distill.pub

This is a new online machine learning journal focused on exposition and clear explanations more than original research. Publications are rare but uniformly excellent. My own work is focused pretty strongly on sequence processing rather than images, so here are some research papers I read:

  • http://distill.pub/2016/augmented-rnns/ – purely expository article that visually covers the basics of some of the following papers.
  • https://arxiv.org/abs/1410.5401 – introduces the neural Turing machine (NTM). An NTM augments an LSTM with a memory bank and fully differentiable memory transactions, meaning the model is trainable end-to-end with gradient descent. If the internal memory of the LSTM acts like the registers on a CPU, then the memory bank is similar to RAM.
  • https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf – introduces the classic seq2seq model where an encoding RNN produces a single fixed-length "thought vector" which is decoded by a decoding RNN.
  • https://arxiv.org/pdf/1409.0473.pdf – introduces attentional interfaces between RNNs, as an alternative to the classic seq2seq model. This allows an attending RNN to focus on any part of the output of an encoding RNN, with the focus changing with each time step. This sort of differentiable attention is also called "soft attention" to distinguish it from "hard attention", which cannot be trained with gradient descent.
  • https://arxiv.org/pdf/1511.08228.pdf – introduces the neural GPU, an alternative to NTMs which has shown superior performance in learning algorithms like adding and multiplying 2000-bit binary numbers from examples with 20-bit numbers. A neural GPU constructs a 3D tensor called a "mental image". The initial mental image has all zeros except the first layer (a matrix, or vector of vectors), which consists of all input vectors. This image is then repeatedly convolved with a 4D kernel bank tensor, with each convolution producing a new 3D tensor. Once the convolutions are finished, the sequence of output vectors is extracted again from the first layer of the final mental image tensor.
  • https://arxiv.org/pdf/1610.08613.pdf – introduces "active memory" for neural GPUs, as an alternative to soft attention. In active memory models, all memory cells are accessed or modified equally.
  • https://www.gwern.net/docs/2016-graves.pdf – Nature paper from Google DeepMind introducing the differentiable neural computer, a relatively straightforward modification of the NTM introduced by some of the same authors previously. Unlike the NTM, the DNC monitors memory freeness, provides a mechanism for deallocation of memory (and allocation of new memory in fact), and also maintains a temporal link matrix tracking which memory cells were successively written to, even if they're not adjacent.
  • https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf – another Nature paper from DeepMind that describes AlphaGo. Despite the magnitude of the accomplishment, this paper is pretty basic and very accessible.

All throughout this process I was playing around with toy problems using Python and Spark.

/r/programming Thread Link - makcorps.com