Follow-up to my previous post, achieving the LeCun MLP benchmark with our simple neural network.

If you are interested in coding neural networks I highly recommend taking a look at how other libraries are structured before striking off on your own. If you write the layers of the network in a modular way then you get a lot of flexibility in how you can structure your networks.

It's definitely not obvious how to write a modular network if you've never seen it done before, and the way your code is written now is not taking advantage of the inherently modular structure of the model you are building. For example, you need to change your forward and backprop code to change the type of activation function you use. Changing the number of layers would require changes as well. Writing a 3 layer network and a 2 layer network would require two different classes (or a major refactoring to support both).

The best example I know of a well structured neural network library is the nn module for torch7. They do a very good job of separating operations into "modules" that can be composed into computational graphs. Their modules can be composed in sequence (e.g. linear layer -> tanh -> linear layer -> softmax -> loss function) and also into more complex graph structures which are hard to draw in reddit comments.

You should specifically look at how nn implements the primitives for forward backprop inside their modules. There are three relevant functions in each module: updateOutput, updateGradInput and accGradParameters. updateOutput implements a forward pass through the module and the updateGradInput and accGradParameters together implement backprop (they compute the delta messages and parameter gradients, respectively). Implementing these three operations is enough to allow modules to be composed in arbitrary ways without any special code within the modules themselves. For example, look at the Sequential module which composes a bunch of arbitrary modules in sequence, and is itself a module which means you can put a Sequential inside a Sequential (or inside a different type of container module to get more elaborate structures).

I don't know of any great written references on modular backprop in neural networks specifically. Section 3 of the Efficient Backprop paper you cite in your post touches on it, but isn't enough; Leon Bottou has a paper called A Framework for the Cooperation of Learning Algorithms that also explains it, but it is not a very easy paper to read. There is, however, a very nice section in Nocedal and Wright on automatic differentiation (http://www.bioinfo.org.cn/~wangchao/maa/Numerical_Optimization.pdf , section 7.2) that covers the right things. The examples they use are not quite at the right scale, but if you pretend each node in Figure 7.2 is a layer in your network instead of a single variable then you will see the connection.

Backpropagation corresponds to what Nocedal and Wright call "the reverse mode of automatic differentiation". I don't mean that these two things are similar, I mean "backpropagation" and "reverse mode automatic differentiation" are literally the same thing by two different names. If you read section 7.2 of of Nocedal and Wright and compare what they say to the structure of modules in torch7 you will understand (for example) why torch modules have updateGradInput but then have accGradParameters instead of updateGradParameters, which would be more symmetric.

I recommend looking at torch7 because I think it is very well structured, but you could look at other libraries if you want. Caffe is not a bad choice, although they don't have a clear separation between updateGradInput and accGradParameters which I think is a poor choice. I recommend not looking at pylearn2 (or any other theano based library), even though it might seem like the obvious choice because it is also written in python. Pylearn2 is a very nice library, but it is based around symbolic differentiation (through theano) which is a totally different approach than what I have described here. If you want to commit to theano (there are upsides and downsides to doing this) then the pylearn2 source code is essential reading, otherwise it will just confuse the matter.

/r/MachineLearning Thread Link - databoys.github.io