I have spent quite a lot of my life working with Neural Networks, of the common MLP with backprop sort. They are amazingly versatile and in their own right quite beautiful – if not a little mysterious. The one bit of them which has always made me wonder out loud, and the bit which most people struggle with, is the back-propagation of the error, and how that actually works. So, here is my take on the whole thing insofar as I understand it. I wrote this recently in an email to a colleague and thought I would post it out into posterity in case someone else finds it useful.
First, training a network goes like this…
for all rows
- present a row of input values as a vector
- make a forward pass through the network in its current state to make a prediction
- compare the prediction to the actual dependent variable value/vector in that row and compute an error
- propagate that error back through the network such that its affect is proportional to the strength of interconnections (more later)
- move to next row
One pass through all rows is called an epoch (for some bizarre reason). After each epoch you look at your total error measure and decide whether to stop.
The forward pass is just a matter of taking the input value and multiplying it by the weight between that input node and each node in the next layer to which it is connected. At the recipient node’s end it gets a collection of weighted inputs (the activation value) which it sums and passes through its transfer function – normally a sigmoid.
For avoidance of doubt is the normal way of describing the transfer function and in my case
While we’re here it is an important feature of that function that it is entirely differentiable because during backpropagation its first derivative is used to pass the error back across a node. It is left as an exercise for the student to prove that…
My network architecture is an input layer, a single hidden layer and an output layer. The number of nodes in the hidden layer is calculated using a heuristic I developed some years ago which gives me a good number. This is a bit that I’d like to keep to myself because it’s quite clever and part of IP that I would claim as my own.
So, to back propagation. The basic idea is that you take the error in the output node and spread it backwards through the network according to the strength of interconnections. In this way a large error which is associated with a strong connection causes a big step in weight change. The effect of each row on the weights can be moderated by two coefficients, one is called the learning rate, L, and you can adjust it to make each row have a smaller effect on the weights, and the other is called momentum, M, and you can adjust that so that a small amount of the previously calculated change is also included.
In my generalised case the error is not a single value but a vector of values because my dependent variable is a vector. In my case I am using the network as a discriminatory tool. I combine these errors into an overall error measure for the purposes of stopping, but for the purposes of backprop each node in the output layer sends a different error back through the network. In a similar way to the forward pass, these errors are summed by the recipient node and passed backwards through the first derivative of the transfer function. The result is an error they can then pass on to the nodes in the preceding layer that are connected to them.
The backprop process calculates the error in the output layer, runs it backwards to the hidden layer, and calculates an “error” for each of the hidden nodes. The error in the output nodes is then used to adjust the weights between the hidden and output layers and the “errors” in the hidden nodes are used to adjust the weights between the input and hidden nodes. So if you laid the network out left to right, inputs to outputs, then the error in the right hand layer moderates the weights immediately to its left, and the process repeats in a right to left direction.
Now, the “error” in the output layer that is backpropagated is just the difference between the actual and predicted value, however we want the derivative of the error with respect to the weights since we want to make a weight step, , in the direction that maximises that derivative and takes closest to the minimum in the error surface. Jumping right to the answer that derivative is given by…
where i and j denote the node in each layer that the weight corresponds to, L is the learning rate, is the derivative of the error for the jth node and is the activation value (i.e. the input value) of the ith node. Notice that comes from the jth layer and comes from the ith layer – we’re going backwards.
Now, dig out your maths, the derivation for comes from using the chain rule for partial differentials.
We are trying to get to the error derivative wrt weights, i.e. and using the chain rule we can express this as…
and we can get to each of these partial derivative terms numerically.
The first pair of terms together we’ll call and it is the error derivative for the jth node
For the first term in this partial derivative chain, we can get at it because we are defining E as the RMS error
which we can differentiate with respect to giving
for the next term, is the jth node’s output value, which is simply so
and that’s where our differentiable transfer function comes into play.
Assembling these back we get
Now, for the last term in the original chain we need to get . For that we return to the fact that the relationship between the input to the jth node, , and the weights that make it up is simply a sum…
so the partial derivative is simply !
is very easily calculated because we have , we know how to do and is just the “error” in the output node. we have in any case because we activate the nodes in the forward pass.
The adjustment to the ijth weight, including the learning rate, is then
We add our momentum term from the previous row, so for the Nth row
which is completely calculable numerically.
Now, that’s how backprop works and you wind up taking successive “steps” in the weight space across the error surface in directions which take you towards a minimum.
What nobody tells you is that this is the easy bit. The hard bits are all the stuff you need to wrap round your Neural Network to make it useful. There is one thing that, unless you find a way to deal with, will make your Neural Network will be next to useless for all practical purposes and that is the tendency to over-fit to the data.
What this kind of network actually amounts to is a weighted superposition of the transfer function. Because we chose a non-linear function, i.e. a sigmoid, it can be shown quite reliably that, with enough degrees of freedom, you can estimate pretty much any curve you like by superimposing them on top of each other. The number of degrees of freedom in the network is large because each weight adds one, and in a fully interconnected network the number of weights is potentially very large. This is a real problem because if you match the line of your actual data exactly you are almost certainly over-fitting, especially if the data contains noise (and all real world data does).
To cope with this you have to test the predictability of the model against a set of data on which you have not trained the weights. You can measure the error against that test set in the same way you do the training set and you can compare the two. When the error in the training set falls as the error in the test set rises you are very likely to be over-fitting and modelling the noise in the data.
That in turn means you need to be able to sample a test set and present that during training for epoch-wise comparison of the errors. There’s quite a lot of infrastructure which has nothing to do with the network to make that happen. You can make it the user’s problem to supply the test data, but if you are going to make your Neural Net really useful you’ll do it for them. The critical thing to remember when devising a scheme for partitioning data into training and test sets is that the distribution of the dependent variable(s) in the test set must be very close to that in the training data. Given how back propagation works it should be clear that you stand to skew the resulting model qute dramatically if you train it in an area of the error space which is not representative of the overall data. Be careful if you decide to sample at random that you are really doing that, and there is not some implicit order in the data which will skew your distributions.
Then you need to decide which of the many epochs represents the “best” model. Many years of experience has taught me that the only models which have any reliable practical use are the ones which have predictive ability on the test data. It is really tempting to go for a model which has a low error overall and appears to fit the data very well statistically. However, unless it does as well on the test data it is a highly spurious model and you’ll get disastrous predictions from it. You won’t know your predictions are rubbish until later. It is often an unpalatable truth that a less accurate prediction which is directionally correct is much more useful than a hyper-accurate one which is probably incorrect. In most business situations you only really need a directionally correct prediction – it’s a model after all, not the bible – but that reasoning is far from common sense, which says be as accurate as possible.
As you are training your network you don’t know a priori which epoch is going to be “best” by any measure you may choose, so you have to also provide a means of returning to the one that, once finished, you decide is the best. What I tend to do is a quick persistence of the weights each time my “best” measure improves. I have a highly performant way of doing that persistence so as not to interrupt the training too much nor add to the overall time the algorithm takes to execute. When I have finished I then re-hydrate my “best” model and persist it in a more useful form. In fact I keep a sequence of models in this way because, as a practitioner, I have learnt that sometimes you get several good solutions which exhibit different characteristics and I’d like to use my judgement and wider knowledge of the application to choose which one to use.
Prior to that you also have to decide when to stop. Stop too early and you may miss non-linear structure in the data. Stop too late and you may have wasted a lot of computing time. You will need to develop an algorithm for stopping which may be quite complex. Don’t be afraid of adding complexity here too because it will be a great commfort to you later to know that you have optimised the amount of time spent learning to extract all the information there is in your data.
So, good luck if you are thinking of building your own Neural Net – I strongly encourage you to have a go at it and I hope that this page has been of some use in broadening your knowledge.
[If you want to use my algorithms please get in contact, I have a business set up specifically to license them and several business already deriving value from them. If you have a modelling problem that you feel you can’t solve or would like some expert practical advice about, please also get in touch, we’re pretty good at that too.]