Sunday, August 19, 2007

Netflix Prize: Feature Output Error Curve

I wanted to fill out some details surrounding what might be best referred to as the feature output error curve , mentioned in Brandyn Webb's (AKA Simon Funk) now well read blog post over at http://sifter.org/~simon/journal/20061211.html. He refers to this as the output-output curve or the output vs target output curve.

Having completed training on a feature we can simply apply this feature to the residuals matrix and start training the next feature. However, it turns out that a feature has a 'shape' and that shape can be found by plotting a graph with predicted rating along the X-axis and the mean error on the Y-axis. This is the mean error of all predictions of a given value. In reality of course we have to group predictions into small ranges on the X-axis in order to be able to calculate a mean for that group. By error here I am meaning the offset from the prediction and this includes both negative (prediction was too low) and positive (too high) numbers.

Here is such a graph from one of my own experiments for the first feature. The red line is the curve for probe ratings, the black line is the same plot for training data ratings. Oh and for reasons that even I myself am not sure of I rebased all of the ratings into the 0-4 range, instead of 1-5. Apologies for any resulting confusion!

As I said, Brandyn actually refers to a predicted vs target rating output curve, which is basically the same as the above plots but presented in a form that allows you to map directly from a predicted rating to an adjusted rating (more on that later), it is the above plots but with the offsets subtracted from x, (we subtract the offsets from the line y=x).


As you can see, this form shows the kinks at the ends of the plots that Brandyn mentions, but the overall shape/detail of the curves is less clear.

I should also point out at this stage that Brandyn was probably referring to the ouput of the feature vs it's target output (e.g. +1 will get us to the correct rating, but the feature generated +0.5), whereas I'm referring to the current best prediction (based on the current residuals plus the output of the current feature) vs the target prediction - so the residuals may contain a 3, the feature generates +0.5 and the target rating is 4. It all amounts to the same thing, viewed from slightly different perspectives, 'tis all.

Going back to the first graph then. It shows us that low predictions tend to be too low, and high predictions tend to be too high. So when we apply our feature to the residuals matrix we can pass our predictions through this graph, apply the appropriate offset and achieve a better RMSE on the residuals matrix (eliminate more error).

If you don't apply the offsets then the curve actually remains in the residuals matrix, quite possibly disrupting the discovery of features later on. You can however re-calculate the feature output error curve for a residuals matrix at any time and extract the corresponding amount of error.

One interesting observation is that if you apply such a curve after each feature, then the curve gets less and less prominent, eventually becoming more-or-less flat apart from a bit of noise. Brandyn therefore decided to not calculate this curve after 20 features or so. However, if you plot the curve after 20 features of not applying any offsets then you can see that a curve has re-emerged, and applying the offsets to the residuals matrix as before will reduce RMSE. This re-emergence is I assume the result of the accumulation of many faint curves, individually they are lost in the background noise, but together they become prominent again. On this basis I choose to apply the curve for the first 20 features (as Brandyn) and then again every 20. Perhaps I should just keep applying a curve at each feature?

But what of the two plots in each graph? One is plotted for the probe set predictions and one for the training set. In the above plots the train set plot (black) has a more prominent curvature than the probe. This means that if you remove the train set curve you are over-adjusting the probe predictions (residuals), or if you apply the probe curve you are under-adjusting the training residuals. I suppose you could apply the two curves to their respective residuals, but as a general rule I think that we should consistently apply functions to both sets of residuals to prevent them becoming separated in some way - maybe that instinct is wrong. Also, when training on the full set you do not have any set that is akin to the probe set, you have the qualifying set, sure, but you don't know the target ratings (that is the whole point of the contest!) which are required to calculate a feature output error curve.

Another option, and one I have used to generate a quiz submission, is to store the curves generated for each feature against the probe data, and to the then apply those same curves during SVD on the full data set. This on the basis that those saved curves will more closely represent the curve that would have been generated by the qualifying set if we had access to the ratings. Hmmm.


A loose end...

If like me your maths is a bit rusty [or in my case it was never that shiny in the first place ;) ] then you might be thinking - why did we take the mean error at each point on the x-axis? We are trying to reduce squared error (E^2) , not error (E) and thus applying an offset to the predicted ratings as described, although lowering E may in fact increase E^2. Hmm, well no. This is one of those things that some might find intuitively obvious, but I wasn't really sure until I worked through the maths. If you have a list of numbers and you can apply an offset to all of them and you want to get the lowest possible total sum(V), well then obviously you subtract the mean. And as it happens the same is true of sum(V^2), just subtract the mean. ------ no no no! see comment below!