Katie Malone is a colleague of mine from the CERN days and together we have been working on the Kaggle Character Recognition Contest. As opposed to Kaggle's other contests, which have monetary prizes, this is a contest only in the sense that 2nd grade soccer is a contest, i.e., self-esteem is the prize.
Anyhoo, Katie and I submitted using a Support Vector Machine (SVM) as our algorithm and scored an accuracy of 0.97429, putting us in 169th place. Yay? Then we did a Virtual SVM and scored an accuracy of 0.98571. Currently that puts in 41st place on the Kaggle Leaderboard.
So let's back this up and I'll explain more.
First, the point of the contest is to give people practice in machine learning and big data/data scientist topics. The point specifically of this contest is to recognize characters from the MNIST Database of handwritten numbers. NIST offers this as a 60,000 example training set and a 10,000 example test set. Kaggle has re-packaged that to 48,000 training examples and 28,000 test examples. The examples are each a scan of a handwritten digit broken down into a 28x28 re-centered matrix where each pixel is the image intensity scaled 0 to 255. The actual digits look something like this:
After taking Stanford Professor Andrew Ng's machine learning course last fall, Katie and I wanted to get some more experience with it. In particularly, I wanted to develop the topic of SVMs a bit more. Also, the MNIST page indicated that some types of SVMs did really well. In particularly, I was interested in the results of DeCoste and Scholkopf with Virtual SVMs, so I tracked their paper down, because the word "virtual" makes everything cooler. Just ask Lori Singer.
An SVM uses the data points given to define a hyperplane that classifies the data into two categories, and then optimizes that plane for the most accurate result. The technique can be extended to multiple categories by using many SVMs and having them vote on the result. We used an implementation of SVMs in R provided by the e1071 package, which is just a wrapper for the popular C/C++ libsvm library. We also pre-processed the data. The images are in 28x28 pixel matrices, so I "de-rez'ed" them to 14x14 by average 2x2 clusters of pixels. This gave us 196 features instead of 784. Originally, this was a temporary measure to make the code more tractable for experiment and learning, but it seemed to help accuracy, so we kept it. This set-up with a 3rd-degree polynomial kernel is how we came in 169th. Our buddy Burton DeWilde used a Random Forest to come in 189th. Our first goal was, of course, to beat Burton.
Next we extended it to a Virtual SVM. The somewhat tongue-in-cheek rule in machine learning is that the biggest data set always wins. One way to get more data is to synthesize it, that is, to create more from the data you already have. For the character set, we could shift the character images a bit, or rotate them, or somehow distort them. Or even dump word processor fonts in. For example, we decided to shift them up and down, left and right by 1 pixel. We call this a 3x3 jitter, because the center of the picture can move to a new location in a 3x3 box. If we do that, our 42,000 examples become close 378,000 examples. That's a lot of examples to run over. My laptop can barely do 42,000 in a reasonable time.
The clever trick with a Virtual SVM is that when you run the SVM on the original data, it should catch most of what is important for prediction in that dataset. DeCoste and Scholkopf found that instead of synthesizing from all your data, you could do it from just the support vectors in the model and get the same or better improvement over the original dataset. The support vectors are what define the boundary of the hyperplane, and thus what defines the model. So we just jittered the support vectors after running the SVM once. This gave us a second round dataset with about 40,000 examples using the original support vectors and synthesized data from them. And the result from that gave us our current score.
If you're curious about more details, you can find look at our Git repo which contains our code for doing this. It's pretty straightforward (I hope).