Minimalist MNIST


, ,

At some point everybody uses the MNIST dataset (the drosophila of machine learning). My challenge was to use the smallest model possible to classify the MNIST hand written digits.

The data


Examples of the MNIST training data.

MNIST consists of 60,000 training and 10,000 test hand written digits: the figures 0 to 9. I reserved 10k of the training data for validation. MNIST is a solved problem; we have essentially hit the Bayes error, where no further improvements are possible due to errors in the data set.

Baseline model

My baseline model is a pretty standard four layer MNIST classifier:

  1. convolutional layer with 32 3×3 kernels.
  2. Another convolutional layer with 64 3×3 kernels, followed by 2×2 max pooling and dropout
  3. A dense layer of 128 nodes and more dropout
  4. A ten node softmax output layer.

This model (and all those following) were optimised with Adam and trained until EarlyStopping finds that the validation loss stops improving. The 1,119,882 parameters of this model yield an average F1 score of 0.9900 (averaged across the ten classes).


Training and validation loss. Training stopped when the validation loss stopped improving.


First layer convolution kernels


Failures from the test set. Some of these have been written by someone unfamiliar with Arabic numerals!

Smaller models

Smaller models (fewer parameters) have less capacity to learn. As we reduce the capacity of our models, they become quicker to train (and quicker to maker inferences), but we expect performance to decline.

The strategies I used to reduce the model size:

  • Reduced the number of kernels in each convolutional layer.
    • Halving the number of nodes in each layer reduced the parameters from 1120k to 300k reduces the average F1 from 0.9900 to 0.9860.
  • Adding an extra convolutional + max pooling + dropout layer.
    • Adding one more layer (with 32 kernels) reduces the parameters from 300k to 115k, and increases the average F1 from 0.9860 to 0.9875. I probably should have done this first!
  • Reducing the number of nodes in the dense layer (the penultimate layer) dramatically reduces the number of nodes.
    • Reducing the dense layer nodes from 64 to 16 reduces the number of parameters from 115k to 39k. This reduces the average F1 from 0.9875 to 0.9725.
  • I again halved the number of kernels in each convolutional layer.
    • Parameters reduced from 39k to 16k. F1 down from 0.9725 to 0.9670.
  • Removing the final dense layer
  • Further cutting the number of convolutional kernels
  • Adding a 1 x 1 convolutional kernel, to drastically cut the number of connections to the final (softmax) layer.

Minimum model with average F1 > 0.9

The smallest model that achieved an average F1 over 0.9 was:

  1. 4 kernel 3×3 convolutional layer.
  2. 3 kernel 3×3 convolutional layer with max pooling and dropout.
  3. 3 kernel 3×3 convolutional layer with max pooling and dropout.
  4. 1 kernel 1×1 convolutional layer.
  5. 10 node softmax output layer.

This model used 841 parameters and achieved an average F1 of 0.9005.


Examples (failures in red) from the 841 parameter model.

Parameters vs average F1

As shown below, the average F1 score increased as the model gained capacity, until the point where the architecture I select could no longer extract new information.


The average F1 achieved for each model.





2017 flights


No prizes for guessing which city I have to be in, and which city I want to be in! A special mention to Montreal for (just) taking me outside of UK & US!


My 2017 flights, slightly transparent pen so that repeated routes stand out.

Constable and Turner visit Nevada

Some recently unearthed masterpieces from J M W Turner and John Constable‘s visits to Nevada. The fountains at the Bellagio are older than I thought! See below for a short explanation of neural style transfer.


Turner (left) and Constable must have sat side-by-side at the Bellagio. It was a good compromise between Turner’s love of the sea and Constable’s love of musically synced fountains.

Constable later went up to Tahoe and was rightly inspired by the scenery.


Lake Tahoe. I don’t think Constable’s very good at doing skies when there aren’t many clouds.

Neural style transfer

Neural style transfer uses three images: C, a content image, S, a style image and G, a generated image (which starts as C + noise). The loss function combines a content loss and a style loss. The content loss compares the pixel values of C and G at each layer of a pre-trained CNN (here VGG-19). This compares the similarity of each image’s content. The style loss compares the ratio of different filter activations in G and S. This compares the look and feel of the image. The pixels of G are altered by gradient descent to minimise the combined loss.

The content images I used were The Fighting Temeraire tugged to her last berth to be broken up, Turner’s 1838 oil painting showing the inevitable march of progress and Constable’s 1820 Salisbury Cathedral from Lower Marsh Close. I accidentally used Constable’s worst painting of Salisbury.


Source (by Constable), Content (by me), Generated


Source (by Turner), Content (by me), Generated

Learning to count (regression)

Last time we tried to count the number of white pixels on a black image. Using a classification approach was fundamentally limiting the counter to the number classes (eg the number of output neurons). To get round this limitation I replaced the output layers with one output node + ReLU activation.


A simple network trained on the numbers 0 to 9 is able to predict the numbers 0 to 59.

What is we make the images bigger? How high can we count? I tried one 3×3 filter (same padding) followed by three successive 3×3 filters with stride 3×3, which quickly reduced the dimensions down to a small flattened layer. This did ok, but was hardly the 100% accuracy I demand!


Mostly convolutional CNN counts to within a few pixels of the correct answer.

Of course as this is a trivial problem we could cheat:


AveragePooling + one output node = 100% accuracy!

Next time: A harder problem, where average pooling won’t work.

Learning to count (classifier)

I aim to train a CNN (convolutional neural network) to count (let’s say up to 100), starting with counting the number of white pixels on a black image. I start by making a classifier, trained on images that contain 1, 2, 3, 4 or 5 white pixels.

Results after training for 10 epochs on 12,800 images are pretty good – y is the true value (the label), y_hat is the prediction.


Classifier trained on 12,800 images. These are from the 500 image test set. Test accuracy: 100%

Sadly this does less well on a test set that includes higher numbers (below). Dr R suggested I adopt a “1, 2, 3, 4, many” approach, which would give good accuracy, but I think would be a bit unsatisfying.


Performance less impressive on higher numbers.

Network architecture

I used one 3×3 convolutional filter, then a couple of dense layers. I expected the filter to converge to one high value surrounded by low values — the perfect shape for picking out white pixels surrounded by dark ones. This wasn’t what I found, as shown by these three example, which have been stretched so that abs(max(W)) = 100.

I guess that it doesn’t really matter which convolution you use, provided you understand the output!

Continue reading

Making Your Mind Up


Judging Eurovision: how to fairly compare incomplete ranks & scores

The Eurovision Song Contest is an expression of centuries of European geo-politics and rivalry disguised as a friendly song competition, bizarrely also featuring Israel, Azerbaijan and Australia.

A friend (definitely not me) recently hosted a Eurovision party, where merrymakers completed official BBC Eurovision 2017 Party Pack Grand Final scorecards. The question now: how do we compare our scores against the final results?

Eurovision scorecards

Completed BBC scorecards, each using a different scale and with frequent missing scores.

Seven people have entered the competition: SB, JMS, Chris, Lea, Jon, M1 & M2. M1 and M2 are mystery people who were so embarrassed by the concept that they couldn’t bring themselves to write their names.


The most obvious method of scoring is to compare the top three scores of each person. Sadly only one person has a match so we can declare him the winner. Hurray.

Truth SB JMS Chris Lea Jon M1 M2
Portugal Azerbaijan Portugal Romania Sweden Germany Romania Sweden
Bulgaria Sweden Armenia Ukraine Spain Portugal Moldova Moldova
Moldova Portugal The N’lands Belgium Norway Israel Norway Norway

This is not a very good method because a) it only considers 3 out of the 26 results, b) it only uses the rankings and ignores the scores.

Winner: JMS (me). Hurray!

Top three picks in ground truth top five?

It feels fairer to see how many of the true top three songs were in each persons top five. This is similar to the scoring of the classification aspect of the ImageNet competitions.

In top 5 Points
SB Portugal, Moldova


JMS Portugal




Lea Portugal


Jon Portugal


M1 Moldova


M2 Moldova


I feel that this is fairer because we have widened the comparison to consider more positions, however it still only uses the rankings.

Winner: SB

Series analysis

Considering the number of votes vs final position, I first normalised each person’s score card so that their average score was 10. This seemed like a fair way of dealing with null values.

Eurovision 2017 normalised scores

Fairly poor correlation between the entrants and the real final score (GT score).

We see that the real scores are heavily weighted towards the top few countries. This is clearer if we only look at the final scores (GT score), the jury scores and the public vote. You can see how the jury vote and the public vote are combined to make the final score:


The final score (GT score), jury score (GT jury) and public vote (GT vote).

Due to the method by which the Eurovision final score is calculated (two scores added, each of those scores is a vote), I don’t think that the final score should be considered as a fair representation of how good the song is: we shouldn’t conclude that Portugal’s entry was twice as good as Moldova’s.

Next time, we’ll consider some better methods for judging our scorers. Hopefully this will show that I won!

The jamesgeo well-travelled map



After years of deliberation, months of data wrangling and several reviews by family, I have finally finished version 1 of the jamesgeo well-travelled map. This map aims to answer the question “how well travelled am I?”. Unsurprisingly the answer is not very.

The map is made by taking every point that I’ve been to on the Earth’s surface and buffering by 100 km, without letting this cross any international borders, unless I did. I have chosen 100 km as I argue that culture, geography, geology etc changes significantly over about this distance.


Well-travelled map, showing the Earth’s surface that I’ve experienced.


Well-travelled map, without the context, paints a somewhat dismal view.