, ,

At some point everybody uses the MNIST dataset (the drosophila of machine learning). My challenge was to use the smallest model possible to classify the MNIST hand written digits.

The data


Examples of the MNIST training data.

MNIST consists of 60,000 training and 10,000 test hand written digits: the figures 0 to 9. I reserved 10k of the training data for validation. MNIST is a solved problem; we have essentially hit the Bayes error, where no further improvements are possible due to errors in the data set.

Baseline model

My baseline model is a pretty standard four layer MNIST classifier:

  1. convolutional layer with 32 3×3 kernels.
  2. Another convolutional layer with 64 3×3 kernels, followed by 2×2 max pooling and dropout
  3. A dense layer of 128 nodes and more dropout
  4. A ten node softmax output layer.

This model (and all those following) were optimised with Adam and trained until EarlyStopping finds that the validation loss stops improving. The 1,119,882 parameters of this model yield an average F1 score of 0.9900 (averaged across the ten classes).


Training and validation loss. Training stopped when the validation loss stopped improving.


First layer convolution kernels


Failures from the test set. Some of these have been written by someone unfamiliar with Arabic numerals!

Smaller models

Smaller models (fewer parameters) have less capacity to learn. As we reduce the capacity of our models, they become quicker to train (and quicker to maker inferences), but we expect performance to decline.

The strategies I used to reduce the model size:

  • Reduced the number of kernels in each convolutional layer.
    • Halving the number of nodes in each layer reduced the parameters from 1120k to 300k reduces the average F1 from 0.9900 to 0.9860.
  • Adding an extra convolutional + max pooling + dropout layer.
    • Adding one more layer (with 32 kernels) reduces the parameters from 300k to 115k, and increases the average F1 from 0.9860 to 0.9875. I probably should have done this first!
  • Reducing the number of nodes in the dense layer (the penultimate layer) dramatically reduces the number of nodes.
    • Reducing the dense layer nodes from 64 to 16 reduces the number of parameters from 115k to 39k. This reduces the average F1 from 0.9875 to 0.9725.
  • I again halved the number of kernels in each convolutional layer.
    • Parameters reduced from 39k to 16k. F1 down from 0.9725 to 0.9670.
  • Removing the final dense layer
  • Further cutting the number of convolutional kernels
  • Adding a 1 x 1 convolutional kernel, to drastically cut the number of connections to the final (softmax) layer.

Minimum model with average F1 > 0.9

The smallest model that achieved an average F1 over 0.9 was:

  1. 4 kernel 3×3 convolutional layer.
  2. 3 kernel 3×3 convolutional layer with max pooling and dropout.
  3. 3 kernel 3×3 convolutional layer with max pooling and dropout.
  4. 1 kernel 1×1 convolutional layer.
  5. 10 node softmax output layer.

This model used 841 parameters and achieved an average F1 of 0.9005.


Examples (failures in red) from the 841 parameter model.

Parameters vs average F1

As shown below, the average F1 score increased as the model gained capacity, until the point where the architecture I select could no longer extract new information.


The average F1 achieved for each model.