Computer Vision Deep Learning Primer with Keras and Python

Jonathan Fraine
38 min readMay 1, 2023

--

An extended, detailed primer for the Keras Deep Learning Framework and Convolutional Neural Networks with Examples in Python for Computer Vision

Photo by Ion Fet on Unsplash

Intro to Machine Learning with Keras

Machine learning is the flavour of AI in which we train an algorithm to statistically infer a mapping between a set of features and a set of labels. The features can be tables of data, images, time series, language strings (sentences), etc. The labels can be integers (often: classification), strings (NLP), images (generative AI), or floats (real numbers: regression). The “labels” can also be the features for clustering, dimensionality reduction, denoising, or generative AI.

Machine learning is the statistical inference towards predictions from yet unknown data, through the training of parameters for algorithms that probe the feature space environment (e.g., how pixels are related to each other). The inference is most often towards predicting new labels or values from data that the algorithm has never seen. Accurately predicting unseen data is called “generalisability”. If the model does not predict well on unseen data, then it is said to underfit or overfit.

Deep learning is a variant of machine learning that trains sequences of linear matrices and non-linear (activation) functions to map features onto labels. Models derived from deep learning are referred to as neural networks (NN). Deep learning training involves updating the elements of the matrices, which are weights and biases for linear equations that tune the output of the network to match the provided labels. Training simultaneously involves updating the weights and biases for the non-linear transformations and the linear matrices.

Keras is a machine learning library that specialises in GPU and CPU optimisation of linear matrix mathematics. It was designed by Francois Chollet — a data scientist from Google at the time — and released in 2015 as a stand-alone package that could integrate several, existing deep learning libraries: Theano, TensorFlow, and CNTK.

Keras became prominent as a high-level API interface because it hid most of the complicated NN logic and syntax “under the hood”. The API allows developers to “add a convolutional layer” (Keras) instead of “defining what, where, and how a convolutional layer behaves …” (TensorFlow).

Later, TensorFlow version 1.2 (TF-1.2) released an optimised version of Keras wrapper built into TensorFlow as `tf.keras`. This was an application of Keras that was specifically optimised for the TensorFlow framework.

Convolutional Neural Networks (CNNs)

CNNs are a special application of neural networks that work with convolutional kernels as linear matrices. Much the rest of the neural network operations are the same for CNNs and non-CNNs: linear matrices, non-linear transformations, backpropagation, etc.

In particular, the method of applying the CNN linear matrices to the data is through convolutional integration, instead of linear matrix multiplications. Convolutional integration follows the procedure:

  1. Multiplication of a kernel (“filter”) to a subset of the feature set:
    the subset is the window of an image or time series.
  2. Sum the subframe over the multiplication
  3. Moving the window (or “slide” it) to the next pixel or timestamp
  4. Repeat for all windows by sliding through the pixels or timestamps

The benefit of convolutional neural networks is that they take advantage of intrinsic (often: physical) correlations between neighbouring points in the features: pixels-to-pixel correlations or timestep-to-timestep correlations. Another benefit is that they can be passed over any sized image or time series by subsampling it into the predefined window size. Likewise, they can be rapidly multi processed with GPUs.

CNNs have created the backbone for the vast improvements in deep learning functionality since ~2012 with AlexNet.

  1. CNNs can classify images, detect objects, and segment images more accurately and 1000x faster than humans
  2. CNNs are the core of generative models that create images or sounds — and play a role in GPT text generation
  3. CNNs are the backbone of anomaly detection systems for time series and image data sets
  4. CNNs are the underlying functionality for automated driving and augmented reality (e.g., Human Pose Estimation).

MNIST Dataset

MNIST data set consists of 70k hand-written digits (0, 1, …, 9) that were created in 1998; as the combination of two other (older) datasets. It has been used to train and benchmark new machine learning algorithms; for example — but not detailed here — support vector machines (SVMs), random forests (RF), dense neural networks (DNNs), convolutional neural networks (CNN), etc.

Each of the 70k images (“feature vectors”) contains a single hand-written digit near the centre of the 28x28 pixels, greyscale image. The images are stored in greyscale, with pixel values from 0 (black) to 255 (white).

Several machine learning (ML) methods have been devised to make improvements to the predictability (labelling) of handwritten digits as a benchmark for the quality of basic algorithmic functionality: PCA, SVM, and XGB. These same algorithms can be used for face detection or image classification as well. As such, CNNs that work on Image classification are valuable for many historically difficult computer vision tasks that previously held ML back from creating robust applications for decades.

Loading the Keras API

To load the Keras library, we will call the API built into the TensorFlow framework library.

For this example, we will take advantage of the layers, utils, datasets, models modules:

  1. layers to include and update the structure of the neural network:
    e.g., Conv2D, MaxPooling, Dropout, etc
  2. callbacks for using the EarlyStopping object during training to avoid overfitting
  3. utils for categorical classification data preparation
  4. datasets to load the datasets for our training process
  5. models to initialise the TensorFlow-Keras framework in the context of a neural network model
import numpy as np
from matplotlib import pyplot as plt
from tensorflow.keras import layers, callbacks, utils, datasets, models

# Set a random seed to maintain similar values
np.random.seed(42)

Load MNIST Data Set and Preprocess It

The expected properties for MNIST include 10 classes and 70k greyscale 28x28 images with values ranging from 0 to 255.

num_classes = 10 # 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
img_shape = 28, 28 # 28 rows & 28 columns, 1 color -- greyscale

# Load data from keras
# x = inputs and y = outputs

(x_train, y_train), (x_test, y_test) = datasets.mnist.load_data()

# Numerical operations (e.g. CNNs) prefer values between 0, 1 or -1, 1
# We must scale the values from 255 (default inputs) to 0-1 (CNN inputs)
# Note that we also convert the arrays to float32,
# this could be float16/float8 for speed or float64 for accuracy, etc
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255

# Conv2D-nets require 4D inputs: batch, row, col, color
# Unfortunately, MNIST does not provide a color dimension: it's greysca
# Therefore, we add the color dimensions to represent greyscale
COLOR_DIM = -1
x_train = np.expand_dims(x_train, axis=COLOR_DIM)
x_test = np.expand_dims(x_test, axis=COLOR_DIM)

Display samples of the MNIST data set with input labels (green)

We created a helper function to plot samples before and after training.
See the plot_samples function in the Colab notebook.

Confirming that the data set is as we expect it:

num_classes = np.unique(y_train).__len__()  # 10
img_shape = x_train[0].shape # 28, 28, 1

One-Hot Encoding

Converting class labels 0,1,…,9 to multiclass vector [0 0 0 0 0 0 1 0 0 0]

Currently, the class labels are integer values from 0–9. In contrast, neural networks are sequences of linear matrix operations with non-linear transforms — called “activation functions” — interspersed. Neural networks were designed to work with values from 0 to 1 (or -1 to 1). As such, it is beneficial to provide outputs that are vectorised (1-d matrices) with binary integer values: [0, 1].

A vectorised outputs (NN expectation) that contain integers of either 0 or 1 are called one-hot encoding. The word hot refers to electrical engineering slang for “the current is live”. As such, the one-hot encoding provides a set of gates: 0 for cold and 1 for hot.

For example, with MNIST: a 1 (hot) in 7th element would mean that the CNN algorithm is predicting that the input digit is a six (“6”). Note that the one-hot encoding is zero-indexed or “starts from zero”. As such, if there is a 1 in the 4th element, then the CNN is predicting that the input digit is a three (“3”).

print("Sample of labels before one-hot encoding")
print(np.random.choice(y_train.ravel(), size=5))

# tf.keras function to transform integer labels to one-hot encodings
y_train = utils.to_categorical(y_train, num_classes)
y_test = utils.to_categorical(y_test, num_classes)

print("Sample of labels after one-hot encoding")
print(y_train[np.random.choice(range(y_train.shape[0]), size=5)])
  1. Sample of labels before one-hot encoding: [3 3 8 4 5]
  2. Sample of labels after one-hot encoding
    [0 0 0 1 0 0 0 0 0 0]
    [0 0 0 1 0 0 0 0 0 0]
    [0 0 0 0 0 0 0 0 1 0]
    [0 0 0 0 1 0 0 0 0 0]
    [0 0 0 0 0 1 0 0 0 0]

Confirm that the data is as expected

Confirm that all 60k (train) and 10k (test) images (i.e., the input data: “x”) are properly scaled (0–1), and that they have square shapes of (28, 28) pixels per image — with 1 color: (28, 28, 1)

print("Confirm Shape")
print(f"x_train shape: {x_train.shape}") # (60000, 28, 28, 1)
print(f"x_test shape : {x_test.shape}") # (10000, 28, 28, 1)

print("Confirm Samples")
print(f"train samples: {x_train.shape[0]}") # 60000
print(f"test samples : {x_test.shape[0]}") # 10000

print("Confirm Train Range")
print(f"x_train min: {x_train.min()}") # 0.0
print(f"x_train max: {x_train.max()}") # 1.0

print("Confirm Test Range")
print(f"x_test min: {x_test.min()}") # 0.0
print(f"x_test max: {x_test.max()}") # 1.0

Confirm that the label (“y”) values are one-hot encoded, with square matrix shape, and that their values range from zero to one.

print("Confirm Shape")
print(f"y_train shape: {y_train.shape}") # (60000, 10)
print(f"y_test shape : {y_test.shape}") # (10000, 10)

print("Confirm Samples")
print(f"train samples: {y_train.shape[0]}") # 60000
print(f"test samples : {y_test.shape[0]}") # 10000

print("Confirm Train Range")
print(f"y_train min: {y_train.min()}") # 0.0
print(f"y_train max: {y_train.max()}") # 1.0

print("Confirm Test Range")
print(f"y_test min: {y_test.min()}") # 0.0
print(f"y_test max: {y_test.max()}") # 1.0

Understanding Convolutional Neural Network

A CNN is often configured as a sequence of convolutional layers, batch normalisation layers, pooling layers, and activation layers. For classification, the NN must be flattened (using the Flatten layer) and transformed into the output layer shape, using the Dense layer. Furthermore, we will use the Dropout layer to randomly turn off pathways in an effort to avoid overfitting — memorise the training set, and not generalise well — a common issue with NNs, especially with Dense layers.

Note that there is always an input layer and an output layer — corresponding to the input images and the output labels. In addition, there can also be any number of hidden layers. Hidden layers are all of the other layers and stuff in between the input and output layers. The meat and connective tissue of any NN are the hidden layers.

Layers of a CNNs

A convolutional layer provides a stack of convolutional kernels (i.e., small matrices). Each kernel is also referred to as a “filter”. A Conv2D layer (set of filters) can be parameterised by

  • The number of filters or kernels
  • The size of each kernel: 1D for time-series, 2D for images, etc
  • The padding size — i.e., how many zeros are on the outside — same logic as with an FFT
  • The stride — number of pixels to slide between convolutions per image; see the graphic above.
    [Note that a stride larger than one (1) reduces the size of the output feature space by the stride size in each direction. This functionality is related to pooling (see below)]
  • There are many other Conv2D parameters, but these are the options most often modified.

A batch normalisation layer takes the output of the convolutional layer and renormalises it closer to a standard Normal distribution — a Gaussian with mean zero and width one. The purpose of this is to control the output of each convolutional layer or stack throughout the NN to avoid exploding or vanishing gradients — a common issue with deeper neural networks. Before the invention of Batch Normalisation, many CNNs (and other NNs) would generate hidden layer outputs (feature maps) that either summed too close to zero or too close to infinity for all inputs.

  • Feature maps are the results of the images after the convolutions. If we have 32 filters in a convolutional layer, then the layer generates 32 feature maps as convolved images.
  • Batch normalisation does not train any parameters; as such, it is referred to as a “parameter-free” or “non-parametric” layer.

A pooling layer with a pool size of two (2) transforms a 100 x 100 feature map (or input image) into a 50 x 50 output feature map. The purpose of the pooling layer is to reduce the number of features that subsequent convolutional (or other) layers must process — i.e., reducing the dimensionality and complexity of the data processed by the deeper layers of the network.

The most comment forms of pooling at MaxPool and AvgPool:

  • MaxPool takes a set of pixels (e.g., 2x2 views) and returns the maximum of the set. This is believed to have the effect of returning the features in the image that maximise the activation deeper in the network: i.e., maximise the predictability.
  • AvgPool takes a set of pixels (e.g., 2x2 views) and returns the average of the set. This has the effect of averaging over an image or feature set to reduce its computational size and complexity.

I prefer AvgPool lower in the stack — closer to the image input layer — and MaxPool deeper in the stack — closer to the label output layer. The use and location of pooling layers are up to the NN architect (you).

The purpose of reducing the dimensionality has three functions:

  1. Pooling acts to regulate the network; i.e., minimising the ability of the network to overfit by focusing on narrow features in the training data. Without regularisation, all NN reduce their ability to generalise — to predict well on the (unseen) testing/validation data.
  2. Pooling is believed to focus the deeper layers of the NN on larger-scale features. Think of an image of a car next to a lake. One set of deep features could classify the car, while others classify the lake.
  3. Pooling reduces the number of NN parameters that must be fit, which reduces the time it takes to train and predict from a NN.

Head node: from features to predictions

The head of the neural network is the final layer that transforms the last Conv2D (in our case) from the backbone into the NN’s prediction, with which we will compare to the labels.

In our convolutional kernel, the start of the head node is a Flatten layer. The Flatten layer takes an NxN or NxM feature map and converts it into an NMx1 vector. For example, if the last Conv2D layer outputs a 5x5 feature map, then the Flatten layer will output a 25-element vector: i.e., 5 times 5.

To transform the output from the Flatten layer from any number of elements to the 10-element vector that matches the one-hot encoded labels, we use a Dense layer(s). This is a linear matrix (or set of matrices) to upscale or downscale the remaining convolved feature values into the final output labels. For example, if the last Conv2D layer outputs a 3x3 feature map, then the Flatten layer will output a 9-element vector; and the Dense layer will train a 9x10-sized matrix to transform those 9 features into the 10-element, one-hot encoded prediction.

The last pre-output operation is often to regulate the output layers using a Dropout layer. A Dropout layer randomly chooses a set of elements in the matrices to turn off (set to zero). This technique seeks to minimise the number of nonzero output (per Dense layer) neurons that are passed onto the next section of the NN -- or to the output layer in our case.

The purpose of dropout is that the network does not understand reality, it only understands matrix transformations. As such, without dropout, we allow the NN to operate on all of the feature maps at the same time. With a dropout rate of 0.5, we only allow the NN to train on half of the outputs at the same time.

The fitting algorithm automatically draws a new random number every batch and allows the NN to train on a different 50% of the output vector(s). If a set of neurons from the dense layer are unnecessary for accurate prediction, then Dropout is likely to reveal this, and the algorithm is likely to keep the elements in the matrix (neurons) as zero.

Dropout layers can be used anywhere in the network, including the input, hidden layers, and output. Dropout is most often used with Dense layers because Dense layers very easily overfit the data by using all of the input values to make predictions. Dropout adds stochasticity to the NN linear matrix training.

Activation (function) layers are the secret sauce of neural networks. At a base level, a neural network is a stack of linear transformations (matrices) with non-linear functions interspersed. The non-linear functions are the activation layers. Without the activation layers, the neural network would only be a large sequence of linear transformations, which could be collapsed into a single matrix — not matrices, just a single matrix. The activation functions transform each sequence of linear matrices into a new feature space that cannot be collapsed into a single matrix.

Activation functions come in many forms. The most common activation functions are related to the ReLU activation (see below). In contrast, the most classical activation functions are the sigmoid and the tanh (hyperbolic tangent) functions. Because early neural network research attempted to approximate a “neuron” as a smooth function of input (x) to an output (y: 0–1), the early researchers adopted the sigmoid function, which takes the form: 1 / (1 + exp(-x)). It takes as input any real value and only outputs values between 0 to 1.

Important: It may be helpful to realise that, in all of the activation functions, the input data (x) is the images or deeper feature maps; while the w and b are the weights and biases of the neural network. As such, in all of the matrices and functions, the x is static with respect to training or fitting. The parameters that the algorithm trains are the weights and biases, which were not represented in the sigmoid function above. The weights and biases are both the values of the linear matrices and the parameters of the activation functions.

The .fit or .train algorithm uses a technique called backpropagation (see below) to modify these weights and biases. It is used to update the slopes (weights: w) and offsets (biases: b) of the Sigmoid, TanH, or ReLU functions. It also shapes the convolutional kernels by updating the values within the kernel matrices. In the training phase, the set of w and b values are the parameters being trained. During the prediction phase, w and b values are held constant to predict new labels from new images.

Activation Functions

Sigmoid or “logistic function”

  • At large negative values (x→−∞), the sigmoid asymptotically approaches zero
  • At large positive values (x→+∞), the sigmoid asymptotically approaches one
  • The sigmoid function approximates a straight line close to x = b. With default values of b = 0 and w = 1 the y-intercept is 0.5
  • In the neural network context, there are weights (slopes) and biases (intercepts) that are trained (i.e., “fit”) by the network training algorithm (backpropagation). It modifies the slopes/intercepts of the matrices and functions, which also changes how fast the function falls to 0 or 1 for the sigmoid.

Softmax

A softmax layer is very similar to a sigmoid layer, extended to vectorised data. It is most often used as the output layer of a multi-class classification neural network such as ours. The softmax provides a sigmoid function per dimension of the output. In the case of one-hot encoding, it establishes which of the num_classes output vector elements are most logically the correct prediction.

This represents the softmax with a weighted, exponential average per element in the output vector (one-hot encoding). The sigmoid function is the softmax function (logistic) for a single, binary classification: True/False.

Tanh

Researchers then realised that the sigmoid function could not add a penalty to errant features: the output was either “yes” or “unknown/no”, as opposed to “yes”, “no”, and “unknown”.

As such, they adapted the tanh function for activation:

  • At large negative values (x→−∞), the tanh asymptotically approaches -1 (a penalty)
  • At large positive values (x→+∞), the tanh also asymptotically approaches one (same as the sigmoid)
  • The tanh function also approximates a straight line close to x = 0, with a default y-intercept of 0, which symbolises “unknown”.
  • In the neural network context, there are weights (slopes) and biases (intercepts) that are trained (i.e., “fit”) by the backpropagation algorithm. The algorithm modifies the slope/intercept of the activation function and linear matrices, which also changes how fast the function falls to -1 or 1 for the tanh.

ReLU: Rectified Linear Unit

Technically, a linear function can be used as an activation function. Unfortunately, a linear function with linear matrices would similarly collapse the neural network into a single (very complicated-looking) matrix.

A “rectified linear unit” (ReLU) is linear from zero to infinity (all non-negative values). It would then output exactly zero for all values below the bias value. The slope (weight) and intercept (bias) of the ReLU function are trained (fit) during the iterative, backpropagation process.

The rectified linear unit is simple, but dramatically faster to compute than the sigmoid or tanh functions. As such, ReLU simplifies the mathematics inside the neural network and allows neural architecture experimentation to focus outside the activation function. Even so, it still provides the necessary pinch of non-linearity to avoid collapsing the NN into a single matrix — it still transforms the feature maps into a new feature space.

Leaky ReLU, ELU, PrLU, SeLU

Adaptions of ReLU come in many forms that are either less linear (linear at large x) than the ReLU, less static below zero (not constant at x→−∞), or both.

Although I find ELU (exponential linear unit) more elegant, my favourite ReLU is the LeakyReLU because it is simple and functional: two straight lines pasted together.

Note that w_neg is usually provided as a hyperparameter as opposed to a weight to be trained. If so, then it is provided before training and does not change over the lifetime of the NN. Moreover, the w_negis usually a tiny number, such as 10^−4 (0.0001). Keeping w_neg small allows the LeakyReLU to leak or penalise values less than the bias (offset), as opposed to “flatten” them. A large w_negwould leak too much, while a small w_neg might not leak enough.

See Activation Functions — ML Glossary documentation for more examples, details, visualisations, and explanations.

Image from Automatic localization of casting defects with convolutional neural networks Ferguson et al. (2017)

Build the Model: Convolutional Neural Network

We will use the Sequential form of the Keras architecture for model creation. This means that each layer is appended to the model, similar to a Python list line .append() operation.

The operations look like this:

model = Sequential()
model.add(input_layer)
model.add(hidden_layer1)

model.add(hidden_layerN)
model.add(output_layer)

Define the hyperparameters for the CNN and the fitting process.

  • kernal_shape: the size of the convolutional kernel in 2D
  • activation: the activation function used between layers or blocks
  • pool_shape: the size of MaxPool/AvgPool downsamplings in 2D
  • nfilters_hidden1: the number of convolutional filters for the first layer
  • n_filters_hidden2: number of convolutional filters for the second layer

Note that the pooling layers (i.e., MaxPooling/AvgPooling) reduce the size of the “image” or “feature map” that is output by subsequent convolutional layers by a factor of pool_shape = 2 in both directions. As such, it is usual to increase the number of filters in subsequent convolutional layers by 2 (or the pool shape per dim). This is related to reducing the number of features per Pooling operation more linearly than quadratically, as well as GPU optimisation.

Example: because we used pool_shape = 2, the second convolutional layer has twice as many filters as the first convolutional layer. With 28x28 images and 32 filters, the first Conv2D layer outputs 25k elements as its feature maps. The Pooling reduced the size of the images to 14x14, while we increased the number of filters from 32 to 64, which results in 12.5k elements in the feature maps output by the second Conv2D layer. The number of features output by the second Conv2D layer is only half the number of features output by the first Conv2D layer — as opposed to a quarter. The convolved images are smaller, but the deeper layer generates more of them.

kernel_shape = 3, 3  # train 3x3 kernels across all Conv layers
activation = 'relu' # use inte Rectified Linear Unit activiation functions
pool_shape = 2, 2 # reduce dimensionality by 4 = 2 x 2
dropout_rate = 0.5 # drop 50% of neurons
padding = 'same' # maintain the shape of feature maps per layer
strides = 1 # Default. Do not downsample via stride

nfilters_hidden1 = 32 # Start with 32 convolution filters to train
nfilters_hidden2 = 64 # end with twice as many filters to train next

Construct the neural network model using keras.models.Sequential() and model.add(…):

# Define how we will build the model
model = models.Sequential(name='MNIST_CNN_Tutorial')

# Create the input layer to understand the shape of each image and batch-size
model.add(
layers.Input(
shape=img_shape,
# batch_size=batch_size,
name='Image_Batch_Input_Layer',
)
)

# Add the firest convolution layer with 32 filters
model.add(
layers.Conv2D(
filters=nfilters_hidden1,
kernel_size=kernel_shape,
activation=activation,
padding=padding,
strides=strides,
name='First_Conv2D_Layer'
)
)

# Reduce the dimensionality after the first Conv-layer w/ MaxPool2D
model.add(
layers.MaxPooling2D(
pool_size=pool_shape,
name="First_MaxPool2D_Layer"
)
)

# Add the firest convolution layer with 64 filters
model.add(
layers.Conv2D(
filters=nfilters_hidden2,
kernel_size=kernel_shape,
activation=activation,
padding=padding,
strides=strides,
name='Second_Conv2D_Layer'
)
)

# Reduce the dimensionality after the second Conv-layer w/ MaxPool2D
model.add(
layers.MaxPooling2D(
pool_size=pool_shape,
name="Second_MaxPool2D_Layer"
)
)

# Convert the 2D outputs to a 1-D vector in preparation for label prediction
model.add(
layers.Flatten(
name="Flatten_from_Conv2D_to_Dense"
)
)

# Dropout 50% of the neurons from the Conv+Flatten layers to regulate
model.add(
layers.Dropout(
rate=dropout_rate,
name="Dropout_from_Dense_to_Output"
)
)

# Compute the weighted-logistic for each possible label in one-hot encoding
model.add(
layers.Dense(
units=num_classes,
activation="softmax",
name="n-Dimensional_Logistic_Output_Layer"
)
)

Here we have created an updated version of LeNet5, similar to the figure below. The pyramidal structure below is a common and useful visualisation of the nature of convolutional neural networks.

Image provided by WikiCommons under CC0 license.

Printout of the numerical CNN structure

Here, we will also “visualise” the neural network in text form, using the model.summary() method:

Model: "MNIST_CNN_Tutorial"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
First_Conv2D_Layer (Conv2D) (None, 28, 28, 32) 320

First_MaxPool2D_Layer (None, 14, 14, 32) 0
(MaxPooling2D)

Second_Conv2D_Layer (Conv2D) (None, 14, 14, 64) 18496

Second_MaxPool2D_Layer (None, 7, 7, 64) 0
(MaxPooling2D)

Flatten_from_Conv2D_to_Dens (None, 3136) 0
(Flatten)

Dropout_from_Dense_to_Outpu (None, 3136) 0
(Dropout)

n-Dimensional_Logistic_Outp (None, 10) 31370
(Dense)

=================================================================
Total params: 50,186
Trainable params: 50,186
Non-trainable params: 0
_________________________________________________________________

The rows in the text table above provide the name of each layer, the type of each layer (in parentheses), the shape of each layer output, and the number of trainable parameters per layer. Note that the Dropout, Flatten, MaxPool, and BatchNormalisation layers have zero (0) trainable parameters

Compile the model

When fitting the model, we must first understand

  • Loss Functions: What the algorithm should compare against to determine if training is improving; i.e., “goodness of fit”
  • Minimisation Algorithm: Which algorithm should process the training/fitting itself
  • [Optional] What metrics should we track to convince ourselves that our model fits the data well, and is generalised to unseen data

1. Loss Functions: Goodness of Fit

There are many metrics that could be used as our goodness of fit. With linear regression, the direct choice is often the mean squared error, because this represents a reduced chi-squared statistic — i.e., the error is represented as the natural log of a Normal function. It has been shown to represent a literal bias-variance tradeoff — minimising overfitting and underfitting at the same time. An algorithm that iterates to find the smallest mean squared error is also referred to as minimising the chi-squared.

For classification, MSE is almost irrelevant because the class labels are represented by the logistic vector (float: 0 to 1). As a result, the algorithms compare the output of the CNN (softmax: weighted logistic per class) to the one-hot encoded label vectors.

Cross-Entropy

With respect to binary classification, the most prolific metric to compare predictions to labels is called cross-entropy. Cross-entropy measures how dissimilar vectors are from each other. The use of the word “entropy” here is related to the second law of thermal dynamics, where entropy is always maintained or increased by physical systems. Entropy is thus connected to the residual scatter. Thus, to minimise the entropy is to minimise the scatter in the residuals.

In the information criteria framework (i.e., model fitting), entropy is referred to as the “quality of information”, and minimising entropy is considered to be the same as increasing the quality of information. Going to physics, reducing the entropy provides a more clear image of the underlying system. A large entropy (initial conditions for a model fit) implies a large uncertainty about the system; i.e., the predictions from the network do not represent the data or labels.

For one-hot encoding (multilabel, not binary) classification, the same principle can be applied to the one-hot encoded label vectors. We must first understand that: because each label (0,1,2,..,9) is distinct, it is referred to as categorical. Therefore, a multi-dimensional classification that minimises cross-entropy as its loss function is referred to as categorical_crossentropy

Mathematically, cross-entropy can be computed by comparing the likelihood of the prediction being true, p, to the likelihood that the prediction is false, q, across input data and labels.

In the model fitting context, with classification, the prediction of the CNN is often labelled as y-hat and the label itself is referred to as y. Therefore, for binary classification (not our use case), the model fitting context can simplify the cross entropy as:

Categorical Crossentropy
For a multi-label classification (our use case; finally!), we compute the summation of the cross entropy over all categories as such:

2. Optimizers: Algorithms to optimise the loss functions

There are many different algorithms that can optimise a loss function. We say optimise because some loss functions increase (rarely), while most of them decrease.

The optimisation algorithms often iterate parameters (the neural network weights and biases) in order to reduce error: the difference between predictions and expectations:

error = prediction − expectation = prediction − label

The most commonly used optimisation algorithm that I know about is the maximum likelihood estimate, in the form of a minimum chi-squared estimation. Chi-squared is the natural log of the Gaussian likelihood estimate.

This is also the weighted mean square error — weighted by the uncertainty.

Maximum Likelihood Estimation (i.e., minimising chisq) starts with a guess — initial condition — and iteratively attempts to improve the solution by minimising error through sequential searching of nearby parameters, then storing the parameters that reduce the error, which is expected to improve the predictions as compared to the expectations.

This is very nearly the same process with neural networks. The algorithm is referred to as backpropagation or “backward propagation of errors”. The underlying principle is that the error from the output of the NN (prediction-label) is iteratively multiplied by the weights of each layer in the neural network to determine the relative adjustment necessary per layer to optimise the NN and produce improved predictions. When the error approaches zero, the updates also approach zero and the algorithm is considered to have converged. With hope, a converged backpropagation also means that the predictions generalise well to unseen data. This is an entirely different consideration for “error”, referred to as generalisation error.

Gradient Descent
The most direct application of backpropagation is gradient descent, which is the algorithm described above: subtract changes in the error iteratively throughout the NN, weighted by the weights of the NN:

Note that the derivative term (∂error/∂w) is the portion of the error relative to the weights for each layer. For earlier layers in the network, the difference in the error from the output of each deeper layer is less and less. This reduction in the error for deeper neural networks leads to smaller and smaller adjustments to deeper layers in the NN. In turn, this also leads to issues with loss of relevance and early onset overfitting.

The vanishing gradient problem happens when a neural network is so large that the error drops to zero before processing the deepest layers of the network. Several solutions have been invented to mitigate this, but knowing that the problem exists, ahead of time, helps us to design the deepest of neural networks.

In the iterative solution equation above, the α term refers to the learning rate, which establishes how significant the effect of each adjustment has on the weights in the NN. A larger α leads to a faster reduction of error throughout the neural network, but not necessarily to the minimum loss — more likely, it results in a local minimum or overfitting. A smaller α could lead to more robust, deeper minima for the NN, while taking longer to converge. Understanding the balance between them while using a learning rate scheduler minimises this issue.

(Modern) Stochastic Gradient Descent Algorithms
Several updates to the gradient descent algorithm have been designed to maximise the robustness of solutions, while minimising the opportunity for the algorithm to overfit.

The most prolific variant of the GD algorithm is the stochastic gradient descent (SGD), which is a probabilistic approach to GD that is designed to improve training speeds and minimise overfitting. The algorithm works to improve error by randomly examining batches of samples (images for this example). This adds useful randomness to the training, as well as optimises the GPU operations.

The major updates to the stochastic gradient descent algorithm are adaptive learning rate and momentum. These resulted in the algorithms referred to as AdaGrad and AdaM (adam). The latter — adam — is the most used gradient descent algorithm for neural networks at the time of this writing. Although I am also a fan of Eve, it is not as readily accessible as Adam in the pre-built frameworks.

  1. Adaptive gradient descent (AdaGrad) reduces the learning rate when enough subsequent iterations of the SGD do not see improvements in the error.
  2. Adaptive Momentum starts with AdaGrad and adds momentum, which takes information from previous updates and adds them to the current update. The gradients are adjusted relative to the current error, while also including small contributions from previous updates as a momentum down the gradient.

3. Metrics [Optional]:

Metrics are similar to losses that we track, but not train, during the fitting process. In addition to the metric (loss function) that the algorithm minimises, most Python libraries can also track other metrics per epoch. Examples of other metrics include accuracy, AUC, precision, recall, f1-score, etc.

  1. Accuracy: with respect to classification, this is the comparison of the prediction by the CNN to the labels expected by the data: how many times did it predict correctly?
  2. AUC: Area Under the roC[urve]. The ROC curve is the functional relationship between the true positive rate and the false positive rate. The true positive rate is the percentage of true positives predicted over a range of predictive probability thresholds. The False positive rate is the percentage of false positives predicted over a range of predictive probability thresholds. By comparing the two and integrating (summing) under the subsequent curve. We can estimate how well our CNN is able to predict instances by the magnitude of the AUC. A large AUC implies large confidence in the classification results.
  3. Precision: Comparison of the true positives to True and False positives (higher is better)
  4. Recall: Comparison of the true positives to all positive labels (higher is better)
  5. F1-score: geometric average of precision and recall vs threshold. Similar to AUC in comparing two competing curves, which encapsulates a binary confusion matrix.
model.compile(
loss="categorical_crossentropy",
optimizer="adam",
metrics=["accuracy"]
)

Train the CNN model

We have finally come to the part of the project to actually fit the model to the data. First, let’s recap the project flow for summarisation purposes:

# Import Modules
from tensorflow.keras import layers, utils, datasets, models

# Load Data
(x_train, y_train), (x_test, y_test) = datasets.mnist.load_data()

# Preprocess data
x_train, x_test = preprocess(x_train, x_test)

# Encode labels
y_train, y_test = one_hot_encoding(y_train, y_test)

# Create Model
model = create_cnn_model()

# Compile Model
model.compile(...)

And coming next will be fit and evaluate

# Fit Model
model.fit(...)

# Evaluate Model
model.evaluate(...)

The fitting process requires three new hyperparameters, each of which has its own unique effect on the result of the model fitting process.

  1. epochs: The number of iterations for which we choose to allow the model to fit up until. If we also introduce the EarlyStopping callback, then the epochs provided would act as the maximum number of epochs. EarlyStopping monitors the validation loss and stops iterating if the validation loss does not increase for "a number of steps" defined as patience in the creation of the EarlyStopping instance.
  2. validation_split: Because all models overfit eventually, it has become customary to allow the algorithm to separate a percentage of the training data set and use this as a validation data set. The percentage of the training data for the algorithm to hold aside -- i.e., not train on -- is referred to as the validation_split. In the example below, we chose 10% validation with validation_split=0.1
  3. batch_size: In order to define the structure of the data — as seen by the NN on the GPU — we must also define the batch size. The batch_size is the number of images (or data) provided to the NN per training iteration. We could show the NN one image at a time — which would take much longer — or we could show all of the images at the same time, which would likely overload the CPU, RAM, GPU, etc. The batch_size is used to send the most amount of data to the CPU, RAM, and GPU without overloading them. It streamlines the training time by balancing between too many RAM-to-GPU operations (or worse: disk to RAM to GPU), while not wasting time by overloading the CPU-RAM or GPU-RAM (i.e., avoid using swap).
  4. batch_size reconsidered: a moderate batch_size improves the performance of the SGD algorithms, and its related cousins: AdaBoost, AdaGrad, AdaM, Eve, etc. batch_size controls the statistical capacity per batch provided to the SGD (Stochastic Gradient Descent) algorithm. Because SGD is designed to avoid local minima in the loss function by deliberately jumping from batch to batch, including one image is too few for a statistic and including all of the images is too many to allow randomised adjustments to the momentum.

Note: Because we have 60k training images — then 10% set aside for validation split — and a batch size of 128 (see below) — the model.fit progress bar will track 422 batches:

GPU Optimisation
Note that batch sizes are often powers of 2 because this optimises the GPU’s ability to parallelise the computations over a power-of-two number of GPU cores. GPU has 1000s of cores, but only ~10 GB of RAM, and the input-output between the CPU and GPU can often be more costly than GPU operations themselves. Therefore, optimising the configuration of the model fitting process for the GPU —i.e., the data range, the batch size, the optimizer, etc — improves the time-to-convergence as well as SGD avoidance of local minima. Therefore, if we optimise the number of operations per batch on the GPU, then we can optimise the global fitting process and speed up both training and prediction.

epochs = 15  # How many iterations should we cycle over the entire MNIST dataset
validation_split = 0.1 # how many images to hold out per epoch: 10%
batch_size = 128 # nominal use cases: 32, 64, 128, 256, 512
early_stopping = callbacks.EarlyStopping(patience=5)

history = model.fit(
x=x_train,
y=y_train,
batch_size=batch_size,
epochs=epochs,
validation_split=validation_split,
callbacks=[early_stopping]
)
Epoch 1/15
422/422 [==============================] - 4s 7ms/step - loss: 0.3617 - accuracy: 0.8913 - val_loss: 0.0809 - val_accuracy: 0.9770
Epoch 2/15
422/422 [==============================] - 3s 7ms/step - loss: 0.1086 - accuracy: 0.9667 - val_loss: 0.0611 - val_accuracy: 0.9830
Epoch 3/15
422/422 [==============================] - 3s 7ms/step - loss: 0.0814 - accuracy: 0.9736 - val_loss: 0.0477 - val_accuracy: 0.9875
...
Epoch 15/15
422/422 [==============================] - 3s 8ms/step - loss: 0.0319 - accuracy: 0.9896 - val_loss: 0.0281 - val_accuracy: 0.9917

Interpreting the output of the training process

Rolling Progress output

The output from Keras for the fitting process provides

  • A timer (progress bar) to estimate how much time each batch with take
  • An estimate for the ms/step to estimate how much time the full training will take
  • The training loss, which is only referred to as “loss” tracks the loss metric (categorical cross-entropy above) for the training data set seen by the AdaM algorithm: 90% randomly selected by the algorithm
  • The training accuracy (or other metrics), which is only referred to as “accuracy” tracks the provided metric (accuracy above) for the training data set seen by the AdaM algorithm: 90% randomly selected by the algorithm
  • The validation loss (“val_loss”) tracks the loss metric (categorical cross-entropy above) for the “held out” or “validation” data set: 10% randomly selected by the algorithm and unseen by the AdaM algorithm
  • The validation accuracy (“val_accuracy”) tracks the provided metric (accuracy above) for the “held out” or “validation” data set: 10% randomly selected by the algorithm and unseen by the AdaM algorithm

Because Keras and neural networks are best optimised on a GPU, If you are following along with the provided Colab notebook, then we suggest that you activate the GPU option:

Menu > Runtime > Change runtime type > Hardware accelerator: GPU

Because a data set is the pairing of features (images here) and labels (now one-hot-encodings), the loss metric — categorical cross entropy — measures how different the model predictions become compared to the provided labels.

In short, our loss function improves if the predictions match the labels better, which therefore decreases the categorical cross-entropy.

One of the most important metrics to understand is the val_loss. If the val_loss continues to decrease, then we can reasonably expect that the model will generalise to unseen data. The final generalisability test comes next when predicting and evaluating the official test data — or literally “held out” data — more so than that validation data that we gave to the algorithm, but on which it did not train.

Often, the val_lossis often larger than the loss (training loss) because most algorithms can learn the training data, but not all of them can efficiently generalise or predict data that it has never seen: validation data and test data. As such, the val_loss can be larger, but it should not be “significantly larger” (~10x).

If the val_loss begins to increase contiuously, then the model is overfitting and the fitting process should be stopped early — preferably using the EarlyStopping callback. If the val_loss never decreases at all, then the model is probably not the best configuration for the problem set. Note that (with Keras) if we stop a model.fit operation manually, it often deletes the existing progress -- i.e., we would likely have to start over again.

If the train_loss does not meaningfully decrease at all, then the model is likely not effective for the data: i.e., using a convolutional network on tabular data that is not physically, spatially, or temporally correlated. ConvNets are designed for data that include physical correlations:

  • Images are correlated pixel to pixel and color-to-color
  • Time series are correlated time step to time step
  • Geospatial data are correlated similarly to images, as well as to geolocation and time.
  • Language (NLP) is the correlated by the position of a word in the sentence

Another (literal) output from the fitting process is the History callback (left-hand side of model.fit). An instance of the History callback contains many attributes and a few methods, but it also contains a dictionary that is (logically?) called "history".

Because it has become standard to call our instance of the History callback from our model.fit -- humourless -- "history", the numerically useful output of the model.fit is therefore the history.history dictionary:

history.history:

key | purpose | example
-------------|----------------------------------------------|------------------------------------
loss | sequence (list) of all training loss | [0.364, 0.108, 0.084, 0.07 , 0.064]
accuracy | sequence (list) of all training accuracies | [0.889, 0.967, 0.974, 0.978, 0.98 ]
val_loss | sequence (list) of all validation loss | [0.077, 0.058, 0.045, 0.04 , 0.037]
val_accuracy | sequence (list) of all validation accuracies | 0.979, 0.984, 0.988, 0.989, 0.99 ]

To view these results for ourselves, let us print out the values of the history dict

# The results below may differ from the table above because of randomised 
# initial conditions and randomised processes during fitting: e.g., `Dropout`
for key, val in history.history.items():
print(f'{key:>12}: {np.round(val, 3)}')

Evaluating Model Fitting

It is often essential to plot the loss curves over time to understand if and where the training achieved its best results — preferably before the training stopped. If the curves are still decreasing, then we may want to run the model.fit for more epochs.

Another key intuition is that if the training converges to a reasonable solution “too quickly”, then we could adjust the learning rate for the AdaM optimiser. A smaller learning rate will converge more slowly, but provides the algorithm with enough wiggle room to not “jump over” a better minimum loss solution. Higher learning rates could miss more effective solutions, but also take less wall clock time to converge.

Training and Validation loss curves over epoch. Generated by the author in the Colab notebook

In the above learning curve (loss vs epoch), we can see that the val_loss minimised at epoch = 11, while the val_accuracy maximised at epoch = 9. Both of these instances are likely statistical noise-related minimisation.
Training several, identical CNNs on identical data — with randomly generated initial conditions — then plotting the loss function curves should reveal a more robust understanding of the learning curve features: minima and maxima. Using the EarlyStopping callback, we sustain less risk of overfitting with "too many" epochs -- i.e. epochs = 50 could still stop at epoch = 11 if it is the loss minimum found for patience = 5 sequential epochs.

Statistically monitoring CNNs (“several, identical CNNs”) is not covered here, but can be implemented with a for loop on the model.fit above and storing individual history instances for later comparison.

Generalisation

Evaluate the model on the holdout or test data set
The test data are the images and labels that were held out for the purpose of estimating how well our new classifier would behave on data that it has never seen. This is meant to quantify the generalisability of the model.

If the training loss (called loss in the .fit output) is sufficiently small to consider the training to be successful, while the validation loss (val_loss in the .fit output) is significantly worse (2x-10x, or worse), then we can infer that the model has overfit the training data.

In contrast, if the training loss and the validation loss are both similar (~1x-2x), then we could move on to estimating the loss values on the data which the model has never seen: the test data.

Evaluate on the test data

test_score = model.evaluate(x_test, y_test, verbose=0)
train_score = model.evaluate(x_train, y_train, verbose=0)
print(f"{'     Test loss'}: {test_score[0]}")
print(f"{' Train loss'}: {train_score[0]}")
print()
print(f"{' Test accuracy'}: {test_score[1]}")
print(f"{'Train accuracy'}: {train_score[1]}")
     Test loss: 0.02488
Train loss: 0.01596

Test accuracy: 99.13%
Train accuracy: 99.52%

In the example above, the training loss achieved a loss (categorical cross-entropy) of ~ 0.0160, while the testing loss achieved loss of ~0.0249 (values will change with subsequent runs). Is that good? It depends on the data, the model, the developer, and how ‘well’ we feel that it is.

This test vs training errro means that the CNN classifier performs 56% worse on unseen data. Nominally, that is considered a reasonable difference — bravo! More than a 100% difference (i.e. 2x) is less and less reasonable. Unfortunately, all such value comparisons are subjective to the problem set, the data environment, the model environment, as well as the experience of the community working with similar data.

It is very difficult to interpret the quality of a single model training. Often, we require 10s of runs with a single data set to build intuition. Then we can estimate the mean and std-dev of the training and testing losses, as well as other metrics.

The purpose of the test loss is not exactly to determine if this one model and this one run are good fits, but to compare different CNNs or other models on the same data set. If we add a layer to the CNN; change the kernel size; or decide to remove the convolutions and create a DenseNet instead: then comparing the test loss (or an even newer, unseen dataset) is valuable for understanding which model will most likely generalise better on new data. The best method of determining the quality of fit is to attain new, labelled data and compare the test loss from that data — this is often after a system goes live to test.

Personally, I think it is well fit. From experience, the test loss being double the training loss is not unusual. I have seen bad fits, with factors of 10x — 1000x. To become more comfortable with these values, I would first run 10 more identical CNNs — with ranomdised initial conditions — to compare.

Other metrics will also help us interpret these results. The example monitors the accuracy metric, which is something that most people like to see. In contrast, experienced machine learning developers most often find that accuracy can be misleading. Experiences ML developers would have included other metrics to interpret multi-faceted quality comparison results. My preferred other metrics are the AUC, the F1-score, precision, recall, etc. See above for a descriptive list.

Visualising results

pred_test = model.predict(x_test)

>>> 313/313 [==============================] - 1s 3ms/step
plot_samples(
images=x_test,
labels=y_test,
predictions=pred_test,
n_rows=7,
n_cols=7,
figsize=(8, 8)
)
Generated by the author in the Colab notebook

The GREEN numbers in the upper left of any image represent True predictions from our minimalist CNN. The RED numbers in the upper left of any image represent False predictions. Out of the 100 examples randomly selected and shown, one of them is incorrect. this does not form a statistic, but a representation that error still exists. Moreover, humans can infer that the single incorrect value could also have confused a human: a nine that looks like a seven.

Simple Upgrades: Testing other Datasets

To reflect the versatility of the convolutional neural network for related computer vision classification tasks, we can input different data sets to see how well they perform. Similar to MNIST, these datasets were created to teach researchers how to build newer, deeper, and complex CNN architectures.

By making very simple modifications to our code, we can test the CNN on ever more complex computer vision databases — all of which are built into tf.keras.

To modify the existing code, we created a flag that assigns the dataset we wish to examine, then load the dataset of our choosing — and modify the data shapes if necessary. Next, we can run the exact same notebook for almost any computer vision classification dataset provided by tf.keras:

"""
# This is the expected properties for MNIST
num_classes = 10 # 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
img_shape = 28, 28, 1 # 28 rows, 28 columns, 1 color -- grayscale
"""

# load data from keras
# x = inputs and y = outputs

dataset = 'mnist' # 'mnist' # 'fashion_mnist' # 'cifar10'
if dataset == 'mnist':
(x_train, y_train), (x_test, y_test) = datasets.mnist.load_data()
if dataset == 'fashion_mnist':
(x_train, y_train), (x_test, y_test) = datasets.fashion_mnist.load_data()
if dataset == 'cifar10':
(x_train, y_train), (x_test, y_test) = datasets.cifar10.load_data()

# Numerical operatoins (e.g. CNNs) prefer values between 0, 1 or -1, 1
# We must scale the values from 255 (default inputs) to 0-1 (CNN inputs)
# Note that we also convert the arrays to float32,
# this could be float16/float8 for speed for float64 for accuracy, etc
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255

# Conv2D-nets require 4D inputs: batch, row, col, color
if np.ndim(x_train) == 3: # no color dimension
# Add the color dimensions to represent greyscale
COLOR_DIM = -1
x_train = np.expand_dims(x_train, axis=COLOR_DIM)
x_test = np.expand_dims(x_test, axis=COLOR_DIM)

Fashion MNIST

If dataset = ‘fashion_mnist’ , then Keras will load the Fashion MNIST data set. FashionMNIST is a collection of 60k greyscale images, with 28x28 resolution, assorted into 10 classes of images containing clothing and accessories. If we replace the words “clothing and accessories” with “handwritten digits”, then this statement would perfectly describe the MNIST dataset — and that is the point.

MNIST — and Fashion MNIST — were created as improvements in CNN technology progressed. CNNs can provide accuracy with the MNIST data set of over 99.7%, which is too large to provide researchers with the wiggle room to invent more advanced NN architectures. As a result, teams needed a new data set to experiment with new CNNs and related computer vision technology. This is why Zalando invented Fashion MNIST in 2017.

Fashion MNIST is a more modern data set that is very similar to MNIST but provides more complexity for researchers to experiment with improved computer vision deep learning techniques. Fashion MNIST increased the image complexity in order to reduce inherent accuracy and provide ML researchers with more flexibility to explore innovative CNN architectures.

    Test loss: 0.277
Train loss: 0.225

Test accuracy: 89.96%
Train accuracy: 91.84%

Because we did not modify our CNN configuration, the results are as follows: the training and testing losses are about 10x worse for Fashion MNIST than with MNIST. Furtheremore, the accuracy may have reduced by more than 10%; but it is better to examine the failure rate, which increased >10x: from 0.87% to 10.04%.

Left: examples of predictions (red/green) with corresponding FashionMNISt images. Right: Loss and Accuracy over epochs. Generated by the author in the Colab notebook

CIFAR-10

If dataset = ‘cifar10’ , then Keras will load the CIFAR-10 data set, which is a collection of 60k COLOR images, with 32x32 resolution, assorted into 10 classes of animal and vehicle images. This too was developed as an extension of the MNIST dataset concept. Providing colour images increased the number of channels of the input layer from 1 to 3: greyscale to RGB, respectively. Additionally, the size is now 32x32, which increases the number of input data points (per channel) by more than 30%: and the number of features (pixels-by-color) per sample (image) from 784 to 3072. This is roughly 130% more complex than MNIST per channel. This barely begins to approach modern computer vision tasks (512x512 RGBD) or even common Geospatial images with 11-channels.

    Test loss: 0.892
Train loss: 0.805

Test accuracy: 69.85%
Train accuracy: 73.16%

Because we still have not modified the CNN configuration, the results are as follows: the training and testing losses are about 3x worse for CIFAR-10 than Fashion MNIST, and the accuracy is reduced by another 20% (compared to Fashion MNIST). This is not unreasonable result because we have not modified the neural network in the least.

Examining the failure rate instead of the accuracy, we see that the exact same CNN model sustains a failure rate of 30.15% with the large color images provided by the CIFAR-10 dataset. This is in comparison to 10.04% for FashoinMNIST and 0.87% for MNIST. The increase in test loss, test failure rates, etc reveals the need for deeper, larger, more complex CNNs to create models that generalise well on ever more complex image datasets.

Left: examples of predictions (red/green) with corresponding CIFAR-10 images. Right: Loss and Accuracy over epochs. Generated by the author in the Colab notebook

For more information, and to play the code or data, please see the provided Colab notebook.

If you like the article and would like to support me make sure to:

  • 👏 Clap for the story (50 claps) and follow me 👉
  • 📰 View more content on my medium profile
  • 🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter
  • 🚀👉 Join the Medium membership program to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.

--

--

Jonathan Fraine

Director of Engineering for Wikimedia DE. I work with dozens of motivated and enthusiastic developers to improve the future of free knowledge around the world.