Generative AI Through the Ages: Convolutional Autoencoders Part 2
Build a Convolutional Autoencoder to Understand the Original Architecture of Generative AI.
I originally wrote this article as a 60-page medium post about convolutional autoencoders as the origin and backbone of Generative AI. Editors and colleagues recommended that I break that up in three parts: architecture+history, application [here], and latent space evaluation.
The most common autoencoder architecture for computer vision deep learning is the convolutional autoencoder (CAE). CAEs are connected stacks of convolutional kernels for the Encoder and Decoder, with linear matrices in between to transform the encoded images into the latent space. More dense layers are used to again connect the latent space onto the Decoder for image reconstruction.
In a previous article, we discussed how convolutional autoencoders are the backbone and evolutionary ancestor of modern Generative AI models. Here will we show how to build one, how each component contributes to the generative properties of GenAI, and evaluate our model on the MNIST dataset. In the next (sub-)article, we will visually evaluate the latent space and generative capacity of even this simple model.
Compartmentalised ConvAutoencoder to Prepare for the Future
Depending on the use case, the architecture of the CAE could be included in a single class or a set of classes that either inherit from each other or are inherited into a final aggregated class. For both educational purposes and maintainability, we chose here to follow the latter method: a set of classes inherited into a final aggregated class.
Make It Real: Build a ConvAutoencoder from Scratch
Here we will begin our implementation of our convolutional autoencoder by instantiating, training, visualising, and evaluating. First, ensure that we have the appropriate libraries installed to run the code below.
The code can be run in full with the associated colab notebook.
!pip install pygtc scikit-learn tensorflow
and import the libraries into the run environment
import numpy as np
import plotly.graph_objects as go
from tensorflow.keras import datasets
from matplotlib import pyplot as plt
from pygtc import plotGTC
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import (
Input,
Dense,
Conv2D,
MaxPooling2D,
Conv2DTranspose,
Flatten,
Reshape
)
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras import backend as K
from tensorflow import image
Most of the operations involved in training a neural network are definitively probabilistic, and weigh heavily on the use of pseudo-random number generators. As such, to enable reproducibility between our results over multiple trial runs, we institute and initial random seed as nerdy as we could imagine: 42 because we’re happily nerdy.
np.random.seed(42)
The Encoder and Decoder Classes
In the Encoder
and Decoder
classes below, we take as input the convolutional kernel parameters, image shape, and number of latent dimensions to configure the modules of our Convolutional Autoencoder. The convolutional kernel parameters are a dict
that establish the kernel shape, the number of filters, the activation function, stride length, and pool size (for MaxPool2D
layers).
The Encoder
class maps the input image to its corresponding latent vector through a sequence of convolutional layers. The Decoder
class maps the latent vector to its corresponding reconstructed image through a symmetric sequence of convolutional layers.
Attributes:
input_shape
(tuple): The shape of the input image. Default is (28, 28, 1).latent_dim
(int): The dimension of the latent space. Default is 10.conv_params
(dict): A dictionary of convolution parameters. (See below)
The constructor for Encoder
class initializes the encoder with the given input shape and latent dimension. Furthermore, the conv_params
dictionary contains 7 other parameters that control the size, shape, and complexity of the convolutional layers (filters/kernels) and blocks (stacks of filters/kernels).
conv_params
as key:value pairs
activation
is the flavor of non-linear activation function used between convolutional blocks. It defaults here torelu
or Rectified Linear Unit.decoder_activation
is the flavor of non-linear activation function used at the top of the Decoder stack (output of the Autoencoder) to match the reconstructed image best to the input image. We chose to default the decoder activation functions as the ‘sigmoid’ because we preprocessed the datasets to sustain a range of 0 to 1, and the sigmoid enforces an output range of 0 to 1.padding
is the structure of the padding around the image after each convolutional layer. We chose the hyperparameter setting ‘same’ to ensure that we do not lose information on the edges of feature maps without intending to do so.stride
is a parameter that sets the number of pixels by which the convolutional kernels will step between each transformation. Our default of 2 results in each subsequent feature map (output of a convolutional layer) to be created with a quarter the number of features (half the size in both X and Y directions).pool_size
is a parameter that establishes how many pixels to bin the MaxPool layers summation. By choosing apool_size
of 1, we are forcing the CAE to ignore the MaxPool operation — in favor of letting the stride operations dominate the feature size reduction. Both Stride and MaxPool serve as similar dimensionality reduction methods.kernel_size
is the hyperparameter that sets the size of the each convolutional filter in each dimension. With a default of 3, we are establishing that each kernel will have 3x3 shape, or 9 parameters. The parameter can take an integer for square kernels or a 2-tuple for non-square kernel shapes.n_filters
is a list that associates how many filters should be trained per convolutional layer in each block. We only include logic for one layer per block, but multiple layers could be introduced per block. In our default setting, we chose an Encoder sequence of [16, 8, 4], which provides that the first layer will have 16 filters, while the second and third layers will have 8 filters each. The Decoder is by default the inverse of this list: [4, 8, 16].
See my primer for more details on each hyperparameter and more.
class Encoder:
"""
A Encoder class for a Convolutional Autoencoder. This class maps
the input image to its corresponding latent vector through a
sequence of convolutional layers.
Attributes:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space.
Default is 10.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""
def __init__(
self, input_shape=(28, 28, 1), latent_dim=10,
conv_params=None):
"""
The constructor for Encoder class. Initializes the Encoder
with the given input shape and latent dimension.
Args:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space.
Default is 10.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""
# Initialize convolution parameters
self.conv_params = conv_params
# If no conv_params provided, use default values
if self.conv_params is None:
self.conv_params = {
'activation': 'relu',
'decoder_activation': 'sigmoid',
'padding': 'same',
'strides': 2,
'pool_size': 1,
'kernel_size': 3,
'n_filters': [16, 8, 4]
}
# Ensure the latent_dim is a scalar
assert(np.ndim(latent_dim) == 0)
# Ensure the input_shape is a vector
assert(np.ndim(input_shape) == 1)
# If input_shape is a 2D image, append a channel dimension
if len(input_shape) == 2:
input_shape = (*input_shape, 1)
# Set the input shape and latent dimension
self.input_shape = input_shape
self.latent_dim = latent_dim
def build_encoder(self):
"""
Builds the encoder model. The model takes an image as input and
outputs a latent vector.
The function first applies a series of convolution layers to the
input image. The number of filters in each
layer is specified by the 'n_filters' key in the conv_params
dictionary. The activation function, padding, and strides for
these layers are also specified by the conv_params dictionary.
The output of the last convolution layer is then flattened and
passed through a dense layer. The number of units in this dense
layer is equal to the dimension of the latent space.
The output of the dense layer is the latent vector, which is the
output of the encoder model.
"""
# Ensure the 'n_filters' key in conv_params is a list or array
assert(
isinstance(
self.conv_params['n_filters'], (tuple, list, np.ndarray)
)
)
# Define the input image
self.input_image = Input(shape=self.input_shape)
x = self.input_image
# Define the kernel and pool shapes
kernel_shape = [self.conv_params['kernel_size']] * 2
pool_shape = [self.conv_params['pool_size']] * 2
# Encoding layers
for nfilters in self.conv_params['n_filters']:
# Apply a convolution layer with the given parameters
x = Conv2D(
filters=nfilters,
kernel_size=kernel_shape,
activation=self.conv_params['activation'],
padding=self.conv_params['padding'],
strides=self.conv_params['strides']
)(x)
# Store the shape of the volume before flattening
self.volume_size = K.int_shape(x)
# Flatten the volume
x = Flatten()(x)
# Apply a dense layer to map the flattened volume
# to the latent space
self.latent = Dense(self.latent_dim)(x)
# Define the encoder model
self.encoder = Model(self.input_image, self.latent)
The Decoder Class
The Decoder
classes below is very similar to the Encoder
class above. In fact, the Decoder
was generated from the Encoder
by adding a map — Dense
+ Reshape
layers — from the latent space to the expected shape of the inner most convolutional layer. After which, the for loop runs over the reversed kernel sizes and uses Conv2DTranspose
layer, as opposed to the Conv2D
layer seen in the Encoder
class above.
The inputs to the Decoder
are deliberately identical to the Encoder
because the input will later become a combination object that inherits from both the Encoder
and Decoder
.
Attributes:
input_shape
(tuple): The shape of the input image. Default is (28, 28, 1).latent_dim
(int): The dimension of the latent space. Default is 10.conv_params
(dict): A dictionary of convolution parameters. Def: None.volume_size
(tuple): the shape of the first (2D) convolutional layer
An additional input to the Decoder
, volume_size
, acts as both a bypass and an explicit input for an otherwise implicit value. The volume_size
is nominally inherited from the Encoder
object in the Autoencoder
class below. The use of an input parameter here bypasses the need for an Encoder
to be given to the Decoder.
The shape of innermost convolutional layer, volume_size
, enables future use cases that might require manipulating the Decoder
without the need for a symmetric Encoder
or an Encoder
at all.
See my primer for more details on each hyperparameter and more.
class Decoder:
"""
A Decoder class for a Convolutional Autoencoder. This class maps
the latent vector to its corresponding reconstructed image through
a sequence of convolutional layers. The final output layer is
reshaped deliberately to match the shape of the input image:
ncols x nrows x nchannels.
Attributes:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space.
Default is 10.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""
def __init__(
self, input_shape=(28, 28, 1), latent_dim=10,
conv_params=None, volume_size=None):
"""
The constructor for Decoder class. Initializes the Decoder with
the given latent dimension and image shape.
Args:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space.
Default is 10.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""
# Initialize convolution parameters
self.conv_params = conv_params
# If no conv_params provided, use default values
if self.conv_params is None:
self.conv_params = {
'activation': 'relu',
'decoder_activation': 'sigmoid',
'padding': 'same',
'strides': 2,
'pool_size': 1,
'kernel_size': 3,
'n_filters': [16, 8, 4]
}
"""
If Decoder is established within the Autoencoder,
then the volume_size is inherited from the Encoder object.
If the Decoder is instantiated alone, then the volume_size
must be calculated or provided as an input variable.
"""
if not hasattr(self, 'volume_size'):
# TODO: calculate `volume_size` from latent_dim and kernel
# shapes assert(volume_size is not None)
# Becuase this operation only exists inside an Autoencoder,
# the `volume_size` is assumed to be inherited from
# the `Encoder`
self.volume_size = volume_size
# Ensure the latent_dim is a scalar
assert(np.ndim(latent_dim) == 0)
# Ensure the input_shape is a 1D-vector
assert(np.ndim(input_shape) == 1)
# If input_shape is a 2D image (greyscale),
# append a channel dimension
if len(input_shape) == 2:
input_shape = (*input_shape, 1)
# Set the input shape and latent dimension
self.input_shape = input_shape
self.latent_dim = latent_dim
def build_decoder(self):
"""
Builds the decoder model. The model takes a latent vector as
input and outputs a reconstructed image.
The function first maps the latent vector to a dense layer with
the same number of units as the product of the volume size.
The output of this dense layer is then reshaped to match the
volume size.
The reshaped output is then passed through a series of
transposed convolution layers (also known as deconvolution
layers). The number of filters in each layer is specified
by the 'n_filters' key in the conv_params dictionary.
The activation function, padding, and strides for these layers
are also specified by the conv_params dictionary.
Finally, a convolution layer is applied to the output of the
last deconvolution layer. The number of filters in this layer
is equal to the number of channels in the input image.
The activation function for this layer is specified by the
'decoder_activation' key in the conv_params dictionary.
The output of the final convolution layer is then resized to
match the size of the input image. The resized image is the
output of the decoder model.
"""
# Define the latent inputs
latent_inputs = Input(shape=(self.latent_dim,))
# Compute the volume of the reshaped dense layer
volume_shape = self.volume_size[1:]
volume = np.prod(volume_shape)
# Map the latent inputs to a dense layer and reshape it
x = Dense(volume)(latent_inputs)
x = Reshape(volume_shape)(x)
# Define the kernel and pool shapes
kernel_shape = [self.conv_params['kernel_size']] * 2
pool_shape = [self.conv_params['pool_size']] * 2
# Decoding layers
for nfilters in self.conv_params['n_filters'][::-1]:
# Apply a transposed convolution layer with the
# given parameters
x = Conv2DTranspose(
filters=nfilters,
kernel_size=kernel_shape,
activation=self.conv_params['activation'],
padding=self.conv_params['padding'],
strides=self.conv_params['strides']
)(x)
# The output must be the number of channels
nchannels = self.input_shape[-1]
# Apply a convolution layer to map the volume to the
# number of channels
decoded = Conv2D(
filters=nchannels,
kernel_size=kernel_shape,
activation=self.conv_params['decoder_activation'],
padding='valid'
)(x)
# Get the size of the input image
img_size = K.int_shape(self.input_image)[1:]
# Resize the decoded image to match the size of the input image
resized_image_tensor = image.resize(
images=decoded,
size=list(img_size[:2]),
method='bilinear',
preserve_aspect_ratio=True,
antialias=False,
name=None,
)
# Define the decoder model
self.decoder = Model(latent_inputs, resized_image_tensor)
Putting it All Together: ConvAutoencoder as a builder object
The Encoder
class above transforms the input images into their latent space representation and the Decoder
class above transforms the latent space vector into a reconstructed image. To tie those pieces together, the ConvAutoencoder
class inherits from both the Enc
and Dec
classes above. This level of modularity helps us to focus on each component quasi-independently, which benefits both the explainability and maintainability of the CAE components.
In later articles, we will build out the full generative AI prompting for image generation through to Latent Stable Diffusion (LSD), which starts as a Convolutional Variational Autoencoder. The LSD CAE is a denoising AE that manipulates the latent space representation according the embedded user query or prompt.
The modularity allows the LSD to stage (class below) the CAE and adds the flexibility with Variational and Constrained input (prompting) for later development, while maintaining explainability. As such, the inputs to an LSD ConvAutoencoder
class are nearly identical to the Encoder
and Decoder
here, except that it finely manipulates the autoencoder to generate or new samples.
class ConvAutoencoder(Encoder, Decoder):
"""
A ConvAutoencoder class that combines the Encoder and Decoder classes
to form a complete Convolutional Autoencoder.The ConvAutoencoder class takes an image as input, encodes it into a
latent vector using the Encoder, and then decodes the latent vector
back into an image using the Decoder.
Attributes:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space. Default is 10.
autobuild (bool): Whether to automatically build the autoencoder
upon initialization. Default is True.
training_params (dict): A dictionary of parameters for training.
Default is None.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""
def __init__(
self, input_shape=(28, 28, 1), latent_dim=10, autobuild=True,
conv_params=None):
"""
The constructor for ConvAutoencoder class.
Initializes the ConvAutoencoder with the given input shape and
latent dimension.
Args:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space.
Default is 10.
autobuild (bool): Whether to automatically build the
autoencoder upon initialization. Default is True.
training_params (dict): A dictionary of parameters for
training. Default is None.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""
# Initialize the Encoder and Decoder superclasses
super(Encoder, self).__init__()
super(Decoder, self).__init__()
# Set the convolution parameters
self.conv_params = conv_params
if self.conv_params is None:
self.conv_params = {
'activation': 'relu',
'decoder_activation': 'sigmoid',
'padding': 'same',
'strides': 2,
'pool_size': 1,
'kernel_size': 3,
'n_filters': [16, 8, 4]
}
# Check the dimensions of the latent dimension and input shape
assert(np.ndim(latent_dim) == 0)
assert(np.ndim(input_shape) == 1)
# If the input shape is 2D, add a channel dimension
if len(input_shape) == 2:
input_shape = (*input_shape, 1)
# Set the input shape and latent dimension
self.input_shape = input_shape
self.latent_dim = latent_dim
# Build the autoencoder if autobuild is True
if autobuild: # Default
self.build_autoencoder()
def build_autoencoder(self):
"""
Builds the autoencoder model. The model takes an image as input
and outputs a reconstructed image.
The function first checks if the encoder and decoder models have
been built. If not, it calls the build_encoder and build_decoder
methods to build them.
The autoencoder model is then constructed by passing the input
image through the encoder and decoder models in sequence.
"""
# Build the encoder model if it hasn't been built yet
if not hasattr(self, 'encoder'):
self.build_encoder()
# Build the decoder model if it hasn't been built yet
if not hasattr(self, 'decoder'):
self.build_decoder()
# Construct the autoencoder model by passing the input image
# through the encoder and decoder
self.autoencoder = Model(
self.input_image,
self.decoder(
self.encoder(
self.input_image
)
)
)
Make it Work
With the ConvAutoencoder
architecture now settled, we want to introduce the training and inference components. The below class ConvAETrainer
inherits from the previous three classes to build the ConvAutoencoder
, then compile it, and train it on the provided dataset. The dataset we use here is either MNIST or FashionMNIST. In addition to the train
method, the class also provides encode_image
, decode_image
, and generate_image
methods.
These methods include
1. encode_image
computes the latent vector from an input image
2. decode_image
computes the reconstructed image from an input image
3. generate_image
computes the reconstructed image from a latent vector
The 3rd method above ( generate_image
) is the generative part of “Generative AI”. It takes as input a representation of latent space and outputs a vector. In the case of Latent Stable Diffusion, the “latent space” representation is generated via embedding the user prompt.
In our case here, we will probe the statistical distributions over the latent space to identify the most likely regions for each input data class (MINST: digit numbers). Using the modes from those statistical distributions, we will generate images of handwritten digits that coincide with the multi-dimensional latent space modes, i.e., latent space, or compressed, image representation.
class ConvAETrainer(ConvAutoencoder, Encoder, Decoder):
"""
A ConvAETrainer class that inherits from the ConvAutoencoder class.
This class is used to train the Convolutional Autoencoder and
generate new images from the latent space.
Attributes:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space.
Default is 10.
training_params (dict): A dictionary of training parameters.
Default is None.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""
def __init__(
self, input_shape=(28, 28, 1), latent_dim=10,
autobuild=True, training_params=None, conv_params=None):
"""
The constructor for ConvAETrainer class. Initializes the
ConvAETrainer with the given input shape, latent dimension,
training parameters, and convolution parameters.
Args:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space.
Default is 10.
training_params (dict): A dictionary of training parameters.
Default is None.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""
# Initialize the Encoder and Decoder superclasses
super(Encoder, self).__init__()
super(Decoder, self).__init__()
super(ConvAutoencoder, self).__init__()
# Set the convolution parameters
self.conv_params = conv_params
if self.conv_params is None:
self.conv_params = {
'activation': 'relu',
'decoder_activation': 'sigmoid',
'padding': 'same',
'strides': 2,
'pool_size': 1,
'kernel_size': 3,
'n_filters': [16, 8, 4]
}
# Set the autoencoder parameters
self.training_params = training_params
if self.training_params is None:
self.training_params = {
'optimizer': 'adam',
'loss': 'binary_crossentropy',
'metrics': None
}
# Check the dimensions of the latent dimension and input shape
assert(np.ndim(latent_dim) == 0)
assert(np.ndim(input_shape) == 1)
# If the input shape is 2D, add a channel dimension
if len(input_shape) == 2:
input_shape = (*input_shape, 1)
# Set the input shape and latent dimension
self.input_shape = input_shape
self.latent_dim = latent_dim
# Build the autoencoder if autobuild is True
if autobuild: # Default
self.build_autoencoder()
def train(
self, x_train, x_val, epochs=50, batch_size=128,
callbacks=None, shuffle=True):
"""
Train the autoencoder.
Args:
x_train (np.array): The training data.
x_val (np.array): The validation data.
epochs (int): The number of epochs to train for.
Default is 50.
batch_size (int): The batch size for training.
Default is 128.
callbacks (list): A list of callbacks to apply
during training. Default is None.
shuffle (bool): Whether to shuffle the training data
before each epoch. Default is True.
Returns:
A History object. Its History.history attribute is a
record of training loss values and metrics values at
successive epochs, as well as validation loss values
and validation metrics values (if applicable).
"""
# Build the autoencoder if it hasn't been built yet
if not hasattr(self, 'autoencoder'):
self.build_autoencoder()
# Compile the autoencoder model with the specified optimizer,
# loss function, and metrics
self.autoencoder.compile(
optimizer=self.training_params['optimizer'],
loss=self.training_params['loss'],
metrics=self.training_params['metrics']
)
# Train the autoencoder
return self.autoencoder.fit(
x_train,
x_train,
epochs=epochs,
batch_size=batch_size,
shuffle=shuffle,
callbacks=callbacks,
validation_data=(x_val, x_val)
)
def encode_image(self, image):
"""
Get the encoded representation of the image.
Args:
image (np.array): The image to encode.
Returns:
The encoded representation of the image.
"""
# Ensure the encoder has been built
assert(hasattr(self, 'encoder'))
# Encode the image
return self.encoder.predict(image)
def decode_image(self, images):
"""
Get the decoded (reconstructed) image.
Args:
images (np.array): The images to decode.
Returns:
The decoded images.
"""
# Ensure the autoencoder has been built
assert(hasattr(self, 'autoencoder'))
# Decode the images
return self.autoencoder.predict(images)
def generate_image(self, latent_vector):
"""
Generate a new image from the latent space.
Args:
latent_vector (np.array): The latent vector from which
to generate the reconstructed image.
Returns:
The generated image.
"""
# Generate the image from the latent vector
return self.decoder.predict(latent_vector)
Preprocessing the Data
The datasets are stored remotely with greyscale values from 0 to 255. Because the matrices within the convolutional neural network expect as input values from 0 to 1, we will load the data then divide each pixed by 255 (max value). Because MNIST and FashionMNIST images are stored as 2D arrays per image, we expand the input images with a third dimension of size 1 — i.e., greyscale. This allows our operation to enable this project to scale up to colored images (3D arrays per image).
def preprocess(features, labels):
"""
Normalizes the supplied array and reshapes it into
the appropriate format.
"""
features = features.astype("float32") / 255.0
if np.ndim(features) == 3: # Greyscale is N samples of 2D arrays
features = np.expand_dims(features, axis=-1)
if np.ndim(labels) == 2:
labels = labels.ravel()
return features, labels
Train, Validation, Test Splitting
In addition to preprocessing, the following load_and_preprocess
script here will split the data into the prescribed test_size
. The default test_size
is 50% because we want to improve our generalisability by testing on a large set of unseen or untrained images. Generalisation error is the error derived from using the model “in the wild”, often after deployment or at least during a later decision making process.
There is a different use of the words test and validation in the nomenclature because the ML community iterated on them over time. The train_test_split
functions uses the scikit-learn
terminology: test_size.
In contrast, the model.fit
method uses the keras
terminology: validation_data
and val_loss.
Furthermore, the MNIST dataset includes train
and test
images, where the test
images are meant to remain unseen by the model during the training process. This coincides with the keras
terminology. Our use here is that
train
images are used to update the model weights during training.validation
images are used to monitor a proxy for generalisation error during the training process.test
images are used to evaluate the model results as a more direct proxy for generalisation error.
Because MNIST comes with 10k test images, we will use the train_test_split to generate 50% train images and 50% validation images. Validation images are used during the training process to compute the val_loss
which allows the callbacks, and developers, to be more confident that the training processes has not devolved passed the minimum generalisation error per model.
Generalisation error is best estimated after training as the test_loss
by using the test images not provided to the model.train.
Most developers use the val_loss
as a running proxy for the test_loss.
If the final val_loss
does not match the test_loss
within expected variance of a few percent, then the model will likely not sustain an expected generalisation error.
In later articles, we will probe hyperparameter optimisation by training hundreds of CAEs and other ML models. For each model, we must compute the test_loss
to select the best model, as well as weight each model’s contributions towards a weighted ensemble. When performing hyperparameter optimisation, the test_loss
should not be used a robust proxy for the generalisation error. Depending on the size of the provided dataset, we could split it into four components: train, validation, test, generalisation. If the dataset is not large enough for four splits, then new data should be acquired and monitored, often during deployment and later inference.
def load_and_preprocess_data(dataset='mnist', test_size=0.5):
known_datasets = ['mnist', 'fashion_mnist', 'cifar10', 'cifar100']
if dataset == 'mnist': # 2D arrays of greyscale images
mnist = datasets.mnist.load_data()
(x_train, y_train), (x_test, y_test) = mnist
if dataset == 'fashion_mnist': # 2D arrays of greyscale images
fashion_mnist = datasets.fashion_mnist.load_data()
(x_train, y_train), (x_test, y_test) = fashion_mnist
if dataset == 'cifar10': # 3D arrays of color images
cifar10 = datasets.cifar10.load_data()
(x_train, y_train), (x_test, y_test) = cifar10
if dataset == 'cifar100': # 3D arrays of color images
cifar100 = datasets.cifar100.load_data()
(x_train, y_train), (x_test, y_test) = cifar100
x_train, y_train = preprocess(x_train, y_train)
x_test, y_test = preprocess(x_test, y_test)
x_train, x_val, y_train, y_val = train_test_split(
x_train,
y_train,
test_size=test_size # 50% of train test for validation
)
return x_train, x_val, x_test, y_train, y_val, y_test
Load Data, Make Model, Fit Model to Data
Load Data
Here we assign the dataset and test_size
parameters that dominate our use case for our Convolutional Autoencoder, then load and preprocess the data as specified above.
dataset = 'mnist'
test_size = 0.5
x_train, x_val, x_test, y_train, y_val, y_test = load_and_preprocess_data(
dataset=dataset,
test_size=test_size
)
Make the Model
With the data in hand, we can now define the autoencoder architecture, instantiate the model, and load the data into the ConvAETrainer
instance. We chose the latent_dim
to have 10 dimensions because it seems appropriate when we later isolate the modes of each dimension with a specific class or number, from which to generate new handwritten digits. It is unnecessary, but convenient for pedagogical and explainable development.
Loss Function Selection
For our loss function, we chose the loss function here to be binary_crossentropy
because our greyscale images support a range from 0 to 1, and expect that reconstructed values should be between 0 or 1. Each loss function has its own behaviour, assumptions, expectations, use case, and value. The input data, output activation function, and loss function should be chosen as a triplet. We suggest exploring the loss function with respect the data set and preprocessing as a “hyper-hyperparameters”.
For clarity on the deep learning background context with regards to loss functions and datasets, please see my primer on Computer Vision Deep Learning Primer with Keras and Python.
latent_dim = 10
# https://link.springer.com/chapter/10.1007/978-3-031-11349-9_30
# loss = 'mse' # good for small images and large datasets
# loss = 'mae' # good for small images and large datasets
loss = 'binary_crossentropy' # good for large images and small datasets
training_params = {
# 'optimizer': 'adadelta',
'optimizer': 'adam',
'loss': loss,
'metrics': None
}
conv_params={
'activation': 'relu',
'decoder_activation': 'linear',
'padding': 'same',
'strides': 2,
'pool_size': 1,
'kernel_size': 3,
'n_filters': [16, 8, 4]
}
input_shape = x_train.shape[1:] # skip the number of samples
conv_autoencoder = ConvAETrainer(
input_shape=input_shape,
latent_dim=latent_dim,
training_params=training_params,
conv_params=conv_params
)
print(conv_autoencoder.autoencoder.summary())
print(conv_autoencoder.encoder.summary())
print(conv_autoencoder.decoder.summary())
Calculating the Size of the Model
In a Convolutional Autoencoder, it is crucial to understand the number of parameters in the network to appreciate the complexity and capacity of the model. To count the number of parameters in a Convolutional Autoencoder, we need to consider the encoder, decoder, and latent space components. The primary components contributing to the parameter count are convolutional layers (Enc + Dec) and fully connected layers.
- Encoder Convolutional Layers: For a
Conv2D
, the number of parameters is determined by the size of the filters (or kernels), the number of input channels, and the number of output channels (n_filters
).n_params
= (kernel_size
·kernel_size
·nfilters_in
+1)·nfilters_out
- Decoder Convolutional Layers: The parameter calculation for
Conv2DTranspose
layers are similar to that ofConv2D
layers. The formula remains the same as above. - Latent Space: If present in the encoder for dimensionality reduction or reshaping, the number of parameters in fully connected layers are:
n_params
= (n_input_units
+ 1)·n_output_units
In each n_params
calculation above, there is a “+1” term, which represents the bias term for each layer or filter. The bias term allows the network to rebalance the average amplitude of each of the layers or filters to avoid exploding or vanishing gradients. For convolutional layers, a more common technique not shown here is to introduce a BatchNormalization
layer after each convolutional filter.
Examining the Size of Our Autoencoder
In the default configuration above, each filter defaults to a 3x3 kernel. With Encoder layers including [16, 8, 4] filters, the convolutional blocks of this default Encoder includes 2262 parameters to train outside of the latent space transformations. The Decoder has slightly different number of parameters because the direction of the bias term changes the input parameters per layer, which results in 2461 Decoder parameters to train.
In addition to parameters, it is important to maintain a functional understanding of the size and number of feature maps (outputs per filter/kernel per layer). Feature maps are the image-like 2D arrays generated per filter transformation. In computer vision, this would be the transformed images, like edge detections or color gradients. With 16 filters/kernels in the first block, a 28x28 image would output 16 features maps that are quarter the number of input pixels, because stride is set to 2. This means that the first convolution layer produces 3136 feature pixels produced by the convolutions: 28 x 28 x 16 / 4.
After 3 convolutional blocks, with 16, 8, 4 filters sequentially, the Encoder outputs 4 4x4 feature maps, which is 64 feature pixels. The latent vector results from a linear matrix transformation using a Dense
layer to convert these 64 features pixels into the 10 latent space dimensions. This introduces a (64+1) x 10 linear matrix, which therefore include 650 parameters for the neural network to train.
The 28 different convolutional transformations in the Encoder
and the 28 different transformations in the Decoder
result in 4723 parameters for the neural network to train, while the two Dense
layers (into and out of the latent space) require 1300 parameters to train the neural network, which is 21.5% of the total network in just two layers. This shows that Dense
layers can dramatically outnumber the parameters required for the sme operation as a set of Conv2D
layers.
Scaling Up to a Real World Application with Latent Stable Diffusion
Each convolutional layer produces an equal number of feature maps to the number of filters. With stride:2
the size of each filter map is a quarter the size of the input feature map to that layer. For reference, the first feature map is actually the input image data itself, and the second feature map is the transformed images after being convolved by the first layer of filters. This in the equivalent 3136+392+48 or 3576 feature pixels from the Encoder feature maps. As well, another 3576 feature pixels from the Decoder feature maps. All 7152 feature pixels (i.e., elements of the feature maps) — as well as all 4726 convolutional autoencoder parameters — must be stored on the GPU per image in the batch. With a batch size of 128, we need 1,520,384 floating point numbers to be stored on the GPU.
These 1,5M floats are not significant for modern GPUs. In contrast, our experiment only includes with 6 convolutional layers, as well as small images. These values expand linearly with respect to number of input pixels (image height x image width) and the batch size. For an HD image (1920x1080 pixels), the number of floats stored for our simple convolutional autoencoder would scale to over 4 billion floats. If we then expand our reality to include much larger autoencoders, the original Latent Stable Diffusion paper quotes 1.5 billion parameters, which requires a GPU that can process several billions of float values — i.e., the GPU must be able to efficiently process 10s of GB/s. These are still trifles compared to modern LSDs and LLMs, which are built on similar technology.
Our Encoder Size
Model: "Encoder"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 28, 28, 1)] 0
conv2d (Conv2D) (None, 14, 14, 16) 160
conv2d_1 (Conv2D) (None, 7, 7, 8) 1160
conv2d_2 (Conv2D) (None, 4, 4, 4) 292
flatten (Flatten) (None, 64) 0
dense (Dense) (None, 10) 650
=================================================================
Total params: 2262 (8.84 KB)
Trainable params: 2262 (8.84 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________
Our Decoder Size
Model: "Decoder"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 10)] 0
dense_1 (Dense) (None, 64) 704
reshape (Reshape) (None, 4, 4, 4) 0
conv2d_transpose (Conv2DTr (None, 8, 8, 4) 148
anspose)
conv2d_transpose_1 (Conv2D (None, 16, 16, 8) 296
Transpose)
conv2d_transpose_2 (Conv2D (None, 32, 32, 16) 1168
Transpose)
conv2d_3 (Conv2D) (None, 30, 30, 1) 145
tf.image.resize (TFOpLambd (None, 28, 28, 1) 0
a)
=================================================================
Total params: 2461 (9.61 KB)
Trainable params: 2461 (9.61 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Our Autoencoder Size
Model: "Autoencoder"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 28, 28, 1)] 0
model (Functional) (None, 10) 2262
model_1 (Functional) (None, 28, 28, 1) 2461
=================================================================
Total params: 4723 (18.45 KB)
Trainable params: 4723 (18.45 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Try It and See What Happens
As far as we can tell, we have everything set up. Now, we shall set up the training procedure, configure the callbacks, and initiate the training process.
Training Setup
We will run the test in batches of 128 MNIST images for a minimum of 25 epochs and a maximum of 200 epochs (see callbacks). Furthermore, we will set shuffle=True
because it is standard to shuffle the data to avoid local minima by training on different sequence of the same 25k images.
Callbacks: Early Stopping
There are many useful callbacks that can be considered for monitoring the training, improving the generalisability, reducing the wall-time to complete, etc. As a minimum, we should always include an EarlyStopping
callback. Because we set the maximum number of epochs arbitrarily, we could set it to 1 epoch and produce a low quality model. We could also set it to 1 billion epochs, such that the training will never end in the lifetime of the human species on Earth.
More than just wall-clock time, it has been shown that generalisation error increases with too many epochs (after the train-validation sweet spot), because the model is constantly trying to improve the training loss, without regard for the validation loss — which we use as a running proxy for the generalisation error. We will see below that the validation error may test to increase after many iterations, while the training error continues to decrease. The lowest validation error because increase in the sweet spot.
As such, the EarlyStopping
callback monitors the validation loss, and deactivates the training process when the val_loss
(or monitor of our choosing) increases over successive epochs. The EarlyStopping
callback has three parameters that we set different from the default.
patience
(=20) means that if theval_loss
does not improve from the current best after 20 epochs, then theEarlyStopping
callback will deactivate the training and return the historymonitor
(=‘val_loss’) means thatEarlyStopping
will track theval_loss
metric. We could ask it to track the RMSE, the training loss, the F1 score, etc. We could also provide it with an external function to call that takes as input the features and labels, then returns a single value.start_from_epoch
(=25) lets theEarlyStopping
callback take a nap until the 26th epoch. There is often a considerable fluctuation in theval_loss
over the first 10+ epochs. We want to avoid theEarlyStopping
callback from deactivating the training before it has begun. We must, therefore, assume that the generalisation error (proxy by the validation loss) is not minimised in the first 25 epochs.
epochs = 200
batch_size = 128
shuffle = True
patience = 20
start_from_epoch = 25
callbacks = [
EarlyStopping(
patience=patience,
monitor="val_loss",
min_delta=0,
verbose=0,
mode="auto",
baseline=None,
restore_best_weights=True,
start_from_epoch=start_from_epoch,
),
]
We embedded the training procedure within ConvAETrainer
, and instantiated it as the conv_autoencoder
instance. We thus start training by calling the conv_autoencoder.train
method, which is a wrapper for the keras autoencoder.fit
. This method takes the training data (x_train
), the validation data (x_val
), the list
of callbacks, and training parameters that we discussed above. The output of the conv_autoencoder.train
is a dict
that we store as history
. The conv_autoencoder.train
procedure also outputs a stream of meaningful text, including the train_loss
and val_loss
.
For more details about understanding the Keras training output, please see my primer on Computer Vision Deep Learning Primer with Keras and Python.
history = conv_autoencoder.train(
x_train=x_train,
x_val=x_val,
epochs=epochs,
batch_size=batch_size,
callbacks=callbacks,
shuffle=shuffle
)
"""
Epoch 1/200
235/235 [==============================] - 22s 30ms/step -
loss: 0.3454 - val_loss: 0.2447
Epoch 2/200
235/235 [==============================] - 6s 24ms/step -
loss: 0.2230 - val_loss: 0.2112
Epoch 3/200
235/235 [==============================] - 5s 22ms/step -
loss: 0.2075 - val_loss: 0.2021
...
Epoch 197/200
235/235 [==============================] - 3s 12ms/step -
loss: 0.1432 - val_loss: 0.1432
Epoch 198/200
235/235 [==============================] - 2s 10ms/step -
loss: 0.1415 - val_loss: 0.1411
Epoch 199/200
235/235 [==============================] - 3s 11ms/step -
loss: 0.1410 - val_loss: 0.1405
Epoch 200/200
235/235 [==============================] - 3s 12ms/step -
loss: 0.1783 - val_loss: 0.1629
"""
Now that the training has completed, let’s understand how to evaluate the results, which are predominantly stored in the history dict
. Because the effectiveness of training (loss
and val_loss
) is correlated with how many epochs the algorithm iterated, we will make an epochs
array and plot the loss values related to it.
In our trial run, the validation loss did not trigger the EarlyStopping
callback. As a result, we have a full 200 epochs to evaluate. This could imply that we should either update the learning rate, increase the number of epochs, or change the loss function from binary_crossentropy
to mse
, mae
, or a number of other options.
The global results of the visualisation is that both the validation and training loss decay over time, which is a positive sign of training effectiveness. It implies that the model is able to determine meaningful correlations between the input and output data. Of course, because we are working with an autoencoder, the input and output data are the same.
Putting this together, we can understand the an effective training implies that the latent space of the autoencoder is able to meaningfully capture enough information from the encoded input images to effectively generate reconstructed images. This means that the latent space could be effective at generative new images with the Decoder independently. It also means that the encoder can be considered effective enough at compressing these images without losing significant information.
Visual Evaluation of Loss and Error
When first experimenting with deep learning, and loss curves, most new ML developers wonder how to interpret the features of the loss curve: loss over epochs. Aside from the global, start-to-finish decay in the loss, there are also quasi-periodic spikes throughout the epochs. In our visualisation below, these spikes occur every 20–40 epochs in both the training (blue) and validation (red) loss.
The cause of these spikes is not perfectly knowable, but the most common explanation is variability in the mini-batch stochastic gradient descent (SGD). This variability means that when the mini-batch SGD grabs a set of images, those images are significantly different from any of the images that the model has understood well thus far, i.e., more varied.
As a made up example, if the mini-batch was unlucky such that the model did not understand any images of handwritten zero digits, then stochastically being asked to train a mini-batch consisting of only the number zero, the autoencoder would find it to be very difficult to reconstruct a zero. There are other explanation for the spikes, but this is the version I most relate to our example, as well to the architecture of our autoencoder.
As a result, another mark of quality for the training process is if amplitude of these spikes decreases with respect to increased number of training epochs. In our example visualisation above, the largest spike (other than epoch 0) occurs at epoch 90. If we continue with my made up example about misunderstanding handwritten digits for zero, then we could say that the autoencoder “had not yet learned” how to recognise zero.
The stochastic nature of the process and the real-world variability of the data set combine in such a way that each successive epoch is not pre-determined to be be a better model. The good news is that our visualisation shows that the spikes became less significant with increased epochs, to the point of converging closer to the loss curve.
epochs = np.arange(len(history.history['loss']))
plots = [
go.Scatter(
x=epochs,
y=np.log10(history.history['loss']),
name='Loss'
),
go.Scatter(
x=epochs,
y=np.log10(history.history['val_loss']),
name='Validation Loss'
)
]
fig = go.Figure(plots)
fig.show()
Conclusion
We created a basic convolutional autoencoder using independent classes with double and triple inheritance. Each class could have been built into a single class object, but the learning and (future) testing experience of compartmentalising the autoencoder provided a detailed understanding of each piece of the puzzle.
- The
Encoder
transforms, the input image through matrix manipulations and convolutional transformations in the latent vector. - The
latent
vector represents the encoded representation of the input images. With hope, it contains sufficient information to reconstruct the image. - The
Decoder
transforms the latent vector into the reconstructed image using symmetric matrix manipulations and convolutional transformations, upscaling the autoencoded features into an image representation of the latent vector.
In the next (sub-)article, we will visually evaluated the latent space and the reconstructed images to understand how the distributions of the latent space store the image data. We also revealed how we can be post-process to constrain the latent space and reconstruct specific handwritten digits by isolating its multidimensional regions that corresponds to the digit label of our choosing.
Read through the previous (sub-)articles to understand how Generative AI evolved from the more simple architecture of Convolutional Autoencoders in Part 1 of this article series. In Part 3 of this article series, we visually evaluate how to probe and constrain the latent space in order to understand generative properties from latent space manipulation. If you prefer a direct, deep dive into the subject, then try out the full text of this article series in one continuous article by selecting [Full] below.
Join the Medium membership program to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.
If you like the article and would like to support me make sure to:
- 👏 Clap for the story (53 claps) and follow me 👉
- 📰 View more content on my medium profile
- 🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter
- 🚀👉 Join the Medium membership program to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.