Generative AI Through the Ages: Convolutional Autoencoders

Jonathan Fraine
43 min readMar 3, 2024

--

Image Generated by the Author using DALL-E 3 Prompted by the Article Title

Generative AI (GenAI) is any form of computational algorithm that can create new content related to a set of data from which it has learned or mapped reproducible patterns.

Modern versions of GenAI are based on deep learning neural networks that train on massive data sets of images, audio signals, and text. They have become highly popular in the last few years with user-facing applications, such as Stable Diffusion, Midjourney, DALL-E, as well as Large Language Models (LLMs), such as GitHub CoPilot, ChatGPT, or Claude.

At some level in each of these examples, an Autoencoder was configured to encode and decode each application’s data environment. Most often, an Autoencoder is implemented at the first (encoder) and last (decoder) stages to embed or vectorising the inputs, as well as transform the final latent stage to generate a new image, text, audio, etc.

The goal of most Generative AI experiments is to create an algorithm that can generate new data samples that do not explicitly exist in the dataset. A well trained model can produce samples that closely represent the data environment from which they were trained. The data environment could include images alone, audio alone, text alone, as well as images + text, audio + text, or images + classes. In the most impressive cases, the algorithm can generate convincing examples outside the expected boundaries of the data environment, such as a cave drawing of a Neanderthal using a cell phone.

Generated by the Author using DALL-E 3

Original User-facing Image Generators

One of the first, famous, examples of a user-facing GenAI specific to computer vision was This Person Does Not Exist .com. It deployed a variant of a Generative Adversarial Network, which is known to create very crisp, artificial images of faces when trained on a corpus of human faces.

We show in this article that an Autoencoder is often visualised as two pyramids pointing at each other. The architecture of a GAN can be visualised as an inside out Autoencoder, with the image domains touching.

(left) Diagram of a Convolutional Autoencoder. (right) GAN Diagram Created by Flipping ConvAE [Image from Snover et al 2020]

Even the first generation of GANs could produce astonishingly precise human like facial images to impress public users, not just AI experts. The first GANs were limited in their scope of reality and generated human faces often resembled each other. This became apparent to users because the faces were too similar to each other, with only a handful of features: gender, eye type, hair color, hair style, facial expression, etc. As we will discuss here, this was caused because latent spaces were narrow and produced low quality image mappings that failed to capture real-world, multimodal distributions — known as mode collapse.

Both GANs and Autoencoders have improved significantly since their inception. GAN developers created tools and methods to improve the breadth of their latent space expansion. At nearly the same time and pace, the image generation clarity of Autoencoders, especially Constrained Convolutional Autoencoders, improved dramatically with increased computational capacities (i.e., more complex models) and a more expansive set of datasets.

As stated above, almost all modern Generative AI techniques utilise an Autoencoder to encode (or “embed”) and decode their input/output data streams. As a result, we will discuss Autoencoders here as the backbone of Generative AI. In the context of image generators, the most prevalent variant is the Convolutional Autoencoder.

Autoencoders

Although GenAI has blossomed with resounding popularity in the last few years, Autoencoders (AEs) were first invented as far back as the 1980s. They were able to generate new images by exploring the multidimensional latent space distributions learned from training on existing image datasets. The most notable, original dataset used was MNIST handwritten digits. For clarity on early computer vision deep learning and more background on the MNIST handwritten digits dataset, please see my primer on Computer Vision Deep Learning Primer with Keras and Python.

Even though Autoencoders had generative capability, they were not widely used to generate new images of human faces, but instead as denoising, feature extracting, data compression, or data augmentation models. In fact, an early technique for training deep neural networks for classification was first to train an autoencoder in which the dimensionality of the latent space was equal to the desired number of classes. After throwing away the decoder, the encoder could be fine-tuned along with the output layers to retrain it as a classifier.

Diagram of a Convolutional Autoencoder. [Image from Snover et al 2020]

For our computer vision use case, an Autoencoder is a three part neural network that inputs an image and outputs another image. The output image is referred to as the reconstructed image. The three components of our autoencoder are an Encoder, a latent space representation, and a Decoder.

The input of the Encoder is the preprocessed image, while the output of the Encoder is the latent vector. The input of the Decoder is that latent vector, while the output of the Decoder is the reconstructed image (scaled similar to the input image). The latent space is an abstract, multidimensional representation of the transformed or compressed image that is nominally one dimensional. Each element of that 1D vector can be interpreted as a learned feature over the dataset.

If we prioritised simplifying the feature space interpretability — and, say, focused on human face generation — then one of the features could represent eye shape or color, another could change the face shape, a third could effect the ear height, while a fourth could manipulate the hair style. From this perspective, the values of the latent vector could be a quantification of the features measured in the input image by the Encoder.

Similarly, changing the values of the latent vector elements before passing it through the Decoder could be akin to tuning a knob on the features of the reconstructed, human-face image. For a more casual user example, the latent space can be interpreted as a non-linear compression or compressed representation of the input image. Compression is, indeed, a well established use case for Autoencoders.

In reality, these features are often correlated and can be manipulated, but not necessarily act independently or linearly. The features are very often correlated along the non-linear transformations throughout the Encoder feature mappings. At the same time, it is reasonable to interpret the latent space as a collection of feature elements that represent the image dataset. In our toy example below, we will use a post-processing method to artificially constrain the distributions over our latent vectors to generate specific handwritten digits with increasing clarity.

Furthermore, there are variations on autoencoders where the latent vector is not explicitly defined. Instead, the output of the Encoder is simply passed through to the Decoder. These variants do not produce a vector that can be extracted, examined, and manipulated. The Encoder would instead output a matrix as an Encoded “image” or feature map that is still a sample from the latent space.

Nonetheless, because GenAI directly implements latent space manipulation as a key element of manipulating the output via user prompting, we will focus on the nominal Autoencoder variant that outputs an explicit vector within the latent space. In the context of dimensionality reduction, the latent vector is by definition smaller than the image itself. All the same, it contains as much of the original information that is necessary to reconstruct the output image with minimum error or loss. Because of this use case, the latent vector is sometimes referred to as the encoding or embedding vector.

This use case either uses the AE as a low pass filter (denoising) or only stores that necessary latent vector to reconstruct the image (compression). Furthermore, if we train an AE on a set of data, then we can compress that data into the set of latent vectors. Think of this as encoding on your computer, transmitting the latent vector via the internet, and decoding on another computer. If properly trained, the second computer could reconstruct the image to very high accuracy (i.e., low loss or error).

As a result, only the image latent vector would need to be transmitted from our system to another without needing to overload the bandwidth with the full image. Note that this compression is nominally “lossy compression”, which is a well established norm for internet streaming services.

Convolutional Autoencoders

The most common autoencoder architecture for computer vision deep learning is the convolutional autoencoder (CAE). CAEs are connected stacks of convolutional kernels for the Encoder and Decoder, with linear matrices in between to transform the encoded images into the latent space. More dense layers are used to again connect the latent space onto the Decoder for image reconstruction.

Convolutions are powerful tools for manipulating images because they take into account the intrinsic (i.e., physical) correlations between pixels. Humans associate these intrinsic correlations as colors and objects being correlated across neighboring pixels. Convolutions are specifically used because they adeptly take advantage of these physical correlation.

Note that the original autoencoder architectures were dominated by stacks of Dense layers, which almost solely involved linear matrix transformations. Dense autoencoders are bulky stacks of matrices that do not intrinsically take advantage of physical correlations inherent to images.

Furthermore, convolutional autoencoders can be either much smaller by having the same number of transformations with smaller matrices. In the other direction, CAEs can be much deeper by having the same number of parameters with many more transformations — as compared to Dense autoencoders. These properties are permissible because a single convolutional filter is often orders of magnitude smaller in number of parameters to train than a Dense layer.

Furthermore, the same CAE can be used for any image input shape. Although it may not be wise, when a CAE is trained on 28x28 images, it can be applied to 256x256 images. This works because convolutional operation passes the kernel over the image pixel-by-pixel. In a CAE configuration, the majority of trained matrices within the autoencoder are used to train parameters from convolutional filter, also known as kernels. As such, the convolution only observes the set of pixels under the kernel because the convolutional operation is to multiply small windows and sum over them. In contrast, dense layers compute linear matrix multiplications over the entire image at once.

For extensive details on convolution deep learning methods, hyperparameters, use cases, and architecture: please see my primer on Computer Vision Deep Learning Primer with Keras and Python.

Compartmentalised ConvAutoencoder to Prepare for the Future

Depending on the use case, the architecture of the CAE could be included in a single class or a set of classes that either inherit from each other or are inherited into a final aggregated class. For both educational purposes and maintainability, we chose here to follow the latter method: a set of classes inherited into a final aggregated class.

We create a class for the Encoder and an independent class for the Decoder. In future articles that build upon this, we will introduce both a variational component to the autoencoder (VAE), as well as a constrained input to represent the user prompt. As such, this method affords us the simplicity of adding the VAE Sampling and class constraints to the Autoencoder class later. Currently, our CAE class here only inherits from the Encoder and Decoder classes.

Make It Real

Here we will begin our implementation of our convolutional autoencoder by instantiating, training, visualising, and evaluating. First, ensure that we have the appropriate libraries installed to run the code below.

The code can be run in full with the associated colab notebook.

!pip install pygtc scikit-learn tensorflow

and import the libraries into the run environment

import numpy as np
import plotly.graph_objects as go

from tensorflow.keras import datasets
from matplotlib import pyplot as plt
from pygtc import plotGTC


from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import (
Input,
Dense,
Conv2D,
MaxPooling2D,
Conv2DTranspose,
Flatten,
Reshape
)

from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras import backend as K
from tensorflow import image

Most of the operations involved in training a neural network are definitively probabilistic, and weigh heavily on the use of pseudo-random number generators. As such, to enable reproducibility between our results over multiple trial runs, we institute and initial random seed as nerdy as we could imagine: 42 because we’re happily nerdy.

np.random.seed(42)

The Encoder and Decoder Classes

In the Encoder and Decoder classes below, we take as input the convolutional kernel parameters, image shape, and number of latent dimensions to configure the modules of our Convolutional Autoencoder. The convolutional kernel parameters are a dict that establish the kernel shape, the number of filters, the activation function, stride length, and pool size (for MaxPool2D layers).

The Encoder class maps the input image to its corresponding latent vector through a sequence of convolutional layers. The Decoder class maps the latent vector to its corresponding reconstructed image through a symmetric sequence of convolutional layers.

Attributes:

  • input_shape(tuple): The shape of the input image. Default is (28, 28, 1).
  • latent_dim(int): The dimension of the latent space. Default is 10.
  • conv_params (dict): A dictionary of convolution parameters. (See below)

The constructor for Encoder class initializes the encoder with the given input shape and latent dimension. Furthermore, the conv_params dictionary contains 7 other parameters that control the size, shape, and complexity of the convolutional layers (filters/kernels) and blocks (stacks of filters/kernels).

conv_params as key:value pairs

  • activation is the flavor of non-linear activation function used between convolutional blocks. It defaults here to relu or Rectified Linear Unit.
  • decoder_activation is the flavor of non-linear activation function used at the top of the Decoder stack (output of the Autoencoder) to match the reconstructed image best to the input image. We chose to default the decoder activation functions as the ‘sigmoid’ because we preprocessed the datasets to sustain a range of 0 to 1, and the sigmoid enforces an output range of 0 to 1.
  • padding is the structure of the padding around the image after each convolutional layer. We chose the hyperparameter setting ‘same’ to ensure that we do not lose information on the edges of feature maps without intending to do so.
  • stride is a parameter that sets the number of pixels by which the convolutional kernels will step between each transformation. Our default of 2 results in each subsequent feature map (output of a convolutional layer) to be created with a quarter the number of features (half the size in both X and Y directions).
  • pool_size is a parameter that establishes how many pixels to bin the MaxPool layers summation. By choosing a pool_size of 1, we are forcing the CAE to ignore the MaxPool operation — in favor of letting the stride operations dominate the feature size reduction. Both Stride and MaxPool serve as similar dimensionality reduction methods.
  • kernel_size is the hyperparameter that sets the size of the each convolutional filter in each dimension. With a default of 3, we are establishing that each kernel will have 3x3 shape, or 9 parameters. The parameter can take an integer for square kernels or a 2-tuple for non-square kernel shapes.
  • n_filters is a list that associates how many filters should be trained per convolutional layer in each block. We only include logic for one layer per block, but multiple layers could be introduced per block. In our default setting, we chose an Encoder sequence of [16, 8, 4], which provides that the first layer will have 16 filters, while the second and third layers will have 8 filters each. The Decoder is by default the inverse of this list: [4, 8, 16].

See my primer for more details on each hyperparameter and more.

class Encoder:
"""
A Encoder class for a Convolutional Autoencoder. This class maps
the input image to its corresponding latent vector through a
sequence of convolutional layers.

Attributes:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space.
Default is 10.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""

def __init__(
self, input_shape=(28, 28, 1), latent_dim=10,
conv_params=None):
"""
The constructor for Encoder class. Initializes the Encoder
with the given input shape and latent dimension.

Args:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space.
Default is 10.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""

# Initialize convolution parameters
self.conv_params = conv_params

# If no conv_params provided, use default values
if self.conv_params is None:
self.conv_params = {
'activation': 'relu',
'decoder_activation': 'sigmoid',
'padding': 'same',
'strides': 2,
'pool_size': 1,
'kernel_size': 3,
'n_filters': [16, 8, 4]
}

# Ensure the latent_dim is a scalar
assert(np.ndim(latent_dim) == 0)
# Ensure the input_shape is a vector
assert(np.ndim(input_shape) == 1)

# If input_shape is a 2D image, append a channel dimension
if len(input_shape) == 2:
input_shape = (*input_shape, 1)

# Set the input shape and latent dimension
self.input_shape = input_shape
self.latent_dim = latent_dim

def build_encoder(self):
"""
Builds the encoder model. The model takes an image as input and
outputs a latent vector.

The function first applies a series of convolution layers to the
input image. The number of filters in each
layer is specified by the 'n_filters' key in the conv_params
dictionary. The activation function, padding, and strides for
these layers are also specified by the conv_params dictionary.

The output of the last convolution layer is then flattened and
passed through a dense layer. The number of units in this dense
layer is equal to the dimension of the latent space.

The output of the dense layer is the latent vector, which is the
output of the encoder model.
"""
# Ensure the 'n_filters' key in conv_params is a list or array
assert(
isinstance(
self.conv_params['n_filters'], (tuple, list, np.ndarray)
)
)

# Define the input image
self.input_image = Input(shape=self.input_shape)
x = self.input_image

# Define the kernel and pool shapes
kernel_shape = [self.conv_params['kernel_size']] * 2
pool_shape = [self.conv_params['pool_size']] * 2

# Encoding layers
for nfilters in self.conv_params['n_filters']:
# Apply a convolution layer with the given parameters
x = Conv2D(
filters=nfilters,
kernel_size=kernel_shape,
activation=self.conv_params['activation'],
padding=self.conv_params['padding'],
strides=self.conv_params['strides']
)(x)

# Store the shape of the volume before flattening
self.volume_size = K.int_shape(x)
# Flatten the volume
x = Flatten()(x)
# Apply a dense layer to map the flattened volume
# to the latent space
self.latent = Dense(self.latent_dim)(x)

# Define the encoder model
self.encoder = Model(self.input_image, self.latent)

The Decoder Class

The Decoder classes below is very similar to the Encoder class above. In fact, the Decoder was generated from the Encoder by adding a map — Dense + Reshape layers — from the latent space to the expected shape of the inner most convolutional layer. After which, the for loop runs over the reversed kernel sizes and uses Conv2DTranspose layer, as opposed to the Conv2D layer seen in the Encoder class above.

The inputs to the Decoder are deliberately identical to the Encoder because the input will later become a combination object that inherits from both the Encoder and Decoder.

Attributes:

  • input_shape(tuple): The shape of the input image. Default is (28, 28, 1).
  • latent_dim (int): The dimension of the latent space. Default is 10.
  • conv_params (dict): A dictionary of convolution parameters. Def: None.
  • volume_size (tuple): the shape of the first (2D) convolutional layer

An additional input to the Decoder, volume_size, acts as both a bypass and an explicit input for an otherwise implicit value. The volume_size is nominally inherited from the Encoder object in the Autoencoder class below. The use of an input parameter here bypasses the need for an Encoder to be given to the Decoder. The shape of innermost convolutional layer, volume_size, enables future use cases that might require manipulating the Decoder without the need for a symmetric Encoder or an Encoder at all.

See my primer for more details on each hyperparameter and more.

class Decoder:
"""
A Decoder class for a Convolutional Autoencoder. This class maps
the latent vector to its corresponding reconstructed image through
a sequence of convolutional layers. The final output layer is
reshaped deliberately to match the shape of the input image:
ncols x nrows x nchannels.

Attributes:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space.
Default is 10.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""
def __init__(
self, input_shape=(28, 28, 1), latent_dim=10,
conv_params=None, volume_size=None):
"""
The constructor for Decoder class. Initializes the Decoder with
the given latent dimension and image shape.

Args:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space.
Default is 10.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""

# Initialize convolution parameters
self.conv_params = conv_params

# If no conv_params provided, use default values
if self.conv_params is None:
self.conv_params = {
'activation': 'relu',
'decoder_activation': 'sigmoid',
'padding': 'same',
'strides': 2,
'pool_size': 1,
'kernel_size': 3,
'n_filters': [16, 8, 4]
}

"""
If Decoder is established within the Autoencoder,
then the volume_size is inherited from the Encoder object.
If the Decoder is instantiated alone, then the volume_size
must be calculated or provided as an input variable.
"""
if not hasattr(self, 'volume_size'):
# TODO: calculate `volume_size` from latent_dim and kernel
# shapes assert(volume_size is not None)
# Becuase this operation only exists inside an Autoencoder,
# the `volume_size` is assumed to be inherited from
# the `Encoder`
self.volume_size = volume_size

# Ensure the latent_dim is a scalar
assert(np.ndim(latent_dim) == 0)

# Ensure the input_shape is a 1D-vector
assert(np.ndim(input_shape) == 1)

# If input_shape is a 2D image (greyscale),
# append a channel dimension
if len(input_shape) == 2:
input_shape = (*input_shape, 1)

# Set the input shape and latent dimension
self.input_shape = input_shape
self.latent_dim = latent_dim

def build_decoder(self):
"""
Builds the decoder model. The model takes a latent vector as
input and outputs a reconstructed image.

The function first maps the latent vector to a dense layer with
the same number of units as the product of the volume size.
The output of this dense layer is then reshaped to match the
volume size.

The reshaped output is then passed through a series of
transposed convolution layers (also known as deconvolution
layers). The number of filters in each layer is specified
by the 'n_filters' key in the conv_params dictionary.
The activation function, padding, and strides for these layers
are also specified by the conv_params dictionary.

Finally, a convolution layer is applied to the output of the
last deconvolution layer. The number of filters in this layer
is equal to the number of channels in the input image.
The activation function for this layer is specified by the
'decoder_activation' key in the conv_params dictionary.

The output of the final convolution layer is then resized to
match the size of the input image. The resized image is the
output of the decoder model.
"""
# Define the latent inputs
latent_inputs = Input(shape=(self.latent_dim,))

# Compute the volume of the reshaped dense layer
volume_shape = self.volume_size[1:]
volume = np.prod(volume_shape)

# Map the latent inputs to a dense layer and reshape it
x = Dense(volume)(latent_inputs)
x = Reshape(volume_shape)(x)

# Define the kernel and pool shapes
kernel_shape = [self.conv_params['kernel_size']] * 2
pool_shape = [self.conv_params['pool_size']] * 2

# Decoding layers
for nfilters in self.conv_params['n_filters'][::-1]:
# Apply a transposed convolution layer with the
# given parameters
x = Conv2DTranspose(
filters=nfilters,
kernel_size=kernel_shape,
activation=self.conv_params['activation'],
padding=self.conv_params['padding'],
strides=self.conv_params['strides']
)(x)

# The output must be the number of channels
nchannels = self.input_shape[-1]

# Apply a convolution layer to map the volume to the
# number of channels
decoded = Conv2D(
filters=nchannels,
kernel_size=kernel_shape,
activation=self.conv_params['decoder_activation'],
padding='valid'
)(x)

# Get the size of the input image
img_size = K.int_shape(self.input_image)[1:]

# Resize the decoded image to match the size of the input image
resized_image_tensor = image.resize(
images=decoded,
size=list(img_size[:2]),
method='bilinear',
preserve_aspect_ratio=True,
antialias=False,
name=None,
)

# Define the decoder model
self.decoder = Model(latent_inputs, resized_image_tensor)

Putting it All Together: ConvAutoencoder as a builder object

The Encoder class above transforms the input images into their latent space representation and the Decoder class above transforms the latent space vector into a reconstructed image. To tie those pieces together, the ConvAutoencoder class inherits from both the Enc and Dec classes above. This level of modularity helps us to focus on each component quasi-independently, which benefits both the explainability and maintainability of the CAE components.

In later articles, we will build out the full generative AI prompting for image generation through to Latent Stable Diffusion (LSD), which starts as a Convolutional Variational Autoencoder. The LSD CAE is a denoising AE that manipulates the latent space representation according the embedded user query or prompt.

The modularity allows the LSD to stage (class below) the CAE and adds the flexibility with Variational and Constrained input (prompting) for later development, while maintaining explainability. As such, the inputs to an LSD ConvAutoencoder class are nearly identical to the Encoder and Decoder here, except that it finely manipulates the autoencoder to generate or new samples.

class ConvAutoencoder(Encoder, Decoder):
"""
A ConvAutoencoder class that combines the Encoder and Decoder classes
to form a complete Convolutional Autoencoder.

The ConvAutoencoder class takes an image as input, encodes it into a
latent vector using the Encoder, and then decodes the latent vector
back into an image using the Decoder.

Attributes:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space. Default is 10.
autobuild (bool): Whether to automatically build the autoencoder
upon initialization. Default is True.
training_params (dict): A dictionary of parameters for training.
Default is None.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""
def __init__(
self, input_shape=(28, 28, 1), latent_dim=10, autobuild=True,
conv_params=None):
"""
The constructor for ConvAutoencoder class.
Initializes the ConvAutoencoder with the given input shape and
latent dimension.

Args:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space.
Default is 10.
autobuild (bool): Whether to automatically build the
autoencoder upon initialization. Default is True.
training_params (dict): A dictionary of parameters for
training. Default is None.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""

# Initialize the Encoder and Decoder superclasses
super(Encoder, self).__init__()
super(Decoder, self).__init__()

# Set the convolution parameters
self.conv_params = conv_params
if self.conv_params is None:
self.conv_params = {
'activation': 'relu',
'decoder_activation': 'sigmoid',
'padding': 'same',
'strides': 2,
'pool_size': 1,
'kernel_size': 3,
'n_filters': [16, 8, 4]
}

# Check the dimensions of the latent dimension and input shape
assert(np.ndim(latent_dim) == 0)
assert(np.ndim(input_shape) == 1)

# If the input shape is 2D, add a channel dimension
if len(input_shape) == 2:
input_shape = (*input_shape, 1)

# Set the input shape and latent dimension
self.input_shape = input_shape
self.latent_dim = latent_dim

# Build the autoencoder if autobuild is True
if autobuild: # Default
self.build_autoencoder()

def build_autoencoder(self):
"""
Builds the autoencoder model. The model takes an image as input
and outputs a reconstructed image.

The function first checks if the encoder and decoder models have
been built. If not, it calls the build_encoder and build_decoder
methods to build them.

The autoencoder model is then constructed by passing the input
image through the encoder and decoder models in sequence.
"""
# Build the encoder model if it hasn't been built yet
if not hasattr(self, 'encoder'):
self.build_encoder()

# Build the decoder model if it hasn't been built yet
if not hasattr(self, 'decoder'):
self.build_decoder()

# Construct the autoencoder model by passing the input image
# through the encoder and decoder
self.autoencoder = Model(
self.input_image,
self.decoder(
self.encoder(
self.input_image
)
)
)

Make it Work

With the ConvAutoencoder architecture now settled, we want to introduce the training and inference components. The below class ConvAETrainer inherits from the previous three classes to build the ConvAutoencoder, then compile it, and train it on the provided dataset. The dataset we use here is either MNIST or FashionMNIST. In addition to the train method, the class also provides encode_image, decode_image, and generate_image methods.

These methods include

1. encode_image computes the latent vector from an input image

2. decode_image computes the reconstructed image from an input image

3. generate_image computes the reconstructed image from a latent vector

The 3rd method above ( generate_image) is the generative part of “Generative AI”. It takes as input a representation of latent space and outputs a vector. In the case of Latent Stable Diffusion, the “latent space” representation is generated via embedding the user prompt.

In our case here, we will probe the statistical distributions over the latent space to identify the most likely regions for each input data class (MINST: digit numbers). Using the modes from those statistical distributions, we will generate images of handwritten digits that coincide with the multi-dimensional latent space modes, i.e., latent space, or compressed, image representation.

class ConvAETrainer(ConvAutoencoder, Encoder, Decoder):
"""
A ConvAETrainer class that inherits from the ConvAutoencoder class.
This class is used to train the Convolutional Autoencoder and
generate new images from the latent space.

Attributes:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space.
Default is 10.
training_params (dict): A dictionary of training parameters.
Default is None.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""
def __init__(
self, input_shape=(28, 28, 1), latent_dim=10,
autobuild=True, training_params=None, conv_params=None):
"""
The constructor for ConvAETrainer class. Initializes the
ConvAETrainer with the given input shape, latent dimension,
training parameters, and convolution parameters.

Args:
input_shape (tuple): The shape of the input image.
Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space.
Default is 10.
training_params (dict): A dictionary of training parameters.
Default is None.
conv_params (dict): A dictionary of convolution parameters.
Default is None.
"""
# Initialize the Encoder and Decoder superclasses
super(Encoder, self).__init__()
super(Decoder, self).__init__()
super(ConvAutoencoder, self).__init__()

# Set the convolution parameters
self.conv_params = conv_params

if self.conv_params is None:
self.conv_params = {
'activation': 'relu',
'decoder_activation': 'sigmoid',
'padding': 'same',
'strides': 2,
'pool_size': 1,
'kernel_size': 3,
'n_filters': [16, 8, 4]
}

# Set the autoencoder parameters
self.training_params = training_params
if self.training_params is None:
self.training_params = {
'optimizer': 'adam',
'loss': 'binary_crossentropy',
'metrics': None
}

# Check the dimensions of the latent dimension and input shape
assert(np.ndim(latent_dim) == 0)
assert(np.ndim(input_shape) == 1)

# If the input shape is 2D, add a channel dimension
if len(input_shape) == 2:
input_shape = (*input_shape, 1)

# Set the input shape and latent dimension
self.input_shape = input_shape
self.latent_dim = latent_dim

# Build the autoencoder if autobuild is True
if autobuild: # Default
self.build_autoencoder()

def train(
self, x_train, x_val, epochs=50, batch_size=128,
callbacks=None, shuffle=True):
"""
Train the autoencoder.

Args:
x_train (np.array): The training data.
x_val (np.array): The validation data.
epochs (int): The number of epochs to train for.
Default is 50.
batch_size (int): The batch size for training.
Default is 128.
callbacks (list): A list of callbacks to apply
during training. Default is None.
shuffle (bool): Whether to shuffle the training data
before each epoch. Default is True.

Returns:
A History object. Its History.history attribute is a
record of training loss values and metrics values at
successive epochs, as well as validation loss values
and validation metrics values (if applicable).
"""
# Build the autoencoder if it hasn't been built yet
if not hasattr(self, 'autoencoder'):
self.build_autoencoder()

# Compile the autoencoder model with the specified optimizer,
# loss function, and metrics
self.autoencoder.compile(
optimizer=self.training_params['optimizer'],
loss=self.training_params['loss'],
metrics=self.training_params['metrics']
)

# Train the autoencoder
return self.autoencoder.fit(
x_train,
x_train,
epochs=epochs,
batch_size=batch_size,
shuffle=shuffle,
callbacks=callbacks,
validation_data=(x_val, x_val)
)

def encode_image(self, image):
"""
Get the encoded representation of the image.

Args:
image (np.array): The image to encode.

Returns:
The encoded representation of the image.
"""
# Ensure the encoder has been built
assert(hasattr(self, 'encoder'))

# Encode the image
return self.encoder.predict(image)

def decode_image(self, images):
"""
Get the decoded (reconstructed) image.

Args:
images (np.array): The images to decode.

Returns:
The decoded images.
"""
# Ensure the autoencoder has been built
assert(hasattr(self, 'autoencoder'))

# Decode the images
return self.autoencoder.predict(images)

def generate_image(self, latent_vector):
"""
Generate a new image from the latent space.

Args:
latent_vector (np.array): The latent vector from which
to generate the reconstructed image.

Returns:
The generated image.
"""
# Generate the image from the latent vector
return self.decoder.predict(latent_vector)

Preprocessing the Data

The datasets are stored remotely with greyscale values from 0 to 255. Because the matrices within the convolutional neural network expect as input values from 0 to 1, we will load the data then divide each pixed by 255 (max value). Because MNIST and FashionMNIST images are stored as 2D arrays per image, we expand the input images with a third dimension of size 1 — i.e., greyscale. This allows our operation to enable this project to scale up to colored images (3D arrays per image).

def preprocess(features, labels):
"""
Normalizes the supplied array and reshapes it into
the appropriate format.
"""

features = features.astype("float32") / 255.0

if np.ndim(features) == 3: # Greyscale is N samples of 2D arrays
features = np.expand_dims(features, axis=-1)

if np.ndim(labels) == 2:
labels = labels.ravel()

return features, labels

Train, Validation, Test Splitting

In addition to preprocessing, the following load_and_preprocess script here will split the data into the prescribed test_size. The default test_size is 50% because we want to improve our generalisability by testing on a large set of unseen or untrained images. Generalisation error is the error derived from using the model “in the wild”, often after deployment or at least during a later decision making process.

There is a different use of the words test and validation in the nomenclature because the ML community iterated on them over time. The train_test_split functions uses the scikit-learn terminology: test_size. In contrast, the model.fit method uses the keras terminology: validation_data and val_loss. Furthermore, the MNIST dataset includes train and test images, where the test images are meant to remain unseen by the model during the training process. This coincides with the keras terminology. Our use here is that

  • train images are used to update the model weights during training.
  • validation images are used to monitor a proxy for generalisation error during the training process.
  • test images are used to evaluate the model results as a more direct proxy for generalisation error.

Because MNIST comes with 10k test images, we will use the train_test_split to generate 50% train images and 50% validation images. Validation images are used during the training process to compute the val_loss which allows the callbacks, and developers, to be more confident that the training processes has not devolved passed the minimum generalisation error per model.

Generalisation error is best estimated after training as the test_loss by using the test images not provided to the model.train. Most developers use the val_loss as a running proxy for the test_loss. If the final val_loss does not match the test_loss within expected variance of a few percent, then the model will likely not sustain an expected generalisation error.

In later articles, we will probe hyperparameter optimisation by training hundreds of CAEs and other ML models. For each model, we must compute the test_loss to select the best model, as well as weight each model’s contributions towards a weighted ensemble. When performing hyperparameter optimisation, the test_loss should not be used a robust proxy for the generalisation error. Depending on the size of the provided dataset, we could split it into four components: train, validation, test, generalisation. If the dataset is not large enough for four splits, then new data should be acquired and monitored, often during deployment and later inference.

def load_and_preprocess_data(dataset='mnist', test_size=0.5):
known_datasets = ['mnist', 'fashion_mnist', 'cifar10', 'cifar100']

if dataset == 'mnist': # 2D arrays of greyscale images
mnist = datasets.mnist.load_data()
(x_train, y_train), (x_test, y_test) = mnist
if dataset == 'fashion_mnist': # 2D arrays of greyscale images
fashion_mnist = datasets.fashion_mnist.load_data()
(x_train, y_train), (x_test, y_test) = fashion_mnist
if dataset == 'cifar10': # 3D arrays of color images
cifar10 = datasets.cifar10.load_data()
(x_train, y_train), (x_test, y_test) = cifar10
if dataset == 'cifar100': # 3D arrays of color images
cifar100 = datasets.cifar100.load_data()
(x_train, y_train), (x_test, y_test) = cifar100

x_train, y_train = preprocess(x_train, y_train)
x_test, y_test = preprocess(x_test, y_test)

x_train, x_val, y_train, y_val = train_test_split(
x_train,
y_train,
test_size=test_size # 50% of train test for validation
)

return x_train, x_val, x_test, y_train, y_val, y_test

Load Data, Make Model, Fit Model to Data

Load Data

Here we assign the dataset and test_size parameters that dominate our use case for our Convolutional Autoencoder, then load and preprocess the data as specified above.

dataset = 'mnist'
test_size = 0.5

x_train, x_val, x_test, y_train, y_val, y_test = load_and_preprocess_data(
dataset=dataset,
test_size=test_size
)

Make the Model

With the data in hand, we can now define the autoencoder architecture, instantiate the model, and load the data into the ConvAETrainer instance. We chose the latent_dim to have 10 dimensions because it seems appropriate when we later isolate the modes of each dimension with a specific class or number, from which to generate new handwritten digits. It is unnecessary, but convenient for pedagogical and explainable development.

Loss Function Selection

For our loss function, we chose the loss function here to be binary_crossentropy because our greyscale images support a range from 0 to 1, and expect that reconstructed values should be between 0 or 1. Each loss function has its own behaviour, assumptions, expectations, use case, and value. The input data, output activation function, and loss function should be chosen as a triplet. We suggest exploring the loss function with respect the data set and preprocessing as a “hyper-hyperparameters”.

For clarity on the deep learning background context with regards to loss functions and datasets, please see my primer on Computer Vision Deep Learning Primer with Keras and Python.

latent_dim = 10

# https://link.springer.com/chapter/10.1007/978-3-031-11349-9_30
# loss = 'mse' # good for small images and large datasets
# loss = 'mae' # good for small images and large datasets
loss = 'binary_crossentropy' # good for large images and small datasets

training_params = {
# 'optimizer': 'adadelta',
'optimizer': 'adam',
'loss': loss,
'metrics': None
}

conv_params={
'activation': 'relu',
'decoder_activation': 'linear',
'padding': 'same',
'strides': 2,
'pool_size': 1,
'kernel_size': 3,
'n_filters': [16, 8, 4]
}

input_shape = x_train.shape[1:] # skip the number of samples

conv_autoencoder = ConvAETrainer(
input_shape=input_shape,
latent_dim=latent_dim,
training_params=training_params,
conv_params=conv_params
)

print(conv_autoencoder.autoencoder.summary())
print(conv_autoencoder.encoder.summary())
print(conv_autoencoder.decoder.summary())

Calculating the Size of the Model

In a Convolutional Autoencoder, it is crucial to understand the number of parameters in the network to appreciate the complexity and capacity of the model. To count the number of parameters in a Convolutional Autoencoder, we need to consider the encoder, decoder, and latent space components. The primary components contributing to the parameter count are convolutional layers (Enc + Dec) and fully connected layers.

  • Encoder Convolutional Layers: For a Conv2D, the number of parameters is determined by the size of the filters (or kernels), the number of input channels, and the number of output channels (n_filters).
    n_params = ( kernel_size ·kernel_size·nfilters_in+1)·nfilters_out
  • Decoder Convolutional Layers: The parameter calculation for Conv2DTranspose layers are similar to that of Conv2D layers. The formula remains the same as above.
  • Latent Space: If present in the encoder for dimensionality reduction or reshaping, the number of parameters in fully connected layers are:
    n_params = (n_input_units+ 1)·n_output_units

In each n_params calculation above, there is a “+1” term, which represents the bias term for each layer or filter. The bias term allows the network to rebalance the average amplitude of each of the layers or filters to avoid exploding or vanishing gradients. For convolutional layers, a more common technique not shown here is to introduce a BatchNormalization layer after each convolutional filter.

Examining the Size of Our Autoencoder

In the default configuration above, each filter defaults to a 3x3 kernel. With Encoder layers including [16, 8, 4] filters, the convolutional blocks of this default Encoder includes 2262 parameters to train outside of the latent space transformations. The Decoder has slightly different number of parameters because the direction of the bias term changes the input parameters per layer, which results in 2461 Decoder parameters to train.

In addition to parameters, it is important to maintain a functional understanding of the size and number of feature maps (outputs per filter/kernel per layer). Feature maps are the image-like 2D arrays generated per filter transformation. In computer vision, this would be the transformed images, like edge detections or color gradients. With 16 filters/kernels in the first block, a 28x28 image would output 16 features maps that are quarter the number of input pixels, because stride is set to 2. This means that the first convolution layer produces 3136 feature pixels produced by the convolutions: 28 x 28 x 16 / 4.

After 3 convolutional blocks, with 16, 8, 4 filters sequentially, the Encoder outputs 4 4x4 feature maps, which is 64 feature pixels. The latent vector results from a linear matrix transformation using a Dense layer to convert these 64 features pixels into the 10 latent space dimensions. This introduces a (64+1) x 10 linear matrix, which therefore include 650 parameters for the neural network to train.

The 28 different convolutional transformations in the Encoder and the 28 different transformations in the Decoder result in 4723 parameters for the neural network to train, while the two Dense layers (into and out of the latent space) require 1300 parameters to train the neural network, which is 21.5% of the total network in just two layers. This shows that Dense layers can dramatically outnumber the parameters required for the sme operation as a set of Conv2D layers.

Scaling Up to a Real World Application with Latent Stable Diffusion

Each convolutional layer produces an equal number of feature maps to the number of filters. With stride:2 the size of each filter map is a quarter the size of the input feature map to that layer. For reference, the first feature map is actually the input image data itself, and the second feature map is the transformed images after being convolved by the first layer of filters. This in the equivalent 3136+392+48 or 3576 feature pixels from the Encoder feature maps. As well, another 3576 feature pixels from the Decoder feature maps. All 7152 feature pixels (i.e., elements of the feature maps) — as well as all 4726 convolutional autoencoder parameters — must be stored on the GPU per image in the batch. With a batch size of 128, we need 1,520,384 floating point numbers to be stored on the GPU.

These 1,5M floats are not significant for modern GPUs. In contrast, our experiment only includes with 6 convolutional layers, as well as small images. These values expand linearly with respect to number of input pixels (image height x image width) and the batch size. For an HD image (1920x1080 pixels), the number of floats stored for our simple convolutional autoencoder would scale to over 4 billion floats. If we then expand our reality to include much larger autoencoders, the original Latent Stable Diffusion paper quotes 1.5 billion parameters, which requires a GPU that can process several billions of float values — i.e., the GPU must be able to efficiently process 10s of GB/s. These are still trifles compared to modern LSDs and LLMs, which are built on similar technology.

Our Encoder Size

Model: "Encoder"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 28, 28, 1)] 0

conv2d (Conv2D) (None, 14, 14, 16) 160

conv2d_1 (Conv2D) (None, 7, 7, 8) 1160

conv2d_2 (Conv2D) (None, 4, 4, 4) 292

flatten (Flatten) (None, 64) 0

dense (Dense) (None, 10) 650

=================================================================
Total params: 2262 (8.84 KB)
Trainable params: 2262 (8.84 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________

Our Decoder Size

Model: "Decoder"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 10)] 0

dense_1 (Dense) (None, 64) 704

reshape (Reshape) (None, 4, 4, 4) 0

conv2d_transpose (Conv2DTr (None, 8, 8, 4) 148
anspose)

conv2d_transpose_1 (Conv2D (None, 16, 16, 8) 296
Transpose)

conv2d_transpose_2 (Conv2D (None, 32, 32, 16) 1168
Transpose)

conv2d_3 (Conv2D) (None, 30, 30, 1) 145

tf.image.resize (TFOpLambd (None, 28, 28, 1) 0
a)

=================================================================
Total params: 2461 (9.61 KB)
Trainable params: 2461 (9.61 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Our Autoencoder Size

Model: "Autoencoder"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 28, 28, 1)] 0

model (Functional) (None, 10) 2262

model_1 (Functional) (None, 28, 28, 1) 2461

=================================================================
Total params: 4723 (18.45 KB)
Trainable params: 4723 (18.45 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Try It and See What Happens

As far as we can tell, we have everything set up. Now, we shall set up the training procedure, configure the callbacks, and initiate the training process.

Training Setup

We will run the test in batches of 128 MNIST images for a minimum of 25 epochs and a maximum of 200 epochs (see callbacks). Furthermore, we will set shuffle=True because it is standard to shuffle the data to avoid local minima by training on different sequence of the same 25k images.

Callbacks: Early Stopping

There are many useful callbacks that can be considered for monitoring the training, improving the generalisability, reducing the wall-time to complete, etc. As a minimum, we should always include an EarlyStopping callback. Because we set the maximum number of epochs arbitrarily, we could set it to 1 epoch and produce a low quality model. We could also set it to 1 billion epochs, such that the training will never end in the lifetime of the human species on Earth.

More than just wall-clock time, it has been shown that generalisation error increases with too many epochs (after the train-validation sweet spot), because the model is constantly trying to improve the training loss, without regard for the validation loss — which we use as a running proxy for the generalisation error. We will see below that the validation error may test to increase after many iterations, while the training error continues to decrease. The lowest validation error because increase in the sweet spot.

As such, the EarlyStopping callback monitors the validation loss, and deactivates the training process when the val_loss (or monitor of our choosing) increases over successive epochs. The EarlyStopping callback has three parameters that we set different from the default.

  • patience (=20) means that if the val_loss does not improve from the current best after 20 epochs, then the EarlyStopping callback will deactivate the training and return the history
  • monitor (=‘val_loss’) means that EarlyStopping will track the val_loss metric. We could ask it to track the RMSE, the training loss, the F1 score, etc. We could also provide it with an external function to call that takes as input the features and labels, then returns a single value.
  • start_from_epoch (=25) lets the EarlyStopping callback take a nap until the 26th epoch. There is often a considerable fluctuation in the val_loss over the first 10+ epochs. We want to avoid the EarlyStopping callback from deactivating the training before it has begun. We must, therefore, assume that the generalisation error (proxy by the validation loss) is not minimised in the first 25 epochs.
epochs = 200
batch_size = 128
shuffle = True
patience = 20
start_from_epoch = 25

callbacks = [
EarlyStopping(
patience=patience,
monitor="val_loss",
min_delta=0,
verbose=0,
mode="auto",
baseline=None,
restore_best_weights=True,
start_from_epoch=start_from_epoch,
),
]

We embedded the training procedure within ConvAETrainer, and instantiated it as the conv_autoencoder instance. We thus start training by calling the conv_autoencoder.train method, which is a wrapper for the keras autoencoder.fit. This method takes the training data (x_train), the validation data (x_val), the list of callbacks, and training parameters that we discussed above. The output of the conv_autoencoder.train is a dict that we store as history. The conv_autoencoder.train procedure also outputs a stream of meaningful text, including the train_loss and val_loss.

For more details about understanding the Keras training output, please see my primer on Computer Vision Deep Learning Primer with Keras and Python.

history = conv_autoencoder.train(
x_train=x_train,
x_val=x_val,
epochs=epochs,
batch_size=batch_size,
callbacks=callbacks,
shuffle=shuffle
)


"""
Epoch 1/200
235/235 [==============================] - 22s 30ms/step -
loss: 0.3454 - val_loss: 0.2447
Epoch 2/200
235/235 [==============================] - 6s 24ms/step -
loss: 0.2230 - val_loss: 0.2112
Epoch 3/200
235/235 [==============================] - 5s 22ms/step -
loss: 0.2075 - val_loss: 0.2021

...

Epoch 197/200
235/235 [==============================] - 3s 12ms/step -
loss: 0.1432 - val_loss: 0.1432
Epoch 198/200
235/235 [==============================] - 2s 10ms/step -
loss: 0.1415 - val_loss: 0.1411
Epoch 199/200
235/235 [==============================] - 3s 11ms/step -
loss: 0.1410 - val_loss: 0.1405
Epoch 200/200
235/235 [==============================] - 3s 12ms/step -
loss: 0.1783 - val_loss: 0.1629
"""

Now that the training has completed, let’s understand how to evaluate the results, which are predominantly stored in the history dict. Because the effectiveness of training (loss and val_loss) is correlated with how many epochs the algorithm iterated, we will make an epochs array and plot the loss values related to it.

In our trial run, the validation loss did not trigger the EarlyStopping callback. As a result, we have a full 200 epochs to evaluate. This could imply that we should either update the learning rate, increase the number of epochs, or change the loss function from binary_crossentropy to mse, mae, or a number of other options.

The global results of the visualisation is that both the validation and training loss decay over time, which is a positive sign of training effectiveness. It implies that the model is able to determine meaningful correlations between the input and output data. Of course, because we are working with an autoencoder, the input and output data are the same.

Putting this together, we can understand the an effective training implies that the latent space of the autoencoder is able to meaningfully capture enough information from the encoded input images to effectively generate reconstructed images. This means that the latent space could be effective at generative new images with the Decoder independently. It also means that the encoder can be considered effective enough at compressing these images without losing significant information.

Visual Evaluation of Loss and Error

When first experimenting with deep learning, and loss curves, most new ML developers wonder how to interpret the features of the loss curve: loss over epochs. Aside from the global, start-to-finish decay in the loss, there are also quasi-periodic spikes throughout the epochs. In our visualisation below, these spikes occur every 20–40 epochs in both the training (blue) and validation (red) loss.

Train + Validation Loss Evaluations over epochs

The cause of these spikes is not perfectly knowable, but the most common explanation is variability in the mini-batch stochastic gradient descent (SGD). This variability means that when the mini-batch SGD grabs a set of images, those images are significantly different from any of the images that the model has understood well thus far, i.e., more varied.

As a made up example, if the mini-batch was unlucky such that the model did not understand any images of handwritten zero digits, then stochastically being asked to train a mini-batch consisting of only the number zero, the autoencoder would find it to be very difficult to reconstruct a zero. There are other explanation for the spikes, but this is the version I most relate to our example, as well to the architecture of our autoencoder.

As a result, another mark of quality for the training process is if amplitude of these spikes decreases with respect to increased number of training epochs. In our example visualisation above, the largest spike (other than epoch 0) occurs at epoch 90. If we continue with my made up example about misunderstanding handwritten digits for zero, then we could say that the autoencoder “had not yet learned” how to recognise zero.

The stochastic nature of the process and the real-world variability of the data set combine in such a way that each successive epoch is not pre-determined to be be a better model. The good news is that our visualisation shows that the spikes became less significant with increased epochs, to the point of converging closer to the loss curve.

epochs = np.arange(len(history.history['loss']))

plots = [
go.Scatter(
x=epochs,
y=np.log10(history.history['loss']),
name='Loss'
),
go.Scatter(
x=epochs,
y=np.log10(history.history['val_loss']),
name='Validation Loss'
)
]

fig = go.Figure(plots)
fig.show()

Visual Evaluation of Test Images

Now that we trust our autoencoder is able to understand our data set well enough to test, we shall visually evaluate the reconstruction images. Our class object ConvAETrainer includes a wrapper method for ConvAutoencoder.autoencoder.predict, which first confirms that self.autoencoder exists first. This process takes as input the images held out for testing (x_test), encodes them into the latent space and immediately decodes them as reconstructed images. This is an important evaluation because the autoencoder was not trained (“has not seen”) the test images before.

# xtest_decoded = conv_autoencoder.autoencoder.predict(x_test)
xtest_decoded = conv_autoencoder.decode_image(x_test)

To visually evaluate the quality of our autoencoder from input-to-reconstruction, we randomly select 10 input test images and their related reconstruction images. The third column in our visualisation subtracts the input image from the reconstructed image to visually evaluate the reconstruction error.

(left) Input, (left middle) Decoded, (right middle) Error, and (right) Absolute Error [Generated by the Author]

The Good News

Although we see the ghost in error image, we should first be content that the reconstructed image (middle left column) is not Gaussian or arbitrary noise. Similarly, it is not a copy of the input image. If either False examples were True, then we would could interpret that the autoencoder failed to train. In turn, we should be sufficiently content that the autoencoder was able to reasonably reconstruct a majority of the input images well.

Not The Good News

In contrast, we can of course see ghosts of the input image in each error image (middle right column). Specifically, we can see that the outline or edges of the input image exist in the error images. This informs us that the autoencoder is not able to properly reconstruct the edges of major features. In some of the reconstructed images, the human inferred digit is completely incorrect. That informs us that the encoder, latent space, or decoder are unable to properly distinguish between specific digits, such as zeros and eights, or slanted fours and nines.

The Path forward

We have several options to improve these results, which will not be explored here, but should be within the scope of a test bed of autoencoders with which we could experiment.

Options:

  • Continuing training for more epochs because the autoencoder may simply “get better” with more time.
  • Change the loss function from binary_crossentropy to another option because the SGD algorithm may be able to more efficiently train the autoencoder to capture more image features (or edges of image features) by minimising a different loss function.
  • Include more convolutional layers in the Encoder/Decoder. This could help the Encoder to more accurately compress the image data onto the same latent space, such that the symmetric Decoder is able to more precisely reconstruct the images.
  • Add skip connections that tie the output of individual Encoder layers to the symmetric Decoder layers. This will be discuss in a later article about the U-Net architecture. U-Net often improves the accuracy of the edges in reconstructed images.
  • Increase or decrease the size of the latent space (latent_dim). The latent space must have enough scale to properly capture the real world variability of the image space in order for the Decoder to more precisely reconstruct the images. The latent space should also not be larger than is required to properly generate new images with precision, otherwise the correlations between latent dimensions may add sources of structured noise.
  • De-symmetrise the Decoder and Encoder. Assuming that the Encoder and Latent space are able to fully capture the variability within the input image domain, it is still possible that the Decoder is unable to properly reconstruct the images with enough precision to avoid missing the edges or confusing digits. A Decoder with an independent architecture may provide the flexibility required to more precisely reconstruct the images.

There are more options. Given the range and complexity of these choices, we will explore hyperparameter optimisation in a future article to probe many hyperparameter choices. The hope is that we will select enough hyperparameters to optimise such that we have not missed a hyperparameter, or range thereof, that would best approximate the model for our use case.

from matplotlib import pyplot as plt

n_comparisons = 10
fig, axs = plt.subplots(nrows=n_comparisons, ncols=4, figsize=(10,40))

for axs_ in axs:
idx = np.random.randint(0, x_test.shape[0])
axs_[0].imshow(x_test[idx], interpolation='None')
axs_[1].imshow(xtest_decoded[idx], interpolation='None')
axs_[2].imshow(
xtest_decoded[idx] - x_test[idx],
interpolation='None'
)
axs_[3].imshow(
abs(xtest_decoded[idx] - x_test[idx]),
interpolation='None'
)

for axs_ in axs:
for ax_ in axs_:
ax_.Axes.set_axis_off()

Diagnose Latent Space: Visual Evaluation

To understand how well the latent space is able to capture the variability in the scope of the input image space — how dynamic and variable are the handwritten digits — we will encode the full set of test images and visually evaluate the distribution over the latent space. This is a valuable exercise to understand the encoded images. It also may be misguided because the question often becomes “how well distributed are the latent vectors?” without first asking the nearly impossible question “how should the latent vectors be distributed?”

In the direction of answering the latter question, we often prefer our latent vectors to be Gaussian distributed. In a later article, we will discuss Variational Autoencoders which deliberately constrain the latent space to be closer to Gaussian distributed. This is done by bisecting the output of the Encoder into a mean and variance latent vector — i.e., a two-part latent space.

# xtest_decoded = conv_autoencoder.autoencoder.predict(x_test)
xtest_encoded = conv_autoencoder.encode_image(x_test)

An aesthetic visualisation for any multidimensional distribution is to plot a stair diagram of 2D histograms, with connected 1D histograms along the diagonal of the stair diagram. In our case, we will use the KDE instead of the histogram, via the pyGTC library with the pygtc.plotGTC function.

Because we have 10 latent dimensions, there are 10 rows and 10 columns to the stair plot over our latent space. Each 2D histogram is an evaluation over the cross-correlation between each pairing of two latent dimensions. Each 1D histogram is an evaluation over the marginalised distribution of each latent dimension.

We have further split the visualisation into each of the 10 handwritten digits to produce overlapping cross-correlation and marginalised distributions of the latent vectors — with one color per digit (see Legend).

plotGTC(
[xtest_encoded[y_test == label_] for label_ in range(10)],
chainLabels=[f'Label:{label_}' for label_ in range(10)]
)
Stair plot for 2D Correlation and Marginalised 1D Distributions over the Latent Space

Visualising the cross-correlation compared to marginalised distributions per latent vector is a very important evaluation of the validity of our interpretation over the latent vectors — especially if we choose to assume that the latent spaces are Gaussian distributed. Unfortunately, with more dimensions, it becomes less human-tractable to understand the distributions of latent space.

Kernel Density Estimation over the Latent Space

Here will will instead plot only the 1D Kernel Density Estimates per handwritten digit label, one digit per color, to understand how the latent space has been trained or constructed by our architecture and our chosen algorithms. If we examine the documentation for the Kernel Density Estimator, we should first understand that smoothness is built into the process. The KDE algorithm assumes a Gaussian prior to each data point, which often extrapolates to a Gaussian or multimodal Gaussian KDE.

What we are able to infer from the visualisations is that the latent space has a definite range, with a small number of modes, sometimes only one mode, per digit label (i.e., per color) per latent dimension. The purpose of this evaluation is to understand that we could be able to isolate specific regions of the latent space per digit in order to construct a specific digit of our choosing.

from sklearn.neighbors import KernelDensity

nrows = latent_dim // 2
ncols = latent_dim // 5

fig, axs = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20,20))
# plt.subplots_adjust(
# hspace=0.5
# )

X_plot = np.linspace(-5, 5, 1000)[:, np.newaxis]
kde = KernelDensity(kernel="gaussian", bandwidth=0.2)

for i, axs_ in enumerate(axs):
for j, ax_ in enumerate(axs_):
for label_ in range(10):
label = None
if i + j == 0:
label = label_

x_ = xtest_encoded[y_test == label_, i*ncols + j]
kde = kde.fit(np.expand_dims(x_, axis=-1));
log_dens = kde.score_samples(X_plot)

ax_.fill(X_plot, np.exp(log_dens), alpha=0.4, label=label)
ax_.set_ylim(0.0, 1.2)

if label is not None:
ax_.legend();
Kernel Density Estimates Over the Latent Space per Digit Label

Unconstrained Probe of Latent Space

To express the scope of the unconstrained latent space, we will randomly sample it in all 10 dimensions, then visualise the reconstructed images. Although some features exist in the unconstrained reconstructed images that humans can over-interpret as handwriting, without a constraint to manipulate the latent space into specific and controllable modes, the visualization below is nearly random noise.

In contrast, the encoder does implicitly constrain the latent space to capture the digits as shapes. Unfortunately, the unconstrained latent space is not complete enough to predict the outcome from a random sampling over the itself.

100 unconstrained samples over the latent space, reconstructed through the decoder [Image by Author]
n_samples = 10

x_samples_rand = conv_autoencoder.generate_image(
np.random.normal(0, 4, size=(n_samples*n_samples, latent_dim))
)

fig, axs = plt.subplots(nrows=n_samples, ncols=n_samples, figsize=(20,20))
for i, axs_ in enumerate(axs):
for j, ax_ in enumerate(axs_):
idx = i*n_samples + j
x_sample_ = x_samples_rand[idx]
ax_.imshow(x_sample_, interpolation='None')
ax_.axis('off')

Constrained Probe of Latent Space

In contrast, we did evaluate that the latent space per digit includes some modes, with wide distributions. In a post-processing method, we can sample over the latent space near those modes and produce reconstructed images that are closer to the expected labels, albeit with a wide variety of handwriting styles, edge strengths, and perceived digit values.

At this point it is important to note that human psychology may be overfitting the results to interpret that the latent space is aware about human hand writing. Statistically, the model does not know what is a human digit — it has simply been trained to output pixel values close to what we expect for handwritten digits by capturing the distribution over the input image space. Human psychology to perceive patterns often takes over by identifying the digits as numbers.

In the analogy to monkeys reproducing poetry by randomly typing on a keyboard enough, we are constraining the latent vectors to “only type on the keys that humans identify as poetry (numbers). The visualisation below has been post-processed to emulate a constrained latent space over the span of human digits provided in the training phase.

100 constrained samples over the latent space, reconstructed through the decoder [Image by Author]
median_per_label = {
label_: np.median(xtest_encoded[y_test == label_], axis=0)
for label_ in range(10)
}

std_per_label = {
label_: np.std(xtest_encoded[y_test == label_], axis=0)
for label_ in range(10)
}

x_samples_constrained = [
conv_autoencoder.generate_image(
np.random.normal(
median_per_label[l_],
std_per_label[l_],
size=(latent_dim, n_samples)
)
)
for l_ in range(latent_dim)
]

fig, axs = plt.subplots(nrows=n_samples, ncols=n_samples, figsize=(20,20))
for i, axs_ in enumerate(axs):
for j, ax_ in enumerate(axs_):
idx = i*n_samples + j
x_sample_ = x_samples_constrained[i][j]
ax_.imshow(x_sample_, interpolation='None')
ax_.axis('off')

Overconstrained Probe of Latent Space

If we further constrain the input to the latent space by sampling narrowly close to the latent space modes of each digit, then our reconstructed images resemble the handwritten digits provided to the autoencoder even more closely as reconstruct digits.

This process is inspiring, although still imperfect. In the 100s of digits that we reconstruct, there are incorrect digits that humans can recognise as shapes that do not represent the expected labels: 0–9.

100 overconstrained samples over the latent space, reconstructed through the decoder [Image by Author]
median_per_label = {
label_: np.median(xtest_encoded[y_test == label_], axis=0)
for label_ in range(10)
}

std_per_label = {
label_: np.std(xtest_encoded[y_test == label_], axis=0)
for label_ in range(10)
}

x_samples_constrained = [
conv_autoencoder.generate_image(
np.random.normal(
median_per_label[l_],
std_per_label[l_]/2, # Overconstrain: Half the width
size=(latent_dim, n_samples)
)
)
for l_ in range(latent_dim)
]

fig, axs = plt.subplots(nrows=n_samples, ncols=n_samples, figsize=(20,20))
for i, axs_ in enumerate(axs):
for j, ax_ in enumerate(axs_):
idx = i*n_samples + j
x_sample_ = x_samples_constrained[i][j]
ax_.imshow(x_sample_, interpolation='None')
ax_.axis('off')

This code can be run in full with the associated colab notebook.

Onward to Latent Stable Diffusion

Latent Stable Diffusion, discuss above, introduced carefully constrained variational autoencoders to manipulate the latent space into reconstructing specific images based on the text-encoded input (query or prompt). The end-goal of this series of article is to illuminate the underlying principles of Latent Stable Diffusion by first evaluating its predecessors and components.

Conclusion

We created a basic convolutional autoencoder using independent classes with double and triple inheritance. Each class could have been built into a single class object, but the learning and (future) testing experience of compartmentalising the autoencoder provided a detailed understanding of each piece of the puzzle.

  • The Encoder transforms, the input image through matrix manipulations and convolutional transformations in the latent vector.
  • The latent vector represents the encoded representation of the input images. With hope, it contains sufficient information to reconstruct the image.
  • The Decoder transforms the latent vector into the reconstructed image using symmetric matrix manipulations and convolutional transformations, upscaling the autoencoded features into an image representation of the latent vector.

We then visually evaluated the latent space and the reconstructed images to understand how the distributions of the latent space store the image data. We also revealed how we can be post-process to constrain the latent space and reconstruct specific handwritten digits by isolating its multidimensional regions that corresponds to the digit label of our choosing.

To fully enable the generation of new digits, we could have robustly constrained the latent space by introducing a second encoding that represents the digit label of our choosing. This will be elucidated in a future article.

Join the Medium membership program to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.

Join Medium with my referral link — Jonathan Fraine

Read every story from Jonathan Fraine (and thousands of other writers on Medium). Your membership fee directly supports…

If you like the article and would like to support me make sure to:

  • 👏 Clap for the story (50 claps) and follow me 👉
  • 📰 View more content on my medium profile
  • 🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter
  • 🚀👉 Join the Medium membership program to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.

--

--

Jonathan Fraine

Director of Engineering for Wikimedia DE. I work with dozens of motivated and enthusiastic developers to improve the future of free knowledge around the world.