Generative AI Through the Ages: Convolutional Autoencoders Part 2

29 min readMar 9, 2024

Build a Convolutional Autoencoder to Understand the Original Architecture of Generative AI.

I originally wrote this article as a 60-page medium post about convolutional autoencoders as the origin and backbone of Generative AI. Editors and colleagues recommended that I break that up in three parts: architecture+history, application [here], and latent space evaluation.

Generative AI Through the Ages: Convolutional Autoencoders[Full]

Generative AI (GenAI) is any form of computational algorithm that can create new content related to a set of data from…

exowanderer.medium.com

Generative AI Through the Ages: Convolutional Autoencoders Part 1

The origin story for Generative AI is based within the architecture of convolutional autoencoders.

exowanderer.medium.com

Generative AI Through the Ages: Convolutional Autoencoders Part 3

Visually Evaluate the Latent Space of a Convolutional Autoencoder to Understand the Original Architecture of Generative…

exowanderer.medium.com

Image Generated by the Author using DALL-E 3 Prompted by “The Architecture and Example of Convolutional Autoencoders as the Origin of Generative AI”

The most common autoencoder architecture for computer vision deep learning is the convolutional autoencoder (CAE). CAEs are connected stacks of convolutional kernels for the Encoder and Decoder, with linear matrices in between to transform the encoded images into the latent space. More dense layers are used to again connect the latent space onto the Decoder for image reconstruction.

In a previous article, we discussed how convolutional autoencoders are the backbone and evolutionary ancestor of modern Generative AI models. Here will we show how to build one, how each component contributes to the generative properties of GenAI, and evaluate our model on the MNIST dataset. In the next (sub-)article, we will visually evaluate the latent space and generative capacity of even this simple model.

Compartmentalised ConvAutoencoder to Prepare for the Future

Depending on the use case, the architecture of the CAE could be included in a single class or a set of classes that either inherit from each other or are inherited into a final aggregated class. For both educational purposes and maintainability, we chose here to follow the latter method: a set of classes inherited into a final aggregated class.

Make It Real: Build a ConvAutoencoder from Scratch

Here we will begin our implementation of our convolutional autoencoder by instantiating, training, visualising, and evaluating. First, ensure that we have the appropriate libraries installed to run the code below.

The code can be run in full with the associated colab notebook.

!pip install pygtc scikit-learn tensorflow

and import the libraries into the run environment

import numpy as np
import plotly.graph_objects as go
from tensorflow.keras import datasets
from matplotlib import pyplot as plt
from pygtc import plotGTC

from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import (
    Input,
    Dense,
    Conv2D,
    MaxPooling2D,
    Conv2DTranspose,
    Flatten,
    Reshape
)

from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras import backend as K
from tensorflow import image

Most of the operations involved in training a neural network are definitively probabilistic, and weigh heavily on the use of pseudo-random number generators. As such, to enable reproducibility between our results over multiple trial runs, we institute and initial random seed as nerdy as we could imagine: 42 because we’re happily nerdy.

np.random.seed(42)

The Encoder and Decoder Classes

In the Encoder and Decoder classes below, we take as input the convolutional kernel parameters, image shape, and number of latent dimensions to configure the modules of our Convolutional Autoencoder. The convolutional kernel parameters are a dict that establish the kernel shape, the number of filters, the activation function, stride length, and pool size (for MaxPool2D layers).

The Encoder class maps the input image to its corresponding latent vector through a sequence of convolutional layers. The Decoder class maps the latent vector to its corresponding reconstructed image through a symmetric sequence of convolutional layers.

Attributes:

input_shape(tuple): The shape of the input image. Default is (28, 28, 1).
latent_dim(int): The dimension of the latent space. Default is 10.
conv_params (dict): A dictionary of convolution parameters. (See below)

The constructor for Encoder class initializes the encoder with the given input shape and latent dimension. Furthermore, the conv_params dictionary contains 7 other parameters that control the size, shape, and complexity of the convolutional layers (filters/kernels) and blocks (stacks of filters/kernels).

`conv_params` as key:value pairs

activation is the flavor of non-linear activation function used between convolutional blocks. It defaults here to relu or Rectified Linear Unit.
decoder_activation is the flavor of non-linear activation function used at the top of the Decoder stack (output of the Autoencoder) to match the reconstructed image best to the input image. We chose to default the decoder activation functions as the ‘sigmoid’ because we preprocessed the datasets to sustain a range of 0 to 1, and the sigmoid enforces an output range of 0 to 1.
padding is the structure of the padding around the image after each convolutional layer. We chose the hyperparameter setting ‘same’ to ensure that we do not lose information on the edges of feature maps without intending to do so.
stride is a parameter that sets the number of pixels by which the convolutional kernels will step between each transformation. Our default of 2 results in each subsequent feature map (output of a convolutional layer) to be created with a quarter the number of features (half the size in both X and Y directions).
pool_size is a parameter that establishes how many pixels to bin the MaxPool layers summation. By choosing a pool_size of 1, we are forcing the CAE to ignore the MaxPool operation — in favor of letting the stride operations dominate the feature size reduction. Both Stride and MaxPool serve as similar dimensionality reduction methods.
kernel_size is the hyperparameter that sets the size of the each convolutional filter in each dimension. With a default of 3, we are establishing that each kernel will have 3x3 shape, or 9 parameters. The parameter can take an integer for square kernels or a 2-tuple for non-square kernel shapes.
n_filters is a list that associates how many filters should be trained per convolutional layer in each block. We only include logic for one layer per block, but multiple layers could be introduced per block. In our default setting, we chose an Encoder sequence of [16, 8, 4], which provides that the first layer will have 16 filters, while the second and third layers will have 8 filters each. The Decoder is by default the inverse of this list: [4, 8, 16].

See my primer for more details on each hyperparameter and more.

class Encoder:
    """
    A Encoder class for a Convolutional Autoencoder. This class maps 
      the input image to its corresponding latent vector through a 
      sequence of convolutional layers.

    Attributes:
        input_shape (tuple): The shape of the input image. 
          Default is (28, 28, 1).
        latent_dim (int): The dimension of the latent space. 
          Default is 10.
        conv_params (dict): A dictionary of convolution parameters. 
          Default is None.
    """
    def __init__(
            self, input_shape=(28, 28, 1), latent_dim=10, 
            conv_params=None):
        """
        The constructor for Encoder class. Initializes the Encoder 
          with the given input shape and latent dimension.
        Args:
            input_shape (tuple): The shape of the input image. 
              Default is (28, 28, 1).
            latent_dim (int): The dimension of the latent space. 
              Default is 10.
            conv_params (dict): A dictionary of convolution parameters. 
              Default is None.
        """

        # Initialize convolution parameters
        self.conv_params = conv_params
        # If no conv_params provided, use default values
        if self.conv_params is None:
            self.conv_params = {
                'activation': 'relu',
                'decoder_activation': 'sigmoid',
                'padding': 'same',
                'strides': 2,
                'pool_size': 1,
                'kernel_size': 3,
                'n_filters': [16, 8, 4]
            }

        # Ensure the latent_dim is a scalar
        assert(np.ndim(latent_dim) == 0)

        # Ensure the input_shape is a vector
        assert(np.ndim(input_shape) == 1)

        # If input_shape is a 2D image, append a channel dimension
        if len(input_shape) == 2:
            input_shape = (*input_shape, 1)
        # Set the input shape and latent dimension
        self.input_shape = input_shape
        self.latent_dim = latent_dim

    def build_encoder(self):
        """
        Builds the encoder model. The model takes an image as input and 
          outputs a latent vector.
        The function first applies a series of convolution layers to the 
          input image. The number of filters in each
        layer is specified by the 'n_filters' key in the conv_params 
          dictionary. The activation function, padding, and strides for 
          these layers are also specified by the conv_params dictionary.
        The output of the last convolution layer is then flattened and 
          passed through a dense layer. The number of units in this dense 
          layer is equal to the dimension of the latent space.
        The output of the dense layer is the latent vector, which is the 
          output of the encoder model.
        """
        # Ensure the 'n_filters' key in conv_params is a list or array
        assert(
            isinstance(
              self.conv_params['n_filters'], (tuple, list, np.ndarray)
            )
        )

        # Define the input image
        self.input_image = Input(shape=self.input_shape)
        x = self.input_image

        # Define the kernel and pool shapes
        kernel_shape = [self.conv_params['kernel_size']] * 2
        pool_shape = [self.conv_params['pool_size']] * 2

        # Encoding layers
        for nfilters in self.conv_params['n_filters']:
            # Apply a convolution layer with the given parameters
            x = Conv2D(
                filters=nfilters,
                kernel_size=kernel_shape,
                activation=self.conv_params['activation'],
                padding=self.conv_params['padding'],
                strides=self.conv_params['strides']
            )(x)

        # Store the shape of the volume before flattening
        self.volume_size = K.int_shape(x)

        # Flatten the volume
        x = Flatten()(x)

        # Apply a dense layer to map the flattened volume 
        #  to the latent space
        self.latent = Dense(self.latent_dim)(x)

        # Define the encoder model
        self.encoder = Model(self.input_image, self.latent)

The Decoder Class

The Decoder classes below is very similar to the Encoder class above. In fact, the Decoder was generated from the Encoder by adding a map — Dense + Reshape layers — from the latent space to the expected shape of the inner most convolutional layer. After which, the for loop runs over the reversed kernel sizes and uses Conv2DTranspose layer, as opposed to the Conv2D layer seen in the Encoder class above.

The inputs to the Decoder are deliberately identical to the Encoder because the input will later become a combination object that inherits from both the Encoder and Decoder.

Attributes:

input_shape(tuple): The shape of the input image. Default is (28, 28, 1).
latent_dim (int): The dimension of the latent space. Default is 10.
conv_params (dict): A dictionary of convolution parameters. Def: None.
volume_size (tuple): the shape of the first (2D) convolutional layer

An additional input to the Decoder, volume_size, acts as both a bypass and an explicit input for an otherwise implicit value. The volume_size is nominally inherited from the Encoder object in the Autoencoder class below. The use of an input parameter here bypasses the need for an Encoder to be given to the Decoder. The shape of innermost convolutional layer, volume_size, enables future use cases that might require manipulating the Decoder without the need for a symmetric Encoder or an Encoder at all.

See my primer for more details on each hyperparameter and more.

class Decoder:
    """
    A Decoder class for a Convolutional Autoencoder. This class maps 
      the latent vector to its corresponding reconstructed image through 
      a sequence of convolutional layers. The final output layer is 
      reshaped deliberately to match the shape of the input image: 
      ncols x nrows x nchannels.

    Attributes:
        input_shape (tuple): The shape of the input image. 
          Default is (28, 28, 1).
        latent_dim (int): The dimension of the latent space. 
          Default is 10.
        conv_params (dict): A dictionary of convolution parameters. 
          Default is None.
    """
    def __init__(
            self, input_shape=(28, 28, 1), latent_dim=10, 
            conv_params=None, volume_size=None):
        """
        The constructor for Decoder class. Initializes the Decoder with 
          the given latent dimension and image shape.
        Args:
            input_shape (tuple): The shape of the input image. 
              Default is (28, 28, 1).
            latent_dim (int): The dimension of the latent space. 
              Default is 10.
            conv_params (dict): A dictionary of convolution parameters. 
              Default is None.
        """
        # Initialize convolution parameters
        self.conv_params = conv_params

        # If no conv_params provided, use default values
        if self.conv_params is None:
            self.conv_params = {
                'activation': 'relu',
                'decoder_activation': 'sigmoid',
                'padding': 'same',
                'strides': 2,
                'pool_size': 1,
                'kernel_size': 3,
                'n_filters': [16, 8, 4]
            }

        """
          If Decoder is established within the Autoencoder,
            then the volume_size is inherited from the Encoder object.
          If the Decoder is instantiated alone, then the volume_size
            must be calculated or provided as an input variable.
        """
        if not hasattr(self, 'volume_size'):
            # TODO: calculate `volume_size` from latent_dim and kernel 
            # shapes assert(volume_size is not None)
            # Becuase this operation only exists inside an Autoencoder,
            #   the `volume_size` is assumed to be inherited from 
            #   the `Encoder`
            self.volume_size = volume_size

        # Ensure the latent_dim is a scalar
        assert(np.ndim(latent_dim) == 0)

        # Ensure the input_shape is a 1D-vector
        assert(np.ndim(input_shape) == 1)

        # If input_shape is a 2D image (greyscale), 
        #  append a channel dimension
        if len(input_shape) == 2:
            input_shape = (*input_shape, 1)

        # Set the input shape and latent dimension
        self.input_shape = input_shape
        self.latent_dim = latent_dim

    def build_decoder(self):
        """
        Builds the decoder model. The model takes a latent vector as 
          input and outputs a reconstructed image.
        The function first maps the latent vector to a dense layer with 
          the same number of units as the product of the volume size. 
          The output of this dense layer is then reshaped to match the 
          volume size.
        The reshaped output is then passed through a series of 
          transposed convolution layers (also known as deconvolution 
          layers). The number of filters in each layer is specified 
          by the 'n_filters' key in the conv_params dictionary. 
        The activation function, padding, and strides for these layers 
          are also specified by the conv_params dictionary.
        Finally, a convolution layer is applied to the output of the 
          last deconvolution layer. The number of filters in this layer 
          is equal to the number of channels in the input image. 
        The activation function for this layer is specified by the 
          'decoder_activation' key in the conv_params dictionary.
        The output of the final convolution layer is then resized to 
          match the size of the input image. The resized image is the 
          output of the decoder model.
        """
        # Define the latent inputs
        latent_inputs = Input(shape=(self.latent_dim,))

        # Compute the volume of the reshaped dense layer
        volume_shape = self.volume_size[1:]
        volume = np.prod(volume_shape)

        # Map the latent inputs to a dense layer and reshape it
        x = Dense(volume)(latent_inputs)
        x = Reshape(volume_shape)(x)

        # Define the kernel and pool shapes
        kernel_shape = [self.conv_params['kernel_size']] * 2
        pool_shape = [self.conv_params['pool_size']] * 2

        # Decoding layers
        for nfilters in self.conv_params['n_filters'][::-1]:
            # Apply a transposed convolution layer with the 
            #   given parameters
            x = Conv2DTranspose(
                filters=nfilters,
                kernel_size=kernel_shape,
                activation=self.conv_params['activation'],
                padding=self.conv_params['padding'],
                strides=self.conv_params['strides']
            )(x)

        # The output must be the number of channels
        nchannels = self.input_shape[-1]

        # Apply a convolution layer to map the volume to the 
        #  number of channels
        decoded = Conv2D(
            filters=nchannels,
            kernel_size=kernel_shape,
            activation=self.conv_params['decoder_activation'],
            padding='valid'
        )(x)

        # Get the size of the input image
        img_size = K.int_shape(self.input_image)[1:]

        # Resize the decoded image to match the size of the input image
        resized_image_tensor = image.resize(
            images=decoded,
            size=list(img_size[:2]),
            method='bilinear',
            preserve_aspect_ratio=True,
            antialias=False,
            name=None,
        )

        # Define the decoder model
        self.decoder = Model(latent_inputs, resized_image_tensor)

Putting it All Together: ConvAutoencoder as a builder object

The Encoder class above transforms the input images into their latent space representation and the Decoder class above transforms the latent space vector into a reconstructed image. To tie those pieces together, the ConvAutoencoder class inherits from both the Enc and Dec classes above. This level of modularity helps us to focus on each component quasi-independently, which benefits both the explainability and maintainability of the CAE components.

In later articles, we will build out the full generative AI prompting for image generation through to Latent Stable Diffusion (LSD), which starts as a Convolutional Variational Autoencoder. The LSD CAE is a denoising AE that manipulates the latent space representation according the embedded user query or prompt.

The modularity allows the LSD to stage (class below) the CAE and adds the flexibility with Variational and Constrained input (prompting) for later development, while maintaining explainability. As such, the inputs to an LSD ConvAutoencoder class are nearly identical to the Encoder and Decoder here, except that it finely manipulates the autoencoder to generate or new samples.

class ConvAutoencoder(Encoder, Decoder):
    """
    A ConvAutoencoder class that combines the Encoder and Decoder classes 
      to form a complete Convolutional Autoencoder.The ConvAutoencoder class takes an image as input, encodes it into a 
      latent vector using the Encoder, and then decodes the latent vector 
      back into an image using the Decoder.
    Attributes:
        input_shape (tuple): The shape of the input image. 
          Default is (28, 28, 1).
        latent_dim (int): The dimension of the latent space. Default is 10.
        autobuild (bool): Whether to automatically build the autoencoder 
          upon initialization. Default is True.
        training_params (dict): A dictionary of parameters for training. 
          Default is None.
        conv_params (dict): A dictionary of convolution parameters. 
          Default is None.
    """
    def __init__(
            self, input_shape=(28, 28, 1), latent_dim=10, autobuild=True,
            conv_params=None):
        """
        The constructor for ConvAutoencoder class. 
        Initializes the ConvAutoencoder with the given input shape and 
          latent dimension.
        Args:
            input_shape (tuple): The shape of the input image. 
              Default is (28, 28, 1).
            latent_dim (int): The dimension of the latent space. 
              Default is 10.
            autobuild (bool): Whether to automatically build the 
              autoencoder upon initialization. Default is True.
            training_params (dict): A dictionary of parameters for 
              training. Default is None.
            conv_params (dict): A dictionary of convolution parameters.
              Default is None.
        """
        # Initialize the Encoder and Decoder superclasses
        super(Encoder, self).__init__()
        super(Decoder, self).__init__()

        # Set the convolution parameters
        self.conv_params = conv_params
        if self.conv_params is None:
            self.conv_params = {
                'activation': 'relu',
                'decoder_activation': 'sigmoid',
                'padding': 'same',
                'strides': 2,
                'pool_size': 1,
                'kernel_size': 3,
                'n_filters': [16, 8, 4]
            }

        # Check the dimensions of the latent dimension and input shape
        assert(np.ndim(latent_dim) == 0)
        assert(np.ndim(input_shape) == 1)

        # If the input shape is 2D, add a channel dimension
        if len(input_shape) == 2:
            input_shape = (*input_shape, 1)

        # Set the input shape and latent dimension
        self.input_shape = input_shape
        self.latent_dim = latent_dim

        # Build the autoencoder if autobuild is True
        if autobuild: # Default
            self.build_autoencoder()

    def build_autoencoder(self):
        """
        Builds the autoencoder model. The model takes an image as input 
          and outputs a reconstructed image.
        The function first checks if the encoder and decoder models have 
          been built. If not, it calls the build_encoder and build_decoder 
          methods to build them.
        The autoencoder model is then constructed by passing the input 
          image through the encoder and decoder models in sequence.
        """
        # Build the encoder model if it hasn't been built yet
        if not hasattr(self, 'encoder'):
            self.build_encoder()

        # Build the decoder model if it hasn't been built yet
        if not hasattr(self, 'decoder'):
            self.build_decoder()

        # Construct the autoencoder model by passing the input image 
        #  through the encoder and decoder
        self.autoencoder = Model(
            self.input_image,
            self.decoder(
                self.encoder(
                    self.input_image
                )
            )
        )

Make it Work

With the ConvAutoencoder architecture now settled, we want to introduce the training and inference components. The below class ConvAETrainer inherits from the previous three classes to build the ConvAutoencoder, then compile it, and train it on the provided dataset. The dataset we use here is either MNIST or FashionMNIST. In addition to the train method, the class also provides encode_image, decode_image, and generate_image methods.

These methods include

1. encode_image computes the latent vector from an input image

2. decode_image computes the reconstructed image from an input image

3. generate_image computes the reconstructed image from a latent vector

The 3rd method above ( generate_image) is the generative part of “Generative AI”. It takes as input a representation of latent space and outputs a vector. In the case of Latent Stable Diffusion, the “latent space” representation is generated via embedding the user prompt.

In our case here, we will probe the statistical distributions over the latent space to identify the most likely regions for each input data class (MINST: digit numbers). Using the modes from those statistical distributions, we will generate images of handwritten digits that coincide with the multi-dimensional latent space modes, i.e., latent space, or compressed, image representation.

class ConvAETrainer(ConvAutoencoder, Encoder, Decoder):
    """
    A ConvAETrainer class that inherits from the ConvAutoencoder class. 
    This class is used to train the Convolutional Autoencoder and 
    generate new images from the latent space.

    Attributes:
        input_shape (tuple): The shape of the input image. 
          Default is (28, 28, 1).
        latent_dim (int): The dimension of the latent space. 
          Default is 10.
        training_params (dict): A dictionary of training parameters. 
          Default is None.
        conv_params (dict): A dictionary of convolution parameters. 
          Default is None.
    """
    def __init__(
            self, input_shape=(28, 28, 1), latent_dim=10, 
            autobuild=True, training_params=None, conv_params=None):
        """
        The constructor for ConvAETrainer class. Initializes the 
          ConvAETrainer with the given input shape, latent dimension, 
          training parameters, and convolution parameters.
        Args:
            input_shape (tuple): The shape of the input image. 
              Default is (28, 28, 1).
            latent_dim (int): The dimension of the latent space. 
              Default is 10.
            training_params (dict): A dictionary of training parameters. 
              Default is None.
            conv_params (dict): A dictionary of convolution parameters. 
              Default is None.
        """
        # Initialize the Encoder and Decoder superclasses
        super(Encoder, self).__init__()
        super(Decoder, self).__init__()
        super(ConvAutoencoder, self).__init__()

        # Set the convolution parameters
        self.conv_params = conv_params
        if self.conv_params is None:
            self.conv_params = {
                'activation': 'relu',
                'decoder_activation': 'sigmoid',
                'padding': 'same',
                'strides': 2,
                'pool_size': 1,
                'kernel_size': 3,
                'n_filters': [16, 8, 4]
            }

        # Set the autoencoder parameters
        self.training_params = training_params
        if self.training_params is None:
            self.training_params = {
                'optimizer': 'adam',
                'loss': 'binary_crossentropy',
                'metrics': None
            }

        # Check the dimensions of the latent dimension and input shape
        assert(np.ndim(latent_dim) == 0)
        assert(np.ndim(input_shape) == 1)

        # If the input shape is 2D, add a channel dimension
        if len(input_shape) == 2:
            input_shape = (*input_shape, 1)

        # Set the input shape and latent dimension
        self.input_shape = input_shape
        self.latent_dim = latent_dim

        # Build the autoencoder if autobuild is True
        if autobuild: # Default
            self.build_autoencoder()

    def train(
            self, x_train, x_val, epochs=50, batch_size=128,
            callbacks=None, shuffle=True):
        """
        Train the autoencoder.
        Args:
            x_train (np.array): The training data.
            x_val (np.array): The validation data.
            epochs (int): The number of epochs to train for. 
              Default is 50.
            batch_size (int): The batch size for training. 
              Default is 128.
            callbacks (list): A list of callbacks to apply 
              during training. Default is None.
            shuffle (bool): Whether to shuffle the training data 
              before each epoch. Default is True.
        Returns:
            A History object. Its History.history attribute is a 
              record of training loss values and metrics values at 
              successive epochs, as well as validation loss values 
              and validation metrics values (if applicable).
        """
        # Build the autoencoder if it hasn't been built yet
        if not hasattr(self, 'autoencoder'):
            self.build_autoencoder()

        # Compile the autoencoder model with the specified optimizer, 
        #  loss function, and metrics
        self.autoencoder.compile(
            optimizer=self.training_params['optimizer'],
            loss=self.training_params['loss'],
            metrics=self.training_params['metrics']
        )

        # Train the autoencoder
        return self.autoencoder.fit(
            x_train,
            x_train,
            epochs=epochs,
            batch_size=batch_size,
            shuffle=shuffle,
            callbacks=callbacks,
            validation_data=(x_val, x_val)
        )

    def encode_image(self, image):
        """
        Get the encoded representation of the image.
        Args:
            image (np.array): The image to encode.
        Returns:
            The encoded representation of the image.
        """
        # Ensure the encoder has been built
        assert(hasattr(self, 'encoder'))
        # Encode the image
        return self.encoder.predict(image)

    def decode_image(self, images):
        """
        Get the decoded (reconstructed) image.
        Args:
            images (np.array): The images to decode.
        Returns:
            The decoded images.
        """
        # Ensure the autoencoder has been built
        assert(hasattr(self, 'autoencoder'))
        # Decode the images
        return self.autoencoder.predict(images)

    def generate_image(self, latent_vector):
        """
        Generate a new image from the latent space.
        Args:
            latent_vector (np.array): The latent vector from which 
              to generate the reconstructed image.
        Returns:
            The generated image.
        """
        # Generate the image from the latent vector
        return self.decoder.predict(latent_vector)

Preprocessing the Data

The datasets are stored remotely with greyscale values from 0 to 255. Because the matrices within the convolutional neural network expect as input values from 0 to 1, we will load the data then divide each pixed by 255 (max value). Because MNIST and FashionMNIST images are stored as 2D arrays per image, we expand the input images with a third dimension of size 1 — i.e., greyscale. This allows our operation to enable this project to scale up to colored images (3D arrays per image).

def preprocess(features, labels):
    """
    Normalizes the supplied array and reshapes it into 
      the appropriate format.
    """
    
    features = features.astype("float32") / 255.0

    if np.ndim(features) == 3:  # Greyscale is N samples of 2D arrays
        features = np.expand_dims(features, axis=-1)

    if np.ndim(labels) == 2:
        labels = labels.ravel()

    return features, labels

Train, Validation, Test Splitting

In addition to preprocessing, the following load_and_preprocess script here will split the data into the prescribed test_size. The default test_size is 50% because we want to improve our generalisability by testing on a large set of unseen or untrained images. Generalisation error is the error derived from using the model “in the wild”, often after deployment or at least during a later decision making process.

There is a different use of the words test and validation in the nomenclature because the ML community iterated on them over time. The train_test_split functions uses the scikit-learn terminology: test_size. In contrast, the model.fit method uses the keras terminology: validation_data and val_loss. Furthermore, the MNIST dataset includes train and test images, where the test images are meant to remain unseen by the model during the training process. This coincides with the keras terminology. Our use here is that

train images are used to update the model weights during training.
validation images are used to monitor a proxy for generalisation error during the training process.
test images are used to evaluate the model results as a more direct proxy for generalisation error.

Because MNIST comes with 10k test images, we will use the train_test_split to generate 50% train images and 50% validation images. Validation images are used during the training process to compute the val_loss which allows the callbacks, and developers, to be more confident that the training processes has not devolved passed the minimum generalisation error per model.

Generalisation error is best estimated after training as the test_loss by using the test images not provided to the model.train. Most developers use the val_loss as a running proxy for the test_loss. If the final val_loss does not match the test_loss within expected variance of a few percent, then the model will likely not sustain an expected generalisation error.

In later articles, we will probe hyperparameter optimisation by training hundreds of CAEs and other ML models. For each model, we must compute the test_loss to select the best model, as well as weight each model’s contributions towards a weighted ensemble. When performing hyperparameter optimisation, the test_loss should not be used a robust proxy for the generalisation error. Depending on the size of the provided dataset, we could split it into four components: train, validation, test, generalisation. If the dataset is not large enough for four splits, then new data should be acquired and monitored, often during deployment and later inference.

def load_and_preprocess_data(dataset='mnist', test_size=0.5):
    known_datasets = ['mnist', 'fashion_mnist', 'cifar10', 'cifar100']
    
    if dataset == 'mnist':  # 2D arrays of greyscale images
        mnist = datasets.mnist.load_data()
        (x_train, y_train), (x_test, y_test) = mnist

    if dataset == 'fashion_mnist':  # 2D arrays of greyscale images
        fashion_mnist = datasets.fashion_mnist.load_data()
        (x_train, y_train), (x_test, y_test) = fashion_mnist

    if dataset == 'cifar10':  # 3D arrays of color images
        cifar10 = datasets.cifar10.load_data()
        (x_train, y_train), (x_test, y_test) = cifar10

    if dataset == 'cifar100':  # 3D arrays of color images
        cifar100 = datasets.cifar100.load_data()
        (x_train, y_train), (x_test, y_test) = cifar100

    x_train, y_train = preprocess(x_train, y_train)
    x_test, y_test = preprocess(x_test, y_test)
    x_train, x_val, y_train, y_val = train_test_split(
        x_train,
        y_train,
        test_size=test_size  # 50% of train test for validation
    )

    return x_train, x_val, x_test, y_train, y_val, y_test

Load Data, Make Model, Fit Model to Data

Load Data

Here we assign the dataset and test_size parameters that dominate our use case for our Convolutional Autoencoder, then load and preprocess the data as specified above.

dataset = 'mnist'
test_size = 0.5

x_train, x_val, x_test, y_train, y_val, y_test = load_and_preprocess_data(
    dataset=dataset,
    test_size=test_size
)

Make the Model

With the data in hand, we can now define the autoencoder architecture, instantiate the model, and load the data into the ConvAETrainer instance. We chose the latent_dim to have 10 dimensions because it seems appropriate when we later isolate the modes of each dimension with a specific class or number, from which to generate new handwritten digits. It is unnecessary, but convenient for pedagogical and explainable development.

Loss Function Selection

For our loss function, we chose the loss function here to be binary_crossentropy because our greyscale images support a range from 0 to 1, and expect that reconstructed values should be between 0 or 1. Each loss function has its own behaviour, assumptions, expectations, use case, and value. The input data, output activation function, and loss function should be chosen as a triplet. We suggest exploring the loss function with respect the data set and preprocessing as a “hyper-hyperparameters”.

For clarity on the deep learning background context with regards to loss functions and datasets, please see my primer on Computer Vision Deep Learning Primer with Keras and Python.

latent_dim = 10

# https://link.springer.com/chapter/10.1007/978-3-031-11349-9_30
# loss = 'mse'  # good for small images and large datasets
# loss = 'mae'  # good for small images and large datasets
loss = 'binary_crossentropy'  # good for large images and small datasets

training_params = {
  # 'optimizer': 'adadelta',
  'optimizer': 'adam',
  'loss': loss,
  'metrics': None
}

conv_params={
  'activation': 'relu',
  'decoder_activation': 'linear',
  'padding': 'same',
  'strides': 2,
  'pool_size': 1,
  'kernel_size': 3,
  'n_filters': [16, 8, 4]
}

input_shape = x_train.shape[1:]  # skip the number of samples

conv_autoencoder = ConvAETrainer(
  input_shape=input_shape,
  latent_dim=latent_dim,
  training_params=training_params,
  conv_params=conv_params
)

print(conv_autoencoder.autoencoder.summary())
print(conv_autoencoder.encoder.summary())
print(conv_autoencoder.decoder.summary())

Calculating the Size of the Model

In a Convolutional Autoencoder, it is crucial to understand the number of parameters in the network to appreciate the complexity and capacity of the model. To count the number of parameters in a Convolutional Autoencoder, we need to consider the encoder, decoder, and latent space components. The primary components contributing to the parameter count are convolutional layers (Enc + Dec) and fully connected layers.

Encoder Convolutional Layers: For a Conv2D, the number of parameters is determined by the size of the filters (or kernels), the number of input channels, and the number of output channels (n_filters).
n_params = ( kernel_size ·kernel_size·nfilters_in+1)·nfilters_out
Decoder Convolutional Layers: The parameter calculation for Conv2DTranspose layers are similar to that of Conv2D layers. The formula remains the same as above.
Latent Space: If present in the encoder for dimensionality reduction or reshaping, the number of parameters in fully connected layers are:
n_params = (n_input_units+ 1)·n_output_units

In each n_params calculation above, there is a “+1” term, which represents the bias term for each layer or filter. The bias term allows the network to rebalance the average amplitude of each of the layers or filters to avoid exploding or vanishing gradients. For convolutional layers, a more common technique not shown here is to introduce a BatchNormalization layer after each convolutional filter.

Examining the Size of Our Autoencoder

In the default configuration above, each filter defaults to a 3x3 kernel. With Encoder layers including [16, 8, 4] filters, the convolutional blocks of this default Encoder includes 2262 parameters to train outside of the latent space transformations. The Decoder has slightly different number of parameters because the direction of the bias term changes the input parameters per layer, which results in 2461 Decoder parameters to train.

In addition to parameters, it is important to maintain a functional understanding of the size and number of feature maps (outputs per filter/kernel per layer). Feature maps are the image-like 2D arrays generated per filter transformation. In computer vision, this would be the transformed images, like edge detections or color gradients. With 16 filters/kernels in the first block, a 28x28 image would output 16 features maps that are quarter the number of input pixels, because stride is set to 2. This means that the first convolution layer produces 3136 feature pixels produced by the convolutions: 28 x 28 x 16 / 4.

After 3 convolutional blocks, with 16, 8, 4 filters sequentially, the Encoder outputs 4 4x4 feature maps, which is 64 feature pixels. The latent vector results from a linear matrix transformation using a Dense layer to convert these 64 features pixels into the 10 latent space dimensions. This introduces a (64+1) x 10 linear matrix, which therefore include 650 parameters for the neural network to train.

The 28 different convolutional transformations in the Encoder and the 28 different transformations in the Decoder result in 4723 parameters for the neural network to train, while the two Dense layers (into and out of the latent space) require 1300 parameters to train the neural network, which is 21.5% of the total network in just two layers. This shows that Dense layers can dramatically outnumber the parameters required for the sme operation as a set of Conv2D layers.

Scaling Up to a Real World Application with Latent Stable Diffusion

Each convolutional layer produces an equal number of feature maps to the number of filters. With stride:2 the size of each filter map is a quarter the size of the input feature map to that layer. For reference, the first feature map is actually the input image data itself, and the second feature map is the transformed images after being convolved by the first layer of filters. This in the equivalent 3136+392+48 or 3576 feature pixels from the Encoder feature maps. As well, another 3576 feature pixels from the Decoder feature maps. All 7152 feature pixels (i.e., elements of the feature maps) — as well as all 4726 convolutional autoencoder parameters — must be stored on the GPU per image in the batch. With a batch size of 128, we need 1,520,384 floating point numbers to be stored on the GPU.

These 1,5M floats are not significant for modern GPUs. In contrast, our experiment only includes with 6 convolutional layers, as well as small images. These values expand linearly with respect to number of input pixels (image height x image width) and the batch size. For an HD image (1920x1080 pixels), the number of floats stored for our simple convolutional autoencoder would scale to over 4 billion floats. If we then expand our reality to include much larger autoencoders, the original Latent Stable Diffusion paper quotes 1.5 billion parameters, which requires a GPU that can process several billions of float values — i.e., the GPU must be able to efficiently process 10s of GB/s. These are still trifles compared to modern LSDs and LLMs, which are built on similar technology.

Our Encoder Size

Model: "Encoder"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 conv2d (Conv2D)             (None, 14, 14, 16)        160       
                                                                 
 conv2d_1 (Conv2D)           (None, 7, 7, 8)           1160      
                                                                 
 conv2d_2 (Conv2D)           (None, 4, 4, 4)           292       
                                                                 
 flatten (Flatten)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 10)                650       
                                                                 
=================================================================
Total params: 2262 (8.84 KB)
Trainable params: 2262 (8.84 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________

Our Decoder Size

Model: "Decoder"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_2 (InputLayer)        [(None, 10)]              0         
                                                                 
 dense_1 (Dense)             (None, 64)                704       
                                                                 
 reshape (Reshape)           (None, 4, 4, 4)           0         
                                                                 
 conv2d_transpose (Conv2DTr  (None, 8, 8, 4)           148       
 anspose)                                                        
                                                                 
 conv2d_transpose_1 (Conv2D  (None, 16, 16, 8)         296       
 Transpose)                                                      
                                                                 
 conv2d_transpose_2 (Conv2D  (None, 32, 32, 16)        1168      
 Transpose)                                                      
                                                                 
 conv2d_3 (Conv2D)           (None, 30, 30, 1)         145       
                                                                 
 tf.image.resize (TFOpLambd  (None, 28, 28, 1)         0         
 a)                                                              
                                                                 
=================================================================
Total params: 2461 (9.61 KB)
Trainable params: 2461 (9.61 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Our Autoencoder Size

Model: "Autoencoder"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 model (Functional)          (None, 10)                2262      
                                                                 
 model_1 (Functional)        (None, 28, 28, 1)         2461      
                                                                 
=================================================================
Total params: 4723 (18.45 KB)
Trainable params: 4723 (18.45 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Try It and See What Happens

As far as we can tell, we have everything set up. Now, we shall set up the training procedure, configure the callbacks, and initiate the training process.

Training Setup

We will run the test in batches of 128 MNIST images for a minimum of 25 epochs and a maximum of 200 epochs (see callbacks). Furthermore, we will set shuffle=True because it is standard to shuffle the data to avoid local minima by training on different sequence of the same 25k images.

Callbacks: Early Stopping

There are many useful callbacks that can be considered for monitoring the training, improving the generalisability, reducing the wall-time to complete, etc. As a minimum, we should always include an EarlyStopping callback. Because we set the maximum number of epochs arbitrarily, we could set it to 1 epoch and produce a low quality model. We could also set it to 1 billion epochs, such that the training will never end in the lifetime of the human species on Earth.

More than just wall-clock time, it has been shown that generalisation error increases with too many epochs (after the train-validation sweet spot), because the model is constantly trying to improve the training loss, without regard for the validation loss — which we use as a running proxy for the generalisation error. We will see below that the validation error may test to increase after many iterations, while the training error continues to decrease. The lowest validation error because increase in the sweet spot.

As such, the EarlyStopping callback monitors the validation loss, and deactivates the training process when the val_loss (or monitor of our choosing) increases over successive epochs. The EarlyStopping callback has three parameters that we set different from the default.

patience (=20) means that if the val_loss does not improve from the current best after 20 epochs, then the EarlyStopping callback will deactivate the training and return the history
monitor (=‘val_loss’) means that EarlyStopping will track the val_loss metric. We could ask it to track the RMSE, the training loss, the F1 score, etc. We could also provide it with an external function to call that takes as input the features and labels, then returns a single value.
start_from_epoch (=25) lets the EarlyStopping callback take a nap until the 26th epoch. There is often a considerable fluctuation in the val_loss over the first 10+ epochs. We want to avoid the EarlyStopping callback from deactivating the training before it has begun. We must, therefore, assume that the generalisation error (proxy by the validation loss) is not minimised in the first 25 epochs.

epochs = 200
batch_size = 128
shuffle = True
patience = 20
start_from_epoch = 25

callbacks = [
    EarlyStopping(
        patience=patience,
        monitor="val_loss",
        min_delta=0,
        verbose=0,
        mode="auto",
        baseline=None,
        restore_best_weights=True,
        start_from_epoch=start_from_epoch,
    ),
]

We embedded the training procedure within ConvAETrainer, and instantiated it as the conv_autoencoder instance. We thus start training by calling the conv_autoencoder.train method, which is a wrapper for the keras autoencoder.fit. This method takes the training data (x_train), the validation data (x_val), the list of callbacks, and training parameters that we discussed above. The output of the conv_autoencoder.train is a dict that we store as history. The conv_autoencoder.train procedure also outputs a stream of meaningful text, including the train_loss and val_loss.

For more details about understanding the Keras training output, please see my primer on Computer Vision Deep Learning Primer with Keras and Python.

history = conv_autoencoder.train(
    x_train=x_train,
    x_val=x_val,
    epochs=epochs,
    batch_size=batch_size,
    callbacks=callbacks,
    shuffle=shuffle
)

"""
Epoch 1/200
235/235 [==============================] - 22s 30ms/step - 
  loss: 0.3454 - val_loss: 0.2447
Epoch 2/200
235/235 [==============================] - 6s 24ms/step - 
  loss: 0.2230 - val_loss: 0.2112
Epoch 3/200
235/235 [==============================] - 5s 22ms/step - 
  loss: 0.2075 - val_loss: 0.2021
...
Epoch 197/200
235/235 [==============================] - 3s 12ms/step - 
  loss: 0.1432 - val_loss: 0.1432
Epoch 198/200
235/235 [==============================] - 2s 10ms/step - 
  loss: 0.1415 - val_loss: 0.1411
Epoch 199/200
235/235 [==============================] - 3s 11ms/step - 
  loss: 0.1410 - val_loss: 0.1405
Epoch 200/200
235/235 [==============================] - 3s 12ms/step - 
  loss: 0.1783 - val_loss: 0.1629
"""

Now that the training has completed, let’s understand how to evaluate the results, which are predominantly stored in the history dict. Because the effectiveness of training (loss and val_loss) is correlated with how many epochs the algorithm iterated, we will make an epochs array and plot the loss values related to it.

In our trial run, the validation loss did not trigger the EarlyStopping callback. As a result, we have a full 200 epochs to evaluate. This could imply that we should either update the learning rate, increase the number of epochs, or change the loss function from binary_crossentropy to mse, mae, or a number of other options.

The global results of the visualisation is that both the validation and training loss decay over time, which is a positive sign of training effectiveness. It implies that the model is able to determine meaningful correlations between the input and output data. Of course, because we are working with an autoencoder, the input and output data are the same.

Putting this together, we can understand the an effective training implies that the latent space of the autoencoder is able to meaningfully capture enough information from the encoded input images to effectively generate reconstructed images. This means that the latent space could be effective at generative new images with the Decoder independently. It also means that the encoder can be considered effective enough at compressing these images without losing significant information.

Visual Evaluation of Loss and Error

When first experimenting with deep learning, and loss curves, most new ML developers wonder how to interpret the features of the loss curve: loss over epochs. Aside from the global, start-to-finish decay in the loss, there are also quasi-periodic spikes throughout the epochs. In our visualisation below, these spikes occur every 20–40 epochs in both the training (blue) and validation (red) loss.

Train + Validation Loss Evaluations over epochs

The cause of these spikes is not perfectly knowable, but the most common explanation is variability in the mini-batch stochastic gradient descent (SGD). This variability means that when the mini-batch SGD grabs a set of images, those images are significantly different from any of the images that the model has understood well thus far, i.e., more varied.

As a made up example, if the mini-batch was unlucky such that the model did not understand any images of handwritten zero digits, then stochastically being asked to train a mini-batch consisting of only the number zero, the autoencoder would find it to be very difficult to reconstruct a zero. There are other explanation for the spikes, but this is the version I most relate to our example, as well to the architecture of our autoencoder.

As a result, another mark of quality for the training process is if amplitude of these spikes decreases with respect to increased number of training epochs. In our example visualisation above, the largest spike (other than epoch 0) occurs at epoch 90. If we continue with my made up example about misunderstanding handwritten digits for zero, then we could say that the autoencoder “had not yet learned” how to recognise zero.

The stochastic nature of the process and the real-world variability of the data set combine in such a way that each successive epoch is not pre-determined to be be a better model. The good news is that our visualisation shows that the spikes became less significant with increased epochs, to the point of converging closer to the loss curve.

epochs = np.arange(len(history.history['loss']))

plots = [
    go.Scatter(
        x=epochs,
        y=np.log10(history.history['loss']),
        name='Loss'
    ),
    go.Scatter(
        x=epochs,
        y=np.log10(history.history['val_loss']),
        name='Validation Loss'
    )
]
fig = go.Figure(plots)
fig.show()

Conclusion

We created a basic convolutional autoencoder using independent classes with double and triple inheritance. Each class could have been built into a single class object, but the learning and (future) testing experience of compartmentalising the autoencoder provided a detailed understanding of each piece of the puzzle.

The Encoder transforms, the input image through matrix manipulations and convolutional transformations in the latent vector.
The latent vector represents the encoded representation of the input images. With hope, it contains sufficient information to reconstruct the image.
The Decoder transforms the latent vector into the reconstructed image using symmetric matrix manipulations and convolutional transformations, upscaling the autoencoded features into an image representation of the latent vector.

In the next (sub-)article, we will visually evaluated the latent space and the reconstructed images to understand how the distributions of the latent space store the image data. We also revealed how we can be post-process to constrain the latent space and reconstruct specific handwritten digits by isolating its multidimensional regions that corresponds to the digit label of our choosing.

Read through the previous (sub-)articles to understand how Generative AI evolved from the more simple architecture of Convolutional Autoencoders in Part 1 of this article series. In Part 3 of this article series, we visually evaluate how to probe and constrain the latent space in order to understand generative properties from latent space manipulation. If you prefer a direct, deep dive into the subject, then try out the full text of this article series in one continuous article by selecting [Full] below.

Generative AI Through the Ages: Convolutional Autoencoders[Full]

Generative AI (GenAI) is any form of computational algorithm that can create new content related to a set of data from…

exowanderer.medium.com

Generative AI Through the Ages: Convolutional Autoencoders Part 1

The origin story for Generative AI is based within the architecture of convolutional autoencoders.

exowanderer.medium.com

Generative AI Through the Ages: Convolutional Autoencoders Part 3

Visually Evaluate the Latent Space of a Convolutional Autoencoder to Understand the Original Architecture of Generative…

exowanderer.medium.com

Join the Medium membership program to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.

Jonathan Fraine — Medium

Wandering through the worlds

exowanderer.medium.com

Join Medium with my referral link — Jonathan Fraine

Read every story from Jonathan Fraine (and thousands of other writers on Medium). Your membership fee directly supports…

If you like the article and would like to support me make sure to:

👏 Clap for the story (53 claps) and follow me 👉
📰 View more content on my medium profile
🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter
🚀👉 Join the Medium membership program to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.

Generative AI Through the Ages: Convolutional Autoencoders Part 2

Generative AI Through the Ages: Convolutional Autoencoders[Full]

Generative AI (GenAI) is any form of computational algorithm that can create new content related to a set of data from…

Generative AI Through the Ages: Convolutional Autoencoders Part 1

The origin story for Generative AI is based within the architecture of convolutional autoencoders.

Generative AI Through the Ages: Convolutional Autoencoders Part 3

Visually Evaluate the Latent Space of a Convolutional Autoencoder to Understand the Original Architecture of Generative…

Compartmentalised ConvAutoencoder to Prepare for the Future

Make It Real: Build a ConvAutoencoder from Scratch

The Encoder and Decoder Classes

Attributes:

conv_params as key:value pairs

The Decoder Class

Attributes:

Putting it All Together: ConvAutoencoder as a builder object

Make it Work

Preprocessing the Data

Train, Validation, Test Splitting

Load Data, Make Model, Fit Model to Data

Load Data

Make the Model

Loss Function Selection

Calculating the Size of the Model

Examining the Size of Our Autoencoder

Scaling Up to a Real World Application with Latent Stable Diffusion

Our Encoder Size

Our Decoder Size

Our Autoencoder Size

Try It and See What Happens

Training Setup

Callbacks: Early Stopping

Visual Evaluation of Loss and Error

Conclusion

Generative AI Through the Ages: Convolutional Autoencoders[Full]

Generative AI (GenAI) is any form of computational algorithm that can create new content related to a set of data from…

Generative AI Through the Ages: Convolutional Autoencoders Part 1

The origin story for Generative AI is based within the architecture of convolutional autoencoders.

Generative AI Through the Ages: Convolutional Autoencoders Part 3

Visually Evaluate the Latent Space of a Convolutional Autoencoder to Understand the Original Architecture of Generative…

Jonathan Fraine — Medium

Wandering through the worlds

Written by Jonathan Fraine

`conv_params` as key:value pairs