Generative AI Through the Ages: Convolutional Autoencoders Part 3

Jonathan Fraine
12 min readMar 9, 2024

--

Visually Evaluate the Latent Space of a Convolutional Autoencoder to Understand the Original Architecture of Generative AI.

I originally wrote this article as a 60-page medium post about convolutional autoencoders as the origin and backbone of Generative AI. Editors and colleagues recommended that I break that up in three parts: architecture+history, application, and latent space evaluation [here].

Image Generated by the Author using DALL-E 3 Prompted by “Visually Evaluate the Latent Space of a Convolutional Autoencoder to Understand the Original Architecture of Generative AI”

The most common autoencoder architecture for computer vision deep learning is the convolutional autoencoder (CAE). CAEs are connected stacks of convolutional kernels for the Encoder and Decoder, with linear matrices in between to transform the encoded images into the latent space. More dense layers are used to again connect the latent space onto the Decoder for image reconstruction.

In a previous article, we discussed how convolutional autoencoders are the backbone and evolutionary ancestor of modern Generative AI models. Here will we show how to build one, how each component contributes to the generative properties of GenAI, and evaluate our model on the MNIST dataset. In the next (sub-)article, we will visually evaluate the latent space and generative capacity of even this simple model.

Visual Evaluation of Test Images

Here we will take the output of the ConvAutoencoder training and probe the Latent Space to understand how it is distributed, what generative means for AI, and look forward for how we can improve these results.

Starting from the previous article, we have to trust that our autoencoder is well-trained and able to understand our data set enough to test. We shall visually evaluate the reconstruction images. Our class object ConvAETrainer includes a wrapper method for ConvAutoencoder.autoencoder.predict, which first confirms that self.autoencoder exists first. This process takes as input the images held out for testing (x_test), encodes them into the latent space and immediately decodes them as reconstructed images. This is an important evaluation because the autoencoder was not trained (“has not seen”) the test images before.

# xtest_decoded = conv_autoencoder.autoencoder.predict(x_test)
xtest_decoded = conv_autoencoder.decode_image(x_test)

To visually evaluate the quality of our autoencoder from input-to-reconstruction, we randomly select 10 input test images and their related reconstruction images. The third column in our visualisation subtracts the input image from the reconstructed image to visually evaluate the reconstruction error.

(left) Input, (left middle) Decoded, (right middle) Error, and (right) Absolute Error [Generated by the Author]

The Good News

Although we see the ghost in error image, we should first be content that the reconstructed image (middle left column) is not Gaussian or arbitrary noise. Similarly, it is not a copy of the input image. If either False examples were True, then we would could interpret that the autoencoder failed to train. In turn, we should be sufficiently content that the autoencoder was able to reasonably reconstruct a majority of the input images well.

Not The Good News

In contrast, we can of course see ghosts of the input image in each error image (middle right column). Specifically, we can see that the outline or edges of the input image exist in the error images. This informs us that the autoencoder is not able to properly reconstruct the edges of major features. In some of the reconstructed images, the human inferred digit is completely incorrect. That informs us that the encoder, latent space, or decoder are unable to properly distinguish between specific digits, such as zeros and eights, or slanted fours and nines.

The Path forward

We have several options to improve these results, which will not be explored here, but should be within the scope of a test bed of autoencoders with which we could experiment.

Options:

  • Continuing training for more epochs because the autoencoder may simply “get better” with more time.
  • Change the loss function from binary_crossentropy to another option because the SGD algorithm may be able to more efficiently train the autoencoder to capture more image features (or edges of image features) by minimising a different loss function.
  • Include more convolutional layers in the Encoder/Decoder. This could help the Encoder to more accurately compress the image data onto the same latent space, such that the symmetric Decoder is able to more precisely reconstruct the images.
  • Add skip connections that tie the output of individual Encoder layers to the symmetric Decoder layers. This will be discuss in a later article about the U-Net architecture. U-Net often improves the accuracy of the edges in reconstructed images.
  • Increase or decrease the size of the latent space (latent_dim). The latent space must have enough scale to properly capture the real world variability of the image space in order for the Decoder to more precisely reconstruct the images. The latent space should also not be larger than is required to properly generate new images with precision, otherwise the correlations between latent dimensions may add sources of structured noise.
  • De-symmetrise the Decoder and Encoder. Assuming that the Encoder and Latent space are able to fully capture the variability within the input image domain, it is still possible that the Decoder is unable to properly reconstruct the images with enough precision to avoid missing the edges or confusing digits. A Decoder with an independent architecture may provide the flexibility required to more precisely reconstruct the images.

There are more options. Given the range and complexity of these choices, we will explore hyperparameter optimisation in a future article to probe many hyperparameter choices. The hope is that we will select enough hyperparameters to optimise such that we have not missed a hyperparameter, or range thereof, that would best approximate the model for our use case.

from matplotlib import pyplot as plt

n_comparisons = 10
fig, axs = plt.subplots(nrows=n_comparisons, ncols=4, figsize=(10,40))
for axs_ in axs:
idx = np.random.randint(0, x_test.shape[0])
axs_[0].imshow(x_test[idx], interpolation='None')
axs_[1].imshow(xtest_decoded[idx], interpolation='None')
axs_[2].imshow(
xtest_decoded[idx] - x_test[idx],
interpolation='None'
)
axs_[3].imshow(
abs(xtest_decoded[idx] - x_test[idx]),
interpolation='None'
)
for axs_ in axs:
for ax_ in axs_:
ax_.Axes.set_axis_off()

Diagnose Latent Space: Visual Evaluation

To understand how well the latent space is able to capture the variability in the scope of the input image space — how dynamic and variable are the handwritten digits — we will encode the full set of test images and visually evaluate the distribution over the latent space. This is a valuable exercise to understand the encoded images. It also may be misguided because the question often becomes “how well distributed are the latent vectors?” without first asking the nearly impossible question “how should the latent vectors be distributed?”

In the direction of answering the latter question, we often prefer our latent vectors to be Gaussian distributed. In a later article, we will discuss Variational Autoencoders which deliberately constrain the latent space to be closer to Gaussian distributed. This is done by bisecting the output of the Encoder into a mean and variance latent vector — i.e., a two-part latent space.

# xtest_decoded = conv_autoencoder.autoencoder.predict(x_test)
xtest_encoded = conv_autoencoder.encode_image(x_test)

An aesthetic visualisation for any multidimensional distribution is to plot a stair diagram of 2D histograms, with connected 1D histograms along the diagonal of the stair diagram. In our case, we will use the KDE instead of the histogram, via the pyGTC library with the pygtc.plotGTC function.

Because we have 10 latent dimensions, there are 10 rows and 10 columns to the stair plot over our latent space. Each 2D histogram is an evaluation over the cross-correlation between each pairing of two latent dimensions. Each 1D histogram is an evaluation over the marginalised distribution of each latent dimension.

We have further split the visualisation into each of the 10 handwritten digits to produce overlapping cross-correlation and marginalised distributions of the latent vectors — with one color per digit (see Legend).

plotGTC(
[xtest_encoded[y_test == label_] for label_ in range(10)],
chainLabels=[f'Label:{label_}' for label_ in range(10)]
)
Stair plot for 2D Correlation and Marginalised 1D Distributions over the Latent Space

Visualising the cross-correlation compared to marginalised distributions per latent vector is a very important evaluation of the validity of our interpretation over the latent vectors — especially if we choose to assume that the latent spaces are Gaussian distributed. Unfortunately, with more dimensions, it becomes less human-tractable to understand the distributions of latent space.

Kernel Density Estimation over the Latent Space

Here will will instead plot only the 1D Kernel Density Estimates per handwritten digit label, one digit per color, to understand how the latent space has been trained or constructed by our architecture and our chosen algorithms. If we examine the documentation for the Kernel Density Estimator, we should first understand that smoothness is built into the process. The KDE algorithm assumes a Gaussian prior to each data point, which often extrapolates to a Gaussian or multimodal Gaussian KDE.

What we are able to infer from the visualisations is that the latent space has a definite range, with a small number of modes, sometimes only one mode, per digit label (i.e., per color) per latent dimension. The purpose of this evaluation is to understand that we could be able to isolate specific regions of the latent space per digit in order to construct a specific digit of our choosing.

from sklearn.neighbors import KernelDensity

nrows = latent_dim // 2
ncols = latent_dim // 5

fig, axs = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20,20))


X_plot = np.linspace(-5, 5, 1000)[:, np.newaxis]
kde = KernelDensity(kernel="gaussian", bandwidth=0.2)

for i, axs_ in enumerate(axs):
for j, ax_ in enumerate(axs_):
for label_ in range(10):
label = None
if i + j == 0:
label = label_
x_ = xtest_encoded[y_test == label_, i*ncols + j]
kde = kde.fit(np.expand_dims(x_, axis=-1));
log_dens = kde.score_samples(X_plot)
ax_.fill(X_plot, np.exp(log_dens), alpha=0.4, label=label)
ax_.set_ylim(0.0, 1.2)
if label is not None:
ax_.legend();
Kernel Density Estimates Over the Latent Space per Digit Label

Unconstrained Probe of Latent Space

To express the scope of the unconstrained latent space, we will randomly sample it in all 10 dimensions, then visualise the reconstructed images. Although some features exist in the unconstrained reconstructed images that humans can over-interpret as handwriting, without a constraint to manipulate the latent space into specific and controllable modes, the visualization below is nearly random noise.

In contrast, the encoder does implicitly constrain the latent space to capture the digits as shapes. Unfortunately, the unconstrained latent space is not complete enough to predict the outcome from a random sampling over the itself.

100 unconstrained samples over the latent space, reconstructed through the decoder [Image by Author]
n_samples = 10

x_samples_rand = conv_autoencoder.generate_image(
np.random.normal(0, 4, size=(n_samples*n_samples, latent_dim))
)

fig, axs = plt.subplots(nrows=n_samples, ncols=n_samples, figsize=(20,20))

for i, axs_ in enumerate(axs):
for j, ax_ in enumerate(axs_):
idx = i*n_samples + j
x_sample_ = x_samples_rand[idx]
ax_.imshow(x_sample_, interpolation='None')
ax_.axis('off')

Constrained Probe of Latent Space

In contrast, we did evaluate that the latent space per digit includes some modes, with wide distributions. In a post-processing method, we can sample over the latent space near those modes and produce reconstructed images that are closer to the expected labels, albeit with a wide variety of handwriting styles, edge strengths, and perceived digit values.

At this point it is important to note that human psychology may be overfitting the results to interpret that the latent space is aware about human hand writing. Statistically, the model does not know what is a human digit — it has simply been trained to output pixel values close to what we expect for handwritten digits by capturing the distribution over the input image space. Human psychology to perceive patterns often takes over by identifying the digits as numbers.

In the analogy to monkeys reproducing poetry by randomly typing on a keyboard enough, we are constraining the latent vectors to “only type on the keys that humans identify as poetry (numbers). The visualisation below has been post-processed to emulate a constrained latent space over the span of human digits provided in the training phase.

100 constrained samples over the latent space, reconstructed through the decoder [Image by Author]
median_per_label = {
label_: np.median(xtest_encoded[y_test == label_], axis=0)
for label_ in range(10)
}

std_per_label = {
label_: np.std(xtest_encoded[y_test == label_], axis=0)
for label_ in range(10)
}

x_samples_constrained = [
conv_autoencoder.generate_image(
np.random.normal(
median_per_label[l_],
std_per_label[l_],
size=(latent_dim, n_samples)
)
)
for l_ in range(latent_dim)
]

fig, axs = plt.subplots(nrows=n_samples, ncols=n_samples, figsize=(20,20))

for i, axs_ in enumerate(axs):
for j, ax_ in enumerate(axs_):
idx = i*n_samples + j
x_sample_ = x_samples_constrained[i][j]
ax_.imshow(x_sample_, interpolation='None')
ax_.axis('off')

Overconstrained Probe of Latent Space

If we further constrain the input to the latent space by sampling narrowly close to the latent space modes of each digit, then our reconstructed images resemble the handwritten digits provided to the autoencoder even more closely as reconstruct digits.

This process is inspiring, although still imperfect. In the 100s of digits that we reconstruct, there are incorrect digits that humans can recognise as shapes that do not represent the expected labels: 0–9.

100 overconstrained samples over the latent space, reconstructed through the decoder [Image by Author]
median_per_label = {
label_: np.median(xtest_encoded[y_test == label_], axis=0)
for label_ in range(10)
}

std_per_label = {
label_: np.std(xtest_encoded[y_test == label_], axis=0)
for label_ in range(10)
}

x_samples_constrained = [
conv_autoencoder.generate_image(
np.random.normal(
median_per_label[l_],
std_per_label[l_]/2, # Overconstrain: Half the width
size=(latent_dim, n_samples)
)
)
for l_ in range(latent_dim)
]

fig, axs = plt.subplots(nrows=n_samples, ncols=n_samples, figsize=(20,20))

for i, axs_ in enumerate(axs):
for j, ax_ in enumerate(axs_):
idx = i*n_samples + j
x_sample_ = x_samples_constrained[i][j]
ax_.imshow(x_sample_, interpolation='None')
ax_.axis('off')

This code can be run in full with the associated colab notebook.

Onward to Latent Stable Diffusion

Latent Stable Diffusion, discuss above, introduced carefully constrained variational autoencoders to manipulate the latent space into reconstructing specific images based on the text-encoded input (query or prompt). The end-goal of this series of article is to illuminate the underlying principles of Latent Stable Diffusion by first evaluating its predecessors and components.

Conclusion

We visually evaluated the latent space and the reconstructed images to understand how the distributions of the latent space store the image information. We also revealed how we can be post-process to constrain the latent space and reconstruct specific handwritten digits by isolating its multidimensional regions that corresponds to the digit label of our choosing.

To fully enable the generation of new digits, we could have robustly constrained the latent space by introducing a second encoding that represents the digit label of our choosing. This will be elucidated in a future article.

Read through the previous (sub-)articles to understand how Generative AI evolved from the more simple architecture of Convolutional Autoencoders in Part 1 of this article series. As well, learn how to build our own Convolutional Autoencoder in Part 2 of this article series. If you prefer a direct, deep dive into the subject, then try out the full text of this article series in one continuous article by selecting [Full] below.

If you like the article and would like to support me make sure to:

  • 👏 Clap for the story (53 claps) and follow me 👉
  • 📰 View more content on my medium profile
  • 🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter
  • 🚀👉 Join the Medium membership program to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.

--

--

Jonathan Fraine

Director of Engineering for Wikimedia DE. I work with dozens of motivated and enthusiastic developers to improve the future of free knowledge around the world.