Generative AI Through the Ages: Convolutional Autoencoders Part 1
The origin story for Generative AI is based within the architecture of convolutional autoencoders.
I originally wrote this article as a 60-page medium post about convolutional autoencoders as the origin and backbone of Generative AI. Editors and colleagues recommended that I break that up in three parts: architecture+history [here], application, and latent space evaluation.
Generative AI (GenAI) is any form of computational algorithm that can create new content related to a set of data from which it has learned or mapped reproducible patterns.
Modern versions of GenAI are based on deep learning neural networks that train on massive data sets of images, audio signals, and text. They have become highly popular in the last few years with user-facing applications, such as Stable Diffusion, Midjourney, DALL-E, as well as Large Language Models (LLMs), such as GitHub CoPilot, ChatGPT, or Claude.
At some level in each of these examples, an Autoencoder was configured to encode and decode each application’s data environment. Most often, an Autoencoder is implemented at the first (encoder) and last (decoder) stages to embed or vectorising the inputs, as well as transform the final latent stage to generate a new image, text, audio, etc.
The goal of most Generative AI experiments is to create an algorithm that can generate new data samples that do not explicitly exist in the dataset. A well trained model can produce samples that closely represent the data environment from which they were trained. The data environment could include images alone, audio alone, text alone, as well as images + text, audio + text, or images + classes. In the most impressive cases, the algorithm can generate convincing examples outside the expected boundaries of the data environment, such as a cave drawing of a Neanderthal using a cell phone.
Original User-facing Image Generators
One of the first, famous, examples of a user-facing GenAI specific to computer vision was This Person Does Not Exist .com. It deployed a variant of a Generative Adversarial Network, which is known to create very crisp, artificial images of faces when trained on a corpus of human faces.
We show in this article that an Autoencoder is often visualised as two pyramids pointing at each other. The architecture of a GAN can be visualised as an inside out Autoencoder, with the image domains touching.
Even the first generation of GANs could produce astonishingly precise human like facial images to impress public users, not just AI experts. The first GANs were limited in their scope of reality and generated human faces often resembled each other. This became apparent to users because the faces were too similar to each other, with only a handful of features: gender, eye type, hair color, hair style, facial expression, etc. As we will discuss here, this was caused because latent spaces were narrow and produced low quality image mappings that failed to capture real-world, multimodal distributions — known as mode collapse.
Both GANs and Autoencoders have improved significantly since their inception. GAN developers created tools and methods to improve the breadth of their latent space expansion. At nearly the same time and pace, the image generation clarity of Autoencoders, especially Constrained Convolutional Autoencoders, improved dramatically with increased computational capacities (i.e., more complex models) and a more expansive set of datasets.
As stated above, almost all modern Generative AI techniques utilise an Autoencoder to encode (or “embed”) and decode their input/output data streams. As a result, we will discuss Autoencoders here as the backbone of Generative AI. In the context of image generators, the most prevalent variant is the Convolutional Autoencoder.
Autoencoders
Although GenAI has blossomed with resounding popularity in the last few years, Autoencoders (AEs) were first invented as far back as the 1980s. They were able to generate new images by exploring the multidimensional latent space distributions learned from training on existing image datasets. The most notable, original dataset used was MNIST handwritten digits. For clarity on early computer vision deep learning and more background on the MNIST handwritten digits dataset, please see my primer on Computer Vision Deep Learning Primer with Keras and Python.
Even though Autoencoders had generative capability, they were not widely used to generate new images of human faces, but instead as denoising, feature extracting, data compression, or data augmentation models. In fact, an early technique for training deep neural networks for classification was first to train an autoencoder in which the dimensionality of the latent space was equal to the desired number of classes. After throwing away the decoder, the encoder could be fine-tuned along with the output layers to retrain it as a classifier.
For our computer vision use case, an Autoencoder is a three part neural network that inputs an image and outputs another image. The output image is referred to as the reconstructed image. The three components of our autoencoder are an Encoder, a latent space representation, and a Decoder.
The input of the Encoder is the preprocessed image, while the output of the Encoder is the latent vector. The input of the Decoder is that latent vector, while the output of the Decoder is the reconstructed image (scaled similar to the input image). The latent space is an abstract, multidimensional representation of the transformed or compressed image that is nominally one dimensional. Each element of that 1D vector can be interpreted as a learned feature over the dataset.
If we prioritised simplifying the feature space interpretability — and, say, focused on human face generation — then one of the features could represent eye shape or color, another could change the face shape, a third could effect the ear height, while a fourth could manipulate the hair style. From this perspective, the values of the latent vector could be a quantification of the features measured in the input image by the Encoder.
Similarly, changing the values of the latent vector elements before passing it through the Decoder could be akin to tuning a knob on the features of the reconstructed, human-face image. For a more casual user example, the latent space can be interpreted as a non-linear compression or compressed representation of the input image. Compression is, indeed, a well established use case for Autoencoders.
In reality, these features are often correlated and can be manipulated, but not necessarily act independently or linearly. The features are very often correlated along the non-linear transformations throughout the Encoder feature mappings. At the same time, it is reasonable to interpret the latent space as a collection of feature elements that represent the image dataset. In our future article, we will use a post-processing method to artificially constrain the distributions over our latent vectors to generate specific handwritten digits with increasing clarity.
Furthermore, there are variations on autoencoders where the latent vector is not explicitly defined. Instead, the output of the Encoder is simply passed through to the Decoder. These variants do not produce a vector that can be extracted, examined, and manipulated. The Encoder would instead output a matrix as an Encoded “image” or feature map that is still a sample from the latent space.
Nonetheless, because GenAI directly implements latent space manipulation as a key element of manipulating the output via user prompting, we will focus on the nominal Autoencoder variant that outputs an explicit vector within the latent space. In the context of dimensionality reduction, the latent vector is by definition smaller than the image itself. All the same, it contains as much of the original information that is necessary to reconstruct the output image with minimum error or loss. Because of this use case, the latent vector is sometimes referred to as the encoding or embedding vector.
This use case either uses the AE as a low pass filter (denoising) or only stores that necessary latent vector to reconstruct the image (compression). Furthermore, if we train an AE on a set of data, then we can compress that data into the set of latent vectors. Think of this as encoding on your computer, transmitting the latent vector via the internet, and decoding on another computer. If properly trained, the second computer could reconstruct the image to very high accuracy (i.e., low loss or error).
As a result, only the image latent vector would need to be transmitted from our system to another without needing to overload the bandwidth with the full image. Note that this compression is nominally “lossy compression”, which is a well established norm for internet streaming services.
Convolutional Autoencoders
The most common autoencoder architecture for computer vision deep learning is the convolutional autoencoder (CAE). CAEs are connected stacks of convolutional kernels for the Encoder and Decoder, with linear matrices in between to transform the encoded images into the latent space. More dense layers are used to again connect the latent space onto the Decoder for image reconstruction.
Convolutions are powerful tools for manipulating images because they take into account the intrinsic (i.e., physical) correlations between pixels. Humans associate these intrinsic correlations as colors and objects being correlated across neighboring pixels. Convolutions are specifically used because they adeptly take advantage of these physical correlation.
Note that the original autoencoder architectures were dominated by stacks of Dense
layers, which almost solely involved linear matrix transformations. Dense autoencoders are bulky stacks of matrices that do not intrinsically take advantage of physical correlations inherent to images.
Furthermore, convolutional autoencoders can be either much smaller by having the same number of transformations with smaller matrices. In the other direction, CAEs can be much deeper by having the same number of parameters with many more transformations — as compared to Dense autoencoders. These properties are permissible because a single convolutional filter is often orders of magnitude smaller in number of parameters to train than a Dense layer.
Furthermore, the same CAE can be used for any image input shape. Although it may not be wise, when a CAE is trained on 28x28 images, it can be applied to 256x256 images. This works because convolutional operation passes the kernel over the image pixel-by-pixel. In a CAE configuration, the majority of trained matrices within the autoencoder are used to train parameters from convolutional filter, also known as kernels. As such, the convolution only observes the set of pixels under the kernel because the convolutional operation is to multiply small windows and sum over them. In contrast, dense layers compute linear matrix multiplications over the entire image at once.
For extensive details on convolution deep learning methods, hyperparameters, use cases, and architecture: please see my primer on Computer Vision Deep Learning Primer with Keras and Python.
Conclusion
We introduced the idea of convolutional autoencoders as the evolutionary ancestor of modern generative AI. This includes Latent Stable Diffusion, Large Language Models, Audio Generation, DALL-3, etc. In the next (sub-)article, we will create a basic convolutional autoencoder using independent classes with double and triple inheritance. In the last (sub-)article, we will investigate how to visually evaluate and understand latent space distributions and its generative properties.
Read through Part 2 of this article series to understand how to build a convolutional autoencoder in Tensorflow Keras with Python . In Part 3 of this article series, we visually evaluate how to probe and constrain the latent space in order to understand generative properties from latent space manipulation. If you prefer a direct, deep dive into the subject, then try out the full text of this article series in one continuous article by selecting [Full] below.
Join the Medium membership program to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.
If you like the article and would like to support me make sure to:
- 👏 Clap for the story (53 claps) and follow me 👉
- 📰 View more content on my medium profile
- 🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter
- 🚀👉 Join the Medium membership program to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.
We create a class for the Encoder
and an independent class for the Decoder
. In future articles that build upon this, we will introduce both a variational component to the autoencoder (VAE), as well as a constrained input to represent the user prompt. As such, this method affords us the simplicity of adding the VAE Sampling and class constraints to the Autoencoder
class later. Currently, our CAE class here only inherits from the Encoder
and Decoder
classes.