- Introduction
- Basic Architecture
- Implementation
- The Mathy Bit: Kullback–Leibler Divergence
- Reconstructive and Generative Results
- Coding Space
- PCA
- Interpolations
- Logvar training
- Conditional VAE
Introduction
We saw in Intro to Autoencoders how the encoding space was a non-convex manifold, which makes basic autoencoders a poor choice for generative models. Variational autoencoders fix this issue by ensuring the coding space follows a desirable distribution that we can easily sample from - typically the standard normal distribution.
The theory behind variational autoencoders can be quite involved. Instead of going into too much detail, we try to gain some intuition behind the basic architecture, as well as the choice of loss function and how it helps the autoencoder learn the desired coding distribution.
Basic Architecture
Figure 1: A dense autoencoder |
Our goal in building the model is to end up with normally distributed encodings. While that sounds complicated, having a model that generates such encodings is as simple as having the model learn the parameters of the normal distribution. There are two: the mean and the standard deviation.
Figure 1 shows such an architecture. Notice how the encoder part of the model is feeding into TWO dense layers. These are the two parameters for our distribution, with Par_0
representing the mean , and Par_1
representing the standard deviation . We then feed in a standard normal variable and transform it into an encoding vector the usual way:
$$
z = \mu + \sigma \cdot w
$$
where each of these variables is a 30-dimensional vector.
Implementation
We use Keras to implement our model and train it on MNIST. See the introductory article on autoencoders for details on how to import and prepare the data. For the rest of the model, we will need the following imports:
Encoder model
A variational autoencoder is very similar to a regular autoencoder, except it has a more complicated encoder. We begin by specifying our model hyperparameters, and define a function which samples a standard normal variable and transforms it into our codings via .
Since we have two separate dense layers (the distribution parameters) that take the same input, we build and connect the layers by hand. The code for building the layers is show below. Notice how both param0
and param1
take hidden1
as input, consistent with Figure 1.
The Lambda layer is where the magic happens. It takes the layers param0
and param1
as input (our distribution parameters), and performs the computation outlined by the sampling()
function. The layer outputs our desired encoding.
Finally, we use the Model
class to instantiate our encoder:
Notice when we instantiate the model, we ask it to output 3 things: the codings, followed by the two parameters. This simply allows us to inspect the learned parameters later. We could have easily used the following:
Decoder and VAE models
The decoder model is the same as before. Here’s how the implementation looks if we build it layer by layer and use the Model
class instead of Sequential
to instantiate it:
Lastly, we stack the two together so we can train the autoencoder:
Optimizer and loss function
Constructing loss function is a bit trickier than before. Training should achieve two goals:
- good image reconstructions
- a coding space that is normally distributed
For the reconstruction part, we penalize the model using MSE. However, since we didn’t flatten our inputs before feeing them into the model, they have shape (?, 28, 28, 1), and we cannot simply use keras.losses.mse
since this takes the average along the last axis, i.e. the channels axis=-1
(see the source code of the Keras implementation). Instead, we code the MSE error by hand using the Keras backend.
Notice how we didn’t actually divide by the number of inputs. It’s certainly okay to do so (we are computing an average, after all), but that results in smaller gradients. Instead, we can think of this as a pixel-wise MSE, i.e.
$$
(\text{original pixel} - \text{reconstruction pixel})^2
$$
and adding each of those losses together over all pixels (across the single black and white channel) by specifying the axis=[1, 2, 3]
parameter.
We also need to penalize the model if it tries to create encodings that are not normally distributed. A thorough explanation of the following piece of code is in the Mathy Bit section.
Our total loss will be the average of the two losses we constructed above. Taking the mean collapses the 1-dimensional vector (whose size is the training batch size) to a scalar.
Finally, we pick an optimizer, compile our models, and train it. The choice for loss for the encoder and decoder models is irrelevant, but we need to compile them if we want to use them individually in the future.
The Mathy Bit: Kullback–Leibler Divergence
Let’s try to understand the distribution loss of our variational model, and where it comes from. Recall that our 30-dimensional coding vector is obtained in the sampling()
function by first sampling from a standard normal distribution, and transforming it via :
Figure 2: distinct Gaussian distributions |
These operations are done component-wise, meaning that each component of the 30-dim vector is a normal random variable with mean and variance . The components are mutually independent, so we can simplify our analysis and work with univariate and .
We want to be able to approximate the coding by a standard normal distribution, so we need a way to measure how far off is from a standard normal. For this we use the Kullback–Leibler divergence, which is a way to measure the difference between two probability distributions and . For continuous distributions, KL divergence is defined as $$ \begin{equation} D_{KL}(P || Q) = \int_\chi p(x) \ln \dfrac{p(x)}{q(x)} dx \end{equation} $$ where and are the probability density functions of and , respectively, and the integral is taken over the entire sample space .
Surprise and Entropy
To understand where the formula for KL-divergence comes from, and why it can be used as a measure of the difference between probability distributions, we take a quick detour into information theory.
The “closeness” of distributions can be reduced to the outcomes and their corresponding probabilities. For continuous distributions (like the normal), there are infinitely many outcomes (in our case, all of ), and technically each value occurs with probability zero. This explains why we need to accumulate a differential by integrating over the sample space, but what does the integrand in (1) represent?
Observing a random event evokes some amount of surprise. This can be quantified by a surprise function . It is reasonable to assume that depends only on the probability of that event, so . We define the surprise (or information) associated with an event occurring with probability byr $$ \begin{equation} I(p) = -\ln(p) \end{equation} $$ This definition makes sense as a measure of surprise, since as , we have monotonically (rare events are more surprising than common events), and also (there is no surprise when seeing events that are certain).
Suppose is a discrete random variable whose probability distribution is . The entropy of is given by $$ \begin{equation} H(X) = -\sum_x p(x) \ln p(x) = \sum_x p(x) I(p(x)) = \mathrm{E}_P(I(P)) \end{equation} $$
where and ranges over all possible values of . In other words, entropy is the expected amount of surprise evoked by a realization of . This definition can be expanded to continuous random variables: $$ \begin{equation} H(X) = -\int_\chi p(x) \ln p(x) dx \end{equation} $$
Since we can think of entropy as measuring the degree of randomness of , we can use it to quantify the similarity of two distributions and . One might think to simply take the difference , but that is not enough (one can imagine two distinct pdfs with the same entropy). The expectation of information must be done with respect to one of the distributions, and so the KL divergence from to is $$ D_{KL}(P || Q) = \int_\chi p(x) \ln \dfrac{p(x)}{q(x)} dx $$
which is done with respect to . Using properties of logarithms, this can be rewritten as $$\begin{eqnarray} D_{KL}(P || Q) &=& -\int_\chi p(x) \ln q(x) dx + \int_\chi p(x) \ln p(x) dx \nonumber \\ &=& \mathrm{E}_P(I(Q)) - \mathrm{E}_P(I(P)) \nonumber \end{eqnarray}$$
The first term is the average information of with respect to , also called the cross-entropy of and , and is sometimes denoted by . The second term is the entropy of , . Hence this can be thought of as the average information gain (or loss) when is used instead of .
Figure 3: area to be integrated by |
KL-div for Gaussians
In the context of our autoencoder, is the true distribution of our codings, while is an approximation. In order to use (standard normal) to generate digits, we want to bring closer to , so we minimize by incorporating it into our model’s total loss function.
Because of the autoencoder’s architecture (in particular the sampling()
function), we know is normal with mean and variance , while is standard normal. Let’s go ahead and compute between these normal distributions.
The density functions of and are given by $$ p(x) = \dfrac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) \quad\text{and}\quad q(x) = \dfrac{1}{\sqrt{2\pi}} \exp\left(-\frac{x^2}{2}\right) $$
Plugging these into (1) and applying the rules of logarithms gives $$\begin{eqnarray} D_{KL}(P || Q) &=& \int_\chi \left(\ln p(x) - \ln q(x)\right) p(x) dx \nonumber \\ &=& \int_\chi \left[ \ln\frac{1}{\sigma} + \frac{1}{2}\left( x^2 - \left(\frac{x - \mu}{\sigma}\right)^2\right)\right] p(x) dx \nonumber \end{eqnarray}$$
which is just an expectation relative to . Using properties of expectation, this becomes $$\begin{eqnarray} D_{KL}(P || Q) &=& \mathrm{E}_P \left[\ln \frac{1}{\sigma} + \frac{1}{2} \left( X^2 - \left(\frac{X - \mu}{\sigma}\right)^2\right)\right] \nonumber \\ &=& \ln\frac{1}{\sigma} + \frac{1}{2} \mathrm{E}_P(X^2) - \frac{1}{2} \mathrm{E}_P\left[\left(\frac{X - \mu}{\sigma}\right)^2\right] \nonumber \end{eqnarray}$$
The expectations are done with respect to , which means . Recall that the variance of a random variable can be defined as , so last expectation becomes $$ \mathrm{E}_P\left[\left(\frac{X - \mu}{\sigma}\right)^2\right] = \frac{1}{\sigma^2} \mathrm{E}_P\left[(X - \mu)^2\right] = \frac{1}{\sigma^2} \mathrm{Var}(X) = 1 $$
Also, can be rewritten as $$ \mathrm{E}_P(X^2 - 2X\mu + \mu^2 + 2X\mu - \mu^2) = \mathrm{E}_P\left[(X - \mu)^2\right] + 2\mu\mathrm{E}_P(X) - \mathrm{E}_P(\mu^2) = \sigma^2 + \mu^2 $$
and so the KL divergence from to becomes $$ \begin{equation} D_{KL}(P || Q) = \ln\frac{1}{\sigma} + \frac{\sigma^2 + \mu^2}{2} - \frac{1}{2} = \boxed{\frac{1}{2}\left(\mu^2 + \sigma^2 - 1 - \ln(\sigma^2)\right)} \end{equation} $$
Keeping in mind that is param0
and is param1
, our implementation of the KL loss is consistent with (5):
where eps
ensures we don’t take log of zero.
Reconstructive and Generative Results
Figure 4: Generative progress over 6 epochs |
Training converges rapidly, and within a few epochs we are able to use the model for generating images of handwritten digits. Figure 4 shows the model’s generative ability improving over the first 6 epochs of training.
As training also puts emphasis on normally distributed encodings, the model lost some of its reconstructive ability over a regular autoencoder:
Figure 5: Reconstruction output of our model |
Coding Space
We can check to see that our model indeed produces codings that are normally distributed with mean 0 and variance 1. Figure 6 shows histograms for various coordinates of the encoded training set:
Figure 6: Encoding coordinate distributions |
The distributions visually resemble the standard normal: they are centered around zero, appear symmetrical, and most of the values lie between -2 and 2. While not perfect (coordinate 7 is a bit skewed), the generative results are much better over a standard autoencoder:
Figure 7: More generated digits from our model |
PCA
We can also visualize the coding distributions by perform 2-dimensional PCA. Notice how the variational model gives a convex coding space.
Figure 8: 2-dim PCA on encodings from our variational autoencoder (left) and a standard autoencoder (right) |
If we ignore the color categories, the PCA output on our variational model resembles a 2-dimensional elliptical Gaussian cloud. One would also expect PCA in 3 dimensions to give a convex cloud, and it does:
Figure 9: 3-dim PCA on encodings from our variational autoencoder (left) and a standard autoencoder (right) |
Interpolations
Having a convex coding space is not only advantageous for generating images that better resemble those in the original data set. It is also useful in allowing smooth interpolations between digits.
Figure 10: Digit interpolations from our variational autoencoder (left) and a standard autoencoder (right) |
Logvar training
One common tweak to the variational autoencoder is to have the model learn param1
as instead of , resulting in faster convergence of the model during training. If we need to compute , we simply do .
To implement this, we first modify the sampling()
function to ensure our standard normal variable gets the correct scaling:
Taking K.exp(0.5 * param1)
gives , as needed. We also need to modify the calculation for dist_loss
:
giving us , as desired.
Conditional VAE
Another modification to the VAE is to add a conditioning vector to our inputs. This requires the data to be labeled (the conditioning vector will depend on the label).
Figure 11: Basic CVAE architecture |
The MNIST dataset is simple enough that we can bypass the encoder part of the model, and feed the conditioning vector directly into the decoder by concatenating it with the codings output of the encoder, as shown in Figure 11.
Implementation
First we prepare the conditioning vector, which will be a one-hot vector representation of the MNIST labels. Keras provides the to_categorical()
function, which does this for us automatically:
Let’s build the encoder model. We use a CNN architecture, which means we can’t concatenate the condits
vector directly to the input images. We bypass it all the way to the end, so the encoder never really sees the conditioning. However, if we wanted to incorporate it, we’d
Notice how even though the encoder technically compresses the inputs to a 10-dim representation, the actual codings
vector is 20-dimensional, since we are concatenating two 10-dim vectors together (encoded representation and one-hot conditioning). We take this into account for the decoder’s input:
We also need to modify the code which stacks the two models together for training, to account for the extra conditioning input:
The loss/optimizer parts remain unchanged (we use logvar version). However, the training code needs the additional input:
Figure 12: CVAE reconstructions |
Results
Having the extra conditioning information allows the CVAE to play a corrective role in reconstructions. Figure 12 shows original MNIST images with a square area cut out, and the CVAE’s best guess at reconstructing that original.
More interestingly, when used as a generative model, the CVAE gives us more control over the type of digit we want to generate.
Figure 13: Controlled output of the CVAE |
We see that the one-hot conditioning vector controls the type of digit generated, while the ‘encoding’ piece controls the writing style. This means the encoder model doesn’t have to learn what digit is represented, only the writing style. This is reflected in the distribution of the coding space:
Figure 14: Kernel PCA on encodings from CVAE in 2D (left) and 3D (right) |
Notice there is no clustering taking place - the coding space is homogenous. This enables us to perform smooth digit interpolations between styles (keeping the digit type constant), between digits (keeping the style constant), or both. We built a visualization that allows the reader to explore the coding space.
This wraps up our analysis of the variational autoencoder, and why it works as a generative model. In a future post, we will introduce generative adversarial models, see how they compare to variational autoencoders, and use both to produce realistic images of people.