Introduction to Conditional Flow Matching - Part I, Normalizing Flows

Generative AI is the study and application of algorithms to generate data following some distribution. The distribution could be of anything, in any modality - images of cats and dogs, samples of human speech, medical imaging data, natural language, etc. The accurate modeling of these distributions has shown great potential in creating commercial value as well as helping to address some fundamental societal problems. At their core, all generative techniques share the same goal: to approximate a complex generating distribution using only the samples we have from that distribution.

Over the years, a variety of powerful approaches to achieving this goal have emerged. Some famous examples of these techniques include Generative Adversarial Networks (GANs), transformers, and diffusion models. Each of these methods, while mathematically rich, is grounded in concepts that are relatively straightforward to explain intuitively. GANs approximate the distribution by pitting a generator against a discriminator. The generator tries to fool the discriminator, the discriminator tries to catch the generator. Over time, both improve at their respective tasks, resulting in high quality samples from the generator that accurately represent the underlying training data distribution. Transformers generate sequences one token at a time and learn which part of the already generated sequence to "pay attention" to when generating the next token. Diffusion models iteratively add noise to training samples until they are indistinguishable from some simple to model distribution, and then learn a model to reverse the noising process.

Recently, another technique has joined the fray: Conditional Flow Matching (CFM). In contrast to the techniques mentioned above, it is not so straightforward to give an intuition for what's happening in CFM. At a very high level, it is a way to transform a simple generating distribution into the target distribution by flowing through a continuous set of transformations, with a few tricks thrown in to make the modeling tractable. As that is a little too vague of an explanation for anyone's comfort, I've written this article. I aim to provide a thorough, step-by-step explanation of CFMs, offering intuition for each part of the journey.

Today we start with part I on normalizing flows, which are a method to construct complex distributions by transforming a simpler distribution through a series of invertible mappings.

Normalizing flows

Let's say we have a random variable $X$ distributed according to a simple uniform distribution between 0 and 1: $X \sim p_x = U(0, 1)$.

Figure 1: The PDF $p_x$ of random variable $X \sim U(0, 1)$

It is a constant function, analytically defined on the domain $[0, 1]$ as $p_x(x) = 1$ and $0$ elsewhere. Any sample between 0 and 1 is equally likely, and the PDF over the full domain integrates to 1.

But now let's say that after taking our sample $x$, we transform it by some continuously differentiable invertible function $\phi(x)$, with a continuously differentiable inverse, to get $y = \phi(x)$. As an example, let's pick $ \phi(x) = 2x $. This induces by way of our transformation $\phi(x)$ a new random variable $Y$. Let us denote the induced density as $p_{y} \triangleq [\phi]_{ \# } p_x $. What does the density $p_y$ look like, such that $Y \sim p_y $?

Deriving the maths

Let's put our example to the side for a moment and try to derive a generic answer. We know that probability mass cannot disappear - i.e. the combined probability of all events should always be 1. As the transformation is invertible and thus bijective the probability of a sample $ x $ falling within some interval $ [x_1, x_2] $ has to be equal to the probability of a sample $ y \sim p_y $ falling in the image of this interval under the transformation $ \phi $. The invertibility of the transformation also implies that it is strictly monotonic, meaning that the image of the interval $ [x_1, x_2] $ always maps to the interval $ [y_1, y_2] = [\phi(x_1), \phi(x_2)] $. Note that in the case of a monotonically decreasing transformation, the interval "switches direction".

Figure 2: An invertible function is either monotonically increasing or decreasing, meaning an interval $[x_1, x_2]$ will always map to an interval $[y_1, y_2]$, possibly with inverted ordering.

Writing the preservation of probability under the transformation analytically gives us

$$ \int_{x_1}^{x_2} p_x(x) \mathrm{d}x = \int_{y_1}^{y_2} p_y(y) \mathrm{d}y. $$

Now, importantly, the differentials in both sides of these equations are not equal. Speaking a bit non-rigorously, remember that the differential $ \mathrm{d}x $ refers to infinitesimal change in the variable $ x $, and this change in $x$ does not necessarily imply an equal change in $ y $. Instead, they are related by the chain rule:

$$ \mathrm{d}y = \frac{\mathrm{d}y}{\mathrm{d}x} \mathrm{d}x$$

In other words, an infinitesimal "nudge" in $ y $ space corresponds to a scaled infinitesimal "nudge" in $ x $ space, and the scale is equal exactly to the derivative of $ y $ with respect to $ x $ at that point.

We can use this to rewrite the integral of the density of our transformed samples $ p_y $ in the space of the original samples, i.e. in $x$-space:

$$ \int_{y_1}^{y_2} p_y(y) \mathrm{d}y = \int_{x_1}^{x_2} p_y(\phi(x)) \frac{\mathrm{d}\phi}{\mathrm{d}x}(x) \mathrm{d}x. $$

If we now substitute this back into the previous equation, we get

$$ \int_{x_1}^{x_2} p_x(x) \mathrm{d}x = \int_{x_1}^{x_2} p_y(\phi(x)) \frac{\mathrm{d}\phi}{\mathrm{d}x}(x) \mathrm{d}x. $$

Now, as this must hold for any choice of $ x_1, x_2 $, this must mean that the integrands themselves are equal. Importantly, we need to take the absolute value for the equality to hold, as the transformation could have reversed sign (when it's monotonically decreasing, in which case the ordering of $y_1, y_2 $ is the opposite of $x_1, x_2 $, and the integral will be negative). This gives us

$$ p_x(x) = p_y(\phi(x)) \left\lvert\frac{\mathrm{d}\phi}{\mathrm{d}x}(x)\right\rvert. $$

Now, using the fact that $ \phi(x) $ is invertible, we know that for any choice of $ y $, we have $ x = \phi^{-1}(y) $, which gives us the change of variables rule in one dimension:

$$ p_y(y) = \frac{p_x(\phi^{-1}(y))}{\left\lvert\frac{\mathrm{d}\phi}{\mathrm{d}x}(\phi^{-1}(y))\right\rvert}. $$

Intuitively, the density at a point in $ y $ space is equal to the density of its corresponding point in $ x $ space corrected for how much the transformation has stretched the space. Using the relation of $ \frac{1}{\frac{\mathrm{d}\phi}{\mathrm{d}x}} = \frac{\mathrm{d}\phi^{-1}}{\mathrm{d}y} $ we arrive at an alternative formulation of the same rule:

$$ p_y(y) = p_x(\phi^{-1}(y)) \left\lvert\frac{\mathrm{d}\phi^{-1}}{\mathrm{d}y}(y)\right\rvert. $$

Our example

We now return to our example. We started with a uniform distribution, $ p_x(x) = 1 $ for $x \in [0, 1] $, and $ 0 $ elsewhere. We then applied a transformation of $ \phi(x) = 2x $, with an inverse of $ \phi^{-1}(y) = \frac{y}{2} $. Our new density can now be defined on the transformed interval $ [0, 2] $ as

$$ \begin{align} p_y(y) &= p_x(\phi^{-1}(y)) \left\lvert\frac{\mathrm{d}\phi^{-1}}{\mathrm{d}y}(y)\right\rvert \\ &= 1 \cdot \frac{1}{2} = \frac{1}{2} \end{align} $$

and $ 0 $ elsewhere.

Figure 3: The original and transformed densities

We have defined our first normalizing flow - starting from our source distribution $p_x$ we applied a flow of $\phi$ to arrive at a new distribution $p_y$, where we took care to normalize the induced density.

A slightly more complex example

In the above example, the stretching factor induced by the transformation was constant, independent of the sample $ x $. This does not have to be the case. Let's take a different transformation $ \phi(x) = x^2 $. Note that this function, although not invertible everywhere, is invertible on our source distribution's non-zero domain. So, for any transformed sample $ y $ we can say that $\phi^{-1}(y) = \sqrt{x} $. Applying our knowledge now, we can define our transformed density on the transformed interval $ (0, 1] $ as

$$ p_y(y) = p_x(\phi^{-1}(y)) \left\lvert\frac{\mathrm{d}\phi^{-1}}{\mathrm{d}y}(y)\right\rvert = \frac{1}{2\sqrt{x}} $$

Using a simple starting distribution and a simple transformation, we've now managed to get a distribution that looks a lot more interesting.

Figure 4: The induced density when $\phi(x) = x^2$

Generalizing to higher dimensions

So far we have concerned ourselves with distributions and transformations in $\mathbb{R}$. However, all the above generalizes to higher dimensions - we just need to replace our one dimensional derivative with the full multidimensional Jacobian.

Quick recap of the Jacobian
The Jacobian of a transformation contains all the partial derivatives of the output with respect to the input. If $\phi(\mathbf{x}) : \mathbb{R}^2 \rightarrow \mathbb{R}^2 $, then
$$ J_\phi(\mathbf{x}) = \begin{bmatrix} \frac{\partial \phi_1}{\partial x_1}(\mathbf{x}) & \frac{\partial \phi_1}{\partial x_2}(\mathbf{x}) \\ \frac{\partial \phi_2}{\partial x_1}(\mathbf{x}) & \frac{\partial \phi_2}{\partial x_2}(\mathbf{x}) \end{bmatrix} $$

The Jacobian generalizes the 1D derivative to higher dimensions, and gives the best possible linear approximation of the transformation at a given point. The interactive figure below illustrates this. In the left graph, we have graphed what happens to the grid when applying the transformation

$$ \phi(\mathbf{x}) = \begin{bmatrix} x_1 + t \sin(x_2) \\ x_2 + t \sin(x_1) \end{bmatrix} $$

The value of $ t $ can be controlled by a slider. In the right graph, we're showing a zoomed in version of the grid on the left. The amount of zoom is controlled by a second slider. Aside from how the grid is transformed, we're also plotting how the Jacobian of the transformation at that point would transform the grid. You can see that the smaller the delta, the more closely the transformation is approximated linearly by the Jacobian.

You can drag the square around to show different parts of the grid.

Value of $t$:

Zoom level:

Figure 5: Visualization of $\phi(\mathbf{x}) = [x_1 + t * \sin(x_2); x_2 + t * \sin(x_1)]$ and its Jacobian.

Furthermore, remember that the determinant of a matrix quantifies how much the space is compressed or stretched by the transformation represented by that matrix. The determinant of the Jacobian is then telling us how space is stretched by the Jacobian at a given point, and thus how space is stretched by the original transformation in an infinitesimal region around that point (as in that region, the transformation is approximately linear).

The idea behind this visualization is from the inimitable Khan Academy, which has an excellent pair of videos showing the intuition behind the Jacobian and its determinant.

This means that, generalized to higher dimensions, the change of variables rule looks like this: Let $\phi(\mathbf{x}) : \mathbb{R}^d \rightarrow \mathbb{R}^d $ be a continuously differentiable invertible transformation, with a continuously differentiable inverse. Let $p_x$ be the density function of some random variable $ X $. The density induced by applying the transformation $ \phi(\mathbf{x}) $ to samples from $ X $ can be written as

$$ \begin{align} p_y(\mathbf{y}) &= ([\phi]_{ \# } p_x)(\mathbf{y}) \\ &= p_x(\phi^{-1}(\mathbf{y}))\left\lvert\det (J_{\phi^{-1}}(\mathbf{y}))\right\rvert \\ &= \frac{p_x(\phi^{-1}(\mathbf{y}))}{\left\lvert\det (J_{\phi}(\mathbf{y}))\right\rvert} \end{align} $$

Combining multiple flows

We have seen how to transform a density $ p_x $ after applying a single transformation $ \phi(x) $ to the samples $ \mathbf{x} \in X $, which returns a new density $ p_y = [\phi]_{ \# }p_x$. If we now apply a second transformation to the samples, we can transform the density again. In this way, we can apply a whole series of transformations to the samples, and transform the density in step. If we transform the original samples by a series of transformations $\phi_1, \phi_2, \ldots, \phi_K $, then we denote the combined flow as $ \Phi = \phi_K \circ \ldots \circ \phi_2 \circ \phi_1 $. The density of the combined flow takes into account the stretching of the space as it goes through the different transformations:

$$ p_K(\mathbf{x}_K) = p_x(\Phi^{-1}(\mathbf{y}))\prod_{i=1}^K \left\lvert \det J(\phi_i^{-1})(\mathbf{x}_{i+1}) \right\rvert $$

Training the flow

So now we know how to transform a given source density function when the samples from that density undergo some given transformation $ \phi $. In the context of generative AI however, the setup becomes more interesting when we know what the target distribution should be, and are interested to figure out which transformation maps the (easy to sample from) source distribution onto this target distribution.

Imagine we have a dataset $ \mathcal{D} $ of samples $\mathbf{x} \in \mathcal{D} \subseteq \mathbb{R}^2 $, that are distributed according to some target distribution $ p_t $. For illustration purposes, let's say that $ p_t $ is a normal distribution centered at $ \mu = [5, 5] $, with a covariance matrix of $ \Sigma = \begin{bmatrix} 2.25 & -1.75 \\ -1.75 & 2.25 \end{bmatrix} $ - but note that in a real world scenario we would not have access to this generating distribution, only to the samples in $ \mathcal{D} \sim p_t $.

Figure 6: Our target distribution's density $p_t$

Starting from a known distribution $ p_x $ in the same space, we now want to find a flow $ \phi(\mathbf{x}) $ that maps samples from known $ p_x $ to unknown $ p_t $.

We are more or less free to choose our starting distribution. As after training we want to be able to sample from the distribution, we should pick a distribution for which sampling is straightforward. Let's pick as our starting distribution a normal distribution centered at the origin with unit covariance:

$$ p_x \triangleq \mathcal{N}(0, \mathbf{I}) $$

For our flow, we are free to pick any parameterized flow. One example is a general linear transformation followed by a translation:

$$ \phi(\mathbf{x}, [\mathbf{A}, \mathbf{b}]) \triangleq \mathbf{A} \mathbf{x} + \mathbf{b} $$

such that $ \mathbf{A} $ and $ \mathbf{b} $ are our trainable variables. As per the change of variables rule, the induced density can be written as

$$ \begin{align} p_y(\mathbf{y}) &= ([\phi]_{ \# } p_x)(\mathbf{y}) \\ &= p_x(\phi^{-1}(\mathbf{y})) \left\lvert \det J(\phi^{-1}(\mathbf{y})) \right\lvert \end{align} $$

To fill in this equation, we need to find a few quantities. First, the inverse of the transformation is

$$ \phi^{-1}(\mathbf{y}) = \mathbf{A}^{-1}(\mathbf{y} - \mathbf{b}) $$

Note that this inverse does not necessarily exist! If $ \mathbf{A} $ is singular, then the inverse matrix does not exist. We need to take this into account later when we are running gradient descent. For now, let us assume this is taken care of. The Jacobian of the inverse is then

$$ J_{\phi^{-1}}(\mathbf{y}) = \begin{bmatrix} \frac{\partial \phi^{-1}_1}{\partial y_1}(\mathbf{y}) & \frac{\partial \phi^{-1}_1}{\partial y_2}(\mathbf{y}) \\ \frac{\partial \phi^{-1}_2}{\partial y_1}(\mathbf{y}) & \frac{\partial \phi^{-1}_2}{\partial y_2}(\mathbf{y}) \end{bmatrix}$$

which is simply $\mathbf{A}^{-1}$:

$$ \mathbf{A}^{-1} = \frac{1}{\det(\mathbf{A})} \begin{bmatrix} A_{22} & -A_{12} \\ -A_{21} & A_{11} \end{bmatrix} $$

with

$$ \det(\mathbf{A}) = A_{11}A_{22} - A_{12}A_{21}. $$

The determinant of the inverse of a matrix is simply the inverse of the determinant, so that

$$ \begin{align} \det (J_{\phi^{-1}}(\mathbf{y})) &= \det (\mathbf{A}^{-1}) \\ &= \frac{1}{\det (\mathbf{A})} \\ &= \frac{1}{A_{11}A_{22} - A_{12}A_{21}} \end{align}$$

We can now write out the full induced density $ p_y(\mathbf{y}) $:

$$ \begin{align} p_y(\mathbf{y}) &= p_x(\mathbf{A}^{-1}(\mathbf{y} - \mathbf{b})) \left\lvert \frac{1}{\det(\mathbf{A})} \right\rvert \\ &= \frac{1}{2\pi} \exp (-\frac{1}{2} (\mathbf{A}^{-1}(\mathbf{y} - \mathbf{b}))^T(\mathbf{A}^{-1}(\mathbf{y} - \mathbf{b}))) \left\lvert \frac{1}{\det(\mathbf{A})} \right\rvert \end{align} $$

We can turn this into a log likelihood for easy optimization:

$$ \log p_y(\mathbf{y}) = (-\frac{1}{2} (\mathbf{A}^{-1}(\mathbf{y} - \mathbf{b}))^T(\mathbf{A}^{-1}(\mathbf{y} - \mathbf{b}))) - \log(2\pi) + \log \left\lvert \frac{1}{\det(\mathbf{A})} \right\rvert $$

Now, finally, we take the negative (as we are minimizing) and get rid of the constant to get to our loss function:

$$ \mathcal{L}(\mathbf{y}; [\mathbf{A}, \mathbf{b}]) = (\frac{1}{2} (\mathbf{A}^{-1}(\mathbf{y} - \mathbf{b}))^T(\mathbf{A}^{-1}(\mathbf{y} - \mathbf{b}))) - \log \left\lvert \frac{1}{\det(\mathbf{A})} \right\rvert $$

We now have all ingredients to plug into a gradient descent loop, to optimize for our parameters $ \mathbf{A}, \mathbf{b} $. Let's write this out in PyTorch. The first class we'll need is a dataset that generates samples from our target distribution. In a real world scenario, you would most likely replace this with a dataset that loads samples from disk.

class TargetDataset(torch.utils.data.Dataset):
    def __init__(self, n_samples, mean, cov):
        super(TargetDataset, self).__init__()
        self.n_samples = n_samples
        self.mean = mean
        self.cov = cov

        self.samples = torch.tensor(np.random.multivariate_normal(self.mean, self.cov, self.n_samples), dtype=torch.float32)

    def __len__(self):
        return self.n_samples

    def __getitem__(self, idx):
        return self.samples[idx]

Next, we want to represent our flow. As during training we care only about the inverted flow, we'll write a class to represent the inverted flow explicitly. After training, we can use the inverse of this to get our forward flow.

class InverseLinearTransform(torch.nn.Module):
    def __init__(self):
        super(InverseLinearTransform, self).__init__()

        self.inv_A = torch.nn.Parameter(torch.eye(2))
        self.b = torch.nn.Parameter(torch.zeros(2))

    def det_jacobian(self):
        # Extract elements of the 2x2 matrix
        A_11 = self.inv_A[0, 0]
        A_12 = self.inv_A[0, 1]
        A_21 = self.inv_A[1, 0]
        A_22 = self.inv_A[1, 1]

        return A_11 * A_22 - A_12 * A_21

    def inverse(self, x):
        R = torch.tensor([[self.inv_A[1, 1], -self.inv_A[0, 1]], [-self.inv_A[1, 0], self.inv_A[0, 0]]]) / self.det_jacobian()

        return torch.matmul(R, x) + self.b

    def forward(self, y):
        return torch.matmul(self.inv_A, y - self.b)

Finally, we can write our negative log likelihood loss function:

def nll_loss(y, inv_transform):
    inv = inv_transform(y)
    return 0.5 * torch.matmul(inv.T, inv) - torch.log(torch.abs(self.inv_transform.det_jacobian()))

With all the components in place, we can write our setup and training loop:

mean = np.array([5, 5])
cov = np.array([[2.25, -1.75], [-1.75, 2.25]])
dataset = TargetDataset(1000, mean, cov)

optimizer = torch.optim.SGD(induced_density.parameters(), lr=0.001)

num_epochs = 100
for epoch in range(0, num_epochs):
    for sample in dataset:
        optimizer.zero_grad()
        
        negative_log_likelihood = nll_loss(sample)
        negative_log_likelihood.backward()

        optimizer.step()

Note that this is a very minimal setup - we are doing standard SGD with a fixed learning rate. This code is for illustration purposes - in reality one would be more careful about the training setup.

When training is finished, the underlying transformation parameters $ \mathbf{A}, \mathbf{b} $ have been optimized so that samples from the source distribution passed through the flow will be approximately distributed as the samples from the target distribution. We can visualize this by plotting how the induced density changes over the course of training, as $\mathbf{A}$ and $\mathbf{b}$ get closer and closer to the correct target.

Epoch:

Figure 7: The induced density evolves over training towards approximating the target distribution.

We see that in the first step, there's a very large update - the gradients are initially very large as the training data is exceedingly unlikely under the source distribution. After the first epoch, the density is updated more gradually, however pretty quickly the induced density settles on something that looks very similar to our target distribution.

Now that the flow has been trained, we can take any sample in the source space and transform it to the target space. In the figure below, clicking on a point on top of the source density (on the left) will show its corresponding point in the target density (on the right).

Figure 8: Samples from the source distribution can now be mapped to the target distribution.

Next steps

We have seen in principle how we can train a parameterized flow to transform samples from a source distribution into samples that are distributed according to some target distribution. However, the example presented here is exceedingly simple. Our flow is not a composite of multiple flows - it is just a single step. Moreover, the flow type we trained (a linear transform) is more or less trivial. In the next part, we will look at invertible residual flows, which is a more generic type of non-linear flow representable by a neural network. We will also be combining multiple flows to be able to approximate more complex target distributions.