Tutorials # Normalizing flows in InvertibleNetworks.jl

Ziyi (Francis) Yin

# #Basic Normalizing Flow Training and Sampling

This notebook is the first in a series of InvertibleNetworks.jl. In this tutorial, we will layout the basic theory behind Normalizing Flows (NFs) and how to use the implementations in InvertibleNetworks.jl to train and sample from a basic NF. The following notebook in the series demonstrates how to train a conditional normalizing flow. You can also learn the concept and ideology of InvertibleNetworks.jl package in the JuliaCon 2021 presentation with the youtube video here.

In this tutorial, we train an NF using the GLOW architecture, which implements:

• Affine couplying layer from Real-NVP
• Ability to activate multiscale for efficient training and compressive behavior in latent $z$
• ActNorms for stable training
• 1x1 Convolutions for channel mixing between affine couplying layers
using InvertibleNetworks
using LinearAlgebra
using PyPlot
using Flux
using Random

import Flux.Optimise: ADAM, update!
Random.seed!(1234)

PyPlot.rc("font", family="serif"); 

## #What is normalizing flow

Normalizing flow (NF) is a type of invertible neural network (INN) containing a series of invertible layers, which aims to learn a probability distribution (e.g. cat images). After training, NF can output a white noise image given an input as a cat image in the distribution. Thanks to its invertibility, we can easily draw sample images from the "cat" distribution by drawing random white noise and apply the inverse of the NF.

## #Target distribution

For this example, we will learn to sample from the 2-D Rosenbrock distribution. Accessible in InvertibleNetworks.jl using its colloquial name: the "banana" distribution. The idea of invertible networks is that we want to learn an invertible nonlinear mapping $G$ such that

$G_{\theta}(x) = z,$
(1)#

where $\theta$ is the network parameter, $x$ samples from the target distribution, $z$ samples from Gaussian distribution (white noise). After training, NF can sample from the target distribution via evaluating the inverse of NF on white noise, i.e., $G^{-1}_{\theta}(z)$. Let's first generate a training set and plot the target banana distribution.

n_train = 60000;
X_train = sample_banana(n_train);
size(X_train) #(nx, ny, n_channels, n_samples) Note: we put 2 dimensions as channels
fig = figure(); title(L"x \sim p_x(x)")
scatter(X_train[1,1,1,1:400], X_train[1,1,2,1:400]; alpha=0.4, label = L"x \sim p_{X}(x)");
xlabel(L"x_1"); ylabel(L"x_2");
xlim(-4,4); ylim(0,30);
legend();

## #Change of variables formula

$p_x(x) = p_z(f(x)) \, |\det \frac{\partial f}{\partial x}|$
(2)#

This formula allows us to evaluate the density of a sample under a monotone function $f : \mathbf{X} \rightarrow \mathbf{Z}$.

This density estimation is what gives us a maximum likelihood framework for training our parameterized Normalizing Flow $G_{\theta}$.

## #Training a normalizing flow

NF training is based on likelihood maximization of the parameterized model $f_{\theta}$ under the log likelihood of samples from the data distribution X:

$\underset{\mathbf{\theta}}{\operatorname{argmax}} \mathbb E_{{x} \sim p(x)} [\log p(x)].$
(3)#

We approximate the expectation with Monte Carlo samples from the training dataset:

$\underset{\mathbf{\theta}}{\operatorname{max}} \frac{1}{N} \sum_{x \in X_{train}} \log p(x).$
(4)#

We make this a minimization problem by looking at the negative loglikelihood and then the change of variabes formula makes this:

$\underset{\mathbf{\theta}}{\operatorname{min}} \frac{1}{N} \sum_{x \in X_{train}} -\log p(x) = \frac{1}{N} \sum_{x \in X_{train}} [\frac{1}{2}\|G_\theta({x})\|_2^2 - \log | \det \nabla_{x} G_\theta(x) | ]$
(5)#

This means that you apply your network to your data $\hat z = G_\theta(x)$ and want $\hat z$ to look like Normal noise. The log det of the jacobian term is making sure that we learn a distribution.

Calling G.backward will set all the gradients of the trainable parameters in G. We can access these parameters and their gradients with get_params and update them with the optimizer of our choice.

Note: since the network is invertible, we do not need to save intermediate states to calculate the gradient. Instead, we only provide the G.backward function with the final output Z and it will recalculate the intermediate states to calculate the gradients at each layer while backpropagating the residual dZ.

function loss(G, X)
batch_size = size(X)[end]

Z, lgdet = G.forward(X)

l2_loss = 0.5*norm(Z)^2 / batch_size  #likelihood under Normal Gaussian training
dZ = Z / batch_size                   #gradient under Normal Gaussian training

G.backward(dZ, Z)  #sets gradients of G wrt output and also logdet terms

return (l2_loss, lgdet)
end
nx          = 1
ny          = 1

#network architecture
n_in        = 2 #put 2d variables into 2 channels
n_hidden    = 16
levels_L    = 1
flowsteps_K = 10

G = NetworkGlow(n_in, n_hidden, levels_L, flowsteps_K;)
#G = G |> gpu

#training parameters
batch_size = 150
maxiter    = cld(n_train, batch_size)

lr = 9f-4

loss_l2_list    = zeros(maxiter)
loss_lgdet_list = zeros(maxiter)

for j = 1:maxiter
Base.flush(Base.stdout)
idx = ((j-1)*batch_size+1):(j*batch_size)

X = X_train[:,:,:,idx]
#x = x |> gpu

losses = loss(G, X) #sets gradients of G

loss_l2_list[j]    = losses
loss_lgdet_list[j] = losses

(j%50==0) && println("Iteration=", j, "/", maxiter,
"; f l2 = ",   loss_l2_list[j],
"; f lgdet = ",loss_lgdet_list[j],
"; f nll objective = ",loss_l2_list[j] - loss_lgdet_list[j])

for p in get_params(G)
end
end

## #Check training objective log

There are various ways to train a NF:

• train your network to convergence of objective
• use earlystopping to prevent overfitting
• check normality of $\hat z = G_{\theta}(x)$ with qq plots
• as a heuristic simply observe $\hat z = G_{\theta}(x)$ until it looks normal under the eyeball norm.
gt_l2 = 0.5*nx*ny*n_in #likelihood of gaussian noise

fig, axs = subplots(3, 1, sharex=true)

axs.plot(loss_l2_list, color="black", linewidth=0.6);
axs.axhline(y=gt_l2,color="red",linestyle="--",label="Normal Noise Likelihood")
axs.set_ylabel("L2 Norm")
axs.yaxis.set_label_coords(-0.09, 0.5)
axs.legend()

axs.plot(loss_lgdet_list, color="black", linewidth=0.6);
axs.set_ylabel("Log DET")
axs.yaxis.set_label_coords(-0.09, 0.5)

axs.plot(loss_l2_list - loss_lgdet_list, color="black", linewidth=0.6);
axs.set_ylabel("Full Objective")
axs.yaxis.set_label_coords(-0.09, 0.5)
axs.set_xlabel("Parameter Update") 

## #Testing a Normalizing Flow

Since we have access to $p_x(x)$ in the simple 2D Rosenbrock distribution, we can verify that generative samples from our trained network $\hat x = G^{-1}_\theta(z)$ look like they come from $p_x(x)$.

We can verify this visually (easy since this is a 2D dataset) and under the ground truth density of $p_x(x)$.

Let's start by taking samples from $z \sim N(0,I)$

num_test_samples = 500;
Z_test = randn(Float32,nx,ny,n_in, num_test_samples);

fig = figure(); title(L"z \sim p_{Z}(z)")
scatter(Z_test[1,1,1,:], Z_test[1,1,2,:]; alpha=0.4, color="black", label = L"z \sim p_{Z}(z)");
xlabel(L"z_1"); ylabel(L"z_2");
xlim(-5,5); ylim(-5,5);
legend();
ax.set_aspect(1);

Pass Normal samples $z \sim N(0,I)$ through the inverse network $\hat x = G^{-1}_\theta(z)$

X_test = G.inverse(Z_test);
trans_num = 150
start_points = [(Z_test[1,1,1,i], Z_test[1,1,2,i]) for i in 1:trans_num]
end_points = [(X_test[1,1,1,i], X_test[1,1,2,i]) for i in 1:trans_num]

fig = figure(figsize=(7,9)); title(L"Transformed latent $z \rightarrow x=G^{-1}_\theta(z)$");

for line in zip(start_points, end_points)
plot([line,line], [line ,line], alpha=0.2, linewidth=0.3, color="black")
scatter(line, line, marker="o",alpha=0.4, color="black")
scatter(line, line, marker="o",alpha=0.4, color="orange")
end
xlabel("First Dimension"); ylabel("Second Dimension");
ylim(-2.5,20); xlim(-2.5,2.5); ax.set_aspect(1)

Visually compare generative samples with samples from the ground truth density $x \sim p_x(x)$

fig = figure(); title(L"Generative samples of  $x \sim p_{\theta}(x)$")
scatter(X_train[1,1,1,1:400], X_train[1,1,2,1:400]; alpha=0.4, label = L"x \sim p_{X}(x)");
scatter(X_test[1,1,1,1:400], X_test[1,1,2,1:400]; alpha=0.4, color="orange", label = L"x \sim p_{\theta}(x) = G_\theta^{-1}(z)");
xlabel(L"x_1"); ylabel(L"x_2");
xlim(-4,4); ylim(0,30);
legend();