Back2Numbers

Normalising Flows and Neural ODEs

Fri, 11 Sep 2020 00:00:00 +0000

[UPDATE 1: Code comments. Julia version.]

One of the three best papers awarded at NIPS 2018 was Neural Ordinary Differential Equations by Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt and David Duvenaud (Chen et al. 2019). Since then, the field has developed in multiple directions. This post goes through some background about generative models, normalising flows and finally a few of the underlying ideas of the paper. The form does not intend to be mathematically rigorous but convey some intuitions.

1 A few words about Generative Models

1.1 Introduction

Generative models are about learning simple representations of a complex datasets; how to, from a few parameters, generate realistic samples that are similar to a given dataset with similar probabilities of occurence. Those few parameters usually follow simple distributions (e.g. uniform or Gaussian), and are transformed through complex transformations into the more complex dataset distribution. This is an unsupervised procedure which, in a sense, mirrors clustering methods: clustering starts from the dataset and summarises it into few parameters.

Although unsupervised, the result of this learning can be used as a pretraining step in a later supervised context, or where that dataset is a mix of labelled and un-labelled data. The properties of the well-understood starting probability distributions can then help draw conclusions about the dataset’s distribution or generate synthetic datasets.

The same methods can also be used in supervised learning to learn the representation of a target dataset (categorical or continuous) as a transformation of the features dataset. The unsupervised becomes supervised.

What does representation learning actually mean? It is the automatic search for a few parameters that encapsulate rich enough information to generate a dataset. Generative models learn those parameters and, starting from them, how to re-create samples similar to the original dataset.

Let’s use cars as an analogy.

All cars have 4 wheels, an engine, brakes, seats. One could be interested in comfort or racing them or lugging things around or safety or fitting as many kids as possible. Each base vector could be express any one of those characteristics, but they will all have an engine, breaks and seats. The generation function recreates everything that is common. It doesn’t matter if the car is comfy or not; it needs seats and a driving wheel. The generative function has to create those features. However, the exact number of cylinders, its shape, the seats fabric, or stiffness of the suspension all depend on the type of car.

The true fundamentals are not obvious. For a long time, American cars had softer suspension than European cars. The definition of comfortable is relative. The performance of an old car is objectively not the same as compared to new ones. Maybe other characteristics are more relevant to generate. Maybe price? Consumption? Year of coming to market? All those factors are obviously inter-related.

Generative models are more than generating samples from a few fundamental parameters. They also learn what those parameters should be.

1.2 Latent variables

Still using the car analogy, if the year of a model was not given, the generative process might still be able to conclude that the model year should be an implicit parameter to be learned since relevant to generate the dataset: year is an unstated parameter that explains the dataset. Both the Lamborghini Miura and Lamborghini Countach [^1] were similar in terms of perceived performance and exclusivity at the time they were created. But actual performances and styling where incredibly different.

If looking at the stock market: take a set of market prices at a given date; it would have significantly different meanings in a bull or a bear market. Market regime would be a reasonable latent variable.

1.3 Examples of generative models

There are quite a number of generative models. such restricted Boltzmann machines, deep belief networks. Refer to (Theodoridis 2020) and (Russell and Norvig 2020) for example. Let’s consider generative adversarial networks and variational auto-encoders.

1.3.1 Generative Adversarial Networks (GANS)

Recently, GANs have risen to the fore as a way to generate artificial datasets that are, for some definition, indistinguishable from a real dataset. They consist of two parts:

Figure 1.1: Generative Adversarial Networks (source: (Hitawala 2018)))

A generator which is the generative model itself: given a simple representation, the generator proposes samples that aim to be undistinguishable from the dataset sample.
A discriminator whose job is to identify whether a sample comes from the generator or from the dataset.

Both are trained simultaneously:

if the discriminator finds it obvious to guess, the generator is not doing a good job and needs to improve;
if the discriminator guesses 50/50 (does no better than flipping a coin), it has to discover which true dataset features are truly relevant.

1.3.2 Variational autoencoders

A successful GAN can replicate the richness of a dataset, but not its probability distribution. A GAN can generate a large number of correct sentences, but will not tell how likely to occur that sentence is (or at least guarantee that the distributions match). ‘The dog chases the cat’ and ‘The Chihuahua chases the cat’ are both perfectly valid, but the latter less unlikely to appear.

Generally speaking, autoencoders learn an encoder that takes a sample to generate a vector in a latent space, and a decoder that generates samples from latent state variables. The encoder and the decoder really mirror each other. However, this general approach does not learn how to sample from the latent space. Sampling randomly from the latent space may generate perfectly valid data (i.e. very similar to that in the training dataset), but the distribution of a generated datasest and the training dataset would likely be very different. This is the same problem GANs face.

Figure 1.2: Variational Auto-Encoder (source: Shenlong Wang)

Variable autoencoders (VAEs) take another approach. Instead of just learning a function representing the data, they learn the parameters of a probability distribution representing the data. We can then sample from the distribution and generate new input data samples. The decoder and the encoder are trained simultaneously on the dataset samples, proposing a generated sample from that projection and training on the reconstruction loss. The encoder actually learns means and standard deviations of the each latent variable, each being a normal distribution. The samples generated will be as rich as the GAN’s, but the probability of a sample being generated will depend on the learned distributions.

See (Kingma and Welling 2019) for an approachable extensive introduction. The details include implementation aspects (in particular the reparametrisation trick) that are critical to the success of this approach.

1.4 Limitations

We limited the introduction to those two techniques to merely highlight two fundamental aspect that generative models aim at:

find a simple representation;
explore and replicate the richness of the dataset;
replicate the probability distribution of the dataset.

Note that depending on the circumstances, the last aim may not necessarily be important.

As usual, training and optimisation methods are at risk of getting stuck at local optima. In the case of those two techniques, this manifests itself in different ways:

GANs Mode collapse: Mode collapse occurs in GANs when the generator only explores limited domains. Imagine training a GAN to recognise mammals (the dataset would contain kangaroos, whales, dogs and cats…). If the generator proposes everything but kangaroos, it is still properly generate mammals, but obviously misses out on a few possibilities. Essentially, the generator reaches a local minimum where a vanishing gradient becomes too small to explore alternatives. This is in part due to the difficulty of progressing the training of both the generator and the discriminator in a way that does not lock any one of them in a local optimum while the other still needs improving: if either converges too rapidly, the other will struggle to catch up.
VAEs Posterior collapse: Posterior collapse in VAEs arises when the generative model learns to ignore a subset of the latent variables (although the encoder generates those variables) (Lucas et al. 2019). This happens when (1) a subset of the latent variable space is good enough to generate a reasonable approximation of the dataset and its distribution, and (2) the loss function does not yield large enough gradients to explore other latent variables to further improve the encoder. (More technically, it happens when the variational distribution closely matches the uninformative prior for a subset of latent variables (Tucker et al. 2019).) The exact reasons for this are not entirely understood and this remains an active area of research (refer this extensive list of papers on the topic).

In the next section, we will get into another approach called Normalising Flows which, as we will see, address those two difficulties. Intuitively:

Mode collapse reflects that the generative process does not generate enough possibilities; that the spectrum of possibilities is not as rich as that of the dataset. Normalising flows attempt to address this in two ways. Firstly, their optimising process aims at optimising (and matching) the amount of information captured by the learned representation to that of the dataset (in the sense of information theory). Secondly, we will see that normalising flows allow to start from a sample in the dataset, flow back to the simple distribution and estimate how (un)likely the generative model would have generated this sample.
The posterior collapse could simply be a mismatch between the number of latent variables and the dimensionality of the dataset. As we will see, normalising flows impose that the generative model be a bijection which takes away the choice of of a number of dimensions (although this shifts the issue to become one of parameters regularisation).

On a final note, it will not be surprising that GANs and VAEs have been combined (see (Larsen et al. 2016)).

2 Normalising flows

Normalising Flows became popular around 2015 with two papers on density estimation (Dinh, Krueger, and Bengio 2015) and use of variational inference (Rezende and Mohamed 2016). However, one should note that the concepts predated those papers. See (Kobyzev, Prince, and Brubaker 2020) and (Papamakarios et al. 2019) for recent survey papers.

2.1 Introduction

One important limitations of the approaches described above is that the generation/decoding flow is unidirectional: one starts from a source distribution, sometimes with well-known properties, and generates a richer target distribution. However, given a particular sample in the target distribution, there is no guaranteed way to identify where it would fall in the latent space distribution. That flow of transformation from source to target is not guaranteed to be bijective or invertible (same meaning, different crowds).

Normalising flows are a generic solution to that issue: it is a transformation from a simple distribution (e.g. uniform or normal) to a more complex distribution by an invertible and differentiable mapping, where the probability density of a sample can be evaluated by transforming it back to the original distribution. The density is evaluated by computing the density of the normalised inverse-transformed sample. The word normalising refers to the normalisation of the transformation, and not to the fact that the original distribution could be normal.

In practice, this is a bit too general to be of any use. Let’s break this down:

The original distribution is simple with well-known statistical properties: i.i.d. Gaussian or uniform distributions.
The transformation function is expected to be complicated, and is normally specified as a series of successive transformations, each simpler (though expressive enough) and easy to parametrise.
Each simple transformation is itself invertible and differentiable, therefore guaranteeing that the overall transformation is too.
We want the transformation to be normalised: the cumulative probability density of the generated targets from latent variables has to be equal 1. Otherwise, flowing backwards to use the properties of the original would make no sense.

Figure 2.1: Normalizing Flows (Source: (Rezende and Mohamed 2016))

Geometrically, the probability distribution around each point in the latent variables space is a small volume that is successively transformed with each transformation. Keeping track of all the volume changes ensures that we can relate probability density functions in the original space and the target space.
How to keep track? This is where the condition of having invertible and differentiable transformation becomes important. (Math-speak: we have a series of diffeomorphisms which are transformations from one infinitesimal volume to another. They are invertible and differentiable, and their inverses are also differentiable.) If one imagines that small volume of space around a starting point, that volume gets distorted along the way. At each point, the transformation is differentiable and can be approximated by a linear transformation (a matrix). That matrix is the Jacobian of the transformation at that point (diffeomorphims also means that the Jacobian matrix exists and is invertible). Being invertible, the matrix has no zero eigenvalues and the change of volume is locally equal to the product of all the eigenvalues (more precisely, their absolute values): the volume gets squeezed along some dimensions, expanded along others. Rotations are irrelevant. The product of the eigenvalues is the determinant of the matrix. A negative eigenvalue would mean that the infinitesimal volume is ‘flipped’ along that direction. That sign is irrelevant: the local volume change is therefore the absolute value of the determinant.
We can already anticipate a computation nightmare: determinants are computationally very heavy. Additionally, in order to backpropagate a loss to optimise the transformations’ parameters, we will need the Jacobians of the inverse transformations (the inverse of the transformation Jacobian). Without further simplifying assumptions or tricks, normalising flows would be impractical for large dimensions.

2.2 Short example

We will use examples from the Torchdyn library. Torchdyn builds on Pytorch and the polish of the Pytorch Lightning library which streamlines a lot of the Pytorch boilerplate.

In this example, we try to model a dataset distribution which is the superposition of 6 bivariate normal distribution centred on the summits of an hexagon. The idea is to learn how to map and transform a simple distribution (a simple bivariate normal distribution) into that distribution with 6 modes.

2.2.1 Preamble

First some usual imports.

2.2.1.1 Python version

import sys 

import matplotlib.pyplot as plt

# Pytorch provides the autodifferentation and the neural networks
import torch
import torch.utils.data as data
from torch.distributions import MultivariateNormal

import torchdyn
from torchdyn.models import CNF, NeuralDE, REQUIRES_NOISE
from torchdyn.datasets import ToyDataset

import pytorch_lightning.core.lightning as pl

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

2.2.1.2 Julia version

using Random, Distributions, Plots, GR, LinearAlgebra

# Getting ready for GPUs is OK given the automatic fallback to CPU
using CUDA

2.2.2 Dataset

For this simple example, we will work with several Gaussians each centred on a hexagon.

2.2.2.1 Python version

# The dataset has about 16k samples
n_samples = 1 << 14

# That will be spread across 6 Gaussians on a plave. 
n_gaussians = 6

# Torchdyn has a helper funciton to generate the dataset.
X, yn = ToyDataset().generate(n_samples // n_gaussians, 
                              'gaussians', 
                              n_gaussians=n_gaussians, 
                              std_gaussians=0.5, 
                              radius=4, dim=2)

# Z-score the generated dataset.
X = (X - X.mean())/X.std()

# Let's look what we have
plt.figure(figsize=(5, 5))
plt.scatter(X[:,0], X[:,1], c='black', alpha=0.2, s=1.)

Figure 2.2: Toy dataset

2.2.2.2 Julia version

# The Julia version closely follows the Python one but we do not have the benefit of the helper function.

n_samples = 1 << 14
n_gaussians = 6
n_dims = 2

t_span = (0., 1.)
t_steps = 50

x_span = y_span = -2.5:0.1:2.5

X_span = repeat(x_span', length(y_span), 1)
Y_span = repeat(y_span,  1,              length(x_span))


function generate_gaussians(; n_dims = n_dims, n_samples=100, n_gaussians=7, 
        radius=1.f0, std_gaussians=0.2f0, noise=0.001f0)

    x = zeros(Float64, n_dims, n_samples * n_gaussians)
    y = zeros(Float64, n_samples * n_gaussians)
    incremental_angle = 2 * π / n_gaussians
    
    dist_gaussian = MvNormal(n_dims, sqrt(std_gaussians))

    if n_dims > 2
        dist_noise = MvNormal(n_dims - 2, sqrt(noise))
    end
    
    current_angle = 0.f0
    for i ∈ 1:n_gaussians
        current_loc = zeros(Float32, n_dims, 1)
        if n_dims >= 1
            current_loc[1] = radius * cos(current_angle)
        end
        
        if n_dims >= 2
            current_loc[2] = radius * sin(current_angle)
        end
        
        x[1:n_dims, (i-1)*n_samples+1:i*n_samples] = current_loc[1:n_dims] .+ rand(dist_gaussian, n_samples)
        if n_dims > 2
            x[1:n_dims-2, (i-1)*n_samples+1:i*n_samples] = rand(noise, n_samples)
        end
        
        
        y[   (i-1)*n_samples+1:i*n_samples] = Float32(i) .* ones(Float32, n_samples)
        
        current_angle = current_angle + incremental_angle
    end
    
    return Float64.(x), Float64.(y)
end


X, Y = generate_gaussians(; n_samples = n_samples ÷ n_gaussians, 
                            n_gaussians = n_gaussians, 
                            radius = 4.0f0, 
                            std_gaussians = 0.5f0)
X = (X .- mean(X)) ./ std(X)
X_SIZE = size(X)[2]


# We will continue onward using the Plotly backend
plotly() 
if n_dims == 1
    histogram(X[1, :], title = "Sample from the true density")
else
    scatter!(X[1, :], X[2, :], title = "Sample from the true density", markershape=:cross, markersize=1)
end

Figure 2.3: Toy dataset

2.2.3 Data loaders

2.2.3.1 Python version

We create data loaders for batches of 1,024:

X_train = torch.Tensor(X).to(device)
y_train = torch.LongTensor(yn).long().to(device)

train = data.TensorDataset(X_train, y_train)
trainloader = data.DataLoader(train, batch_size=1024, shuffle=True)

2.2.3.2 Julia version

Not needed.

2.2.4 Normalising flow module

2.2.4.1 Python version

# Continuous Normalisising Flows require an estimate of the trace of the Jacobian matrix. 
# This will be explained further down.
def autograd_trace(x_out, x_in, **kwargs):
    """Standard brute-force means of obtaining trace of the Jacobian, O(d) calls to autograd"""
    trJ = 0.
    for i in range(x_in.shape[1]):
        trJ += torch.autograd.grad(x_out[:, i].sum(), x_in, allow_unused=False, create_graph=True)[0][:, i]
    return trJ

# Continuous Normalisising Flows 
class CNF(nn.Module):
    def __init__(self, net, trace_estimator=None, noise_dist=None):
        super().__init__()

        self.net = net
        self.noise_dist, self.noise = noise_dist, None

        self.trace_estimator = trace_estimator if trace_estimator is not None else autograd_trace;
        if self.trace_estimator in REQUIRES_NOISE:
            assert self.noise_dist is not None, 'This type of trace estimator requires specification of a noise distribution'

    def forward(self, x):
        with torch.set_grad_enabled(True):
            # first dimension reserved to divergence propagation
            x_in = torch.autograd.Variable(x[:,1:], requires_grad=True).to(x) 
            
            # the neural network will handle the data-dynamics here
            x_out = self.net(x_in)

            trJ = self.trace_estimator(x_out, x_in, noise=self.noise)
        
        # `+ 0*x` has the only purpose of connecting x[:, 0] to autograd graph
        return torch.cat([-trJ[:, None], x_out], 1) + 0*x

2.2.4.2 Julia version

Not needed.

2.2.5 Layer definition

2.2.5.1 Python version

We build a NeuralDE model with a single transformation modelled as a multi-layer perceptron. As we will see, this transformation expresses infinitesimal changes of states. It is the same transformation that is applied from the starting state (the input) all the way to the output.

f = nn.Sequential(
        nn.Linear(2, 64),
        nn.Softplus(),
        nn.Linear(64, 64),
        nn.Softplus(),
        nn.Linear(64, 64),
        nn.Softplus(),
        nn.Linear(64, 2),
    )

# cnf wraps the net as with other energy models
# default trace_estimator, when not specified, is autograd_trace
cnf = CNF(f, trace_estimator=autograd_trace)
nde = NeuralDE(cnf, solver='dopri5', s_span=torch.linspace(0, 1, 2), sensitivity='adjoint', atol=1e-4, rtol=1e-4)

multi_gauss_model = nn.Sequential(Augmenter(augment_idx=1, augment_dims=1), nde).to(device)

2.2.5.2 Julia version

using DiffEqFlux, Optim, OrdinaryDiffEq, Zygote, Flux, JLD2, Dates, Serialization

# The NN is defined with the Flux package. 32 neurons per dimensions.
f = Chain(Dense(n_dims, 32 * n_dims, tanh), 
          Dense(32 * n_dims, 32 * n_dims, tanh), 
          Dense(32 * n_dims, 32 * n_dims, tanh), 
          Dense(32 * n_dims, n_dims)) |> gpu


# The CNF is defined as a differential equation AND the method used for its optimisation (FFJORD)
cnf_ffjord = FFJORD(f, t_span, Tsit5(), basedist = MvNormal(n_dims, 1.), monte_carlo = true)

# The optimisation will be to maximise the negative log loss  
function loss_adjoint(θ)
    logpx = cnf_ffjord(X, θ)[1]
    return -mean(logpx)[1]
end

2.2.6 Latent space

2.2.6.1 Python version

The latent space is defined as a 2-dimensional multivariate independent Gaussians with \(\mu=0\) and \(\sigma=0\).

multi_gauss_prior = MultivariateNormal(torch.zeros(2).to(device), torch.eye(2).to(device))

2.2.6.2 Julia version

This was already done via the parameter basedist of the cnf_ffjord definition with basedist = MvNormal(n_dims, 1.).

2.2.7 Training

2.2.7.1 Python version

Pytorch Lightning also takes care of the training loops, logging and general bookkeeping: a LightningModule is a Pytorch module on steroids.

class LearnerMultiGauss(pl.LightningModule):
    
    def __init__(self, model:nn.Module):
        super().__init__()
        
        self.model = model
        self.iters = 0

    
    def forward(self, x):
        return self.model(x)

    
    def training_step(self, batch, batch_idx):
        self.iters += 1
        x, _ = batch
        xtrJ = self.model(x)
        logprob = multi_gauss_prior.log_prob(xtrJ[:,1:]).to(x) - xtrJ[:,0] # logp(z_S) = logp(z_0) - \int_0^S trJ
        loss = -torch.mean(logprob)
        nde.nfe = 0
        return {'loss': loss}

    
    def configure_optimizers(self):
        return torch.optim.AdamW(self.model.parameters(), lr=2e-3, weight_decay=1e-5)

    
    def train_dataloader(self):
        return trainloader

PytorchLightning handles the training:

learn = LearnerMultiGauss(multi_gauss_model)
trainer = pl.Trainer(max_epochs=300)
trainer.fit(learn)

2.2.7.2 Julia version

# First define a callback function that will keep a record of losses and plot the learned distribution
callback = function(params, loss)
    
    store_all = true
    store_loss = false
    store_plot = false
    
    global iter += 1
    
    # Print the current loss
    println("Iteration $iter  -- Loss: $loss")
    
    
    # Keep a record of everything

    if store_all || store_loss
        push!(losses, loss)
    end
        
    if store_all || store_plot
        # Plot the transformation
        vals = map( (x, y) -> cnf_ffjord([x, y], params; monte_carlo=false)[1][], 
                    X_span, Y_span)    
    
        p = Plots.contour(x_span, y_span, vals, fill=true)
        p
        push!(list_plots, p)
    
        push!(min_maxes, 
              (minimum(vals), maximum(vals)))
    end
        
    return false
end


# Train using the ADAM optimizer. 

# List accumulators for the results
iter = 0; list_plots = []; min_maxes = []; losses = []

res1 = DiffEqFlux.sciml_train(
        loss_adjoint, 
        cnf_ffjord.p,
        ADAM(0.002), 
        cb = callback,
        maxiters = 100)

2.2.8 Sampling

2.2.8.1 Python version

We can now sample from the independent Gaussians to see what is generated from them.

# Let's draw 16k samples
sample = multi_gauss_prior.sample(torch.Size([n_samples]))

# integrating from 1 to 0
multi_gauss_model[1].s_span = torch.linspace(1, 0, 2)
new_x = multi_gauss_model(sample).cpu().detach()
sample = sample.cpu()

plt.figure(figsize=(12, 4))

plt.subplot(121)
plt.scatter(new_x[:,1], new_x[:,2], s=2.3, alpha=0.2, linewidths=0.3, c='blue', edgecolors='black')
plt.xlim(-2, 2) ; plt.ylim(-2, 2)
plt.title('Samples')

plt.subplot(122)
plt.scatter(X[:,0], X[:,1], s=2.3, alpha=0.2, c='red',  linewidths=0.3, edgecolors='black')
plt.xlim(-2, 2) ; plt.ylim(-2, 2)
plt.title('Data')

Figure 2.4: Training result

trajectories = model[1].trajectory(Augmenter(1, 1)(sample.to(device)), s_span=torch.linspace(1,0,100)).detach().cpu()

trajectories = trajectories[:, :, 1:] # scrapping first dimension := jacobian trace

n = 1000
plt.figure(figsize=(6, 6))

# Plot the sample
plt.scatter(sample[:n, 0],   sample[:n, 1],   s=4,  alpha=0.8, c='red')

# Dram a line from the sample to the generated data
plt.scatter(trajectories[:,:n, 0],   trajectories[:,:n, 1],   s=0.2, alpha=0.1, c='olive')

# Plot the generated data
plt.scatter(trajectories[-1, :n, 0], trajectories[-1, :n, 1], s=4,   alpha=1.0, c='blue')

plt.legend(['Prior sample z(S)', 'Flow', 'z(0)'])

Figure 2.5: Flows

We can see that the flow is smooth having sampled 1,000 points. For each sampled point in red), we trace its flow (in olive) to its final destination (in blue). The initial sample follows a 2D Gaussian and sort of explodes towards the direction of each mode. It is important to emphasise how economical this is in terms of parameters. We have become accustomed to deep learning networks with a staggering numbers of cascaded layers, each with its parameters to be optimised. This Neural ODE is a single perceptron with 2 hidden layers that is applied an infinite numbers of times (within the approximation of the ODE solver).

2.2.8.2 Julia version

We plot the progress of the 100 iterations:

anim = @animate for i ∈ 1:length(list_plots)
    # Necessary to create a new plot for each frame
    Plots.plot(1)
    Plots.plot!(list_plots[i])
end

gif(anim) # GIF converted to mp4 to reduce animation file size

2.3 Into the maths

The starting distribution is a random variable \(X\) with a support in \(\mathbb{R}^D\). For simplicity, we will assume just assume that the support is \(\mathbb{R}^D\) since using measurable supports does not change the results. If \(X\) is transformed into \(Y\) by an invertible function/mapping \(f: \mathbb{R}^D \rightarrow \mathbb{R}^D\) (\(Y=f(X)\)), then the density function of \(Y\) is:

\[ \begin{aligned} P_Y(\vec{y}) & = P_X(\vec{x}) \left| \det \nabla f^{-1}(\vec{y}) \right| \\ & = P_X(\vec{x}) \left| \det\nabla f(\vec{x}) \right|^{-1} \end{aligned} \]

where \(\vec{x} = f^{-1}(\vec{y})\) and \(\nabla\) represents the Jacobian operator. Note the use of \(\vec{x}\) to denote vectors instead of the usual \(\mathbf{x}\) which on-screen is easily read as a scalar.

Following the direction of \(f\) is the generative direction; following the direction of \(f^{-1}\) is the normalising direction (as well as being the inference/encoding direction in the context of training).

If \(f\) were a series of individual transformation \(f = f_N \circ f_{N-1} \circ \cdots \circ f_1\), then it naturally follows that:

\[ \begin{aligned} \det\nabla f(\vec{x}) & = \prod_{i=1}^N{\det \nabla f_i(\vec{x}_i)} \\ \det\nabla f^{-1}(\vec{x}) & = \prod_{i=1}^N{\det \nabla f_i^{-1}(\vec{x}_i)} \end{aligned} \]

In order to make clear that the Jacobian is not taken wrt the starting latent variables \(x\), we use the notation:

\[ \vec{x}_i = f_{i-1}(\vec{x}_{i-1}) \]

2.4 Training loss optimisation and information flow

Before moving into examples of normalising flows, we need to comment on the loss function optimisation. How do we determine the generative model’s parameters so that the generated distribution is as close as possible to the real distribution (or at least to the distribution of the samples drawn from that true distribution)?

A standard way to do this is to calculate the Kullback-Leibler divergence between the two. Recall that the KL divergence \(\mathbb{KL}(P \vert \vert Q)\) is not a distance as it is not symmetric. I personally read \(\mathbb{KL}(P \vert \vert Q)\) as “the loss of information on the true \(P\) if using the approximation \(Q\)” as a way to keep the two distributions at their right place (writing \(\mathbb{KL}(P_{true} \vert \vert Q_{est.})\) will help clarify the proper order).

The KL divergence is defined as:

\[ \begin{aligned} \mathbb{KL}(P_{true} \vert \vert Q_{est.}) = \mathbb{E}_{P_{true}(\vec{x})} \left[ \log \frac{P_{true}(\vec{x})}{Q_{est.}(\vec{x})} \right] \end{aligned} \]

Or for a discrete distribution:

\[ \begin{aligned} \mathbb{KL}(P_{true} \vert \vert Q_{est}) & = \sum_{\vec{x} \in X} P_{true}(\vec{x}) \log \frac{P_{true}(\vec{x})}{Q_{est}(\vec{x})} \\ & = \sum_{\vec{x} \in X} P_{true}(\vec{x}) \left[ \log P_{true}(\vec{x}) - \log Q_{est}(\vec{x}) \right] \end{aligned} \]

In our particular case, this becomes:

\[ \begin{aligned} \mathbb{KL}(P_{true} \vert \vert P_Y) & = \sum_{\vec{x} \in X} {P_{true}(\vec{x}) \log \frac{P_{true}(\vec{x})}{P_Y(\vec{y})}} \\ & = \sum_{\vec{x} \in X} {P_{true}(\vec{x}) \left[ \log P_{true}(\vec{x}) - \log P_Y(\vec{y}) \right] } \end{aligned} \]

Recalling that we have a transformation from \(\vec{x}\) to \(\vec{y}\):

\[ \begin{aligned} P_Y(\vec{y}) & = P_X(\vec{x}) \left| det \nabla f^{-1}(\vec{y}) \right| \\ & = P_X(\vec{x}) \left| det\nabla f(\vec{x}) \right|^{-1} \end{aligned} \]

We end up with:

\[ \mathbb{KL}(P_{true} \vert \vert P_Y) = \sum_{\vec{x} \in X} {P_{true}(\vec{x}) \left[ \log P_{true}(\vec{x}) - \log \left( P_X(\vec{x}) \left| det \nabla f(\vec{x}) \right|^{-1} \right) \right] } \]

Minimising this divergence is achieved by changing the parameter which generate \(f\).

The divergence is one of many loss formulae that can be used to measure the distance (in the loose sense of the word) between the true and generated distributions. But the KL divergence illustrates how logarithms of the probability distributions naturally appear. Another common formulation of the loss is the Wasserstein distance.

In the setting of the normalising flows (and VAEs), we have two transformations: the inference direction (the encoder) and the generative direction (the decoder). Given the back-and-forth nature, it makes sense to not favour one direction over the other. Instead of using the KL divergence which is not symmetric, we can use the mutual information (this is equivalent to using free energy as in (Rezende and Mohamed 2016)).

Regardless of the choice of loss function, it is obvious that optimising \(\mathbb{KL}(P_{true} \vert \vert P_Y)\) cannot be contemplated without serious optimisations. Finding more tractable alternative distance measurements is an active research topic.¹

2.5 Basic flows

In their paper, (Rezende and Mohamed 2016) experimented with simple transformations: a linear transformation (with a simple non-linear function) called planar flows and flows within a space centered on a reference latent variable called radial flows.

2.5.1 Planar Flows

A planar flow is formulated as a residual transformation:

\[ f_i(\vec{x}_i) = \vec{x}_i + \vec{u_i} h(\vec{w}_i^\intercal \vec{x}_i + b_i) \]

where \(\vec{u}_i\) and \(\vec{w}_i\) are vectors, \(h(\cdot)\) is a non-linear real function and \(b_i\) is a scalar.

By defining:

\[ \psi_i(\vec{z}) = h'(\vec{w}^\intercal \vec{z} + b_i) \vec{w}_i \]

the determinant required to normalize the flow can be simplified to (see original paper for the short steps involved):

\[ \left| \det \frac{\partial f_i}{\partial x_i} \right| = \left| \det \left( \mathbb{I} + \vec{u_i} \psi_i(\vec{x}_i)^\intercal \right) \right| = \left| 1 + \vec{u_i}^\intercal \psi_i(\vec{x}_i) \right| \]

This is a more tractable expression.

2.5.2 Planar flow example

This is an example inspired by https://github.com/abdulfatir/planar-flow-pytorch.

2.5.2.1 Imports

# https://github.com/abdulfatir/planar-flow-pytorch

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

import torch
import torch.nn as nn
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

from tqdm.notebook import tqdm

2.5.2.2 Constants and parameters

# Constants

# Size of a layer. We operate on a plane => 2D
n_dimensions = 2

# Number of layers
n_layers = 16

# Number of samples drawn
n_samples = 500

2.5.2.3 Densities to be learned

# Unnormalized Density Functions

# As a torch object for training
def true_density(z):
    z1, z2 = z[:, 0], z[:, 1]
    norm = torch.sqrt(z1 ** 2 + z2 ** 2)
    exp1 = torch.exp(-0.5 * ((z1 - 2) / 0.8) ** 2)
    exp2 = torch.exp(-0.5 * ((z1 + 2) / 0.8) ** 2)
    u = 0.5 * ((norm - 4) / 0.4) ** 2 - torch.log(exp1 + exp2)
    return torch.exp(-u)

# As a Numpy object for plotting
def true_density_np(z):
    z1, z2 = z[:, 0], z[:, 1]
    norm = np.sqrt(z1 ** 2 + z2 ** 2)
    exp1 = np.exp(-0.5 * ((z1 - 2) / 0.8) ** 2)
    exp2 = np.exp(-0.5 * ((z1 + 2) / 0.8) ** 2)
    u = 0.5 * ((norm - 4) / 0.4) ** 2 - np.log(exp1 + exp2)
    return np.exp(-u)

figure, axes = plt.subplots(1, 1, figsize=(8, 8))

# True Density
x = np.linspace(-5, 5, 500)
y = np.linspace(-5, 5, 500)

X, Y = np.meshgrid(x, y)

data = np.vstack([X.flatten(), Y.flatten()]).T

# Unnormalized density
density = true_density_np(data) 

axes.pcolormesh(X, Y, density.reshape(X.shape), cmap='Blues', shading='auto')
axes.set_title('True density')
axes.axis('square')
axes.set_xlim([-5, 5])
axes.set_ylim([-5, 5])

Figure 2.6: True density

2.5.2.4 Definition of a single layer

class PlanarTransform(nn.Module):
    def __init__(self, dim=2):
        super().__init__()
        
        self.u = nn.Parameter(torch.randn(1, dim) * 0.01)
        self.w = nn.Parameter(torch.randn(1, dim) * 0.01)
        self.b = nn.Parameter(torch.randn(()) * 0.01)
    
    def m(self, x):
        return -1 + torch.log(1 + torch.exp(x))
    
    def h(self, x):
        return torch.tanh(x)
    
    def h_prime(self, x):
        return 1 - torch.tanh(x) ** 2
    
    def forward(self, z, logdet=False):
        # z.size() = batch x dim
        u_dot_w = (self.u @ self.w.t()).view(())
        
        # Unit vector in the direction of w
        w_hat = self.w / torch.norm(self.w, p=2) 
        
        # 1 x dim
        u_hat = (self.m(u_dot_w) - u_dot_w) * (w_hat) + self.u 
        affine = z @ self.w.t() + self.b
        
        # batch x dim
        z_next = z + u_hat * self.h(affine) 
    
        if logdet:
            
            # batch x dim
            psi = self.h_prime(affine) * self.w 
            
            # batch x 1
            LDJ = -torch.log(torch.abs(psi @ u_hat.t() + 1) + 1e-8) 
            return z_next, LDJ
        
        return z_next

2.5.2.5 Definition of a flow as a concatenation of multiple layers

class PlanarFlow(nn.Module):
    
    def __init__(self, dim=2, n_layers=16):
        super().__init__()
        
        self.transforms = nn.ModuleList([PlanarTransform(dim) for k in range(n_layers)])
    
    def forward(self, z, logdet=False):
        zK = z
        SLDJ = 0.0
        
        for transform in self.transforms:
            out = transform(zK, logdet=logdet)
            if logdet:
                SLDJ += out[1]
                zK = out[0]
            else:
                zK = out
                
        if logdet:
            return zK, SLDJ
        return zK

2.5.2.6 Setup the training model

pf = PlanarFlow(dim=n_dimensions, n_layers=n_layers).to(device)

optimizer = torch.optim.Adam(pf.parameters(), lr=1e-2)
base = torch.distributions.normal.Normal(0., 1.)

2.5.2.7 Training by optimising the \(\mathbb{KL}\) divergence

pbar = tqdm(range(10000))

for i in pbar:
    optimizer.zero_grad()

    z0 = torch.randn(500, 2).to(device)
    zK, SLDJ = pf(z0, True)
    
    log_qk = base.log_prob(z0).sum(-1) + SLDJ.view(-1)
    log_p = torch.log(true_density(zK))
    
    kl = torch.mean(log_qk - log_p, 0)
    kl.backward()
    
    optimizer.step()
    if (i + 1) % 10 == 0:
        pbar.set_description('KL: %.3f' % kl.item())

2.5.2.8 Draw samples to plot the resulting model

samples = []

for _ in tqdm(range(n_samples)):
    
    # 500 starting sampled points
    z0 = torch.randn(500, 2)
    
    # Transformed 
    zK = pf(z0).detach().numpy()

    samples.append(zK)

samples = np.concatenate(samples)

figure, axes = plt.subplots(1, 2, figsize=(16, 8))

# True Density (unnormalised)
x = np.linspace(-5, 5, 500)
y = np.linspace(-5, 5, 500)

X, Y = np.meshgrid(x, y)
data = np.vstack([X.flatten(), Y.flatten()]).T
density = true_density_np(data) 

axes[0].set_title('True density')
axes[0].axis('square')
axes[0].set_xlim([-5, 5])
axes[0].set_ylim([-5, 5])
axes[0].pcolormesh(X, Y, density.reshape(X.shape), cmap='Blues', shading='auto')

# Learned Density
axes[1].set_title('Learned density')
axes[1].axis('square')
axes[1].set_xlim([-5, 5])
axes[1].set_ylim([-5, 5])
axes[1].hist2d(samples[:, 0], samples[:, 1], bins=100, cmap='Blues', shading='auto')

plt.savefig('assets/2ddensity.png')

Figure 2.7: Learned density

2.5.3 Radial flows

The formulation of the radial flows takes a reference hyper-ball centered at a reference point \(\vec{x}_0\). Any point \(\vec{x}\) gets moved in the direction of \(\vec{x} - \vec{x}_0\). That move is dependent on \(\vec{x}\). In other words, imagine a plain hyper-ball, after many such transformations, you obtain a hyper-potato.

The flows are defined as:

\[ f_i(\vec{x}_i) = \vec{x}_i + \beta_i h(\alpha_i, \rho_i) \left( \vec{x}_i - \vec{x}_0 \right) \]

where \(\alpha_i\) is a strictly positive scalar, \(\beta_i\) is a scalar, \(\rho_i = \left|| \vec{x}_i - \vec{x}_0 \right||\) and \(h(\alpha_i, \rho_i) = \frac{1}{\alpha_i + \rho_i}\).

This family of functions gives the following expression of the determinant:

\[ \left| \det \nabla f_i(\vec{x}_i) \right| = \left[ 1 + \beta_i h(\alpha_i, \rho_i) \right] ^{D-1} \left[ 1 + \beta_i h(\alpha_i, \rho_i) + \beta_i \rho_i h'(\alpha_i, \rho_i) \right] \]

Again, this is a more tractable expression since \(h(\cdot)\) is relatively simple.

Unfortunately, it was found that those transformations do not scale well to high-dimensional latent spaces.

2.6 More complex flows

2.6.1 Residual flows (discrete flows)

Various proposals were initially put forward with common aims: replacing \(f\) by a series of sequentially composed simpler but expressive base functions and paying particular attention the computational costs. (see (Kobyzev, Prince, and Brubaker 2020) and (Papamakarios et al. 2019) for details).

Generalised residual flows (He et al. 2015) were a key development. As the name suggests, the transformations alludes the RevNet neural network structure. Explicitly, \(f\) is defined as \(f(\vec{x}) = \vec{x} + \phi(\vec{x})\). The left-hand side identity term is a matrix where all the eigenvalues are 1 (duh). If \(\phi(\vec{x})\) represented a simple matrix multiplication, imposing the condition that all its eigenvalues of the righthand side term are strictly strictly between 0 and 1 ensure that \(f\) remains invertible. An equivalent, and more general condition, is to impose that \(\phi\) is Lipschitz-continuous with a constant below 1. That is:

\[ \forall \vec{x}, \vec{y} \qquad 0 < \left| \phi(\vec{x}) - \phi(\vec{y}) \right| < \left| \vec{x} - \vec{y} \right| \]

and therefore:

\[ \forall \vec{x}, \vec{h} \neq 0 \qquad 0 < \frac{\left| \phi(\vec{x}+\vec{h}) - \phi(\vec{x}) \right|}{\left| \vec{h} \right|} < 1 \]

Thanks to this condition, not only \(f\) is invertible, but all the eigenvalues of \(\nabla f = \nabla \left( \mathbb{I} + \phi(x) \right)\) are strictly positive (adding a transformation with unity eigenvalues (i.e. \(\mathbb{I}\)) and a transformation with eigenvalues strictly below unity (in norm) cannot result in a transformation with nil eigenvalues). Therefore, we can be certain that \(\left| \det \nabla f \right| = \det \left( \nabla (\mathbb{I} + \phi \right)\) (no negative or nil eigenvalues).

Recalling that \(det(e^A) = e^{tr(A)}\) and the Taylor expansion of \(\log\), we obtain the following simplification:

\[ \begin{aligned} \log \enspace \vert \det \nabla f \vert & = \log \enspace \det(\nabla \phi) \\ & = Tr(\log (\nabla \phi)) \\ \log \enspace \vert \det \nabla f \vert & = \sum_{k=1}^{\infty}{(-1)^{k+1} \frac{tr(\nabla \phi)^k}{k}} \end{aligned} \]

Obviously a trace is much easier to calculate than a determinant. However, the expression now becomes an infinite series. One of the core result of the cited paper is an algorithm to limit the number of terms to calculate in this infinite series.

2.7 Other versions

[TODO] Table from Papamakorios

3 Continuous Flows and Neural ordinary differential equations

3.1 Introduction

Up to now, the normalising flows were defined as a discrete series of transformations. If we go back to the reversible formulation of the flows, the internal state of the flow evolve as

\[\vec{x}_{i+1} = f(\vec{x}_{i}) = \vec{x}_{i} + \phi(\vec{x}_{i})\]

\[\vec{x}_{i+1} - \vec{x}_{i} = \phi_i(\vec{x}_{i})\]

This can be read as the Euler discretisation of the following ordinary differential equation:

\[\frac{d\vec{x}(t)}{dt} = \phi\left( \vec{x}(t), \theta \right)\]

In other words, as the steps between layers becoming infinitesimal, the flows become continuous, where \(\theta\) represent the layer’s parameters. Note that the parameters do not depend on the depth \(t\). As remarked by (Massaroli et al. 2020), this formulation with a constant \(\theta\) (instead of a depth-dependent \(\theta(t)\)) is the deep limit of a residual network with constant layer. We could be more general by using depth-dependent \(\theta(t)\) to create truly continuous neural networks.

Since \(\phi(\cdot)\) only depends on \(t\), we can define \(\vec{x}(t_1) = \phi^{t_1 - t_0}(\vec{x}(t_0)) = \vec{x}(t_0) + \int_{t_0}^{t_1}{\phi(\vec{x}(t))dt}\) and see that \(\phi^{t} \circ \phi^{s} = \phi^{t+s}\). Assuming, without loss of generality that \(t \in \left[ 0, 1 \right]\), \(\phi^1\) is a smooth flow called a time one map. Note that under the assumptions that \(\phi^t(\cdot)\) is continuous in \(t\) and Lipschitz-continuous in \(\vec{x}\), the solution is unique (Picard–Lindelöf-Cauchy–Lipschitz theorem).

This presentation of continuous flows is what (Chen et al. 2019) named Neural Ordinary Differential Equation.

Surprisingly, the log probability density becomes simpler in this continuous setting. The discrete formulation:

\[\log(P_Y(\vec{y})) = \log(P_X(\vec{x})) - \log(\left| \det\nabla \left( \mathbb{I} + \phi(\vec{x}) \right) \right|)\]

becomes

\[\frac{\partial \log(P(\vec{x}(t)))}{\partial t}=-Tr \left( \frac{\partial \phi(\vec{x}(t))}{\partial \vec{x}(t)} \frac{\partial \vec{x}(t)}{\partial t} \right)\] (See Appendix A of the paper for details.)

3.2 Continuous flows means no-crossover

Previously, in the context of discrete transformations, the transformation matrix (the Jacobian) could have strictly positive or strictly negative eigenvalues. This is not the case in a continuous context.

Let’s consider at a simple case in one dimension where we are simply trying to change the sign of a distribution.

For any value of \(t\), a transformation is only a function of the distribution density at that depth. The transformation does not depend on the trajectories reaching that depth. Therefore at the point of crossing, a transformation would not be able to create crossing trajectories.

Another way to look at this is to realise that at (or infinitesimally around) the point of crossing, the Jacobian of the transformation must have a negative eigenvalue to flip the volume. Starting from strictly positive eigenvalues, given that \(\phi(\cdot)\) is sufficiently smooth, reaching a negative eigenvalue implies going through 0, at which point the transformation ceases to be a diffeomorphim. This is contrary to the design of normalising flows.

Let’s look at what Torchdyn would produce. The dataset contains pairs of (-1, 1) and (1, -1).

n_points = 100

# The inputs
X = torch.linspace(-1, 1, n_points).reshape(-1,1)

# The reflected values
y = -X

X_train = torch.Tensor(X).to(device)
y_train = torch.Tensor(y).to(device)

# We train in a single batch
train = data.TensorDataset(X_train, y_train)
trainloader = data.DataLoader(train, batch_size=len(X), shuffle=False)

We define a LightningModule:

class LearnerReflect(pl.LightningModule):
    def __init__(self, model:nn.Module, settings:dict={}):
        super().__init__()
        self.model = model
    
    def forward(self, x):
        return self.model(x)
    
    def training_step(self, batch, batch_idx):
        x, y = batch      
        y_hat = self.model(x)   
        loss = nn.MSELoss()(y_hat, y)
        logs = {'train_loss': loss}
        return {'loss': loss, 'log': logs}   
    
    def configure_optimizers(self):
        return torch.optim.Adam(self.model.parameters(), lr=0.01)

    def train_dataloader(self):
        return trainloader

The ODE is a single perceptron:

# vanilla depth-invariant
f = nn.Sequential(
  nn.Linear(1, 64),
  nn.Tanh(),
  nn.Linear(64,1)
  )

# define the model
model = NeuralDE(f, solver='dopri5').to(device)

# train the neural ODE
learn = LearnerReflect(model)
trainer = pl.Trainer(min_epochs=100, max_epochs=200)
trainer.fit(learn)

# Trace the trajectories
s_span = torch.linspace(0, 1, 100)
reflection_trajectory = model.trajectory(X_train, s_span).cpu().detach()

plt.figure(figsize=(12,4))
plot_settings = {
  'n_grid':30, 
  'x_span': [-1, 1], 
  'device': device}

# Plot the learned flows
plot_traj_vf_1D(model, 
                s_span, reflection_trajectory, 
                n_grid=30, 
                x_span=[-1,1], 
                device=device);

# evaluate vector field
plot_n_pts = 50
x = torch.linspace(reflection_trajectory[:,:, 0].min(), 
                   reflection_trajectory[:,:, 0].max(), 
                   plot_n_pts)
y = torch.linspace(reflection_trajectory[:,:, 1].min(), 
                   reflection_trajectory[:,:, 1].max(), 
                   plot_n_pts)
X, Y = torch.meshgrid(x, y) 

z = torch.cat([X.reshape(-1,1), Y.reshape(-1,1)], 1)

# Field vectors
model_f = model.defunc(0,z.to(device)).cpu().detach()

fx = model_f[:, 0].reshape(plot_n_pts , plot_n_pts)
fx = model_f[:, 1].reshape(plot_n_pts , plot_n_pts)

# plot vector field and its intensity
fig = plt.figure(figsize=(4, 4))
ax = fig.add_subplot(111)

# Draws vector field itself
ax.streamplot(X.numpy().T, Y.numpy().T, 
              fx.numpy().T, fy.numpy().T, 
              color='black')

# Contour plot of the field's intensity 
ax.contourf(X.T, Y.T, 
            torch.sqrt(fx.T**2 + fy.T**2), 
            cmap='RdYlBu')

This simple example shows that in this form, Neural ODEs are not general enough.

3.3 Training / Solving the ODE

When optimising the parameters of discrete layers, we use backpropagation. What is the equivalent in a continuous setting?

Backpropagation works in a discrete context by propagate backward training losses which are allocated to parameters in proportion to their contribution to the loss and adjusting the parameters accordingly. The equivalent in a continuous context is the adjoint sensitivity method which originates from optimal control theory (see (Errico 1997) for example).

Given a loss defined as:

\[ \mathcal{L(\vec{x}(t_1))} = \mathcal{L} \left( \vec{x}(t_0) + \int_{t_0}^{t_1} \phi(\vec{x}(t), t, \theta) dt \right) \]

the adjoint \(a(\cdot)\) is defined as the gradient of the loss for a given hidden state evaluated at \(\vec{x} = \vec{x}(t)\):

\[ a(t) = \frac{\partial \mathcal{L}}{\partial \vec{x}(t)} = \frac{\partial \mathcal{L}}{\partial \vec{x}} \frac{\partial \vec{x}(t)}{\partial t} \]

The following figure explains what \(a(\cdot)\) represents: as \(t\) changes, so does the transformation \(\vec{x}(t)\) of the input (if looking from \(\vec{x}(t_0)\)). At a given step \(t\), the loss \(\mathcal{L}(\vec{x}(t))\) is a function only of that given state. The adjoint expresses (1) the changes of that loss and (2) expresses it as a function of the progress through the flow \(t\) instead of the value of the hidden state.

Figure 3.1: Backpropagation in time of the adjoint sensitivity (Source: (Chen et al. 2019))

A first order of approximation gives the following ODE (see (Chen et al. 2019) Appendix B.1. for details):

\[ - \frac{da(t)}{dt} = {a(t)}^\intercal \frac{\partial \phi(\vec{x}(t), t, \theta}{\partial \vec{x}(t)} \]

We write the negative sign in front of the derivative to make it more apparent that the adjoint sensitivity method is interested in tracking the backward changes of the loss: a positive derivative as \(t\) increases becomes a negative derivative as \(t\) decreases.

Deep learning libraries such a Pytorch, TensorFlow in Python, or Zygote.jl/Flux.jl/DiffEqFlux.jl in Julia provide automatic differentiation and a collection of bijections (to express the diffeomorphisms and loss function). They provide the infrastructure to express \(a(\cdot)\) and its derivative, track its changes and optimise the parametrisation \(\theta\) of the transformations. R has bindings to the Python libraries.

3.4 What parameters to optimise?

Recall that, unlike the initial introduction of the Neural ODEs, the general case has depth-dependent parameters \(\theta(t)\). There is no practical general implementation of those continuous networks. (Massaroli et al. 2020) describes two different approaches: hyper-networks where the parameters are generated by a neural network (one of the inputs being the depth), and what the paper calls Gälerkin-style approach. This approach uses a weighted basis of functions (think polynomials of a Taylor expansion or sine/cosine of a Fourier transform) limited to a few terms.

3.5 Increase the complexity of a flow: Augmented flows

As mentioned above, the basic continuous flows are not able to express something as simple as a change of sign of a distribution. This can be addressed with augmented flows (see (Dupont, Doucet, and Teh 2019)). The idea is to increase the dimension of the input: simply put, it embeds the flow into a space of higher dimension.

(Dupont, Doucet, and Teh 2019) demonstrate that this augmentation is efficient enough to achieve any transformation.

CHECK Appendix B.3 of (Massaroli et al. 2020)

3.6 Decrease the complexity of a flow: Regularisation and stability

Despite its advantages, continuous flows suffer from potential instability: it does not take much for a dynamic systems to exhibit a chaotic behaviour. This is all the more possible since the latent space dimension is the same as the dataset’s. A larger number of dimensions means more possible flows within that space. Depth-dependent parameters \(\theta(t)\), instead of a constant \(\theta\), increases that risk (using a constant being a form of regularisation). (See (Zhang, Wang, and Liu 2014) for a comprehensive review of the stability of neural networks.) Greater stability can be achieved by penalising extreme or sudden flow divergences where small changes in inputs yield large changes in output.

To quantify the propensity for chaotic behaviour, the literature is focused on the Lyapunov exponents (LEs) of the flows. What does LEs represent? Intuitively, you can imagine a point in space surrounded by a small volume \(V_1\). When that volume is carried by the flow (with time changing from \(t_1\) to \(t_2\)), it contracts and/or dilates to \(V_2\). LEs is a measure of this change \(V_2 / V_1\) expressed as a logarithm: if the volume is unchanged, the LE \(\lambda\) is 0 (\(e^\lambda = e^0 = 1\)). A contraction (resp. dilatation) has a negative (resp. positive) exponent. This is formulation has two benefits:

An exponent can be of any sign, but the change of volume is always positive (a negative volume makes no sense); and,
for time changing from \(t_1\) to \(t_2\), the exponent \(\lambda\) is consistently expressed as an instantaneous change independent of time: \(V_2/V_1 = e^{\lambda (t_2 - t_1)}\).

Adding a penalty term to the cost function are a natural solution:

(Yan et al. 2020) proposes using an estimate of the Lyapunov exponent. However, their proposal is to make this estimation along the flows; in essence, they regularise each flow (from an infinitesimal volume to another along segments of that flow) to avoid successive cycles of contraction/dilatation. Intuitively, this favours flows in the form of funnels (contraction) or horns (dilatation). It is however computationally expensive.
(Massaroli et al. 2020) proposes to only calculate between \(t=0\) and \(t=1\) (with \(\mathcal{L}_{reg} = \sum\limits_{i}^N \left|| \phi^1(t, x(1), \theta(1)) \right||_2\) for a training batch of size \(N\)). If \(\phi^1\) is zero, there is no change between the initial and final volume of a flow line.

3.7 Other

Previously mentioned generative models can be improved with normalising flows

Flow-GAN Grover, Dhan Ermon, Flow-GAN combining max Likelihood and adversarial learning and generative model

References

Literature

Chen, Ricky T. Q., Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. 2019. “Neural Ordinary Differential Equations.” December 13, 2019. http://arxiv.org/abs/1806.07366.

Dinh, Laurent, David Krueger, and Yoshua Bengio. 2015. “NICE: Non-Linear Independent Components Estimation.” April 10, 2015. http://arxiv.org/abs/1410.8516.

Dupont, Emilien, Arnaud Doucet, and Yee Whye Teh. 2019. “Augmented Neural ODEs.” October 26, 2019. http://arxiv.org/abs/1904.01681.

Errico, Ronald M. 1997. “What Is an Adjoint Model?” Bulletin of the American Meteorological Society 78 (11): 2577–92. https://doi.org/10.1175/1520-0477(1997)078<2577:WIAAM>2.0.CO;2.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. “Deep Residual Learning for Image Recognition.” December 10, 2015. http://arxiv.org/abs/1512.03385.

Hitawala, Saifuddin. 2018. “Comparative Study on Generative Adversarial Networks.” January 11, 2018. http://arxiv.org/abs/1801.04271.

Kingma, Diederik P., and Max Welling. 2019. “An Introduction to Variational Autoencoders.” Foundations and Trends in Machine Learning 12 (4): 307–92. https://doi.org/ggfm34.

Kobyzev, Ivan, Simon J. D. Prince, and Marcus A. Brubaker. 2020. “Normalizing Flows: An Introduction and Review of Current Methods.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1. https://doi.org/10.1109/TPAMI.2020.2992934.

Larsen, Anders Boesen Lindbo, Søren Kaae Sønderby, Ole Winther, and Hugo Larochelle. 2016. “Autoencoding Beyond Pixels Using a Learned Similarity Metric.” February 10, 2016. https://arxiv.org/abs/1512.09300.

Lucas, James, George Tucker, Roger Grosse, and Mohammad Norouzi. 2019. “Don’t Blame the ELBO! A Linear VAE Perspective on Posterior Collapse.” November 6, 2019. http://arxiv.org/abs/1911.02469.

Massaroli, Stefano, Michael Poli, Jinkyoo Park, Atsushi Yamashita, and Hajime Asama. 2020. “Dissecting Neural ODEs.” June 20, 2020. http://arxiv.org/abs/2002.08071.

Papamakarios, George, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. 2019. “Normalizing Flows for Probabilistic Modeling and Inference.” December 5, 2019. http://arxiv.org/abs/1912.02762.

Rezende, Danilo Jimenez, and Shakir Mohamed. 2016. “Variational Inference with Normalizing Flows.” June 14, 2016. http://arxiv.org/abs/1505.05770.

Russell, Stuart, and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach. 4th ed. Pearson Series on Artificial Intelligence. Pearson. http://aima.cs.berkeley.edu/.

Theodoridis, Sergios. 2020. Machine Learning: A Bayesian and Optimization Perspective. Amsterdam Boston Heidelberg London New York Oxford Paris San Diego San Francisco Singapore Sydney Tokyo: Elsevier, AP. https://doi.org/10.1016/C2019-0-03772-7.

Tucker, George, Roger Grosse, Mohammad Norouzi, and James Lucas. 2019. “Understanding Posterior Collapse in Generative Latent Variable Models.” In DeepGenStruct Worshop. https://www.semanticscholar.org/paper/Understanding-Posterior-Collapse-in-Generative-Lucas-Tucker/7e2f5af5d44890c08ef72a5070340e0ffd3643ea.

Yan, Hanshu, Jiawei Du, Vincent Y. F. Tan, and Jiashi Feng. 2020. “On Robustness of Neural Ordinary Differential Equations.” January 1, 2020. http://arxiv.org/abs/1910.05513.

Zhang, Huaguang, Zhanshan Wang, and Derong Liu. 2014. “A Comprehensive Review of Stability Analysis of Continuous-Time Recurrent Neural Networks.” IEEE Transactions on Neural Networks and Learning Systems 25 (7): 1229–62. https://doi.org/10.1109/TNNLS.2014.2317880.

Web references

Difficulties of training GANs.
A blog post by Adam Kosiorek on Normalizing Flows.
A two-part Normalizing Flows Tutorial by Eric Jang.
A Tutorial on Deep Generative Models by Shakir Mohamed and Danilo Rezende.
Picard–Lindelöf-Cauchy–Lipschitz theorem
TensorFlow bijectors and continuous models
Pytorch bijectors and continuous models
Julia bijectors in Turing and neural ODEs which also covers normalising flows and FFJORD.

Incidentally, this observation is made in the last sentence of the last paragraph of the last chapter of the Deep Learning Book (Goodfellow, Bengio, and Courville 2016)↩︎

Change of template

Wed, 12 Aug 2020 00:00:00 +0000

I moved to a richer Hugo template, partly to move things around under the hood, but importantly it gives a sounder platform for the future.

However, it took many hours of frustration to get blogdown and the template to nicely render \(\LaTeX\) formulas. In the end, it was very simple, although no documentation or blog posts helped: the mathjax: true YAML header option needs to be changed to math: true. No need to alternate between .Rmd or .md or .Rmarkdown files, mess around with config.toml or params.toml, chase down unknown pandoc binaries or add new partials snippets.

In addition, I finally figured out how to automatically generate table of contents. Insert the following snippet in the file header:

An Introduction to Julia

Thu, 30 Apr 2020 20:17:27 +0800

Presentation at the Kong Kong Machine Learning meetup

Thu, 30 Apr 2020 00:00:00 +0000

I recently made a presentation at the regular Hong Kong Machine Learning meetup organised by Gautier Marti.

The presentation was an introduction to Julia and used as an example a SEIR model COVID-19 I had written. The presentation is available on Github.

It seems to have had some effect!

Forecasting the progression of COVID-19

Wed, 25 Mar 2020 00:00:00 +0000

The Neherlab COVID-19 forecast model

using CSV, Dates;
using DataFrames, DataFramesMeta;
using Plots, PyPlot;
using DifferentialEquations;

This is more a data science post than machine learning. It was born after reading a report from Imperial College London and finding a forecasting model by NeherLab. The numbers produced by those models can only be described as terrifying.

How do those models work? How are they calibrated?

BUT

Remember that whatever concerns one can have about their precision, those models are all absolutely clear that social-distancing, quarantining have a massive impact on death rates. Being careful saves lives. If anybody feels like ignoring those precautions out of excess testosterone, they are at risk of killing others.

This post started from one of the pages of the NeherLab site describing their methodology. The work that team is achieving deserves more credit than I can give them.

The NeherLab website, including the model, is entirely written in Javascript. This is difficul to understand and audit.

Basic assumptions

WARNING: This is not an introduction to SEIR (and variant) compartment modelling of epidemies. For an introduction (difficult to avoid the maths), see a presentation by the Swiss Tropical and Public Health Institute. Wikipedia is always an option.

Overview

The model works as follows:

susceptible individuals are exposed/infected through contact with infectious individuals. Each infectious individual causes on average \(R_0\) secondary infections while they are infectious.

Transmissibility of the virus could have seasonal variation which is parameterized with the parameter “seasonal forcing” (amplitude) and “peak month” (month of most efficient transmission).

exposed individuals progress to a symptomatic/infectious state after an average latency
infectious individuals recover or progress to severe disease. The ratio of recovery to severe progression depends on age
severely sick individuals either recover or deteriorate and turn critical. Again, this depends on the age
critically ill individuals either return to regular hospital or die. Again, this depends on the age

The individual parameters of the model can be changed to allow exploration of different scenarios.

Age cohorts

COVID-19 is much more severe in the elderly and proportion of elderly in a community is therefore an important determinant of the overall burden on the health care system and the death toll. We collected age distributions for many countries from data provided by the UN and make those available as input parameters. Furthermore, we use data provided by the epidemiology group by the Chinese CDC to estimate the fraction of severe and fatal cases by age group.

Severity

The basic model deals with 3 levels of severity: slow, moderate and fast transmissions.

# severityLevel = :slow;
severityLevel = :moderate;
# severityLevel = :fast;

Seasonality

Many respiratory viruses such as influenza, common cold viruses (including other coronaviruses) have a pronounced seasonal variation in incidence which is in part driven by climate variation through the year. We model this seasonal variation using a sinusoidal function with an annual period. This is a simplistic way to capture seasonality. Furthermore, we don’t know yet how seasonality will affect COVID-19 transmission.

# Northern or southern hemisphere
latitude = :north;
# latitude = :tropical;
# latitude = :south;

# The time unit is days (as floating point)
# Day 0 is taken at 1 March 2020
BASE_DATE = Date(2020, 3, 1);
BASE_DAYS = 0;

function date2days(d) 
    return convert(Float64, datetime2rata(d) - datetime2rata(BASE_DATE))
end;

function days2date(d) 
    return BASE_DATE + Day(d)
end;

# Default values for R_0
baseR₀ = Dict( (:north,    :slow)     => 2.2, 
               (:north,    :moderate) => 2.7, 
               (:north,    :fast)     => 3.2, 
               (:tropical, :slow)     => 2.0, 
               (:tropical, :moderate) => 2.5, 
               (:tropical, :fast)     => 3.0,
               (:south,    :slow)     => 2.2, 
               (:south,    :moderate) => 2.7, 
               (:south,    :fast)     => 3.2);

# Peak date
peakDate = Dict( :north     => date2days(Date(2020, 1, 1)), 
                 :tropical  => date2days(Date(2020, 1, 1)),    # although no impact
                 :south     => date2days(Date(2020, 7, 1)));

# Seasonal forcing parameter \epsilon
ϵ = Dict( (:north,    :slow)     => 0.2, 
          (:north,    :moderate) => 0.2, 
          (:north,    :fast)     => 0.1, 
          (:tropical, :slow)     => 0.0, 
          (:tropical, :moderate) => 0.0, 
          (:tropical, :fast)     => 0.0,
          (:south,    :slow)     => 0.2, 
          (:south,    :moderate) => 0.2, 
          (:south,    :fast)     => 0.1);

# Gives R_0 at a given date
function R₀(d; r_0 = missing, latitude = :north, severity = :moderate)
    if ismissing(r_0)
        r₀ = baseR₀[(latitude, severity)]
    else
        r₀ = r_0
    end
    eps = ϵ[(latitude, severity)]
    peak = peakDate[latitude]
    
    return r₀ * (1 + eps * cos(2.0 * π * (d - peak) / 365.25))
end;

Transmission reduction

The tool allows one to explore temporal variation in the reduction of transmission by infection control measures. This is implemented as a curve through time that can be dragged by the mouse to modify the assumed transmission. The curve is read out and used to change the transmission relative to the base line parameters for \(R_0\) and seasonality. Several studies attempt to estimate the effect of different aspects of social distancing and infection control on the rate of transmission. A report by Wang et al estimates a step-wise reduction of \(R_0\) from above three to around 1 and then to around 0.3 due to successive measures implemented in Wuhan. This study investigates the effect of school closures on influenza transmission.

This curve is presented as a list of tuples: (days from start date, ratio). The month starts from the start date. Between dates, the ration is interpolated linearly. After the last date, the ration remains constant.

startDate = date2days(Date(2020, 3, 1));

mitigationRatio = [(0, 1.00), (30, 0.80), (60, 0.20), (150, 0.50)];

function getCurrentRatio(d; start = BASE_DAYS, schedule = mitigationRatio)
    l = length(schedule)
    
    # If l = 1, ratio will be the only one
    if l == 1 
        return schedule[1][2]
    else
        for i in 2:l
            d1 = schedule[i-1][1]
            d2 = schedule[i  ][1]
            
            if d < d2 
                deltaR = schedule[i][2] - schedule[i-1][2]
                return schedule[i-1][2] + deltaR * (d - d1) / (d2 - d1)
            end
        end
    
        # Last possible choice
        return schedule[l][2]
    end
end;

Details of the model

Age strongly influences an individual’s response to the virus. The general population is sub-divided in to age classes, indexed by \(a\), to allow for variable transition rates dependent upon age.

# The population will be modeled as a single vector. 
# The vector will be a stack of several vectors, each of them represents a compartment.
# Each compartment vector has a size $nAgeGroup$ representing each age group.
# The compartments are: S, E, I, H, C, R, D, K, L

# We also track the hospital bed usage BED and ICU

# Population to compartments
function Pop2Comp(P)
    
    # To make copy/paste less prone to error 
    g = 0
    
    S = P[ g*nAgeGroup + 1: (g+1)*nAgeGroup]; g += 1
    E = P[ g*nAgeGroup + 1: (g+1)*nAgeGroup]; g += 1
    I = P[ g*nAgeGroup + 1: (g+1)*nAgeGroup]; g += 1
    J = P[ g*nAgeGroup + 1: (g+1)*nAgeGroup]; g += 1
    H = P[ g*nAgeGroup + 1: (g+1)*nAgeGroup]; g += 1
    C = P[ g*nAgeGroup + 1: (g+1)*nAgeGroup]; g += 1
    R = P[ g*nAgeGroup + 1: (g+1)*nAgeGroup]; g += 1
    D = P[ g*nAgeGroup + 1: (g+1)*nAgeGroup]; g += 1
    K = P[ g*nAgeGroup + 1: (g+1)*nAgeGroup]; g += 1
    L = P[ g*nAgeGroup + 1: (g+1)*nAgeGroup]; g += 1
    
    BED = P[ g*nAgeGroup + 1: g*nAgeGroup + 1]
    ICU = P[ g*nAgeGroup + 2: g*nAgeGroup + 2]
    
    return S, E, I, J, H, C, R, D, K, L, BED, ICU
end;

Population compartments

Qualitatively, the epidemy model dynamics tracks several sub-groups (compartments):

Susceptible individuals (\(S\)) are healthy and susceptible to being exposed to the virus by contact with an infected individual.
Exposed individuals (\(E\)) are infected but asymptomatic. They progress towards a symptomatic state on average time \(t_l\). Reports are that asymptomatic individuals are contagious. We will assume that they are proportionally less contagious than symptomatic individuals as a percentage \(\gamma_E\) of \(R_0\). For the purposes of modelling we will assume (without supporting evidence, but will be the object of parameter estimation):

γₑ = 0.50;

Infected individuals (\(I\)) infect an average of \(R_0\) secondary infections. On a time-scale of \(t_i\), infected individuals either recover or progress towards severe infection.

From here on, the compartments differ from the NeherLab model is that we split compartments depending on the severity of the symptoms (Severe or Critical) and the location of the individual (out of the hospital infrastructure, isolated in hospital, or isolated in intensive care units). The transitions reflect the following assumptions:

Transition between locations is purely a function of bed availability: as soon as beds are available, they are filled by all age groups in their respective proportions.
Transition from severe to critical is assumed to be independent from the location of the patient. For severe patients, the relevance of the location is whether they are isolated or not, that is the possibility to infect susceptible individual. The same way an asymptomatic individual’s attracts a ratio \(\gamma_e\), the other compartments will. The transition from \(J\) and \(H\) to recovery or criticality has a time-scale of \(t_h\).

# R_0 multipliers depending on severity. Subscript matches the compartment's name.
# Infected / symptomatic individuals
γᵢ=1.0;

# Severe symptoms
γⱼ=1.0;

# Critical symptoms
γₖ = 2.0;

Once critical, the location of a patient influences their chances of recovery. Although we will assume that the time to recovery is identical in all cases, we will assume that the risks will double and triple if a patient is in simple isolation (receiving care but without ICU equipmment) or out of hospital.

# Fatality mulitplier.

# In ICU
δᵤ = 1.0;

# In hospital
δₗ = 2.0;

# Out of hospital
δₖ = 3.0;

The time-scale to recovery (\(R\)) or death (\(D\)) is \(t_u\).
Recovering and recovered individuals [\(R\)] can not be infected again. We will assume that recovering individual are not contagious (no medical experience for this assumption for recovering individual).

Model parameters

Many estimates of \(R_0\) are in the range of 2-3 with some estimates pointing to considerably higher values.
The serial interval, that is the time between subsequent infections in a transmission chain, was estimated to be 7-8 days.
The China CDC compiled extensive data on severity and fatality of more than 40 thousand confirmed cases. In addition, we assume that a substantial fraction of infections, especially in the young, go unreported. This is encoded in the columns “Confirmed [% of total]”.
Seasonal variation in transmission is common for many respiratory viruses but the strength of seasonal forcing for COVID19 are uncertain. For more information, see a study by us and by Kissler et al. The parameters of this model fall into three categories: transition time scales, age-specfic parameters and a time-dependent infection rate.

Transition time scales

The time scales of transition from a compartment to the next: \(t_l\), \(t_i\), \(t_h\), \(t_c\).

\(t_l\): latency time from infection to infectiousness
\(t_i\): the time an individual is infectious after which he/she either recovers or falls severely ill
\(t_h\): the time a sick person recovers or deteriorates into a critical state
\(t_u\): the time a person remains critical before dying or stabilizing (Neherlab uses \(t_c\) instead of \(t_u\))

# Time to infectiousness (written t\_l)
tₗ = Dict(  :slow     => 5.0, 
            :moderate => 5.0, 
            :fast     => 4.0);

# Time to infectiousness (written t\_i)
tᵢ = Dict(  :slow     => 3.0, 
            :moderate => 3.0, 
            :fast     => 3.0);

# Time in hospital bed (not ICU)
tₕ = 4.0;

# Time in ICU 
tᵤ = 14.0;

Age-specfic parameters

The age-specific parameters \(z_a\), \(m_a\), \(c_a\) and \(f_a\) that determine relative rates of different outcomes.

\(z_a\): a set of numbers reflecting to which extent an age group is susceptible to initial contagion. Note that NeherLab denotes this vector by \(I_a\) which is confusing with the compartmment evolution \(I_a(t)\) notation. (This sort of defeats the purpose of \(R_0\).)
\(m_a\): fraction of infectious becoming severe (Hospitalisation Rate) or recovers immediately (Recovery Rate)
\(c_a\): fraction of severe cases that turn critical (Critical Rate) or can leave hospital (Discharge Rate)
\(f_a\): fraction of critical cases that are fatal (Death Rate) or recover (Stabilisation Rate)

AgeGroup = ["0-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80+"];
zₐ =       [0.05,   0.05,   0.10,    0.15,    0.20,    0.25,    0.30,    0.40,    0.50];
mₐ =       [0.01,   0.03,   0.03,    0.03,    0.06,    0.10,    0.25,    0.35,    0.50];
cₐ =       [0.05,   0.10,   0.10,    0.15,    0.20,    0.25,    0.35,    0.45,    0.55];
fₐ =       [0.30,   0.30,   0.30,    0.30,    0.30,    0.40,    0.40,    0.50,    0.50];

nAgeGroup = length(AgeGroup);

Infrastruture

The number of beds available is assumed as a fixed resource in time. The number of hospital (resp. ICU) beds in use will be denoted \(\mathscr{H}(t)\) (resp. \(\mathscr{U}(t)\)) up to a maximum of \(\mathscr{H}_{max}\) (resp. \(\mathscr{U}_{max}\)).

Although the initial infections took place via dommestic and international travellers (apart from the initial infections in Wuhan obviously), we will assume no net flow of population in and out of a country of interest.

Infection

Susceptible:

The base rate of contagion is denoted as \(R_0\). The actual rate varies with time (to reflect seasons and impact of temperature on virus resilience) and the effectiveness of the mitigation measures such as social distancing. Separately, each age group will have a different sensitivity to infection.

The infection rate \(\beta_a(t)\) is age- and time-dependent. It is given by:

\[\beta_a(t) = z_a M(t) R_0 \left( 1+\varepsilon \cos \left( 2\pi \frac{t-t_{max}}{t_i} \right) \right) \]

where:

\(z_a\) is the degree to which particular age groups are sensitive to initial infection. It reflects bioligical sensitivity and to which degree it is isolated from the rest of the population (denoted \(I_a\) in NeherLab).
\(M(t)\) is a time-dependent ratio reflecting the effectiveness of mitigation measures.
\(\varepsilon\) is the amplitude of seasonal variation in transmissibility.
\(t_{max}\) is the time of the year of peak transmission.

Susceptible individuals are exposed to a number of contagious individuals:

asymptomatic infected: \[\gamma_e \beta_a(t) E_a(t)\]
symptomatic infected: \[\gamma_i \beta_a(t) I_a(t)\]
severe not in hospital: \[\gamma_j \beta_a(t) J_a(t)\]
critical not in hospital: \[\gamma_k \beta_a(t) K_a(t)\]

The sum of those will be a flow from susceptible (\(S\)) to (\(E\)) exposed individuals.

\[ \begin{aligned} S2E_a(t) & = \gamma_e \beta_a(t) E_a(t) + \gamma_i \beta_a(t) I_a(t) + \gamma_j \beta_a(t) K_a(t) + \gamma_k \beta_a(t) L_a(t) \\ S2E_a(t) & = \beta_a(t) \left( \gamma_e E_a(t) + \gamma_i I_a(t) + \gamma_j J_a(t) + \gamma_k K_a(t) \right) \\ E2S_a(t) & = -S2E_a(t) \\ \end{aligned} \]

and therefore:

\[ \frac{dS_{a}(t)}{dt} = - S2E_a(t) = E2S_a(t) \]

After infection

Quantitatively, the model expresses how many individuals transfer from one situation/compartment to another. Flows from compartment X to Y are written as \(X2Y\) (obviously \(X2Y = - Y2X\)).

Note that the compartments are split into age groups.

Transitions between compartments

Epidemiology

Instead of expressing the sum of the flows at each node, it is easier to express the arrows, and summing them afterwards. For example, arrow from \(J\) to \(K\) will be:

\[JK_a(t) = \frac{c_a}{t_h} J_a(t)\]

with a positive flow following the direction of the arrow.

In {julia}, this will become:

        JK = cₐ .* J / tₕ

where J is a vector representing an age group, .* is the element-wise multiplication.

After defining the arrows \(IJ\), \(JK\) and \(JH\), the change in \(J\) will simply be:

        dJ = IJ - JK - JH

Bed transfers

Individuals are transferred into hospital beds then into ICU beds in the order indicated by the red numbers.

Critical patients already in hospital go into ICU as spots become available. The freed bed are first made available to critical patients out of hospital (\(K\)). Then, any free beds will receive patients in severe condition.

Safeguards:

Note the need to ensure a few common sense rules:

No compartment can have a negative number of people.
The total population figure should remain unchanged. This is done by adjusting the number of susceptible individuals.
Careful accounting of the use of fixed number of hospital beds.
The number of infected people should always be above the number of reported cases.

# Helper function to never change the number of individuals in a compartment in a way that would 
# make it below 0.1 (to avoid rounding errors around 0)
function ensurePositive(d,s)
    return max.(d .+ s, 0.1) .- s
end;

    
    
# The dynamics of the epidemy is a function that mutates its argument with a precise signature
# Don't pay too much attetion to the print debugs/

function epiDynamics!(dP, P, params, t)
    
    S, E, I, J, H, C, R, D, K, L, BED, ICU = Pop2Comp(P)
    
    BED = BED[1]
    ICU = ICU[1]
    
    r₀, tₗ, tᵢ, tₕ, tᵤ, γₑ, γᵢ, γⱼ, γₖ, δₖ, δₗ, δᵤ, startDays = params 
    
    
    ####################################
    # Arrows reflecting epidemiology - Check signs (just in case)
    EI = ones(nAgeGroup) .* E / tₗ;  EI = max.(EI, 0.0); IE = -EI; 
    IJ = mₐ              .* I / tᵢ;  IJ = max.(IJ, 0.0); JI = -IJ
    JK = cₐ              .* J / tₕ;  JK = max.(JK, 0.0); KJ = -JK
    HL = cₐ              .* H / tₕ;  HL = max.(HL, 0.0); LH = -HL
    
    # Recovery arrows
    IR = (1 .- mₐ)       .* I / tᵢ;  IR = max.(IR, 0.0); RI = -IR
    JR = (1 .- cₐ)       .* J / tₕ;  JR = max.(JR, 0.0); RJ = -JR
    HR = (1 .- cₐ)       .* H / tₕ;  HR = max.(HR, 0.0); RH = -HR
    KR = (1 .- δₖ .* fₐ) .* K / tᵤ;  KR = max.(KR, 0.0); RK = -KR
    LR = (1 .- δₗ .* fₐ) .* L / tᵤ;  LR = max.(LR, 0.0); RL = -LR
    CR = (1 .- δᵤ .* fₐ) .* C / tᵤ;  CR = max.(CR, 0.0); RC = -CR
    
    # Deaths
    KD = δₖ .* fₐ        .* K / tᵤ;  KD = max.(KD, 0.0); DK = -KD
    LD = δₗ .* fₐ        .* L / tᵤ;  LD = max.(LD, 0.0); DL = -LD
    CD = δᵤ .* fₐ        .* C / tᵤ;  CD = max.(CD, 0.0); DC = -CD
    
    
    ####################################
    # Bed transfers
    
    ####### Step 1:
    # Decrease in bed usage is (recall that CD and CR are vectors over the age groups) 
    dICU = - (sum(CD) + sum(CR));                 dICU = ensurePositive(dICU, ICU)
    
    # ICU beds available
    ICU_free = ICU_max - (ICU + dICU)
    
    # Move as many patients as possible from $L$ to $C$ in proportion of each group
    ICU_transfer = min(sum(L), ICU_free)
    LC = ICU_transfer / sum(L) .* L;    CL = -LC
    
    # Overall change in ICU bed becomes
    dICU = dICU + ICU_transfer;                   dICU = ensurePositive(dICU, ICU)
    
    # And some normal beds are freed
    dBED = -ICU_transfer;                         dBED = ensurePositive(dBED, BED)
    #print(" dBed step 1 "); println(floor.(sum(dBED)))

    ####### Step 2:
    # Beds available
    BED_free = BED_max - (BED + dBED)
    
    # Move as many patients as possible from $K$ to $L$ in proportion of each group
    BED_transfer = min(sum(K), BED_free)
    KL = BED_transfer / sum(K) .* K;   LK = -KL
    
    # Overall change in normal bed becomes
    dBED = dBED + BED_transfer;                   dBED = ensurePositive(dBED, BED)
    #print(" dBed step 2 "); println(floor.(sum(dBED)))
    

    ####### Step 3:
    # Beds available
    BED_free = BED_max - (BED + dBED)
    
    # Move as many patients as possible from $J$ to $H$ in proportion of each group
    BED_transfer = min(sum(J), BED_free)
    JH = BED_transfer / sum(J) .* J;   HJ = -JH 
    
    # Overall change in ICU bed becomes
    dBED = dBED + BED_transfer;                   dBED = ensurePositive(dBED, BED)
    #print(" dBed step 3 "); println(floor.(sum(dBED)))
    

    ####################################
    # Sum of all flows + Check never negative compartment
    
    # Susceptible    
    # Calculation of β
    β = getCurrentRatio(t; start = BASE_DAYS, schedule = mitigationRatio) .* zₐ .* 
        R₀(t; r_0 = r₀, latitude = Latitude, severity = SeverityLevel)
    
    #print("r₀"); println(r₀); println("R₀"); 
    #println(R₀(t; r_0 = r₀, latitude = Latitude, severity = SeverityLevel)); print()
    
    dS = -β .* (γₑ.*E + γᵢ.*I + γⱼ.*J + γₖ.*K);   dS = min.(-0.01, dS); dS = ensurePositive(dS, S)
    
    #print("dS"); println(floor.(dS)); println(); 
    
    # Exposed
    dE = -dS + IE;                                dE = ensurePositive(dE, E)
    
    # Infected. 
    dI = EI + JI + RI;                            dI = ensurePositive(dI, I)
    
    # Infected no hospital
    dJ = IJ + HJ + KJ + RJ;                       dJ = ensurePositive(dJ, J)
    
    #print("I "); println(floor.(IJ)); print("H "); println(floor.(HJ))
    #print("K "); println(floor.(KJ)); print("R "); println(floor.(RJ))
    
    # Infected in hospital
    dH = JH + LH + RH ;                           dH = ensurePositive(dH, H)
    
    # Critical no hospital
    dK = JK + LK + DK + RK;                       dK = ensurePositive(dK, K)
    
    # Critical in hospital
    dL = KL + HL + CL + DL + RL;                  dL = ensurePositive(dL, L)
    
    # Critical in ICU
    dC = LC + DC + RC;                            dC = ensurePositive(dC, C)
    
    # Recovery (can only increase)
    dR = IR + JR + HR + KR + LR + CR;             dR = max.(dR, 0.01)
    
    # Dead (can only increase)
    dD = KD + LD + CD;                            dD = max.(dD, 0.01)
    
    # Vector change of population and update in place
    result = vcat(dS, dE, dI, dJ, dH, dC, dR, dD, dK, dL, [dBED], [dICU])
    #print(" dS "); print(floor.(sum(dS))); print(" dE "); print(floor.(sum(dE))); 
    #print(" dI "); print(floor.(sum(dI))); print(" dJ "); println(floor.(sum(dJ))); 
    #print(" dH "); print(floor.(sum(dH))); print(" dC "); print(floor.(sum(dC))); 
    #print(" dR "); print(floor.(sum(dR))); print(" dD "); print(floor.(sum(dD))); 
    #print(" dK "); print(floor.(sum(dK))); print(" dL "); println(floor.(sum(dL))); println(); 
    for i = 1:length(result)
        dP[i] = result[i]
    end

end;

Load data

The data comes from Neherlab’s data repository on Github.

We will use Italy as an example

country = "Italy";

This file contains a record of cases day by day.

cases = DataFrame(CSV.read("data/World.tsv", header = 4));
cases = @where(cases, occursin.(country, :location));
sort!(cases, :time);

# Add a time column in the same format as the other dataframes
cases = hcat(DataFrame(t = date2days.(cases[:, :time])), cases);

# Remove any row with no recorded death
cases = cases[cases.deaths .> 0, :];

last(cases[:, [:time, :cases, :deaths]], 6)

The last rows shows the number of cases and deaths up to the last date in the dataset.

Plotting the number of death shows an almost exponential increase in numbers (straight line in logarithmic scale).

using PyPlot;

pyplot();
clf();
ioff();
plot_x = cases.time;
plot_y = cases.deaths;

fig, ax = PyPlot.subplots();

ax.plot(plot_x, plot_y, "ro");
ax.fill_between(plot_x, plot_y, color="red", linewidth=2, label="Deaths", alpha=0.3);
ax.legend(loc="upper left");
ax.set_xlabel("time");
ax.set_ylabel("Deaths");
ax.set_yscale("log");

PyPlot.savefig("images/Deaths.png");

Deaths

This file contains ICU beds figures.

ICU_capacity = select(CSV.read("data/ICU_capacity.tsv"; delim = "\t"), :country, :CriticalCare);
ICU_capacity = @where(ICU_capacity, occursin.(country, :country))[!, :CriticalCare][1];
ICU_capacity = convert(Float64, ICU_capacity);

Country codes are necessary to load the another file.

country_codes = select(CSV.read("data/country_codes.csv"), :name, Symbol("alpha-3"));
country_codes = @where(country_codes, occursin.(country, :name));
countryShort = country_codes[:, Symbol("alpha-3")][1];

This file contains hospital beds figures.

hospital_capacity = select(CSV.read("data/hospital_capacity.csv", 
                                    types = Dict(:COUNTRY => String), limit = 1267), :COUNTRY, :YEAR, :VALUE);
hospital_capacity = @where(hospital_capacity, Not(ismissing.(:COUNTRY)));
hospital_capacity = last(@where(hospital_capacity, occursin.(countryShort, :COUNTRY)), 1)[!, :VALUE][1];
hospital_capacity = convert(Float64, hospital_capacity);

This file contains a distribution of the population in age groups.

age_distribution = CSV.read("data/country_age_distribution.csv");
age_distribution = @where(age_distribution, occursin.(country, :_key))[!, 2:10];

# Convert to simple matrix
age_distribution = Matrix(age_distribution);
show(age_distribution);

Initialise parameters

Fixed constants

SeverityLevel = :moderate;
Latitude = :north;

StartDate = Date(2020, 3, 1);
StartDays = date2days(StartDate);

EndDate = Date(2020, 9, 1);
EndDays = date2days(EndDate);

tSpan = (StartDays, EndDays);

Infrastructure

BED_max = hospital_capacity

ICU_max = ICU_capacity

Parameter vector

# r₀, tₗ, tᵢ, tₕ, tᵤ, γᵢ, γⱼ, γₖ, δₖ, δₗ, δᵤ, startDate = params 

parameters = [  baseR₀[Latitude, SeverityLevel], 
                tₗ[SeverityLevel], tᵢ[SeverityLevel], tₕ, tᵤ, 
                γₑ, γᵢ, γⱼ, γₖ, 
                δₖ, δₗ, δᵤ, 
                StartDays];

Population

Age_Pyramid = transpose(age_distribution);
Age_Pyramid_frac = Age_Pyramid / sum(Age_Pyramid);

We do not know the number of actual number of infections cases at the start of the model. We only know confirmed cases (almost certainly far below the number of actual infections).

We assume that actual infections are 3 time more numerous.

DeathsAtStart = @where(cases, :time .== StartDate)[!, :deaths][1];
ConfirmedAtStart = @where(cases, :time .== StartDate)[!, :cases][1];
EstimatedAtStart = 3.0 * ConfirmedAtStart;

Parameters vector

# Note that values are inintialised at 1 to avoid division by zero

S0 = Age_Pyramid;
E0 = ones(nAgeGroup);
I0 = EstimatedAtStart * Age_Pyramid_frac;
J0 = ones(nAgeGroup);
H0 = ones(nAgeGroup);
C0 = ones(nAgeGroup);
R0 = ones(nAgeGroup);
D0 = DeathsAtStart * Age_Pyramid_frac;
K0 = ones(nAgeGroup);
L0 = ones(nAgeGroup);

# Everybody confirmed is in hospital
BED = [ConfirmedAtStart];
ICU = [1.0];

P0 = vcat(S0, E0, I0, J0, H0, C0, R0, D0, K0, L0, BED, ICU);
dP = 0 * P0;

Differential equation solver

model = ODEProblem(epiDynamics!, P0, tSpan, parameters);

# Note: progress steps might be too quick to see!
sol = solve(model, Tsit5(); progress = false, progress_steps = 5);

# The solutions are returned as an Array of Arrays: 
#  - it is a vector of size the number of timesteps
#  - each element of the vector is a vector of all the variables
nSteps = length(sol.t);
nVars  = length(sol.u[1]);

# Empty dataframe to contain all the numbers
# (When running a loop at top-level, the global keywrod is necessary to modify global variables.)
solDF = zeros((nSteps, nVars));
for i = 1:nSteps
    global solDF
    solDF[i, :] = sol.u[i]
end;

solDF = hcat(DataFrame(t = sol.t), DataFrame(solDF));

# Let's clean the names
compartments =  ["S", "E", "I", "J", "H", "C", "R", "D", "K", "L"];
solnames = vcat([:t], [Symbol(c * repr(n)) for c in compartments for n in 0:(nAgeGroup-1)], [:Beds], [:ICU]);
rename!(solDF, solnames);

# Create sums for each compartment
# (Consider solDF[!, r"S"])
# 
for c in compartments
    col =  [Symbol(c * repr(n)) for n in 0:(nAgeGroup-1)]
    s = DataFrame(C = sum.(eachrow(solDF[:, col])))
    rename!(s, [Symbol(c)])
        
    global solDF = hcat(solDF, s)
end;

# The D column gives the final number of dead.
println(last(solDF[:, Symbol.(compartments)], 5))

The last row shows the final sizes of the various compartments.

Next is the evolution of the over time.

pyplot();
clf();
ioff();

fig, ax = PyPlot.subplots();

ax.plot(solDF.t, solDF.D, label = "Forecast");
ax.plot(solDF.t, solDF.R, label = "Recoveries");
ax.plot(cases.t, cases.deaths, "ro", label = "Actual", alpha = 0.3);

ax.legend(loc="lower right");
ax.set_xlabel("time");
ax.set_ylabel("Individuals");
ax.set_yscale("log");

PyPlot.savefig("images/DeathsForecast.png");

Increase in Recoveries and Deaths over time

It is clear the model forecasts a faster growth than reality. A parameter estimation is necessary.

pyplot();
clf();
ioff();

fig, ax = PyPlot.subplots();

ax.plot(solDF.t, solDF.Beds, label = "Beds");
ax.plot(solDF.t, solDF.ICU, label = "ICU");

ax.legend(loc="lower right");
ax.set_xlabel("time");
ax.set_ylabel("Number of beds");
ax.set_yscale("linear");

PyPlot.savefig("images/BedUsage.png");

Bed Usage over time

It is clear that the requirements for beds quickly hits the available capacity

Bilibliography

The Novel Coronavirus Pneumonia Emergency Response Epidemiology Team. The Epidemiological Characteristics of an Outbreak of 2019 Novel Coronavirus Diseases (COVID-19) — China, 2020[J]. China CDC Weekly, 2020, 2(8): 113-122. LINK

RNN Compressive Memory Part 1: A high level introduction.

Sat, 07 Mar 2020 00:00:00 +0000

This is the first post of series dedicated to Compressive Memory of Recurrent Neural Networks. This is inspired by a recent DeepMind paper published in November 2019 on Arxiv.

Currently, the ambition of the series is to follow this plan:

Part 1 (here): A high level introduction to Compressive Memory mechanics starting from basic RNNS;
Part 2: a detailed explanation of the TransformerXL;
Part 3: an implementation using PyTorch (soon);
Part 4: finally, its application to time series (soon).

Most likely, this will be fine-tuned over time.

Big thanks to Gautier Marti and Vincent Zoonekynd for their suggestions and proof-reading!

Update: Additional diagrams (14 March 2020)

Recurrent Neural Networks (RNN)

From simple RNNs to LSTMs

Traditional neural networks were developed to train/run on information provided in a single step in a consistent format (e.g. images with identical resolution). Conceptually, a neural network could similarly be taught on sequential information (e.g. a video as a series of images) looking at it as a single sample, but that would require (1) being trained on the full sequence (e.g. an entire video), (2) being able to cope with information of variable length (i.e. short vs. long video). (1) is computationally intractable, and (2) means that units analysing later parts of the video would not be receiving as much training as earlier units when ideally they should be all share the same amount of training .

Basic RNN (source: Wikipedia)

The original RNN address those issues:

Sequences are chopped in small consistent sub-sequences (say, a segment of 10 images, or a group of 20 words).
An RNN layer is a group of blocks (or cells), each receiving a single element of the segment as input. Note that here layer does not have the traditional meaning of a layer of neural units fully connected to a previous layer of units. It is a layer of RNN cells. Within each cell, quite a few things happen, including using layers of neural units. From here on, a layer will refer to an RNN layer and not a layer of neural units..
Within a layer, cells are identical: they have the same parameters.

Although each element of a sequence might be of interest on its own, it only becomes really meaningful in the context of the other elements. Each cell contains a state vector (called hidden state). Each cell is trained using an individual element from a segment and the hidden state from the preceding cell. Training the network means training the creation of those states. Passing of the hidden state transfers some context or memory from prior elements of the segment. The cells receiving a segment form a single layer. Each cell would typically (but not necessarily) also include an additional sub-cell to create an output as a function of the hidden step. In that case, the output of a layer can then be used as input of new RNN layer.

A layer is trained passing hidden states from prior cells to later cells. The hidden state from prior elements is used to contextualise a current element. To use context from later elements (e.g. in English, a noun giving context to a preceding adjective), a separate layer is trained where context instead passes from later to prior elements. Those forward and backward layers jointly create a bidirectional RNN .

Historically, RNNs applied to NLP deal with elements which are either one-hot encoded (either letters, or, more efficient, tokens), or word embeddings often normalised as unit vectors (for example see Word2Vec and GloVe). RNN cells therefore deal with values between 0 and 1. Typically, non-linearity is brought by \(tanh\) or \(sigmoid\) activations which guarantee unit values within that range. Those activation functions quickly have very flat gradients. Segments often have 10s or 100s of elements. Because of vanishing gradients, a hidden state receives little information from distant cells (training gradients are hardly influenced by gradients of distant cells).

Long/Short Term Memory RNNs

Basic LSTM RNN (source: Wikipedia)

Long/Short Term Memory RNNs (LSTM) address this by passing two states:

a hidden state \(h\) as described above trained with non-linearity: this is the short-term memory; and,
another hidden state \(c\) (called context) weighting previous contexts with a simple exponential moving average (in Gated Recurrent Units) or a slightly more complicated version thereof in the original LSTM model structure. Determining the optimal exponential decay is part of the training process. This minimally processed state is the long-term memory.

LTSM can also be made bidirectional.

Without going into further details, note that each \(\sigma\) and \(\tanh\) orange block represents matrix of parameters to be learned.

Attention

Attention RNN

RNN were further extended with an attention mechanism. Blog posts on attention by Jay Alammar and Lilian Weng are good introductions.

A multi-layer RNN takes the output a layer and uses it as input for the next. With the attention mechanism, the outputs go through an attention unit.

Beyond LSTM: Transformers

RNNs were then simplified (insert large air quotes) with Transformers (using what is called self-attention) that significantly reduce the number of model parameters and can be efficiently parallelised with minimum model performance impact. For an extremely clear introduction to those significant improvements, you cannot do better than reading , and by Peter Bloem on transformers. The following assumes that you are broadly familiar with those ideas.

The basic transformer structure uses self-attention where, for a given element (the query), the transformer looks at the other elements of the segment (the keys) to determine how much ‘attention’ other elements of the segment influence the role of the query in changing the hidden state.

Broadly:

The query is projected in some linear space (a matrix \(W_q\)). That’s basically an embedding which is part of the model training.
All the other elements, the keys, are projected in another linear space (a matrix \(W_k\)); another embedding which is part of the model training.
The similarity (perharps affinity would be a better word) between the projected query and each projected key is calculated with a dot product / cosine distance. This is exactly the approach of basic recommender systems with the difference that the recommendation is between sets of completely different nature (for example affinity between users and movies). Note that although query and keys are elements of identical type, they are embedded into different spaces with different projections matrices.
We now have a vector of the same size as the segment length (one cosine distance per input element). It goes through another layer (a matrix \(W_v\)) to give a value.

The triplet of \(\left( W_q, W_k, W_v \right)\) is called an attention head. Actual models would include multiple heads (of the order of 10), and the output of a transformer layer could then feed into a new transformer layer.

This model is great until you notice that the dot product / cosine similarity is commutative and does not reflect whether a key element is located before or after the query element: order is fundamental to sequential information (“quick fly” vs. “fly quick”). To address this, the input elements are always enriched with a positional embedding: the input elements are concatenated with positional information showing where they stand within a segment.

Note that a transformer layer is trained on a segment using only the information from that segment. This is fine to train on sentences, but it cannot really account for more distant relationships between words within a lengthy paragraph, let alone a full text.

Transformer-XL

Transformers have been further improved with Tranformer-XL (XL = extra long) which are trained using hidden states from previous segments, therefore using information from several segments, to improve a model’s memory span.

Conceptually, this is an obvious extension of the basic transformer to increase its memory span. But there is a fundamental problem. Going back to the basic transformer, each element includes its absolute position within the segment. The position of the first word of the segment is 1, that of the last one is, say, 250 . Such a scheme breaks down as soon as the state of the previous segment is taken into account. Word 1 of the current segment obviously comes before word 250, but has to come after word 250 of the previous segment. The absolute position encoding does not reflect the relative position of elements located in different segments.

The key contribution of the Transformer-XL is to develop a relative positional encoding that allows hidden state information to cross segment boundaries. In their implementation, the authors evaluate that the attention length, being basically how many hidden states are used, is 450% longer that the basic transformer. That’s going from sentence length to full paragraph, but still far from a complete book.

A side, but impressive, benefit is that the evaluation speed of the model, or it use once trained, is significantly increased thanks to the relative addressing (the paper states up to a 1,800-fold increase depending on the attention length).

Compressive Transformers

Full text understanding cannot be achieved by simply lengthening segment sizes from 100s to the word count of a typical novel (about 100,000). When training a model routinely takes 10s of hours on GPU clusters, an increase by 3 orders of magnitude is not realistic.

In a recent paper, DeepMind proposes a new RNN model called Compressive Transformers.

Introduction

Transformer-XL uses the hidden state of a prior segment (\(h_{T-1}\)) to improve the training of the current segment (\(h_{T}\)). When moving to the next segment, training (\(h_{T+1}\)) now only uses \(h_{T}\) and \(h_{T-1}\) is discarded. To increase the memory span, one could train using more past segments at the expense of increase in memory usage and computation time (quadratic). The actual Transformer-XL uses the hidden states of several previous segments, but the discarding mechanism will remain.

The key contribution of the Compressive Transformers is the ability to retain salient information from those otherwise discarded past states. Instead of being discarded, they are stored in compressed form.

Each Transformer-XL layer is now trained with prior hidden states (primary memory) and the compressed memory of older hidden states.

As an aside, although not explicitly mentioned, we should note that the ‘-XL’ aspect of the Transformer-XL and the memory compression mechanics are conceptually independent from the actual types of RNN cell. Simple RNNs, GRUs or LSTMs could be trained using the hidden states of past segments (not dissimilar to state/context peeking into past cells in certain RNN variants). But the performance benefit of Transformer-XL is such that the paper only focuses on transformer-XL.

Compression scheme

As compared to Transformer-XL, the key difference is the compression scheme. The rest of the model seems identical.

Size parameters

The size of the model is described with a few size parameters:

\(n_s\): size of a segment = the number of cells in a layer.
\(n_m\): number of hidden states in the primary uncompressed memory (like the Transformer-XL). \(n_m\) is a multiple of \(n_s\). The primary memory is a FIFO buffer: the first (oldest) memories will be the first to be later compressed.
\(n_{cm}\): number of compressed hidden states in the compressed memory. States in the compressed memory will compress an old segment of size \(n_s\) dropping out of the primary memory. \(c\) is an information compression ratio from \(n_s\) primary memory entries into compressed memory entries. There can be two ways of applying this compression ratio, which both reduce the number of hidden states by the same ratio:
- \(c\) uncompressed layers could create a single compressed hidden state of identical size. This merges the information of a group of elements (e.g. \(c\) words) into a single hidden state. In this case, \(n_s\) is proportional to \(c\) and \(n_{cm}\) is proportional to \(n_s / c\). The authors do not use this approach. It would enforce a sub-segmentation of an uncompressed segment at arbitrary intervals (why group 3 words instead of 5 or 7…)
- Instead, the authors use dimension reduction: a single uncompressed hidden state is compressed into a new hidden state with \(c\) times fewer hidden states. If the size of the hidden state of a Transformer-XL cell is \(n_h\), hidden states in the primary memory will have the same size, and the compressed memory hidden states will have a size of \(n_h / c\).

By way of example, a segment could have 100 cells (\(n_s = 100\)). This segment could be trained with the hidden states of the past 3 segments’ training (\(n_m = 3 * n_s = 300\)). When training the next segment, an old segment of size 100 becomes available for compression which will create 100 new hidden states.

This example is for a single layer. The same scheme would be replicated for each layer of the model

Note that the paper only contemplates a single set of compressed memories. There could also be multiple generations of compressed memories, primary memory compresses in generation 1, then compressing into generation 2…

Compression functions

A compressed hidden state is created from \(c\) primary memory hidden states. When training on texts with word embeddings,the authors used a value of \(c=3\) or \(c=4\).

Several compression schemes are explored in the paper:

max or mean pooling with a stride of \(c\). This is typical of image convolution networks - no explanation required.
1-dimensional convolution with a stride of \(c\). This is also typical of image convolution network apart from being one-dimensional. This requires parameter training.
dilated convolution. In practice image convolutions have shown to be inadequate for sequential information where dependencies can be at both short and long ranges: working at different scales makes sense. Dilated convolutions use convolution filters that are contracted and dilated versions of a template to be trained.
a most-used mechanism that identifies and retains part of the hidden states according to their importance in the cells training gauged by the attention they received.

Compression training

Training the compression parameters is done separately from the optimisation of the Transformer-XL cells.

The purpose of the compressed memory is to provide a compressed and lossy representation of the primary memory (hidden states) or the attention heads parameters: the quality of the compression mechanics is assessed by how well the original information can be re-generated from it. In essence, the compressed hidden states are a compressed representations to a learned representation vector in an auto-encoder. This is the training mechanics used by the authors.

As in an auto-encoder, the representation is learned by comparing the original information to its reconstruction. This training is kept completely independent from the training of the transformers: the auto-encoding loss and gradients do not impact the attention heads’ parameters.

Conversely, the loss and gradients of the attention heads’ training do not flow into the training of the compression scheme.

Summary

This was a high level introduction of RNNs all the way up to Compressive Memory mechanics. Next, the algorithm’s nitty-gritty.

Lending Club peer-to-peer loans scoring

Thu, 12 Dec 2019 16:17:27 +0800

Click on the pdf or slides buttons above to access the materials.

Movielens Recommender System

Thu, 12 Dec 2019 16:07:46 +0800

Click on the pdf or slides buttons above to access the materials.

HarvardX Gitbooks available

Thu, 12 Dec 2019 00:00:00 +0000

Both capstones for the HarvardX certificates are now available. Just click on the Projects link!

If Gitbooks are not your thing, at the top of their main page, there is a download link to a pdf version.

They make for a good knock-me-asleep reading…

HarvardX Final Report - LendingClub dataset

Wed, 11 Dec 2019 00:00:00 +0000

After 3 months of work, the final report for the HarvardX Data Science course was submitted.

It is based on the LendingClub dataset. LendingClub is a peer-2-peer lender. This is a matching of private borrowers and investors. Small amounts, fairly high risk (if they could, borrowers would probably have had a bank involved). Surprisingly, after tapping a market of individual lenders, the biggest lenders are now the banks. To inform the investors, LendingClub make historical information publicly available.

This work went through many blind alleys. I won’t list them, they are in the report (post-mortem section). But it was an overall enriching experience. I learned a lot, often about limitations of what I tried (the dataset is big with a few millions samples (big for an old laptop), with many (ca. 150) mispecified mixed categorical and numeric variables). The experience will be filed in the ‘it-builds-character’ category…

One point that is still tingling my mind is learning about Conditional Inference Trees used to bin variables. That is then used for logistic regression to predict probabilities of loan default.

Why are those trees interesting? They are sourced in information theory and measure the information content of a prediction variable to predict a binary response. The prediction variable is then partitioned in a few intervals (bins). What is great?

The measurement does NOT rely on the value of the prediction variable. This means that variable NAs go from being a nuisance to being stashed in a bin of their own treated as any other bin.
The logistic regression context predicates binary variables which were perfect for the purpose of this report. But those trees do not require binary outcomes. They rely on what are called Weight of Evidence (calculated for each bin) and Information Value (calculated for each variable).
The calculations are very quick (about 1/10th second to bin 1 million samples) with a small memory footprint.

In other words, whatever comes in, we do not have to worry about scaling/z-scoring/filling NAs; it is quickly reformatted into a handful (literally of that order of magnitude depending on parameters used) based on the relevance to predicting what needs to come out.

If I didn’t know better, this should be called model impedance matching (electrical engineers can explain)!

Apart from that, the number of avenues to explore with this dataset (especially using data from other sources) could fill many more months. I listed a list of possible techniques in the report’s conclusion. This is what does and will keep banks’ credit risk departments busy and well-staffed…

I am working on making the MovieLens and LendingClub reports available as gitbooks. To be announced.

Quick Thought: Universal translator and same language translator

Fri, 25 Oct 2019 00:00:00 +0000

Quick Thoughts are random thoughts looking for comments

Let’s imagine a universal translator able to translate any language to any language. Sourcing a corpus of pair translation is a major hurdle. However there is an almost infinite corpus of pair translations: a language with itself; translating English to English is easy, even for a computer.

Let’s give the blackbox universal translator three inputs: a source text, the language of the source text, the language of the desired translation. What would be the consequences for the learning system inside the blackbox of being constrained that if the languages are the same, the output has to be identical to the input?

Obviously, the blackbox could quickly learn that bypassing the translation does the trick. However, that would probably require the internal circuitry to allow for the bypass, and that could be constrained out. So:

Could we expect any interesting result?
Could the input to be eventually forced down to a language-independent universal representation?
Let’s say there is a language-independent universal representation kernel. If the input comes in without information of which is the output language, and the output has no information of what the input language was, does it force the network to create a universal representation, or would it just withered away?
Is it possible to invert a network? Probably not in a truly bijective way, but to model the fact that text representation \(\rightarrow\) universal representation is the inverse (for some definition of the word) of universal representation \(\rightarrow\) text representation of the same language?

Comments welcome.

Neural Network - Incremental Growth

Wed, 23 Oct 2019 00:00:00 +0000

DRAFT 1

We all have laptops. But le’ts face it, even in times of 32GB of RAM and NVMe2 drives, forget about running any interesting TensorFlow model. You need to get an external GPU, build your own rig, or very quickly pay a small fortune for cloud instances.

Back in 1993, I read a paper about growing neural networks neuron-by-neuron. I have no other precise recollection about this paper apart from the models considered being of the order of 10s of neurons and the weight optimisation being made on a global basis, i.e. not layer-by-layer like backpropagation. Nowadays, it is still too often the case that finding a network structure that solves a particular problem is a random walk: how many layers, with how many neurons, with which activation functions? Regularisation methods? Drop-out rate? Training batch size? The list goes on.

This got me thinking about how a training heuristic could incrementally modify a network structure given a particular training set and, apart maybe from a few hyperparameters, do that with no external intervention. At regular training intervals, a layer¹ will be modified depending on what it seems able or not to achieve. As we will see, we will use unsupervised learning methods to do this: a layer modification will be independent of the actual learning problem and automatic.

Many others have looked into that. But what I found regarding self-organising networks is pre-2000, and nothing in the context of deep learning. So it seems that the topic has gone out of fashion because of the current amounts of computing power, or has been set aside for reasons unknown. (See references at the end). In any event, it is interesting enough a question to research it.

Background

Let us look at a simple 1-D layer and decompose what it exactly does. Basically a layer does:

\[ \text{ouput} = f(M \times \text{input}) \]

If the input \(I\) has size \(n_I\), the output \(O\) has size \(n_I\), and \(f\) being the activation function, we have (where \(\odot\) represents the matrix element-wise application of a function):

\[ O = f \odot (M \times I) \]

Then, looking at \(M\), what does it really do? At one extreme, if \(M\) was the identity matrix, it would essentially be useless (bar the activation function²). This would be a layer candidate for deletion. The question is then:

Looking at the matrix representing a layer, can we identify which parts are (1) useless, (2) useful and complex enough, or (3) useful but too simplistic?.

Here, complex enough or simplistic is basically a synonym of “one layer is enough”, or “more layers are necessary”.

The idea to look for important/complex information which where the network needs to grow more complex; and identify trivial information which can be discarded, or can be viewed as minor adjustments to improve error rates (basically overfitting…)

Caveat: Note that we ignore the activation function. They are key to introduce non-linearity. Without it, a network is only a linear function, i.e. no interest. They have a clear impact on the performance of a network.

Singular matrix decomposition

There exists many ways to decompose a matrix. Singular matrix decomposition (SVD) \(M = O \Sigma I^\intercal\) is an easy and efficient way to interpret what a given matrix does. SVD builds on the eigenvectors (expressed in an orthonormal basis), and eigenvalues. (Note that \(M\) is real-valued, so we use the transpose notation \(M^\intercal\) instead of the conjugate transpose \(M^*\).)

In a statistical world, SVD (with eigenvalues ordered by decreasing value) is how to do principal component analysis(PCA).

In a geometrical context, SVD:

takes a vector (expressed in the orthonormal basis);
re-expresses onto a new basis made of the eigenvectors (that would only exceptionally be orthonormal);
dilates/compresses those components by the relevant eigenvalues;
and returns this resulting vector expressed back onto the orthonormal basis.

As presented here, this explanation requires a bit more intellectual gymnastic when the matrix is not square (i.e. when the input and output layers have different dimensions), but the principle remains identical.

Where next?

Taking the statistical and geometrical points of view together, the layer (matrix \(M\)) shuffles the input vector in its original space space where some specific directions are more important than others. Those directions are linear combinations of the input neurons, each combinations is along the eigenvectors. Those combinations are given more or less importance as expressed by the eigenvalues. (Note that the squares of the eigenvalues expressed how much information each combination brings to the table.)

Intuitively, the simplest and most useless \(M\) would be the identity matrix (the input units are repeated), or zero matrix (the input units are dropped because useless). Let us repeat the caveat that the activation function is ignored.

If compared to the identity matrix, the SVD shows that \(M\) includes (at least) two types of important information identified:

What are interesting combinations of the input units? This is expressed by how much the input vector is rotated in space.
Independently from whether a combination is complicated or not (i.e. multiple units, or unit passthrough), how an input is amplified (as expressed by the eigenvalues).

The idea is then produce a 2x2 decision matrix with high/low rotation mess and high/low eigenvalues.

A picture is gives the intuition of what we are after:

Transformation of the Layer Matrix

Looking from top to bottom at what the “after” matrices would be:

Part of the original layer, immediately followed by a new one (we will see below what that would look like). The intuition is that this layer is really messing things up down the line, or seems very sensitive.
Part of the original layer where the number of units would be increased (here doubled as an example).
Part of the original layer kept functionally essentially as is.
Delete the rest which is either not sensitive to input or outputs nothings. This would be within a certain precision. That is basically a form of regularisation preventing the overall model to be too sensitive. I am aware that there are other types of regularisations, but that will go in the limitations category.

The next layer would take as input all the transformed outputs.

In practice, the picture presents the matrices separated. This is for ease of understanding. In reality the same effect would be achieved if the three dark blue sub-layers are merged in a single layer.

Back to SVD

Let us assume that there are \(n\) input units and \(m\) output units. \(M\) then is of dimensions \(m \times n\). The matrices of the SVD have dimensions:

\[ \begin{matrix} M & = & O & \Sigma & I^\intercal \\ m \times n & & m \times m & m \times n & n \times n \\ \end{matrix} \]

Note that instead of using \(U\) and \(V\) to name the sub-matrices of the SVD, we use \(I\) and \(O\) to represent input and output.

The \(I\) and \(O\) can be written as:

\[ I = \begin{pmatrix} | & & | \\ i_1 & \cdots & i_m \\ | & & | \\ \end{pmatrix} \qquad \text{and} \qquad O = \begin{pmatrix} | & & | \\ o_1 & \cdots & o_n \\ | & & | \\ \end{pmatrix} \]

Then:

\[ \begin{aligned} M & = O \Sigma I^\intercal \\ & = \begin{pmatrix} | & & | \\ o_1 & \dots & o_m \\ | & & | \\ \end{pmatrix} \times \\ & \times \begin{pmatrix} \sigma_1 \\ & \sigma_2 \\ && \ddots \\ &&& \sigma_r \\ &&&& 0 \\ &&&&& \ddots \\ &&&&&& 0 \\ \end{pmatrix} \times \\ & \times \begin{pmatrix} - & i_1 & - \\ & \vdots & \\ - & i_n & - \\ \end{pmatrix} \end{aligned} \]

where \(\Sigma\) has \(r\) non-zero eigenvalues.

Regularisation

At this stage, we can regularise all components.

Vector coordinates

For each vector \(i_k\) or \(o_k\), we could zero its coordinates when below a certain threshold (in absolute value). All the coordinates will \(-\) and \(1\) since each vector has norm 1 (\(I\) and \(O\) are orthonormal), therefore all of them will be regularised in similar ways.

After regularisation, the matrices will not be orthonormal anymore. They can easily be made normal by scaling up by \(\frac{1}{\sum_{k}i_k^2}\) or \(\frac{1}{\sum_{k}o_k^2}\). There is no generic way to revert to an orthogonal basis and keep the zeros.

We need a way to measure the rotation messiness of each vector. As a shortcut, we can use the proportion of non-zero vector coordinates (after de minimis regularisation).

Eigenvalues

The same can be done for the \(\sigma\)s. As an avenue of experimentation, those values can not only be zero-ed in places, but also rescale the large values in some non-linear way (e.g. logarithmic or square root rescaling).

Threshold

Where to set the threshold is to be experimented with. Mean? Median since more robust? Some quartile?

2-by-2 decision matrix

Based on those regularisation, we would propose the following:

\[ \begin{matrix} & \text{low rotation messiness} & \text{high rotation messiness} \\ \text{high } \sigma & \text{Double height} & \text{Double depth} \\ \text{low } \sigma & \text{Delete} & \text{Keep identical} \\ \end{matrix} \]

[TODO] Other Principal Components methods

SVD is PCA. Projects information on hyperplanes.

Reflect on non-linear versions: Principal Curves, Kernel Principal Components, Sparse Principal Components, Independent Component Analysis. (_Elements of Statistical Learning s. 14.5 seq.).

Limitations and further questions

Limitations

Only 1-D layers. Higher-order SVD is in principle feasible for higher order tensors. Other methods?
We delete the eigenvectors associated to low eigenvalues and limited rotations. There are other forms of regularisations, e.g. random weight cancelling that would not care about anything eigen-.
What is the real impact of ignoring the activation function? PCA requires centered values. Geometrically, uncentered values would mean more limited rotations since samples would be in quadrant far from 0.

Further questions

The final structure is a direct product of the training set. What if the training is done differently (batches sized or ordered differently)?
What about training many variants with different subsets of the training set and using ensemble methods?
The eigenvalues could be modified when creating the new layers. By decreasing the highest eigevalues (in absolute value), we effectively regularise the layers outputs. This decrease could bring additional non-linearity if the compression ratio depends on the eigengevalue (e.g. replacing it by it square root). And this non-linearty would not bring additional complexity to the back-propagation algorithm, or auto-differentiated functions: it only modifies the final values if the new matrices.

Litterature

Here are a few summary litterature references related to the topic.

The Elements of Statistical Learning

The ESL top of p 409 proposes PCA to interpret layers, i.e. to improve the interpretability of the decisions made by a network.

Neural Network Implementations for PCA and Its Extensions

http://downloads.hindawi.com/archive/2012/847305.pdf

Uses neural networks as a substitute for PCA.

An Incremental Neural Network Construction Algorithm for Training Multilayer Perceptrons

Aran, Oya, and Ethem Alpaydin. “An incremental neural network construction algorithm for training multilayer perceptrons.” Artificial Neural Networks and Neural Information Processing. Istanbul, Turkey: ICANN/ICONIP (2003).

https://www.cmpe.boun.edu.tr/~ethem/files/papers/aran03incremental.pdf

Kohonen Maps

https://en.wikipedia.org/wiki/Self-organizing_map

Self-Organising Network

A Self-Organising Network That Grows When Required (2002)

https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8763

The Cascade-Correlation Learning Architecture

https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.125.6421

Growth with quick freeze as a way to avoid the expense of back-propagation.

SOINN：Self-Organizing Incremental Neural Network

http://www.haselab.info/soinn-e.html https://cs.nju.edu.cn/rinc/SOINN/Tutorial.pdf

Seems focused on neuron by neuron evolution.

We will only consider modifying the network layer by layer, not neuron by neuron.↩︎
This could actually be a big limitation of this discussion. In reality, even an identity matrix yields changes by piping the inputs through a new round of non-linearity, which is not necessarily identical to the preceding layer↩︎

HarvardX Data Science course - First final project

Sat, 05 Oct 2019 00:00:00 +0000

I recently finished to penultimate final assignment for the HarvardX Data Science course. The Stanford course was clearly machine learning. This one is definitely lighter on the machine learning and much heavier on the data science: how to source, clean and visualise data are key skills. The targeted knowledge is more traditional probabilities/statistics. Long-existing fundamental techniques like inference, polling are there.

This time R is the centre tool of the course. It makes clear sense. When I started learning it about 15 years ago, I loathed the multiple gotchas. Since then, new libraries have simplified base R and removed its exceptions and exceptions to exceptions. In addition the Rcpp library has eased implementation of efficient algorithms and interfacing with popular libraries. Still not a speed demon, but not the snail it used to be.

I won’t go through the project and my models. No revolutionary concepts. Just great results. I took half a day to reimplement in Julia, both to crosscheck and personal training. As expected, a lot easier to read. But the big surprise was the speed difference. Although I didn’t time it, Julia only felt about twice quicker. Credit to the R project folks (I only used matrices operations, no modeling libraries).

On this report, I got grades that can’t be improved upon. Happy camper.

Stanford Online - Machine Learning C229

Fri, 02 Aug 2019 00:00:00 +0000

Review

I recently completed the Stanford online version of the Machine Learning CS229 course taught by Andrew Ng. There is no need to introduce this course which has reached stardom.

It often was a trip a trip down memory lane repeating what I studied in the late 90’ies. It was interesting that quite a bit has remained as relevant. Back then, and I am now talking early 90ies, neural networks were still fashionable but computationally intractable past what would hardly be considered a single layer nowadays. Backpropagation was already used, but similarly quickly tedious.

Enough recalling old times… There was plenty I had not done back then.

The course was extremely pleasant. The progression made sense, pace was enjoyable. In particular, the blackboard style presentation was great. Following along with pen and paper made things easily stick.

Every piece of code had to be written in Matlab/Octave. The choice was surprising in those days and age where R has been a mainstay of statistics and statistical learning, and Python is now the language of choice to glue and interface so many optimised C/C++ libraries (in addition to its natural qualities). But the rationale of Matlab/Octave being very natural to implement algorithms where matrices are the mathematical object of choice, made sense. The learning curve was easy, code looked very legible and natural. For short scripts, all good. For anybody who thinks that his/her code will one day be maintained by a psychopath who know his/her address, Matlab/Octave is to be left as a Wikipedia article. Maybe Julia will become a better choice. (Numpy matrix calcs looks very far from mathematical formalism and easy to bug up.)

The course was light on the theory side. No surprise: long curriculum, few hours. On the flip side, the recurring emphasis on the ‘what does it mean?’, developing intuitions and, in particular, the hammering about bias/complexity or bias/variance trade-off would be of great value to anyone entering the field. There is a somewhat prevalent meme that machine learning only works because we now have train loads of sdcards of data, and that if something doesn’t quite work, just throw more data at it. Hammering that trade off will hopefully make many become at least sceptical. More data is not a magic wand.

Exercises and grading

The automated grading grading system was surprisingly efficient. There were a few gotchas on exact spelling or white spaces. But overall, no complaints. And given the lack of real-people face-to-face time, this was a nice alternative.

The regular coding exams were interesting and the backend infrastructure worked great. As time progressed, the difficulty significantly dropped because of the more difficult content (harder to draft an exercise that really really covers content that was superficially addressed). The course 6 exam was clearly the hardest for many of the students.

Summary

Worth it? On a personal level, definitely. And impossible to beat the value for money.

As a carrer-enhancing proposition, it remains to be seen, and I’ll need to see it to believe it.

Hello Blogdown!

Thu, 01 Aug 2019 00:00:00 +0000

Blogdown

I have been a happy user of R markdown and bookdown developed by Yixui Xie. When I decided to start this blog, giving blogdown a try was a no-brainer. To be honest, it was not my first choice. Jekyll was #1 given it’s good support by GitHub pages. Then I took a dive with Pelican. Both are impressive, but both brought equally painful theming: the base theme sort of works, and only sort of, but anyway was not what I wanted. Attempts to use anything else failed. I didn’t have time to dig into the HTML/CSS templates.

Blogdown just worked out of the box without any sort of caveat.

Setup

Basically, I just followed up the blogdown documentation. As for all his projects, Yixui’s documentation is clear, didactic and shows how much thoughts have gone into making his software easy to use, yet powerful.

By defaults, blogdown uses the Hugo, but a Jekyll backend is in beta.

Great resources are R Blogdown Setup in GitHub, even its valuable update R Blogdown Setup in GitHub (2).

Themes

As in other solutions, theming is never straightforward. Blogdown uses Hugo themes which cannot always be imported without changes and would need a bit of massaging. Having said that, if you find a them you like, it is just a matter of running blogdown::install_theme("REPONAME"), and the theme will be downloaded and installed in the themes subdirectory. blogdown will automatically change the theme: parameter in the toml configuration file and the site will be re-generated. Easy enough? Bonus points for Hugo that takes under 100ms to do that job.