Deep Learning (Aalto University CS-E4890 2022)

Introduction

Content

Linear Classifiers and Multilayer Perceptrons

Using the ReLU function makes the MLP a piecewise linear function.
Chain rule via Jacobian matrices.
Each computational unit comprises of the trainable parameters, then the inputs and the outputs. Implementation requires the forward computation of the outputs given the inputs and parameters, and the backward computations take the outputs and return the gradients of the outputs with respect to the parameters and the inputs.
(, ) - Universal approximation theorem for single hidden layer MLPs.
(, a) - MLP with a single hidden layer with $n$ inputs containing $Lm$ units has $\mathcal{O}(L^nm^n)$ regions for which the network is a piecewise linear function.
(, a) - MLP with $L$ hidden layers with $n$ inputs and width $m > n$ can compute functions with $\Omega((m/n)^{(L-1)n}m^n)$ linear regions.

Optimization

This lecture introduced many important concepts related to optimization in deep learning, such as intuition behind momentum, batch-normalization, input scaling, and the relevance of the Hessian.

Convolutional Neural Networks (CNNs)

This lecture emphasized on key features, tricks and properties related to CNNs such as:

Weight sharing.
Translation equivariance ($f(T(x))=T(f(x))$ which is in contrast to invariance $f(T(x))=f(x)$).
Transposed convolutions (backward computations in convolutional layers).
Residual connections, which solve the shattered gradients problem i.e. the problem that deeper networks without residual connections have gradients which are like white noise (as a function of the input) - and small changes in the input have significant effect on the gradient which makes optimization harder.

Regularization

Common methods for regularization include:

Limiting model capacity, by either reducing network size, using weight decay or parameter sharing.
Early stopping.
Ensemble approaches like dropout, or other probabilistic methods.
Data augmentation, e.g. noise injection to inputs, applying transformations for which the input is equal to (e.g. translating most images do not change labels), and adversarial training.

This lecture gives intuition as to how the above methods help, and other details such as Madry's defense for adversarially robust training etc.

Recurrent Neural Networks (RNNs)

We start by the motivation that in some tasks the inputs can be sequences of different sizes. One such task is sentiment analysis, where we want to classify sentences from some natural language into $C=\{1,\cdots,c\}$ with 1 being the worst sentiment.

A simple, naive way to process such sequences is via a for loop. For e.g. take the problem of counting the number of zeros in some sequence $(x_{1_n},...,x_{T_n})$ where $T_n$ denotes the length of the $n^{th}$ sample. A simple for loop for this problem is:

h = 0
for x_i in x:
    if x_i==0:
        h = h+1

The computation graph for this would involve a function $f$ being applied sequentially, and at each step $i$, the corresponding input $x_i$ is also part of the input in addition to the output from the previous layer. After taking these inputs, $f$ returns the output from the previous layer, which counts the number of 0s until $x_{i-1}$ then increments the value by 1 if $x_i=0$.

For more complex problems, we can still have some function $f$ applied in a similar manner, however $f$ is now a function we want to learn from data. This gives rise to the vanilla Recurrent Neural Network (RNN), where we have

\[y = \text{Softmax}(\mathbf{f}(\mathbf{h}_{N-1}, \mathbf{x}_N))\] \[\mathbf{h}_{N-1}=\mathbf{f}(\mathbf{h}_{N-2},\mathbf{x}_{N-1})\]

Where $\mathbf{f}(\mathbf{h},\mathbf{x})=\sigma(\mathbf{W}\mathbf{h}+\mathbf{U}\mathbf{x}+\mathbf{b})$ for some non-linearity $\sigma$, the $\mathbf{h}_i$ are called hidden states, and $\mathbf{h}_0$ is predefined for e.g. is a sample from some distribution. Here the softmax is put at the end assuming that we want to classify sequences.

Compared to a standard feedforward network which starts with the input and every layer has its own learnable parameters, an RNN some hidden state $\mathbf{h}_0$ with parts of the input being applied sequentially, and the learnable parameters are shared throughout the network. (#TODO Are there networks where all of the input is applied at every step in the RNN? I suppose not since otherwise $f$ cannot be some fixed function to be learnt.)

Training RNNs as described above is similar to how we trained feedforward networks - we can use for e.g. the cross-entropy loss.

Backpropagation in RNNs can be thought of as first assuming each layer has its own parameters and backpropagating, then using those gradients to compute the true gradient wrt shared parameters. Since the paramaters we assume exist for each layer are the shared parameters for the whole network, we can by the chain rule get that we have to sum over the gradients at each iteration of the sequence. Another way to think of this is that there is a function that takes the global shared parameters and generates parameters for each layer - here it is the identity function and we backpropagate through that. This is called backpropagation through time. (#TODO Where do the parameters for each layer alone come from?)

To see how recurrence might not be the best idea, consider a vanilla RNN starting with hidden state $\mathbf{0}$ with no bias term i.e $\mathbf{b}=\mathbf{0}$. At iteration $t$ the hidden state is then $\mathbf{h}_t = \sum_{\tau=1}^t \mathbf{W}^{t-\tau}\mathbf{U}\mathbf{x}_\tau$. Assuming $\mathbf{W}$ is diagonalizable, $\mathbf{W}^{t-\tau} = \mathbf{Q}\Lambda^{t-\tau}\mathbf{z}||^2 = ||\Lambda^{t-\tau}\mathbf{z}||^2 = \sum_i (\lambda_i^{t-\tau}z_i)^2$. Here we use the orthogonality of $\mathbf{Q}$, and we see that if even one eigenvalue $\lambda_i$ of $\mathbf{W}$ is such that $\lambda_i>1$ then the norm explodes and we will have very large numbers in forwrd computations.

The largest absolute value among the eigenvalues of $\mathbf{W}$ is called the spectral radius. One solution to this is to use the $\tanh$ non-linearity which bounds the outputs to be in $(-1,1)$.

Audience question: When using mini-batches, what happens if we have sequences of different length in these mini-batches? Lecturer answer: One method is to create a tensor of the same-size, then mark the index at the end of each sequence in the batch so that it discards the values in the tensors which come after that index.

But using $\tanh$ introduces another problem, namely the gradients are now \[ \frac{\partial \mathcal{L}}{\partial \mathbf{h}_1}^T = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_t}^T \prod_{\tau=t,\cdots,2} \frac{\partial \mathbf{h}_\tau}{\partial \mathbf{h}_{\tau-1}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_t}^T \prod_{\tau=1,\cdots,2}\text{diag}(\phi_\tau')\mathbf{W} \] Where the LHS is a column vector of partial derivatives of the loss with respect to the hidden state $\mathbf{h}_1$ and $\phi_\tau' = \phi'(\mathbf{W}\mathbf{h}_{\tau-1}+\mathbf{U}\mathbf{x}_t+\mathbf{b})$ is the derivative of the non-linearity (here, $\tanh$). Assuming that the neurons are not saturated which means $\phi(h_i)\approx 0$, then there is some $\gamma$ such that $|\phi'_\tau|\geq \gamma$ and if the spectral radius of $\mathbf{W}$ is greater than $\frac{1}{\gamma}$ then the gradient explodes. From this we can conclude that we would want the neuons to be saturated. One solution to this problem is gradient clipping, where if the norm of the gradient is greater than some predefined threshold, we normalize it to have the same magnitude as this threshold.

However, if we manage to have that $0 < |\phi'_\tau| \leq 1$, and the spectral radius of $\mathbf{W}$ is less than 1, the gradient will decay exponentially in the size of $t$, this is called the vanishing

non-saturated, which conflicts with the objective we want to prevent exploding gradients above. When we have vanishing gradients, it becomes harder to learn long-range dependencies.

In practice, recurrent units with gating mechanisms work better, for e.g. Long Short-Term Memory (LSTM) and Gating Recurrent Units (GRU).

The motivation for GRUs comes from the fact that we want to keep old values from the hidden state instead of using all of it when performing the forward pass. GRUs implement this by using an additional update gate $\mathbf{u}_t \in (0,1)$ that controls which states should be updated. \[ \mathbf{h}_t = (1 - \mathbf{u}_t) \odot \mathbf{h}_t + \mathbf{u}_t \odot \tilde{\mathbf{h}}_t\] \[ \mathbf{u}_t = \sigma(\mathbf{W}_u\mathbf{h}_{t-1} + \mathbf{U}_u\mathbf{x}_t + \mathbf{b}_u)\]

Where $\sigma$ is the sigmoid function, $\tilde{\mathbf{h}}_t$ is the new state candidates, which are computed using only the states select by a reset gate $\mathbf{r}_t$, giving: \[\tilde{\mathbf{h}}_t = \phi(\mathbf{W}(\mathbf{r}_t \odot \mathbf{h}_{t-1} + \mathbf{U}\mathbf{x}_t + \mathbf{b}_h)\] \[\mathbf{r}_t = \sigma(\mathbf{W}_r \mathbf{h}_{t-1} + \mathbf{U}_r \mathbf{x}_t + \mathbf{b}_r)\]

Writing the gradients wrt to the hidden state, we see that for the GRU, the gradients decay with a rate $\frac{1}{2}$ rather than $\gamma$ as in RNNs, and if $\gamma$ is large, the gradient magnitude grows exponentially as $\mathcal{O}(\Big( \frac{\gamma}{4}\Big)^t)$ compared to $\mathcal{O}(\gamma^t)$ in RNNs.

Consider the Linear Gaussian model with temporal structure: \[ \mathbb{P}(\mathbf{h}_1) = \mathcal{N}(\mathbf{h}_1|\mathbf{\mu}_1,\mathbf{R}_1)\] \[\mathbb{P}(\mathbf{h}_t|\mathbf{h}_{t-1}) = \mathcal{N}(\mathbf{h}_t|\mathbf{B}\mathbf{h}_{t-1},\mathbf{R})\] \[\mathbb{P}(\mathbf{x}_t|\mathbf{h}_t)=\mathcal{N}(\mathbf{x}_t|\mathbf{A}\mathbf{h}_t,\mathbf{V})\]

When finding the conditional distribution $\mathbb{P}(\mathbf{h}_t|\mathbf{x}_{1:t})$, using a message passing algorithm yields the Kalman filter, which has 2 steps:

Prediction: $\mathbb{P}(\mathbf{h}_t|\mathbf{x}_{1:t-1})=\mathcal{N}(\mathbf{h}_t|\mathbf{h}_t^*,\mathbf{P}_t)$ where $\mathbf{h}_t* = \mathbf{B}\mathbf{h}_{t-1}^{**}$ and $\mathbf{P}_t = \mathbf{B}\mathbf{\Sigma}_{t-1}\mathbf{B}^T + \mathbf{R}$
Correction: $\mathbb{P}(\mathbf{h}_t|\mathbf{x}_{1:t}) = \mathcal{N}(\mathbf{h}_t|\mathbf{h}_t^{**},\mathbf{\Sigma}_t)$ where $\mathbf{h}_t^{****} = \mathbf{h}_t^* + \mathbf{K}_t(\mathbf{x}_t - \mathbf{A}\mathbf{h}_t^{**})$, $\mathbf{\Sigma}_t = (\mathbf{I} - \mathbf{K}_t\mathbf{A})\mathbf{P}_{t-1}$ and $\mathbf{K}_t =\mathbf{P}_{t-1}\mathbf{A}^T(\mathbf{A}\mathbf{P}_{t-1}\mathbf{A}^T + \mathbf{V})^{-1}$.

In the one-dimensional case, we can see the corrected value of the hidden state can be rewritten as, $h^{**}_t = (1-u_t)h^*_t + u_t\frac{x_t}{a}$ where $u_t$ is a sigmoid of the form $u_t = \sigma(\log \frac{a^2 p_{t-1}}{a^2 p_{t-1}+v})$. Hence comparing the update rules for the hidden state in the GRU, we see that they are quite similar, and somewhat gives clarity, if not justification to the gating mechanism which seems a bit arbitrary in the beginning.

LSTMs on the other hand introduce a new vector of states $\mathbf{c}_t$, thought of as a private attribute for an LSTM cell which is updated as: \[ \mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_{t-1}\odot \phi_c (\mathbf{W}_c\mathbf{h}_{t-1} + \mathbf{U}_c\mathbf{x}_t + \mathbf{b}_c)\] with the forget gate $\mathbf{f}_t =\sigma(\mathbf{W}_f\mathbf{h}_{t-1} +\mathbf{U}_f\mathbf{x}_t +\mathbf{b}_f)$ and the input gate $\mathbf{i}_t = \sigma (\mathbf{W}_i\mathbf{h}_{t-1} + \mathbf{U}_i\mathbf{x}_t + \mathbf{b}_i)\in (0,1)$.

The output of the LSTM cell, here called the obsered state is then: \[\mathbf{h}_t = \mathbf{o}_t \odot \phi_h(\mathbf{c}_t)\] where the output gate is $\mathbf{o}_t = \sigma(\mathbf{W}_o\mathbf{h}_{t-1} + \mathbf{U}_o\mathbf{x}_t + \mathbf{b}_o)$.

The gradients of this state is then $\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \text{diag}({\mathbf{f}_t})$, and hence if we have $\mathbf{f}_t=1$ this gradient neither vanishes nor explodes.

When initializing weights for LSTMs, it is common pracitce to have small random weights for $\mathbf{b}_f$ which sets the forget gate to around $\frac{1}{2}$ (#TODO: Why?) - But setting these with large values means the forget gates are closer to 1 and this encourages gradients to flow better back through time, allowing us to learn long-range dependencies.

Recurrent units are commonly used in sequence-to-sequence models.

During training, sometimes a method called teacher forcing is used, where the true sequence is substituted to the decoder generated sequence with some probability.

#TODO Also write about Beam Search.

Attention-based models

Consider the sequence-to-sequence model based on an encoder-decoder framework, where we encode the whole input sequence into a single state (which is sometime called the "context"), and a decoder which uses the context alongside a transformed version of it.

Suppose instead of a single state, we want to encode the input sequence to a representation which varies with length, i.e. longer input sequences lead to longer representations of its encoding. A first attempt at this may involve generating $k$ states for $k$ inputs, where the intermediate representation evolves based on the immediately previous intermediate representation. (E.g. $x_1,...x_k$ generates the representations $z_1,...,z_k$, with the intermediate representations $h_1,...,h_k$ such that $h_i$ depends only on $h_{i-1}$). This however does not work well in practice since the intermediate representations only get information from their preceeding values, meanwhile it is possible that values at successive indices provide have information that is relevant. One solution is to use a bi-directional RNN for the intermediate representations (Bahdanau et al. 2014), so the intermediate representation $h_i$ is now a concatenation of a chain on intermediate representations, one which goes forward in the sequence order and the other going backward.

Now we would want to pass the varying length input representation to the decoder. This is done via an attention block. The attention block takes a varying length input representation and produces a context of fixed size for each iteration when decoding the sequence. This attention block, which we have not defined yet, can be thought of as a switch which selects parts of the input representation that is relevant to the current iteration in the decoding sequence. The attention block can take as input the previous hidden state from the decoding sequence as well. Mathematically, at iteration $i$ of the decoding sequence, we could have the output $\mathbf{c}$ of the attention block as: \[ \mathbf{c} = \sum_{j} \alpha_j \mathbf{z}_j\] where $\alpha_j = \frac{\exp(e_j)}{\sum_{j'}\exp(e_{j'})}}$ with $e_j = \text{MLP}(\mathbf{h}_{i-1},\mathbf{z}_j)$ where $\{\mathbf{z}_j\}_{j=1}^J$ are the varying length input representations. This was introduced in (, a) (#ASK So how does the size of the attention blocks change with different inputs).

(#ASK In the lecture around 21:40 sounds like he implies the MLP is difference for each input from the encoder representation, but the MLP has no index, so I assume its the same MLP?)

Attention blocks have also been used in image captioning tasks, see the paper "Show, attend and tell".

We now look at the intermediate representation, namely the sequential nature of the bi-directional RNN means that training could be slow since this step of the encoding process depends on the size of the input. Gehring et al 2017 adopt a CNN instead of a bi-directional RNN for this step. The CNN can do the computations in parallel, however it cannot take into account whether a position is at the begininng or end of the sequence. To first this problem, Gehring et al 2017 simply include this information by adding a position embedding value $\mathbf{p}_j$ to each input word embedding $\mathbf{x}_j$. (Note that here they add the positional embedding to the input embedding instead of concatenating). In the decorder, we can get rid of RNNs by using a CNN of the form $\mathbf{y}_i = \text{CNN}(\mathbf{y}_{i-1},\cdots,\mathbf{y}_i,\mathbf{z}_1,\cdots,\mathbf{z}_n)$ (an Auto-regressive model) where the receptive field of each $\mathbf{y}_i$ does not include any subsequent elements $\mathbf{y}_j \: : j\geq i$ - this can be done using shifted convolutions. Shifted convolutions produce outputs aligned with the input, where padding is added to the beginning of the edges instead of all edges, they are also called auto-convolutions. 1D convolutions are used since here we process sequences. Gehring et al 2017 went one step further and added an attention block to the decoder as well, which finally produces the output sequence $\mathbf{o}_i$ such that: \[ \mathbf{o}_i = \sum_j \alpha_{ij}(\mathbf{z}_j+\mathbf{x}_j+\mathbf{p}_j)\] Where $\alpha_{ij}=\frac{\exp(\mathbf{h}_i^T\mathbf{z}_j)}{\sum_{j'=1}^n \exp(\mathbf{h}_i^T\mathbf{z}_{j'})}$. (#TODO Check how this attention mechanism works, there was a typo in slide 18).

Gehring et al 2017 in fact used multiple decoder blocks, skip connections and much more.

Transformers were introduced by Vaswani et al 2017. The general architecture of a transformer is very similar to the convolutional sequence-to-sequence model from Gehring et al 2017, to recall, which had an encoder architecture that takes an input sequence $\mathbf{x}_1,\cdots,\mathbf{x}_n$, generates from them the contexts $\mathbf{z}_1,\cdots,\mathbf{z}_n$, which are passed onto a decoder which uses an attention mechanism to produce the final sequence $\mathbf{y}_1,\cdots,\mathbf{y}_m$.

The transformer network has a different attention mechanism which they call the scaled dot-product attention, where the output sequence only uses $\mathbf{z}_i$'s without the input embedding and positional encoding values, and the weights are computed using the formula: \[ \alpha_{ij} = \frac{\exp(\mathbf{z}_j^T\mathbf{h}_i/\sqrt{d_k})}{\sum_{j'=1}^n \exp(\mathbf{z}_{j'}^T\mathbf{h}_i/\sqrt{d_k})}\]

Where $d_k$ is the dimensionality of $\mathbf{z}_j$ and $\mathbf{h}_i$. The authors suggest the following interpretation for this attention block, as a layer that finds values $\mathbf{v}_j=\mathbf{z}_j$ with keys $\mathbf{k}_j=\mathbf{z}_j$ closest to the query $\mathbf{q}_i=\mathbf{h}_i$. Re-writing the scaled dot-product attention with this interpretation we have: \[ \mathbf{o}_i =\sum_j \alpha_{ij}\mathbf{v}_j\] Where $\alpha_{ij} = \frac{\exp(\mathbf{k}_j^T\mathbf{q}_i/\sqrt{d_k})}{\sum_{j'=1}^n\exp(\mathbf{k}_{j'}^T\mathbf{q}_i/\sqrt{d_k})}$.

In Vaswani et al 2017, the authors in fact use multi-headed attention, whereby keys, queries and values are first projected into lower-dimensional spaces and the scaled dot-product attention is performed there, with the outputs being concatenated. Thus we have for $\mathbf{V}\in \mathbb{R}^{m\times d_v}, \mathbf{Q}\in\mathbb{R}^{n\times d_k}, \mathbf{K}\in\mathbb{R}^{m\times d_k}$ the multi-headed attention block is $\text{MultiHead}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{Concat}(\text{head}_1,\cdots,\text{head}_h)W^O \in \mathbb{R}^{n\times d_k}$ and $\text{head}_i = \text{Attention}(\mathbf{Q}W_i^Q, \mathbf{K}W_i^K, \mathbf{V}W_i^V) \in \mathbb{R}^{n\times d_i}$.

For the encoder block, they use the same multi-headed attention mechanism to convert the inputs $\mathbf{x}_1,\cdots,\mathbf{x}_n$ into $\mathbf{z}_1,\cdots,\mathbf{z}_n$, where the vectors $\mathbf{x}_i$ are used as keys, values and queries - this is called self-attention. The first position here affects the representation in the last position and vice-versa after just one layer, in comparison to CNNs and variants of RNNs used previously. In this form, when we shuffle the inputs of the self-attention block, the resulting output is simply the same permutation i.e. this block is permutation-equivariant. To combat this they use a hard-coded positional embedding which for element $i$ of the encoding is a sine of cosine of $\frac{p}{10000^{2i/d}$ depending on whether $i$ is even or odd respectively. Here $p$ is the position, and the encoding itself has dimensionality $d$ which is the same as the input/output embeddings. In this encoder block they then use skip connections, and an MLP as well, and all of these can be stacked on top.

For the decoder block, which must implement an auto-regressive model, they propose to use a so-called masked self-attention block when converting the outputs $\mathbf{y}_i$ being generated to queries $\mathbf{q}_i$ which can be implemented as: \[\mathbf{h}_i = \sum_j \alpha_{ij}\mathbf{v}_j\] Where \[ \alpha_{ij} = \frac{\exp(\mathbf{v}_j^T\mathbf{v}_i/\sqrt{d_k} + m_{ij})}{\sum_{j'=1}^n \exp(\mathbf{z}_{j'}^T\mathbf{h}_i/\sqrt{d_k} + m_{ij'})}\]

With the masks $m_{ij}$ such that $m_{ij}=\infty$ if $j>i$ else 0, thus $\alpha_{ij}=0$ for $j>i$ meaning $\mathbf{v}_{i+1},\cdots,\mathbf{v}_m$ are not used when computing $\mathbf{h}_i$. The overall decoder thus uses a combination of the masked multi-headed attention, skip connections, multi-headed attention and MLPs, with the final decoder being stacked as needed.

Training transformers often use a ramp-up learning schedule, with a linearly growing learning rate followed by decaying learning rate after some predefined iteration number.

One of the famous models that make use of these ideas is BERT from Devlin et al 2018, which is often used a pre-trained model which can be later fine-tuned for particular tasks. This model is essentially a transformer encoder, which takes in a single or pair of sentences (pairs are important for downstream tasks such as question answering) and outputs a single or pair of sentences respectively, where both the input and output have the same size. One pretraining task for BERT is the following: mask a token to corrupt the input sequence and reconstruct this masked token. Another pretraining task is to take 2 sentences and predict (binary output) whether the second sentence follows from the first sentence. After these pretraining tasks, one can now finetune the model for the specific task.

Graph Neural Networks

Prior to this lecture we considered inputs such as vectors, or tensors with 1D or 2D spatial structure, or sequences of varying lengths with temporal structure. However in some applications, the input is represented as a graph. Here we define a graph $G$ as a 3-tuple $G=(u,V,E)$ where $u$ is a global attribute, $V$ is a set of nodes with attributes $x_i$ and $E$ is a set of edges with attributes $e_{kj}$. We will mostly be concerned with undirected graphs.

An e.g. of a task involving graphs is to predict chemical properties of molecules, such as toxicity, excitation spectra, level of activity of a chemical compound against cancer cells etc. The task is similar to regression but the inputs are graphs $\mathcal{G}%$ and outputs are a vector of values we want to predict which are vectors in $\mathbb{R}^n$. The graph representation of a molecule here could be:

Global attribute $u$ is some known property of a molecule.
Nodes $V$ are such that each node corresponds to an atom, and the nodes attribute $x_i$ is the atom's identity.
Edges $E$ correspond to the bond. (#ASK Does assuming edges have no properties here mean the same as assuming they have no attributes?)

Another e.g. is to extract information from documents, where we might want to extract line items from scanned receipts. Here, OCR software can generate segments of text, which can then be turned to a graph representation, where

Each node corresponds to a text segment.
Edges have properties such as the distance between segments, where segments are in same row/column etc.

The task here can be to classify nodes (i.e. text segments) as to whether they correspond to a line item or not.

We view graphs as an explicit representation of a set of entities and their relations. We want a learning algorithm which models objects and their interactions, and grounds modelling in data. There is no default deep learning component which operates on an arbitrary realtional structure. The many architectures that can handle such inputs are called Graph Neural Networks (GNNs).

A GNN should satisfy the following:

Permutation invariance, i.e. the output of the network should be invariant to node permutations.
Ability to process graphs with varying number of nodes.
Ability to distinguish between different graph topologies.

For a thought experiment, consider an MLP which takes in 2 sets of nodes, which are both different permutations of the same set $V$. The outputs will be different, thus the MLP is not permutation invariant. CNNs, RNNs and transformer encoders can process sequences of varying lengths, in fact:

RNNs can be viewed as a neural network which can process graphs with the chain topology.
CNNs can be viewed as a neural network which can process graphs with the grid topology.
Transformer encoder can be viewed as a NN which can process fully connected graphs.

We now want to have a architecture that can generalize to any graph topology.

The Neural Fingerprint Network (Duvenaud et al 2015) converts a graph that represents a molecule to a vector $\mathbf{f}$ called the fingerprint which is then processed further to make some prediction. The starting point is an existing algorithm called circular fingerprints, which encodes the molecule in a way that is invariant to atom-relabelling. In the Neural Fingerprint Network, they "neuralize" this algorithm by introducing extra learnable parameters which can be trained through backpropagation.

GNNs are also used to learn dynamics of physical systems, e.g. Interaction Networks by Battaglia et al 2016. The task is to predict the next state of a physical system, where we assume each pair of objects are in a directed relationship which is represented using graphs. For e.g. particles in a box are a fully connected graph, with an extra node for the walls. A rope could be represented by discretizing it as a sequence of nodes. The interactions are modelled by first assuming that each edge corresponds to one node (sender) influencing another (receiver), with the effect of this interaction being modelled by some function $f_R$ that takes as input both nodes of the edge and the edge attribute $r$. The future state of the receiver node is modelled by another function $f_O$ that takes as input its current state and the prediction(s) from $f_R$ from the edge(s) related to the receiver node. This model generalizes in the sense that for e.g. for the $n$ body simulation problem, a model trained on $n=6$ works well when testing for $n=3$.

Another successful application of GNNs is in visual scene understanding (Santoro et al 2017). One dataset for this task is the CLEVR dataset of relational reasoning, where an image containing 4 objects is shown alongside non-relational and relational questions, where the relational question needs explicit reasoning about the relations between the 4 objects in the image, whereas the non-relational question requires reasoning about the attributes of 1 object. In Santoro et al, they decompose each image into patches, and each patch is a node in a fully connected graph. The embedding of the question (processed from an LSTM for e.g.) is used as a global context for modelling relations (#ASK so each sample has a corresponding question-answer?).

Another e.g. of a GNN architecture is the Graph Convolutional Network (GCN) by Kipf and Welling 2017. The motivation for this model is semi-supervised classification of nodes in a graph. One application is:

Nodes are documents
Edges are citation links
Note attributes $x_i$ are bag-of-words features of documents
Some documents have class labels, e.g. topic of the document

We might want to predict the class labels of documents without labels. When making such a prediction for a node, the attributes and relations from nearby nodes contain information we want to make use of. This problem is somewhat similar to an image segmentation problem, which is often done with convolutional networks for e.g. the U-net. Kipf and Welling generalize the concept of convolution to graphs with arbitrary structure. They make use of the spectral view of convolutions. The graph convolutional layer introduced by them takes as input a graph with $N$ nodes with $C$ dimensional attributes, represented in a matrix $X \in \mathbb{R}^{N \times C}$. The output is a graph with the same structure and a new set of features $Z \in \mathbb{R}^{N \times F}$. The output is $Z = \hat{A}XW$, where $\hat{A}$ is a matrix similar to the adjacency matrix, and $W \in \mathbb{R}^{C \times F}$ is a matrix of filter parameters. For each node $i$, signals are combined from all of its neighbours $j \in \mathcal{N}(i)$, i.e. $Z_{i,:}=\sum_{j \in \mathcal{N}(i)}\hat{a}_{ij}W^TX_{j,:}$.

Another e.g. model is the Recurrent Relation Network by Palm et al 2018. This network aims to solve tasks that require a chain of interdependent steps of relational inference for e.g. when solving Sudoku. The authors look at the Sudoku problem, where they produce a graph such that each node is connected to all other nodes in the same row, column and 3x3 block it belongs to. Each node has some state, and for $T$ iterations $t=1,\cdots,T$, compute "messages" for both directions (since the graph is undirected), i.e. for all edges containing a pair of nodes $i,j$ compute $m_{ij}^t=f(h_i^{t-1},h_j^{t-1}), m_{ji}^t=f(h_j^{t-1},h_i^{t-1})$ where $f$ is for e.g. an MLP. Now for each node, sum all incoming messages $m_j^t = \sum_{i\in\mathcal{N}(j)}m_{ij}^t$ and update the state of the node $h_j^t = g(h_j^{t-1},x_j,m_j^t)$ where $x_j$ is the digit in the cell or a special token if no digit. $g$ can be for e.g. a GRU cell. For each node, the output is then computed as $o_j^t = f_o(h_j^t)$ which tries to classify the digit which belongs to the corresponding node, then the loss is computed. $f_o$ can be for e.g. a linear layer with a 10 class softmax layer at the end.

In general, today we looked at GNN architectures where the input is an undirected graph $G$, with node features $x_i$ and edge features $e_{ij}$ and each node has a hidden state $\mathbf{h}_i$ which is initialized to $\mathbf{h}_i^{t=0}$. There are $T$ iterations which consist of several computations which update the states of the nodes $\mathbf{h}_i^0 \to \mathbf{h}_i^1 \cdots \to \mathbf{h}_i^T$, followed by a "readout" (#ASK what is a readout function) function that combines all node states to compute a single output $\mathbf{y}=o(\{\mathbf{h}_i^T|i \in G\})$ if we are interested about making predictions about the whole graph for e.g. in the molecular application.

Roughly speaking, each of the $T$ iterations involve the following computations:

Each node receives messages from all its neighbours which is passed into some learnable function.
Each node aggregates the messages, for e.g. by summation.
The state of each node is updated using the aggregated messages using some learnable function.

GCNs use simple functions during the message passing phase and the state update phase which limits their representational power. Gilmer et al 2017 propose to unify several GNN architectures under the banner of Message Passing Neural Networks (MPNNs). Battaglia et al 2018 defined an even more general framework that includes the update of the edge attributes inthe first phase of the forward pass. The term message passing comes from the belief propagation algorithm used to perform inference in probabilistic graphical models.

Also look into (, a).

Deep Autoencoders

Today we start moving into the realm of unsupervised learning, where we want computers to learn from unlabelled data $\{\mathbf{x}_i\}_{i \in I}$. This can be useful for:

Representation learning (learning features useful for sueprvised learning problems)
Detect samples that look different from training population (anomaly detection)
Visualize data and possibly discover patterns.
Generate new samples that look similar to training data (generative models)

For representation learning, we can apply a neural network $f$ such that $f(\mathbf{x})=\mathbf{z}$, where extracted features may work better than raw data, so the downstream supervised learning tasks will be performed on pairs $(\mathbf{z},\mathbf{y})$ rather than $(\mathbf{x},\mathbf{y})$. One drawback is that we might not know the downstream task beforehand. One idea to treat this as a data compression task so that the extracted features $\mathbf{z}$ will probably be used for most downstream tasks.

Principal Component Analysis (PCA) is one of the simplest unsupervised learning methods, where we first center the data by subtracting its empirical mean, and we find "principal components", by finding a unit vector $\hat{\mathbf{w}}$ such that the variance of $y_1 = \hat{\mathbf{w}}^T\mathbf{x}$ is maximized, i.e. the solution vector has the characterization that it is $\mathbf{w}^* = \text{argmax}_{\mathbf{w}_1} \mathbf{w}_1\mathbf{C}_x\mathbf{w}_1 \: : \: ||\mathbf{w}_1||=1$, which is the same as it being the first dominant eigenvector of the covariance matrix $\mathbf{C}_x$. The second principal component is then found by maximizing the variance in the subspace orthogonal to this eigenvector and so on.

PCA can be viewed as a minimum mean-square error compression algorithm, that is, finding an $m$ dimensional subspace spanned by orthonormal bases $\mathbf{W} = [\mathbf{w}_1,\cdots,\mathbf{w}_m]$ such that $\mathbf{W}^T\mathbf{W}=\mathbf{I}$, where $\mathbf{z} = \mathbf{W}^T\mathbf{x}$ and the reconstructed values are then $\hat{\mathbf{x}} = \mathbf{W}\mathbf{z} = \sum_{i=1}^m (\mathbf{w}_i^T\mathbf{x})\mathbf{w}_i = \mathbf{W}\mathbf{W}^T\mathbf{x}$. Finding such a $\mathbf{W}$ amounts to solving the optimization problem $\mathbf{W}_{PCA} = \text{argmin}_{\mathbf{W}:\mathbf{W}^T\mathbf{W}=\mathbf{I}}\mathbb{E}[||\mathbf{x} - \hat{\mathbf{x}}||^2]$.

In general, we can think of an encoder and decoder, such that $f(\mathbf{x}) = \mathbf{W}_f\mathbf{x}+\mathbf{b}_f$ is the encoder and $g(\mathbf{z}) = \mathbf{W}_g\mathbf{z}+\mathbf{b}_g$ is the decoder. Often the dimensionality of $\mathbf{z}$ is smaller than that of $\mathbf{x}$, in which case $\mathbf{z}$ is often called a bottleneck.

We can move beyond affine encoders and decoders and use non-linear mappings, and this results in deep autoencoder models. The optimization objective can once again be for e.g. the mean-squared reconstruction error, $\theta_f,\theta_g = \text{argmin}_{\theta_f,\theta_g} \mathbb{E}[||\mathbf{x}-g(f(\mathbf{x},\theta_f),\theta_g)||^2]$. Nonlinear autoencoders can properly model curved data manifolds, for e.g. the letter S on a 2D plane.

Deep autoencoders are often used for data compression, for e.g. when applying RL algorithms to tasks such as playing the game Doom.

The latent space can be nicely visualized using (t-SNE).

Denoising autoencoders corrupt the examples with some noise (e.g. Gaussian noise) before feeding it to the encoder, and ask the true input when computing the reconstruction error. This can be seen as a way to regularize the autoencoder. For gaussian corruptions, it can be shown that the optimal denoising is $d(\tilde{\mathbf{x}}) = \tilde{\mathbf{x}} + \sigma^2 \nabla_{\tilde{\mathbf{x}}} \log p(\tilde{\mathbf{x}})$ assuming the corruption per dimension was $\mathcal{N}(0,\sigma^2)$ [Alain, Bengio 2014].

We can convert autoencoders into generative models with latent variables. Recall that a generative model is one which learns to represent some data distribution $p(\mathbf{x})$ and can be used to generate new examples from $p(\mathbf{x})$, e.g. a Gaussian Mixture Model.

Thinking of the deep autoencoder, we would want to sample from a latent variable $\mathbf{z}$ if we want the model to generate new samples, and for this we would want to put a distribution over the latent space, one e.g. is to set $\mathbf{z}\sim \mathcal{N}(0,\mathbf{I})$. Even if the distribution over $\mathbf{z}$ is very simple, we can hope that the non-linearity of the decoder $g_\theta$ will be able to model the possibly very complex distribution $p(\mathbf{x})$. Thus we have to tune the parameters $\theta, \sigma^$ (In this case $\sigma$ is the parameter of the latent distribution. This is often done by maximum likelihood training, i.e. maximizing $\log p(x_1,\cdots,x_N|\theta) = \sum_{i=1}^N \log p(\mathbf{x}_i|\theta) = \sum_{i=1}^N \log \int p(\mathbf{x}_i|\mathbf{z}_i,\theta)p(\mathbf{z}_i)d\mathbf{z}_i$. Direct optimization is difficult, so one alternative is to use the Expectation-Maximization algorithm, which is a general approach to estimate parameters $\theta$ of a latent variable model $p(\mathbf{x}_1,\cdots,\mathbf{x}_N,\mathbf{z}_1,\cdots,\mathbf{z}_N|\theta)= \prod_{i=1}^N p(\mathbf{x}_i|\mathbf{z}_i,\theta)p(\mathbf{z}_i)$:

E-step: Compute posterior probabilities $p(\mathbf{z}_i|\mathbf{x}_i,\theta)$ given current values of $\theta$. Specifically, let $q(\mathbf{z}_1,\cdots,\mathbf{z}_N)=\prod_{i=1}^N q(\mathbf{z}_i)$ and $q(\mathbf{z}_i)=p(\mathbf{z}_i|\mathbf{x}_i,\theta)$, and update this for each training sample.
M-step: Update the values of $\theta$ using computed values of $p(\mathbf{z}_i|\mathbf{x}_i,\theta)$. This is done by maximizing $\mathcal{F}(\theta) = \langle \log p(\mathbf{x}_1,\cdots,\mathbf{x}_N,\mathbf{z}_1,\cdots,\mathbf{z}_N|\theta)\rangle_{q(\mathbf{z}_1,\cdots,\mathbf{z}_N)}$.

Both of these steps improve $\log p(\mathbf{x}_1,\cdots,\mathbf{x}_N|\theta)$ provably.

However, direct application of the EM algorithm is complex, for e.g. the true conditional distributions $q(\mathbf{z}_i)=p(\mathbf{z}_i|\mathbf{x}_i,\theta)$ maybe intractable to find, for e.g. they could be multi-modal even in the simplest cases. One solution is to approximate it by choosing its form to be for e.g. a Gaussian, $q(\mathbf{z}_i) =\mathcal{N}(\mu_{\mathbf{z}_i},\sigma^2_{\mathbf{z}_i})$. The parameters describing the posterior distributions of the latent variables, $\theta_q = \{\mu_{\mathbf{z}_i},\sigma^2_{\mathbf{z}_i}\}_{i=1}^N$ are called variational parameters. A popular method to find the approximation is by minimizing the KL divergence between $q(\mathbf{z}_i)$ and $p(\mathbf{z}_i|\mathbf{x}_i,\theta)$.

This KL can be minimized by adding an extra term (the entropies of the approximate distributions) to the optimization objective in the M-step: \[ \mathcal{F}(\theta,\theta_q) = \sum_{i=1}^N \int q(\mathbf{z}_i)\log p(\mathbf{x}_i,\mathbf{z}_i|\theta)d\mathbf{z}_i - \int q(\mathbf{z}_i)\log q(\mathbf{z}_i) d\mathbf{z}_i\] \[ =\sum_{i=1}^N \int q(\mathbf{z}_i)\log \frac{p(\mathbf{x}_i,\mathbf{z}_i|\theta)}{q(\mathbf{z}_i)} d\mathbf{z}_i = \sum_{i=1}^N \int q(\mathbf{z}_i)\log \frac{p(\mathbf{z}_i|\mathbf{x}_i,\theta)p(\mathbf{x}_i|\theta)}{q(\mathbf{z}_i)} d\mathbf{z}_i\] \[ = \sum_{i=1}^N -D_{KL}(q(\mathbf{z}_i)||p(\mathbf{z}_i|\mathbf{x}_i,\theta))+\log p(\mathbf{x}_i|\theta)\].

Maximizing $\mathcal{F}(\theta,\theta_q)$ w.r.t variational paramaters $\theta_q$ is equivalent to minimizing the KL divergence between $q(\mathbf{z}_i)$ and $p(\mathbf{z}_i|\mathbf{x}_i,\theta)$, since the log-likelihood term at the end does not depend on $q$.

Thus, we can optimize both $\theta,\theta_q$ jointly without alternating between the E and M steps. The objective function $\mathcal{F}(\theta,\theta_q) \leq \log p(\mathbf{x}_1,\cdots,\mathbf{x}_N|\theta)$ is called the Evidence Lower BOund (ELBO). Rewriting the ELBO as \[ \mathcal{F}(\theta,\theta_q) = \sum_{i=1}^N \int q(\mathbf{z}_i)\log p(\mathbf{x}_i|\mathbf{z}_i, \theta)d\mathbf{z}_i - \int q(\mathbf{z}_i)\log \frac{q(\mathbf{z}_i)}{p(\mathbf{z}_i)} d\mathbf{z}_i\] The first term above can be interpreted as the reconstruction error, and the second term is a regularization term.

First learning algorithms for this type of model was proposed by Lappalainen and Honkela 2001.

However in this approach, the number of variational parameters scale linearly with the number of samples. One way to bypass this is to first know how the generative procedure works: For any sample $\mathbf{x}_i$, we find the distribution $q(\mathbf{z}_i)\approx p(\mathbf{z}_i|\mathbf{x}_i,\theta)$ which in this case has parameters $\mu_{\mathbf{z}_i},\sigma^2_{\mathbf{z}_i}$. If instead we can generate $\mu_{\mathbf{z}},\sigma^2_{\mathbf{z}}$ for each sample and then plug it into $q(\mathbf{z}_i)$ then we do not have the learn all the variational parameters, but we simply have to learn the mapping which can generates $\mu_{\mathbf{z}},\sigma^2_{\mathbf{z}}$ given any input $\mathbf{x}$. This is called amortized inference. If two samples $\mathbf{x}_i,\mathbf{x}_j$ are close to each other, the corresponding $q(\mathbf{z}_i), q(\mathbf{z}_j)$ should be close as well.

This results in the variational autoencoder, which has an encoder which given an input $\mathbf{x}$ outputs the parameters of the distribution over the latent space, in this case $\mu_{\mathbf{z}},\sigma^2_{\mathbf{z}}$, and the decoder which starts from a distribution over the latent space and generates the reconstruction deterministically from thereon.

Training involves optimizing the ELBO. The forward pass consists of the following for sample $\mathbf{x}_i$

Compute $\mu_{\mathbf{z}_i},\sigma^2_{\mathbf{z}_i}$ using the encoder.
Compute the regularization term in the objective function.
Draw $L=1$ samples $\mathbf{z}_i^{(l)}$ from $q(\mathbf{z}_i) = \mathcal{N}(\mu_{\mathbf{z}_i},\sigma^2_{\mathbf{z}_i})$.
Propagate $\mathbf{z}_i^{(l)}$ through the decoder and compute the first term.

For backpropagation to work, we need to compute derivatives w.r.t the parameters of the decoder and encoder. Since there is a sampling layer involved, we need an extra trick to propagate the gradients through the encoder, called the reparametrization trick. This can be thought of as introducing an extra block which is a deterministic and differentiable function of the variational parameters $\mu_{\mathbf{z}},\sigma^2_{\mathbf{z}}$ and some distribution which does not depend on any of the training parameters. In this case, since we are working with Gaussians, we can use the deterministic function $\mathbf{z} = \mu_{\mathbf{z}}+\sigma_{\mathbf{z}}\mathbf{\epsilon}$ with $\epsilon \sim \mathcal{N}(0, \mathbf{I})$ which makes $\mathbf{z}$ have the desired distribution while allowing us to backpropagate through the sampling block and then further through the encoder.

Often samples generated from VAEs are blurry, there has been some work which produce better quality images via VAEs e.g. Vahdat and Kautz 2020.

(Look into UNets, and the use of skip connections in denoising autoencoders, and posterior collapse)

Generative Adversarial Networks

We start with the generative model of the VAE:

Starts with Latent variables $\mathbf{z} \sim \mathcal{N}(0,\mathbb{I})$
Generates data vectors as $\mathbf{x} = g(\mathbf{z},\theta)+\epsilon$, i.e. a nonlinear transformation of the latent variable with possible additive noise. (#ASK Why additive noise?)

VAEs are an explicit density model since we have an explicit parametric form of $p(\mathbf{x})$, namely, $p(\mathbf{x}) = \prod_i \int \mathbcal{N}(\mathbf{z}_i|0,\mathbb{I})\mathcal{N}(\mathbf{x}_i|g(\mathbf{z}_i,\theta),\sigma^2\mathbb{I}) d\mathbf{z}_i$.

GANs however are an implicit density model - we do not have an explicit expression for the likelihood however we can sample from the model. They were proposed by Goodfellow et al 2014 (#TODO Cite), and the general idea is to have 2 networks, a generator and a discriminator, where the generator aims to generate samples similar to those in the distribution meanwhile the discriminator is a classifier which aims to distinguish between samples from the dataset and those from the generator.

A popular choice for the GAN generation process is to sample from an the isotropic Gaussian giving $\mathbf{z} \sim \mathcal{N}(0,\mathbb{I})$ and pass the sample through a deep network $\mathbf{x}=g(\mathbf{z},\theta)$. Other strategies include applying additive or multiplicative noise to hidden layers, and concatenate noise to hidden layers.

The discriminator $d(\mathbf{x}) \in [0,1]$ can be thought of as a teacher that guides the generator. (#ASK Lecturer mentioned in practice the discriminator always wins, why is this the case?). It can be trained by the following optimization problem $\max_d \mathbb{E}_{\mathbf{x}\sim p_\mathcal{D}}[\log d(\mathbf{x})] + \mathbb{E}_{\mathbf{x}\sim p_g}[\log(1-d(\mathbf{x}))]$ - intuitively, when the sample is from the true distribution we want a higher value and alternatively when it is from the generator, we want it to have a smaller value. This objective is the same as $\max_d \mathbb{E}_{\mathbf{x}\sim p_\mathcal{D}}[\log d(\mathbf{x})] + \mathbb{E}_{\mathbf{z}\sim p_z}[\log(1-d(g(\mathbf{z})))]$.

The generator $g(\mathbf{z})$ on the other hand wants to maximize the value of $d(\mathbf{x})$, formulated by the objective $\max_g \mathbb{E}_{\mathbf{x}\sim p_g}[\log d(\mathbf{x})]$ which again is the same as $\max_g \mathbb{E}_{\mathbf{z}\sim p_z}[\log d(g(\mathbf{x}))]$, while in practice the negative of this is minimized. The generator can be, in principle, trained to minimize the probability of being fake or to maximize the probability of being real and both formulations result in the same fixed point. In the beginning, the discriminator will output 0 mostly, thus the latter formulation will provide better gradients (can be seen by plotting $\log(1-d)$ and $-\log d$ over $d\in [0,1]$).

The GAN training procedure is as follows:

Update the discriminator:
- Choose $N$ samples $\mathbf{x}_i$ from the training set.
- Generate $N$ samples $g(\mathbf{z}_i)$ using the generator.
- Compute the binary cross-entropy loss $\mathcal{L}_d = -\frac{1}{N}\sum_{i=1}^N \log d(\mathbf{x}_i) - \frac{1}{N}\sum_{i=1}^N \log(1-d(g(\mathbf{z}_i)))$.
- Update $\theta_d$ using SGD: $\theta_d \leftarrow \theta_d - \nabla_{\theta_d}\mathcal{L}_d$
Update the generator:
- Generate $N$ samples $g(\mathbf{z}_i)$ from the generator.
- Compute the loss function $\mathcal{L}_g = -\frac{1}{N}\sum_{i=1}^N \log d(g(\mathbf{z}_i))$
- Update $\theta_g$ using SGD: $\theta_g \leftarrow \theta_g - \nabla_{\theta_g}\mathcal{L}_g$, gradients flow through discriminator.

A popular view of the GAN is as a 2-player zero-sum minimax game where the generator tries to fool the discriminator and the discriminator tries to call out the generator's fakes. The game is described by the function $v(g,d) = \max_d \mathbb{E}_{\mathbf{x}\sim p_\mathcal{D}}[\log d(\mathbf{x})] + \mathbb{E}_{\mathbf{z}\sim p_z}[\log(1-d(g(\mathbf{z})))]$ which determines the objective $\min_g \max_d v(g,d)$. The (Nash) equilibrium is a saddle point of $v(g,d)$.

For a fixed generator $g$, the optimal discriminator is $d^*_g(\mathbf{x}) = \frac{p_\mathcal{D}(\mathbf{x})}{p_\mathcal{D}(\mathbf{x}) + p_g(\mathbf{x})}$. This can also be shown to relate to minimizing the Jensen-Shannon divergence (JSD).

Training GANs is non-trivial, some problems that occur are:

Mode collapse, where the generator produces the same output or slight variations of it which the discriminator thinks is fake.
Mode hopping, where after a mode collapse, the generator simply "hops" on to another similar mode, without capturing the multi-modal nature of the whole distribution.

Deep Convolutional GANs (DCGANs) are a simple and relatively easy to train GAN variant, which uses all convolutional layers.

Wasserstein GANs make use of the Earth-mover distance (also called Wasserstein-1 distance) rather than the JSD which arise in traditional GANs. (#DOUBT Re-read that sentence again). One motivation of this is the inability of JSD to capture the simple e.g. where the true 1D distribution is delta at $x=0$ and the parametric distribution is a delta with parameter $\theta$ - obviously we need the generator to set $\theta$ but plotting the JSD shows that it is constant for all $\theta \in [-1,1]$ meanwhile the Wasserstein-1 distance is the function $f(\theta)=|\theta|$.

The Wasserstein-1 distance is defined as $W(p_\mathcal{D},p_g)=\inf_{\gamma \in \prod(p_\mathcal{D},p_g)}\mathbb{E}_{x,y \sim \gamma}[||x-y||]$ which is intractable, and has another characterization as $W(p_\mathcal{D},p_\theta) = \sup_{||f||_L \leq 1}\mathbb{E}_{x \sim p_\mathcal{D}}[f(x)] - \mathbb{E}_{x \sim p_g}[f(x)]$ where the supremum is over 1-Lipschitz functions. Using $K$ -Lipschitz functions, we can compute the distance up to the constant $K$, and modelling $f$ with a neural network $f_w$ with weights $w$, which can be clipped so that $w_i\in [-0.01,0.01]$. The optimization problem then becomes $\max_w \mathbb{E}_{x \sim p_\mathcal{D}}[f_w(x)]-\mathbb{E}_{x\sim p_g}[f_w(x)]$.

The Wasserstein GAN training procedure is then:

Update the discriminator:
- Choose $N$ samples $\mathbf{x}_i$ from the training set.
- Generate $N$ samples $g_\theta(\mathbf{z}_i)$ using the generator.
- Compute the binary cross-entropy loss $\mathcal{L}_d = -\frac{1}{N}\sum_{i=1}^N f_w(g_\theta(\mathbf{z}_i)) - \frac{1}{N}\sum_{i=1}^N f_w(\mathbf{x}_i)$.
- Update $w$ using SGD: $w \leftarrow w - \nabla_{w}\mathcal{L}_d$
- Clip the weights $w_i \leftarrow \text{clip}(w_i,-c,c)$.
Update the generator:
- Generate $N$ samples $g_\theta(\mathbf{z}_i)$ from the generator.
- Compute the loss function $\mathcal{L}_g = -\frac{1}{N}\sum_{i=1}^N f_w(g_\theta(\mathbf{z}_i))$
- Update $\theta_g$ using SGD: $\theta_g \leftarrow \theta_g - \nabla_{\theta_g}\mathcal{L}_g$, gradients flow through discriminator.

The Wasserstein GAN can be improved by penalizing the magnitudes of the gradients instead of clipping, changing the loss function to be $\mathcal{L}_{d,gp} = \mathcal{L}_d + \lambda \mathbb{E}_{\hat{x}\sim p_{\hat{x}}}(||\nabla_{\hat{x}}f_w(\hat{x})||_2 - 1)^2$. The gradient penalty term is computed at a point between the true and generated samples.

Spectral normalization is a trick used to stabilize the training of GANs. Recall that the given the oiginal GAN objective function and a fixed generator, the optimal discriminator is $d_g^*(\mathbf{x}) = \sigma(f^*(\mathbf{x}))$ where $\sigma$ is the sigmoid function and $f^*(\mathbf{x}) =\log p_\mathcal{D}(\mathbf{x}) - \log p_g(\mathbf{x})$. We see that $\nabla_\mathbf{x} f^*(\mathbf{x}) = \frac{1}{p_\mathcal{D}(\mathbf{x})}\nabla_\mathbf{x}p_\mathcal{D}(\mathbf{x}) - \frac{1}{p_g(\mathbf{x})}\nabla_\mathbf{x}p_g(\mathbf{x})$, which creates instabilities if the densities have low probabilities. Spectral normalization solves this problem by regularizing the discriminator to be $d_g \leftarrow \text{argmax}_{||f||_{Lip}\leq K} v(g,d)$ where $||f||_{Lip}$ is the smallest value $M$ such that $||f(\mathbf{x})-f(\mathbf{x}')||/||\mathbf{x}-\mathbf{x}'|| \leq M$ for any $\mathbf{x},\mathbf{x}'$. The exact constraint is, for a linear discriminator $f(\mathbf{x}) =\mathbf{W}\mathbf{x}$, we want the spectral norm of the weight matrix $\sigma(\mathbf{W})=1$, which can be done by making the linear discriminator $f(\mathbf{x}) = \frac{\mathbf{W}}{\sigma(\mathbf{W})}\mathbf{x}$. The spectral norm can be efficiently computed in a differential manner using a power iteration method by Maiyato et al 2018 (#TODO Cite). For other layers such as convolutional layers, the weight tensor $\mathbf{W}\in \mathbb{R}^{d_{out}\times d_{in} \times h \times w}$ can be treated as a matrix of size $d_{out}\times d_{in}hw$.

Mescheder et al 2018 (#TODO Cite) studied the convergence of different GAN models on a Dirac distribution with a 1D linear discriminator. They propose new penalty terms $R_1,R_2$ which leads to model convergence.

Progressive growing GANs (ProGANs) by Karras et al 2018 (#TODO Cite) generate samples (images) by starting from lower resolution images then increasing the resolution by subsequent layers. When doubling the resolution the new layers are faded in smoothl, and when training the discriminator, the real images are downscaled to match the current resolution of the network.

Self-Attention GANs (SAGANs) by Zhang et al 2018, Brock et al 2018 (#TODO Cite) use a self-attention module in the generator. Other notable models include StyleGAN, StyleGAN2, StyleGAN3. Conditional GANs on the other hand can be used for interesting applications such as Image-to-Image translation.

(, a) talks about gradients in GANs in good detail.

Autoregressive and Flow based Generative Models

We continue with learning from unlabelled data $\{\mathbf{x}_i\}_{i \in [n]}$. This lecture is concerned with 2 types explicit generative models for $p_\theta(\mathbf{x})$:

Autoregressive models, where there is a notion of order so that the density can be written as $p(\mathbf{x})=\prod_{i=1}^n p(x_i|x_1,\cdots,x_{i-1})$.
Flow-based models, where the $\log p_\theta(\mathbf{x}) = \log p_\theta(\mathbf{z}) + \log |\text{det}(\frac{d\mathbf{z}}{d\mathbf{x}})|$.

The sequence-to-sequence models from previous lectures are examples of autoregressive models, where the context was provided by the encoder $p(\mathbf{y}_i|\mathbf{y}_{i-1},\cdots,\mathbf{y}_1,\mathbf{z}_1,\cdots,\mathbf{z}_n)$, and we considered 3 types of decoders: RNNs, CNNs and transformers.

For unsupervised learning, we do not need to use the context, and the joint distribution is modeled as $p(\mathbf{y}_1,\cdots,\mathbf{y}_m) = p(\mathbf{y}_1)\prod_{i=2}^m p(\mathbf{y}_i|\mathbf{y}_1,\cdots,\mathbf{y}_{i-1})$.

Starting with 1D data, we can build an autoregressive model $p(\mathbf{x}_1,\cdots,\mathbf{x}_m) = p(\mathbf{x}_1)\prod_{i=2}^m p(\mathbf{x}_i|\mathbf{x}_1,\cdots,\mathbf{x}_{i-1})$ by modelling each conditional probability in the product sum with a CNN, using causal (i.e. shifted) convolutional layers to guarantee the autoregressive structure. The WaveNet model (van den Oord et al 2016 #TODO Cite) is an e.g. of an autoregressive model of speech, which uses a stack of dilated causal convolutional layers to increase the receptive field.

Moving onto data with 2D structure, we could treat an $n\times n$ images as a 1D sequence of $n^2$ elements and use an autoregressive model of the same type for 1D data. We will now look at the PixelCNN model. Since the product sum of the joint density has many conditional probabilities we would like to parallize their computations (similar to how 1D autoregressive models can compute them with one pass #DOUBT This sentence in parenthesis is not clear to me). PixelCNN (#TODO Cite) makes use of a stack of masked 2D convolutional layers. The kernel of such convolutional layers make sure that future pixel values are not in the receptive field of any fixed pixel, and stacking such masked convolutional layers increase this receptive field whilst maintaining the autoregressive structure. Each conditional distribution in PixelCNN is modelled as a multinomial distribution over 256 possible values which assumed an 8-bit representation of a pixel value. Each pixel is classified into one of the 256 classes, and the output layer has a softmax nonlinearity and the cross entropy loss is used. To generate samples from this model, we simply sample from all the conditional distributions sequentially.

VQ-VAEs (van den Oord et al 2018, Razavi et al 2019 #TODO Cite) combine autoregressive models with autoenconders so that we could have an encoder to code samples. The simplest idea is to use an autoregressive decoder (e.g. PixelCNN) in a VAE. However this does not work due to posterior collapse, where the decoder ignores the latent code. VQ-VAEs solve this problem by discretizing the latent space and an using autoregressive mdoel for these discrete codes in the latent space. Training involves training an autoencoder with discrete latent codes, then training a PixelCNN on the discrete latent codes. Note that there is a finite amount of possible latent codes $\mathbf{z}_k$ that be used to encode the input, call the collection of these codes the codebook.

The output of the encoder $f$ is quantized to prototype vectors $\mathbf{z}_k$ via the quantization function $Q$ where $Q(f(\mathbf{x})) = \mathbf{z}_k \: : \: k =\text{argmin}_j ||f(\mathbf{x})-\mathbf{z}_j||$.
The decoder $g$ tries to reconstruct the original input $\mathbf{x}$ from the quantized representation $\mathbf{z}_k$ by minimizing the loss $\mathcal{L}(g) = ||\mathbf{x}-g(\mathbf{z}_k)||_2^2$.
The loss optimized is $\mathcal{L}(f) = ||\mathbf{x}-g(\mathbf{z}_k)||_2^2 +\beta ||f(\mathbf{x}-sg[\mathbf{z}_k]||^2$ where $sg$ is the stop-gradient operation.
The quantization operation is not differentiable, so the straight-through gradient estimator is used, which means that the gradients from the decoder input are just copied to the encoder output i.e. $\frac{d\mathcal{L}}{d\mathbf{z}}=\frac{d\mathcal{L}}{d\mathbf{z}_k}$.
Codebook vectors $\mathbf{z}_k$ are themselves updated to minimize some loss often $\sum_{i=1}^{N_k} ||sg[f(\mathbf{x}_i)]-\mathbf{z}_k||^2$ where $k$ is again $\text{argmin}_j ||f(\mathbf{x}_i)-\mathbf{z}_j||$. In practice the codes are updated using an exponential moving average.

(#ASK Why does discretizing help - it feels like a good constraint, and I believe Murphy's book actually had a proof relating it to the KL term in the objective if I recall correctly).

VQ-VAE 2 has 2 levels of hierarchy, with the intuition to model local and global information separately.

Transformer-based autoregressive models like the Generative Pre-Trained Transformer 2,3 (GPT-2,GPT-3) are state of the art in natutal language generation. These models operate on the byte level, with only masked self-attention. These models are trained on large text data. When applying such a model to image data, they worked on bytes and the attention blocks become computationally infeasible. Sparse Transformers by Child et al 2019 (#TODO Cite) make use of sparse factorizations of the attention matrix to overcome this.

The DALL-E model (Ramesh et al 2021 #TODO Cite) is similar to VQ-VAE, with the difference being in how the codes are quantized. The two stages of training are:

Train a discrete VAE to compress each $256 \times 256$ image into a $32 \times 32$ grid of image tokens.
Concatenate up to 256 BPE-encoded text tokens with the $32 \times 32$ image tokens and train and autoregressive transformer to model the joint distribution over the text and image tokens.

The model was trained on 250 million text-images pairs from the internet.

Flow-based genrative models have the following generative process: \[ \mathbf{z} \sim p_\theta(\mathbf{z}) \] \[ \mathbf{x} = \mathbf{g}_\theta(\mathbf{z})\] The initial distribution is often chosen to be a simple tractable density, and $g_\theta$ is invertible. This differs from VAEs where the decoder was not invertible and had extra additive noise, making the recovery of $\mathbf{z}$ from $\mathbf{z}$ non-trivial requiring variational inference.

An invertible mapping can be produced by composing a sequence of simple invertible functions, and this is called a normalizing flow.

Training is done by maximizing the log-likelihoof $\mathcal{F}(\theta) = \frac{1}{N}\sum_{i=1}^N \log p_\theta(\mathbf{x}_i)$. Since the mapping is invertible we can use the change of variables formula $p_\theta(\mathbf{x}) = p_\theta(\mathbf{z})|\text{det}\frac{\partial \mathbf{z}}{\partial \mathbf{x}}|$ which gives the log-likelihood as $\log p_\theta(\mathbf{z}) + \sum_{k=1}^K \log |\frac{\partial \mathbf{h}_k}{\partial \mathbf{h}_{k-1}}|$ for normalizing flows. Thus we can see why we need the sequence of invertible mappings to be simple.

Another way to construct an invertible mapping is using so-called coupling layers: Given $(x_1,x_2)$ and the following function $c$ that maps it to $(y_1,y_2)$: \[y_1=x_1\] \[y_2 = g(x_2,m(x_1))\] Where $g$ is invertible with respect to its first argument given the second one, e.g. $g(x,y)=x+y$ or $g(x,y)=xy ; y\neq 0$. Then $c$ is invertible with \[x_1=y_1\] \[x_2 = g^{-1}(y_2;m(y_1))\] And we can choose $m$ to be any function we want. This can be generalized to higher dimensional variables. The Jacobian of this transformation has a nice structure which gives an easy to compute determinant. This however involves splitting the input by half, and we need to make partition them such that all variables are mixed together. There is also an operation called the squeeze operation which helps with mixing of the variables.

The Glow model (Kingma and Dhariwal, 2018) introduce some new layers called Actnorm and invertible $1\times 1$ convolutions alongside affine coupling layers mentioned above to achieve better empirical performance.

Deep Learning with Few Labelled Samples

Deep learning is data hungry, requiring thousands of samples for handwritten digit recognition, and millions of samples for natural image classification.

When we have a custom classification task, i.e. classifying images to custom classes not covered by commonly used datasets, what can we do?

Collect new samples and label them
- Expensive and time-consuming.
Transfer learning.
- Take a network trained on some other task in the same domain, and fine-tune layers as necessary. E.g. for image classification it is common to take a network pretrained on Image-Net and fine-tune the last layer (Donahue et al 2013 #TODO Cite).
- Using a ResNet pretrained on ImageNet is more common nowadays.
- Vision Transformers (Dosovitskiy et al 2020) are another example of an architecture commonly pretrained on large datasets before being fine-tuned on downstream tasks. To make the input image into patches, we can use the convolutional layer with the kernel size and stride the same size.
Semi-supervised learning.
- Here we have few labeled samples and many unlabeled samples. This is possible when the knowledge about $p(x)$ via unlabelled samples have information that is useful to learn about $p(y|x)$.
- Ladder networks (Rasmus et al 2015 #TODO Cite) use a denoising autoencoder trained on 2 tasks (denoising and classification) which was very efficient in semi-supervised learning. The denoising task was to produce a clean input at the bottom (use labeled and unlabeled samples to compute the loss) and the classification task was to product the correct label at the bottleneck (use labeled samples to compute the loss).
- $Π$-model (Laine and Aila 2016) propose a simplified Ladder network architecture which does not contain the top-down pass. 2 copies of the same network perform computations for $x_1,x_2$ which is the same training sample with different transformations. The cost for unlabelled data is $||z_1-z_2||^2=||f(x_1)-f(x_2)||^2$, where the gradient propagates through both networks and $z$ is the input of the softmax. This model resembles siamese networks. We do not know the correct output for the unlabeled data but we know the output should not change for a transformed input.
- Mean teacher (Tarvainen and Valpola, 2017) - instead of using 2 identical networks, one network called the teacher is computed as the exponential moving average (EMA) of the other network called the student, $\theta'_t = \gamma \theta'_{t-1} + (1-\gamma)\theta_t$. Gradients only propagate through the student network, with the teacher network stabilizing the training. The EMA removes the noise caused by SGD making the teacher more accurate than the student. The consistency cost for unlabeled samples is the same.
Self-supervised learning.
- Semi-supervised learning assumes that unlabeled samples belong the same classes in our original task where we have the labeled samples. However in many applications we would have unlabeled samples which belong to classes not in the set of classes in the labeled dataset. Self-supervised learning aims to circumvent this problem by inventing an auxillary task that can be learning in an unsupervised manner and use the learnt features in the downstream task.
- The denoising part of ladder networks, and the BERT model which masks tokens are examples of self-supervised learning.
- Discriminative Unsupervised Feature Learning (Dosovitskiy et al 2014) train a convolutional neural network by sampling smaller patches from the images and transforming each patch multiple times using a composition of elementary transformation, with the training task being classification of the transformed patches to the class corresponding to the original image the patch comes from. The features are then used as an input to a Support Vector Machine.
- Contrastive Predictive Coding (CPC) (van den Oord et al 2018) is another approach - the goal is to learn representations that encode underlying shared information between different parts while discarding low-level information and noise that is local. When predicting further into the future the amount of shared information becomes lower and the model needs to infer more global structure - slow features spanning over large time spans can be more interesting. The CPC model uses a non-linear encoder $g_e$ that maps input sequence of observations $x_t$ to a sequence of latent representations $z_t=g_e(x_t)$ with a potentially lower temporal resolution. An autoregressive model $g_a$ then summarized all $z \leq t$ and produces a context latent representation $c_t = g_a(z\leq t)$. When training however, predictions in the $z$ space can all be 0 which is easy for the network to do - this phenomenon is called collapsed representation. To avoid this, the authors suggest a contrastive loss. Given context $c_t$, select the correct future code $z_{t+k} = g_e(x_{t+k})$ after $k$ steps among $N$ alternatives $z_{t+k},z_{\tau_1},\cdots,z_{\tau_N-1}$. The alternatives $z_\tau = g_e(z_\tau)$ are selected as encoder outputs produced for inputs $x_\tau$ randomly selected from the dataset. The loss is the categorical cross-entropy of selecting the correct encoding $\mathcal{L} = -\log \frac{\exp(z^T_{t+k}W_kc_t}{\sum_j z^T_{\tau_j}W_kc_t}$ where the logits produced by the classifer are postulated to have the form $z^T_\tau W_kc_t$.
- SimCLR (Chen et al 2020a,202b) can be seen as siamese networks processing 2 different augmentations of the same image combines with the CPC contrastive loss. First, a mini-batch of $N$ samples is chosen, and each sample is augmented with 2 different transformations. Each sample is processed with a deep neural network $z=g(f(x))$. The training task is for each image in the minibatch, select the other image that was produced with the other transformation of the same original image amongst the $2N-1$ alternatives. $l_{ij} = -\log \frac{\exp(\text{sim}(z_i,z_j)/\tau)}{\sum_{k=1,k\neq i}^{2N}\exp(\text{sim}(z_i,z_j)/\tau)}$ where sim is a similarity metric, here the cosing similarity is used. When using learnt representations for other downstream tasks, we take the output of an intermediate layer (e.g. 2 layers before the output). Augmentations were a combination of random cropping followed by resizing to original size, random color distortions and random Gaussian blurs. Combining augmentations are important since otherwise the auxillary tasks become too easy.
- Bootstrap Your Own Latent (BYOL) Grill et al 2020 can be viewed as an extension of the Mean Teacher to the fully unsupervised scenario. The difference is that the strong augmentations used similarly to SimCLR are used, and the output of the student is fed to an extra MLP predictor to get $q(z)$ which is compared to the teacher output $z'$ via the consistency cost $||q(z)-z'||^2$. The extra MLP prevents both the student and teacher networks from outputting 0s, for reasons currently unknown.
Few-shot learning (Meta learning)
- Machine learning algorithms require thousands of samples to get human level accuracy (Lake et al 2015).
- The problem of few-shot learning became popular after the Omniglot challenge was released.
- Consider the task of one-shot classification, i.e. building a classifier from 1 training sample. Siamese networks (Bromley et al 1993) for one-shot learning (Koch et al 2015) involve training on pairs of samples to decide whether they belong to the same class. The twin networks accept distinct inputs, and the output is some distance between the highest level feature representations of the twin networks. The network is trained of pairs of positive (same class) and negative (distinct classes) samples. This works for one-shot learning, and requires tweaking for few-shot learning.
- Matching Networks (Vinyals et al 2016) aim to map a (small) support set of $k$ samples of image-label pairs $S = \{(x_i,y_i)\}_{i=1}^k$ to a classifier $c_S(\hat{x})$, where $c_S(\hat{x})$ defines a probability distribution $\hat{y}$ over the classes inthe support set for a query sample $\hat{x}$. Matching networks parametrize the mapping $S \to c_S(\hat{x})$ as a neural network. The output of the classifier $c_S(\hat{x})$ is taken to be $\hat{y} = \sum_{i=1}^k a(\hat{x},x_i)y_i$ where $y_i$ is a one-hot representation of the classes, $a$ is an attention mechanism of the form $a(\hat{x},x_i) = \frac{\exp(c(f(\hat{x}),g(x_i)))}{\sum_{j=1}^k \exp(c(f(\hat{x},g(x_j))))}$ where $f,g$ are neural networks (#ASK and $c$ here is a notion of similarity?). We are in a sense learning how to learn classifiers. The intuition is that samples whose representations $g(x_i)$ are close the representation $f(\hat{x})$ of the query sample contribute more to the output. The representations are tuned to work well in few-shot learning scenarios. The training procedure for Matching Networks is called episodic training, and one iteration of this consists of:
  - Select $N$ classes by randomly sampling from the training set.
  - Select a support set by taking $K$ random samples for each of the selected classes.
  - Select a query set (a few samples from the same classes as in the support set).
  - Do forward computations and compute the classification loss using the query samples.
  - Do backpropagation and update the parameters of the network.
- Prototypical networks (Snell et al 2017) compute an $M$ dimensional representation $\mathbf{c}_k$ or prototype, of each class through an embedding function $f_\theta$. Each prototype is the mean vector of the embedded support points belonging to its class: $\mathbf{c}_k = \frac{1}{|S_k|}\sum_{(x_i,y_i)\in S_k}f_\theta(x_i)$. A distribution over classes for a query point $x$ is produced based on distance $d$ and the softmax function: $p(y=k|x)=\frac{\exp(-d(f_\theta(x),\mathbf{c}_k))}{\sum_{k'} \exp(-d(f_\theta(x),\mathbf{c}_{k'}))}$. For one-shot learning, prototypical networks are equivalent to matchine networks. One iteration of episodic training involves:
  - Support set is used to compute prototypes.
  - The query set is used to compute the loss.
  - Compute the gradients of the loss and update the parameters $\theta$ of the embedding network.
- Model-Agnostic Meta-Learning (MAML, Finn et al 2017) consider the task of training a classifier $y=f_\theta(x)$ to solve a new few-shot learning task. Learning can be done by performing a few iterations of Gradient Descent (GD).
  - In the case of 1 iteration GD, $\theta' \leftarrow \theta_0 -\alpha \nabla_\theta L((x_1,y_1),\cdots,(x_k,y_k))$ where the (few) $k$ samples is the support set, $L$ is the loss function, $\alpha$ is the learning rate, and $\theta_0$ is the initial parameter values.
  The idea of meta-learning is to learn the initialization $\theta_0$ and the learning rate $\alpha$ to minimize the loss on the query set after the GD-adaptation. Reptile (Nichol et al 2018) is a simplification of MAML.
- CLIP (Radford et al 2021) learn image representations with supervision by natural language (textual descriptions of images), and one of the goals is to achieve good zero-shot capabilities, i.e. learn to solve new classification tasks without training samples. The image and text is encoded using different encoders each creating an $N$ dimensional embedding, and the model is jointly trained to maximize the cosine similarities of theimage and text embeddings of the $N$ real pairs in the batch while minimizing cosine similarity of the embeddings of the $N^2-N$ incorrect pairings. At test time, for image classification, it solves the zero-shot classification task by taking the words of the classes, produce a text prompt "A photo of [object]" which is fed to the trained encoder which creates the $N$ dimensional embedding, and the actual image for classifying is fed to the image encoder, and the class with the highest cosine similarity with the image embedding is chosen. CLIP achieves 76.2% zero-shot accuracy on ImageNet without seeing ImageNet data. Prompt-engineering i.e. choosing the text prompts like "A photo of [object]" helps improve zero-shot results. Vision transformers work best as the encoder.

Unsupervised learning via denoising

From previous lectures we have seen that denoising can help when learning representations, e.g. denoising autoencoders, where the inputs $\mathbf{x}$ were fed with Gaussian iid noise $\epsilon \sim \mathcal{N}(0,\sigma^2\mathbf{I})$ giving noisy inputs $\mathbf{\tilde{x}}$ and the loss function compares the true labels with the network outputs for the noisy inputs, $s(\mathbf{\tilde{x}})$. For such Gaussian noise, the optimal denoising is given by $s(\mathbf{\tilde{x}})=\mathbf{\tilde{x}}+\sigma^2\nabla_{\mathbf{\tilde{x}}}\log p(\mathbf{\tilde{x}})$, where $p(\mathbf{\tilde{x}})$ is the distribution of the corrupted data. For small $\sigma$, $p(\mathbf{\tilde{x}})\approx p(\mathbf{x})$ hence we implicitly learn properties of $p(\mathbf{x})$. This kind of modeling is also called denoising score matching.

Reconstructing clean images from noisy images is non-trivial as we need to understand features commonly present in images. Ladder networks are another model which use the idea adding noise at each layer, and inspired modern models for deep semi-supervised learning.

Generative Modeling via denoising score matching (Song and Ermon 2020) introduce an approach for unsupervised learning. If we know the score function $\nabla_{\mathbf{x}}\log p(\mathbf{x})$, we can sample from the corresponding distribution using Langevin dynamics: \[\mathbf{x}_t \leftarrow \mathbf{x}_{t-1}+\alpha \nabla_{\mathbf{x}}\log p(\mathbf{x}_{t-1})+\sqrt{2}\alpha \mathbf{z}_t\] Where $\mathbf{z}_t \sim \mathcal{N}(0,\mathbf{I})$, $1\leq t \leq T$ and $\alpha>0$ is a step size. The initial sample can come from any prior distribution. When $\alpha$ is sufficiently small, $T$ sufficiently large, the distribution of $\mathbf{x}_T$ is close tp $p(\mathbf{x})$ under some regularity conditions. If we have a neural network $s_\theta(\mathbf{x})$ which is trained such that $s_\theta(\mathbf{x})\approx \nabla_\mathbf{x}\log p(\mathbf{x})$, we can this approximation for Langevin sampling. However, the score function maybe poorly estimated in regions of low data density due to lack of samples, and the samples from Langevine dynamics do not mix well, for e.g. sampling from a 2 component Gaussian mixture model, the score function does not contain the mixture weights. The authors thus propose different noise levels $\sigma_1>\sigma_2>\cdots>\sigma_L$, and estimate the score to all noise levels by training a single conditional score network $s_\theta(\mathbf{x},\sigma)\approx \nabla_\mathbf{x} \log q_\sigma(\mathbf{x})$ where $q_\sigma$ is the pertubed data distribution. The loss is then \[\mathcal{L}=\frac{1}{L}\sum_{i=1}^L \lambda(\sigma_i)\mathcal{L}_i\] \[ \mathcal{L}_i = \frac{1}{2}\mathbb{E}_{p_data(\mathbf{x})}\mathbb{E}_{\mathbf{\tilde{x}}\sim \mathcal{N}(\mathbf{x},\sigma_i^2\mathbf{I})}[||s_\theta(\mathbf{\tilde{x}},\sigma_i)+\frac{\mathbf{\tilde{x}}-\mathbf{x}}{\sigma_i}||^2]\] where $\lambda$ is chosen to be the identity function, a choice made empirically. To generate samples, they use annealed Langevin dynamics, where for each $\sigma_i$, $T$ samples are generated where $\alpha_i = \epsilon \sigma_i^2 /\sigma^2_L$ where $i \in 1,\cdots,L$ with the updates being $\mathbf{\tilde{x}}_t \leftarrow \mathbf{\tilde{x}}_{t-1}+\frac{\alpha_i}{2} s_\theta(\mathbf{x}_{t-1},\sigma_i)+\sqrt{\alpha_i} \mathbf{z}_t$ and the final samples from $\sigma_L\approx 0$ will be from $p_{data}(\mathbf{x})$. For images, $s_\theta(\mathbf{x},\sigma)$ was taken to be a U-net with dilated convolutions.

Diffusion probabilistic models (Sohl-Dickstein et al 2015) define a forward diffusion process, which converts any complex data distribution into a simple, tractable distribution. We can try to learn the generative model which is defined by a reversal of this diffusion process. The most popular forward diffusion process is progressively add Gaussian noise: $q(\mathbf{x}_t|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_t|\sqrt{1-\beta_t}\mathbf{x}_{t-1},\beta_t\mathbf{I})$, given a sample from $\mathbf{x}_0 \sim q(\mathbf{x}_0)$. We then have that $q(\mathbf{x}_t|\mathbf{x}_0)=\mathcal{N}(\mathbf{x}_t|\sqrt{\bar{\alpha_t}}\mathbf{x}_0,(1-\bar{\alpha_t}))$ where $\alpha_t=1-\beta_t$ and $\bar{\alpha}_t = \prod_{j=1}^t\alpha_j$. As $t \to \infty$ this converges in distribution to the multivariate Gaussian. Selecting $\beta_t$ such that $1-\bar{\alpha_t}$ is close to 1 makes $q(\mathbf{x}_T)$ a good approximation to $\mathcal{N}(0,\mathbf{I})$. To generate samples, we want to revert this process. The reverse process is defined as $p(\mathbf{x}_{t-1}|\mathbf{x}_t)=\mathcal{N}(\mathbf{x}_{t-1}|\mathbf{\mu}_\theta(\mathbf{x}_t,t),\mathbf{\Sigma}_\theta(\mathbf{x}_t,t))$ where $\mathbf{\mu}_\theta,\mathbf{\Sigma}_\theta$ are both neural networks. The task is to learn their parameters to maximize the log-likelihood $\mathbb{E}[\log p_\theta(\mathbf{x}_0)]$. The authors propose to estimate them by minimizing the variational bound on the negative log-likelihood, $\mathbb{E}[-\log p_\theta(\mathbf{x}_0)]\leq \mathbb{E}_q[-\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}] = \mathbb{E}_q[-\log p(\mathbf{x}_T) - \sum_{t\geq 1} \log \frac{p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_{t}|\mathbf{x}_{t-1)}}]$, which can be rewritten as a term that with most components having an analytical form due to closed form Gaussian densities and analytical expressions for the KL divergence between Gaussians.

Denoising Diffusion Probabilistic Models (DDPMs, Ho et al 2020) propose a simplified model, where the variances of the forward process $\beta_t$ are fixed, and the covariance matrix is diagonal in the reverse process, with variance $\sigma_t^2=\beta_t^$ working well in practice. This further simplifies the optimization objective, and we have a closed form expression for the target in the reverse process. (#SKIPPED Training and sampling procedure for DDPMs)

Dhariwal and Nichol 2020 fintuned the DDPM architecure and showed diffusion models outperforming GANs.

Denoising Diffusion Implicit Model (DDIM, Song et al 2021) propose a method to accelerate the genearation process by using only a subsequence $\{\tau_i\}$ of steps $[1,\cdots,T]$.

Diffusion Autoencoders (Preechakul et al 2021) encode the input in a low-dimensional representation which can be used to modify existing images.

DALL-E 2 (Ramesh et al 2021)