# Deep Reinforcement Learning (UC Berkeley CS-285 2020)

## Introduction

## Content

### Variational Inference

A probabilistic model is any model that represents some probability distribution. They could be some distribution to model samples, or conditional distributions for e.g. to do regression. Policies are also conditional probabilistic models.

Latent variables are a class of probabilistic models where there are variables other than the evidence (data) or the query (prediction target). These variables need to be integrated out. One example is a mixture model, where the mixture components have a categorical distribution, which is a latent variable.

In general, latent variable models (LVMs) involve the distribution of the latent variables \(p(z)\) often chosen to be simple models, and the distribution over observed variables as \(p(x|z) = p(x|g(z))\), for e.g. \(p(x|z) ~ \mathcal{ N }(\mu_\theta(z),\sigma_\theta(z))\). Even though the given \(p(z),p(x|z)\) are simple, \(p(x)=\int p(x|z)p(z)\:dz\) maybe very complex. (#DOUBT Is the "complexity" here a result of the neural networks in the conditional distribution? Since two Gaussians should give a Gaussian for \(p(x)\)?)

In Reinforcement Learning (RL), conditional LVMs can be used to model multi-modal policies. LVMs are also used in model-based RL to learn a latent state that depends on the actions.

Not all generative models are latent variables, and not all latent variable models are generative models - but representing generative models as latent variable models have many advantages.

Given data \(x_1\cdots x_n\) and an LVM \(p_\theta(x)\), fitting the model is often done by Maximum Likelihood Estimation (MLE). The inclusion of the latent variable however results in an intractable optimization objective,\ \(\text{argmax}_\theta \log \sum_{_i=1}^{N} \int p_\theta(x_i|z)p(z)\:dz\),due to the integral over the latent variables.

One alternative is to use the expected log-likelohood, i.e. \(\text{argmax}_\theta \frac{1}{N}\sum_{i}^{N}\mathbb{E}_{z\sim p(z|x_i)}[\log p_\theta(x_i,z)]\), which can be thought of as using a guess for the latent variables sampled from \(p(z|x_i)\). This now requires us to sample from \(p(z|x_i)\). We could try to approximate it with some simpler distribution \(q_i(z)\sim \mathcal{ N }(\mu_i,\sigma_i)\). Using this approximation allows us to bound \(\log p(x_i)\) - specifically, \(\log p(x_i)\geq \mathbb{ E }_{z\sim q_i(z)}[\log p(x_i|z) + \log p(z)] - \mathbb{ E }_{z\sim q_i(z)}[\log q_i(z)]\).

This bound can be visualized as follows:

- Imagine the first term trying to find parameters that place the approximate distribution close to the peaks of \(p(x_i,z)\), and the second term tries to counter that by trying to increase the entropy, which is the same as trying to spread the probability mass of the approximate distributon over a larger area.

The bound (ELBO) can also be written as KL divergences.

A good \(q_i(z)\) should approximate \(p(z|x_i)\) well.

During training, for each mini-batch of samples \(x_i\), we would estimate \(\nabla_\theta \mathcal{ L }_i(p,q_i)\) by sampling \(z_k\sim q_i(x_i)\) and using the Monte Carlo estimate \(\nabla_\theta \mathcal{ L}_i(p,q_i)\approx\frac{1}{K}\sum_{_k}^{} \nabla_\theta \log p_\theta(x_i|z)\), which is used to update \(\theta\). Then, maximize \(q_i(z)\) to maximize the ELBO, but, we could have many samples, and having one \(q_i(z)\) may not be feasible.

To overcome this, first notice that we want \(q_i(z)\) to approximate \(p(z|x_i)\) - we could learn one network \(q_i(z) = q(z|x_i) \approx p(z|x_i)\). This would result in one network \(p_\theta(x|z)\) which takes latent variables, whose outputs a fed to another network \(q_\phi(z|x) = \mathcal{N}(z|\mu_\phi(x),\sigma_\phi(x))\). This is called amortized variational inference.

The resulting ELBO is now \(\mathbb{E}_{z\sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)+\log p(z)] + \mathbb{ H }[q_\phi(z|x_i)]\). The training procedure now involves updating the parameters \(\phi\) as well, which requires \(\nabla_\phi \mathcal{L}\) - assuming the entropy can be computed in closed form, this requires \(\nabla_\phi \mathbb{E}_{z\sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)+\log p(z)]\), which can be computed using policy gradients. This estimate however is very noisy. Instead we can use the reparametrization trick.

The ELBO can also be written as \(\mathbb{ E }_{z\sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)] - KL(q_\phi(z|x_i)||p(z))\). This can be interpreted as a reconstruction term which tries to maximize the fit to the data, and a regularization term makes sure the approximate latent variables do not go too far from the prior.

We could rewrite the ELBO to model conditional models as well.