[Paper Review] Auto-Encoding Variational Bayes

2022. 7. 20. 22:39Paper Review

1. Introduction

How can we perform efficient approximate inference and learning with directed probabilistic models whose continuous latent variables and/or parameters have intractable posterior distribution?

 

논문 목표는 intractable한 $p_{Z|X}(z|x) \ (z \ \text{is continuous})$를 추정하는 것이다.

글쓴이 뇌피셜

"In the AEVB algorithm we make inference and learning especially efficient by using the SGVB estimator to optimize a recognition model that allows us to perform very efficient approximate posterior inference using simple ancestral sampling"에서 필자는 "ancestral sampling"을 "$z = g_\phi(\epsilon^{(l)}, x^{(i)})$로 $z$를 sampling 한다."라로 해석하는 것이 타당하다 본다.

더보기

Ancestral sampling

$p(x)$를 $\prod_{i=1}^N p(x_i | x_{<i})$로 변경하여 $x$를 sampling하는 기법.

 

글을 쓰는 현재까지도 논문 목표가 (1) "$p_{Z|X}(z|x) \ (z \ \text{is continuous})$이 intractable한 $p_X(x)$를 추정하는 것"인지 (2) "intractable한 $p_{Z|X}(z|x) \ (z \ \text{is continuous})$를 추정하는 것"인지 헷갈린다. 하지만 문맥으로 봤을 때 (2)번이 더 타당하다고 느꼈다. 그러므로 글쓴이는 논문 목표가 (2)번이라고 생각하고 글을 전개해 나갈 것이다.

 

The variational Bayesian (VB) approach involves the optimization of an approximation to the intractable posterior. Unfortunately, the common mean-field approach requires analytical solutions of expectations w.r.t. the approximate posterior, which are also intractable in the general case. We show how a reparameterization of the variational lower bound yields a simple differentiable unbiased estimator of the lower bound.

VB 접근법은 $p_{Z|X}(z|x)$를 보다 단순한 확률 분포 ($p_{Z|X}(z|x; \phi)$)로 근사하여 즉, Variational Inference을 통해 목표를 달성하려 했다. ($p_{Z|X}(z|x)$는 우리가 생각하는 것보다 매우 복잡한 확률 분포를 가질 수 있다. 뿐만 아니라 $z$는 continuous하기에 더욱 표현하기 힘들다. 그렇기 때문에 "VI"로 구하려는 것이다.) 하지만 불행하게도 mean-field approach(simple variational bayesian?)는 (정확한(?) 미분 가능한) $p_{Z|X}(z|x; \phi)$ 평균을 요구하는데 이는 일반적으로 intractable하다. 논문 저자는 reparameterization을 통해  (정확한(?) 미분 가능한) $p_{Z|X}(z|x; \phi)$ 평균을 구하는데 성공했다. (즉, variational lower bound (objective function)를 미분 가능하게 만들었다.)

 

더보기

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning.

 

In probability theory, mean-field theory (MFT) studies the behavior of high-dimensional random models by studying a simpler model that approximates the original by averaging over degrees of freedom.

 

2. Method

The strategy in this section can be used to derive a lower bound estimator (a stochastic objective function) for a variety of directed graphical models with continuous latent variables. 

 

이번 절에서는 연속적인 잠재 변수가 있는 모델의 lower bound estimator 즉, 목표 함수를 유도하는 과정을 설명한다.

 

2.1. Problem scenario

Let us consider some dataset $X = \{x^{(i)}\}_{i=1}^N$ consisting of $N$ i.i.d. samples of some continuous or discrete variable $x$. We assume that the data are generated by some random process, involving an unobserved continuous random variable $z$. The process consists of two steps: (1) a value $z^{(i)}$ is generated from some prior distribution $p_{\theta^*}(z)$; (2) a value $x^{(i)}$ is generated from some conditional distribution $p_{\theta^*}(x|z)$.

 

논문 저자들은 다음과 같이 데이터가 생성된다고 가정했다. (1) $p_{Z}(z)$에서 $z$를 sampling 한다. (2) sampling된 $z$과 $p_{X|Z}(x|z; \theta)$를 통해 $x$를 sampling 한다.  

 

Very importantly, we do not make the common simplifying assumptions about the marginal or posterior probabilities. Conversely, we are here interested in a general algorithm that even works efficiently in the case of: 1. Intractability: the case where the integral of the marginal likelihood $p_\theta(x) = \int p_\theta (z) p_\theta (x|z)$ is intractable (so we cannot evaluate or differentiate the marginal likelihood), where the true posterior density $p_\theta(z|x) = p_\theta(x|z)p_\theta(z)/p_\theta(x)$ is intractable, (so the EM algorithm cannot be used), and where the required integrals for any reasonable mean-field VB algorithm are also intractable. 2. A large dataset:  we have so much data that batch optimization is too costly. Sampling based solutions, e.g. Monte Carlo EM, would in general be too slow, since it involves a typically expensive sampling loop per datapoint.

 

논문 저자는 가정을 세워 문제를 단순화 하지 않는다. 반대로 다음 3가지 제약 뿐만 아니라 큰 데이터 셋에도 효과가 있는 일반적인 알고리즘에 관심이 있다. (1) $p_{X}(x; \theta) = \int p_Z(z)p_{X|Z}(x|z; \theta)\, dz$가 intractable한 경우. (이 식이 tractable하면 문제가 매우 쉬워 진다.) (2) $p_{Z|X}(z|x)$가 intractable한 경우. (이 식이 tractable하면 $p_{Z|X}(z|x; \phi)$를 안 구해도 된다. 즉, VI가 필요 없어진 것이다.) 그리고 (3) mean-field VB algorithm이 필요로 하는 integrals (e.g. $\int p_{Z|X}(z|x^{(i)}; \phi)f(z)\, dz = \mathbb E_{z \sim p_{Z|X} (z|x^{(i)}; \phi)} \left[f(z)\right]$)가 intractable한 경우. (이 식이 tractable하면 reparameterize를 할 필요가 없다.)

 

We are interested in, and propose a solution to, three related problems in the above senario: (1) Efficient approximate ML or MAP estimation for the parameters $\theta$. (2) Efficient approximate posterior inference of the latent variable $z$ given an observed value $x$ for a choice of parameters $\theta$. (3) Efficient approximate marginal inference of the variable $x$. For the purpose of solving the above problems, let us introduce a recognition model $p_\phi(z|x)$: an approximation to the intractable true posterior $p_\theta(z|x)$.

 

(1) $\theta$의 ML or MAP 추정. 즉, 생성 모델 $p_{X|Z}(x|z; \theta)$ 추정. (2) 인식 모델 $p_{Z|X}(z|x; \phi)$ 추정. 즉, $\phi$ 추정. (3) $p_X(x)$ 추정. 즉, $\phi$와 $\theta$ 추정. 위 3가지 문제를 풀기 위해 $p_{Z|X}(z|x)$의 근사 확률 분포인 $p_{Z|X}(z|x; \phi)$를 사용한다.

 

We'll introduce a method for learning the recongition model parameters $\phi$ jointly with the generative model parameters $\theta$.

 

논문 저자는 $\phi$를 $\theta$와 함께 학습하는 과정을 소개하므로써 위 3가지 문제를 풀 수 있게 한다.

 

2.2. The variational bound

$\sum_{i=1}^N \log p_\theta(x^{(i)})$, which can each be rewritten as: $\log p_\theta(x^{(i)}) = D_{KL}(q_\phi(z|x^{(i)})||p_\theta(z|x^{(i)})) + \mathcal L(\theta, \phi, x^{(i)})$. Since this KL-divergence is non-negative, the second RHS term $\mathcal L(\theta, \phi; x^{(i)})$ is called the (variational) lower bound on the marginal likelihood of datapoint $i$, and can be written as: $\mathcal L(\theta, \phi, x^{(i)}) = -D_{KL}(q_\phi(z|x^{(i)}||p_\theta(z)) + \mathbb E_{q_\phi(z|x^{(i)})}\left[ \log p_\theta(x^{(i)}|z)\right]$. We want to differentiate and optimize the lower bound $\mathcal L(\theta, \phi, x^{(i)})$ w.r.t. both variational parameters $\phi$ and generative parameters $\theta$. However, the gradient of the lower bound w.r.t. $\phi$ is a bit problematic: $\nabla_\phi \mathbb E_{q_\phi(z)}\left[ f(z) \right] = \mathbb E_{q_\phi(z)}\left[ f(z) \nabla_{q_\phi(z)} \log q_\phi(z) \right] \simeq \frac{1}{L} \sum_{l=1}^L f(z) \nabla_{q_\phi(z^{(l)})} \log q_\phi(z^{(l)})$ where $z^{(l)} \sim q_\phi(z|x^{(i)})$.
This gradient estimator (The usual Monte Carlo gradient estimator) exhibits very high variance and is impractical for our purposes.

 

RHS의 첫 항은 KL-divergence(=음수가 아닌 값)이기 때문에 $\mathcal L(\theta, \phi; x^{(i)})$만 이용해 즉,  $\mathcal L(\theta, \phi; x^{(i)}) = -D_{KL}(p_{Z|X}(z|x^{(i)}; \phi)||p_Z(z)) + \mathbb E_{z \sim p_{Z|X}(z|x^{(i)})} \left[ \log p_{X|Z}(x^{(i)}|z; \theta) \right]$를 $\phi$와 $\theta$로 미분 및 최적화하여 $p_{Z|X}(z|x; \phi)$, $p_{X|Z}(x|z; \theta)$와 $p_X(x)$를 구할 수 있다. 하지만 $\phi$ 미분 및 Monte Carlo 추정 (즉, (1) $\mathbb E_{z \sim p_{Z|X}(z|x^{(i)})} \left[ f(z) \right]$을 $\phi$로 미분하고 (2) 이를 Monte Carlo estimator에 적용)으로 얻$f(z)$ 평균의 미분값은 매우 높은 분산을 보여주므로 실용적이지 않다.

 

2.3. The SGVB estimator and AEVB algorithm

In this section we introduce a practical estimator of the lower bound and its derivatives w.r.t. the parameters. Under certain mild conditions outlined in section 2.4 for a chosen approximate posterior $q_\phi(z|x)$ we can reparameterize the random variable $z \sim q_\phi(z|x)$ using a differentiable transformation $g_\phi(\epsilon, x)$ of an (auxiliary) noise variable $\epsilon$: $z = g_\phi(\epsilon, x)$ with $\epsilon \sim p(\epsilon)$. See section 2.4 for general strategies for chosing such an approriate distribution $p(\epsilon)$ and function $g_\phi(\epsilon, x)$. We can now form Monte Carlo estimates of expectations of some function $f(z)$ w.r.t. $q_\phi(z|x)$ as follows: $\mathbb E_{q_\phi(z|x^{(i)})} \left[ f(z) \right] = \mathbb E_{p(\epsilon)} \left[ f(g_\phi(\epsilon, x^{(i)})) \right] \simeq \frac{1}{L} \sum_{l=1}^L f(g_\phi(\epsilon^{(i)}, x^{(i)})) \quad \text{where} \quad \epsilon^{(l)} \sim p(\epsilon)$

 

(2.4절에서 설명한 전략 중 하나(e.g. $p_{Z|X}(\cdot)$와 $p_\epsilon(\epsilon)$이 가우시안 분포를 따른다)를 사용하면) $g_\phi(\epsilon, x) \ \text{with} \ \epsilon \sim p(\epsilon)$으로 $z$를 reparameterize하여 새로운 평균 수식 $\mathbb E_{p(\epsilon)} \left[ f(g_\phi(\epsilon, x^{(i)})) \right]$을 유도할 수 있다. Monte Carlo 추정을 사용해 이 식을 다음과 같이 변형하였더니 매우 실용적이다. $\frac{1}{L} \sum_{l=1}^L f(g_\phi(\epsilon^{(l)}, x^{(i)}))$ where  $\epsilon^{(l)} \sim p(\epsilon)$. (자세한 내용은 2.4절에서 설명할 것이다.)

 

Often, the KL-divergence $D_{KL}(q_\phi(z|x^{(i)})||p_\theta(z))$ can be integrated anaytically, such that only the expected reconstruction error $\mathbb E_{q_\phi(z|x^{(i)})}\left[\log p_\theta(x^{(i)}|z)\right]$ requires estimation by sampling. This yields a second version of the SGVB estimator $\tilde{\mathcal L^B}(\theta, \phi; x^{(i)}) \simeq \mathcal L(\theta, \phi, x^{(i)})$: $-D_{KL}(q_\phi(z|x^{(i)})||p_\theta(z)) + \frac{1}{L} \sum_{l=1}^L \log p_\theta(x^{(i)}|z^{(i, l)}))$ where $z^{(i, l)} = g_\phi(\epsilon^{(i, l)}, x^{(i)})$ and $\epsilon^{(l)} \sim p(\epsilon)$

 

일반적으로 KL-발산 $D_{KL}(p_{Z|X}(z|x^{(i)})||p_Z(z))$은 분석적 통합이 되므로 다음과 같이 $\tilde{\mathcal L^B}(\theta, \phi; x^{(i)})$에서 reconstruction error만 sampling 추정(=Monte Carlo 추정)을 하면 된다. $\tilde{\mathcal L^B}(\theta, \phi; x^{(i)}) = -D_{KL}(p_{Z|X}(z|x^{(i)}; \phi)||p_Z(z)) + \frac{1}{L} \sum_{l = 1}^L{(\log p_{X|Z}(x^{(i)}|z^{(i, l)}; \theta))}$ where $z^{(i, l)} = g_\phi(\epsilon^{(i, l)}, x^{(i)})$ and $\epsilon^{(l)} \sim p(\epsilon) - (7)$

 

A connection with auto-encoders becomes clear when looking at the objective function given at eq. (7). The first term is acts as a regularizer, while the second term is a an expected negative reconstruction error. The function $g_\phi(\cdot)$ is chosen such that it maps a datapoint $(x^{(i)})$ and a random noise vector $\epsilon^{(l)}$ to a sample from the approximate posterior for that datapoint: $z^{(i, l)} = g_\phi(\epsilon^{(l)}, x^{(i)})$ where $z^{(i, l)} \sim q_\phi(z|x^{(i)})$. Subsequently, the sample $z^{(i,l)}$ is then input to function $\log p_\theta(x^{(i)}|z^{(i, l)})$, which equals the probability density of datapoint $x^{(i)}$ under the generative model, given $z^{(i, l)}$.

 

AE와 목표 함수인 $\tilde{\mathcal L^B}(\theta, \phi; x^{(i)})$의 연관성이 매우 높다. 첫번째 항은 $\phi$ 정규화 역할을 한다. 두번째 항은 reconstruction error다. (1) $g_\phi(\epsilon, x)$ (Encoder + sampling): 데이터 $x^{(i)}$와 무작위 노이즈 벡터 $\epsilon^{(l)}$을 통해 $p_{Z|X}(z|x^{(i)}; \theta)$에서 sampling된 $z^{(i, l)}$을 출력시킨다. (e.g. (1.1) $x^{(i)}$을 통해 $p_{Z|X}(z|x; \phi) = \mathcal N(\mu, \sigma^2)$ 즉, $\mu$와 $\sigma$을 구한다. (1.2) $z = \mu + \sigma\epsilon$으로 $z$를 sampling한다.) (2) $p_{X|Z}(x|z; \theta)$ (Decoder): sampling된 $z^{(i, l)}$을 통해  "$x^{(i)}$가 나올 확률"을 출력시킨다. 

 

2.4. The reparameterization trick

It is then often possible to express the random variable $z$ as a deterministic variable $z = g_\phi(\epsilon, x)$, where $\epsilon$ is an auxiliary variable with independent marginal $p(\epsilon)$ and $g_\phi(\cdot)$ is some vector-valued function parameterized by $\phi$. This reparameterization is useful for our case since it can be used to rewrite an expectation w.r.t. $q_\phi(z|x)$ such that the Monte Carlo estimate of the expectation is differentiable w.r.t. $\phi$. Take, for example, the univariate Gaussian case: let $z \sim p(z|x) = \mathcal N(\mu, \sigma^2)$. In this case, a valid reparameterization is $z = \mu + \sigma\epsilon$, where $\epsilon$ is an auxilary noise variable $\epsilon \sim \mathcal N(0, 1)$. Therefore, $\mathbb E_{(\mathcal N(z; \mu, \sigma^2)} \left[ f(z) \right] = \mathbb E_{\mathcal N(\epsilon; 0, 1)}\left[ f(\mu + \sigma\epsilon) \right] \simeq \frac{1}{L} \sum_{l=1}^L f(\mu + \sigma \epsilon^{(l)})$ where $\epsilon^{(l)} \sim \mathcal N(0, 1)$.

$z \sim p_{Z|X}(z|x; \phi)$ 즉, $z$가 random variable이면 $\phi$까지 역전파(=미분)가 진행되지 않아 학습이 불가하다. 하지만 $z = g_\phi(\epsilon, x) \ \text{where} \ \epsilon \sim p(\epsilon)$ 즉, $z$가 deterministic variable이면 $\phi$까지 역전파(=미분)가 진행되어 학습이 가능하다. 그러므로 $z$를 random variable에서 deterministic variable로 reparameterize해야 한다. e.g. $z \sim p_{X|Z}(x|z) = \mathcal N(\mu, \sigma^2)$라 가정 시 다음과 같이 $z$를 reparameterize하는 것이 적절하다. $z = g_\phi(\epsilon, x) = \mu + \sigma\epsilon \quad \text{where} \quad \epsilon \sim \mathcal N(0, 1)$. (Why? $p_{Z|X}(\cdot)$와 $p_\epsilon(\cdot)$를 가우시안 분포로 가정하면 $p_{Z|X}(z^{(i, l)}|x^{(i)}; \phi) = p_\epsilon(\epsilon^{(l)})$이 성립되기 때문에 $p_{Z|X}(z|x; \phi)$말고 $p_\epsilon(\cdot)$으로 sampling해도 무리가 없다.) 결과적으로 $f(z)$ 평균은 다음과 같다. $\mathbb E_{\mathcal N(z; \mu, \sigma^2)} \left[ f(z) \right] = \mathbb E_{\mathcal N(\epsilon; 0, 1)} \left[ f(\mu + \sigma\epsilon) \right] \simeq \frac{1}{L} \sum_{l=1}^L f(\mu + \sigma\epsilon^{(l)}) \ \text{where} \ \epsilon^{(l)} \sim \mathcal N(0, 1)$.

 

3. Example: Variational Auto-Encoder

Let the prior over the latent variables be the centered isotropic multivariate Gaussian $p_\theta(z) = \mathcal N(z; 0, 1)$. We let $p_\theta(x|z)$ be a multivariate Gaussian (in case of real-valued data) or Bernoulli (in case of binary data) whose distribution parameters are computed from $z$ with a MLP (a fully-connected neural network with a single hidden layer). We can let the variational approximate posterior be a multivariate Gaussian with a dialgonal covariance structure: $\log q_\phi(z|x^{(i)}) = \log \mathcal N(z; \mu^{(i)}, \sigma^{2(i)}I)$ where the mean and s.d. are outputs of the encoding MLP, i.e. nonlinear functions of datapoint $x^{(i)}$ and the variational parameters $\phi$. We sample from the posterior $z^{(i, l)} \sim q_\phi(z|x^{(i)})$ using $z^{(i, l)} = g_\phi(x^{(i)}, \epsilon^{(l)}) = \mu^{(i)} + \sigma^{(i)} \odot \epsilon^{(l)}$ where $\epsilon^{(l)} \sim \mathcal N(0, 1)$. The resulting estimator for this model and datapoint $x^{(i)}$ is: $\mathcal L(\theta, \phi; x^{(i)}) \simeq \frac{1}{2} \sum_{j=1}^J \left( 1 + \log ({(\sigma_j^{(i)})}^2) - {(\mu_j^{(i)})}^2 - {(\sigma_j^{(i)})}^2 \right) + $ $\frac{1}{L} \sum_{l=1}^L \log p_\theta(x^{(i)}|z^{(i, l)})$ where $z^{(i, l)} = \mu^{(i)} + \sigma^{(i)} \odot \epsilon^{(l)}$ and $\epsilon^{(l)} \sim \mathcal N(0, 1)$.

 

$p_Z(z)$, $p_{Z|X}(z|x; \phi)$와 $p_{X|Z}(x|z; \theta)$를 multivariate Gaussian이라고 가정하자. (1) $p_{Z|X}(z|x^{(i)}; \phi)$을 추정한다. 즉, $\mu$와 $\sigma$을 추정한다 (encoder=MLP). (2) $z^{(i, l)} = \mu + \sigma\epsilon^{(l)} \ \text{where} \ \epsilon^{(l)} \sim \mathcal N(0, 1)$을 통해 $z^{(i, l)}$를 sampling한다. (3) sampling된 $z^{(i, l)}$으로 $p_{X|Z}(x|z^{(i, l)}; \theta)$을 추정한다 (decoder=MLP). 그 후 $z^{(i)}$의 확률을 계산한다.

 

6. Conclusion

We have introduced a novel estimator of the variational lower bound, Stochastic Gradient VB (SGVB), for efficient approximate inference with continuous latent variables.