Evidence lower bound

Lower bound on the log-likelihood of some observed data
Part of a series on
Bayesian statistics
Posterior = Likelihood × Prior ÷ Evidence
Background
Model building
  • Weak prior ... Strong prior
  • Conjugate prior
  • Linear regression
  • Empirical Bayes
  • Hierarchical model
Posterior approximation
Estimators
Evidence approximation
Model evaluation
  • icon Mathematics portal
  • v
  • t
  • e

In variational Bayesian methods, the evidence lower bound (often abbreviated ELBO, also sometimes called the variational lower bound[1] or negative variational free energy) is a useful lower bound on the log-likelihood of some observed data.

The ELBO is useful because it provides a guarantee on the worst-case for the log-likelihood of some distribution (e.g. p ( X ) {\displaystyle p(X)} ) which models a set of data. The actual log-likelihood may be higher (indicating an even better fit to the distribution) because the ELBO includes a Kullback-Leibler divergence (KL divergence) term which decreases the ELBO due to an internal part of the model being inaccurate despite good fit of the model overall. Thus improving the ELBO score indicates either improving the likelihood of the model p ( X ) {\displaystyle p(X)} or the fit of a component internal to the model, or both, and the ELBO score makes a good loss function, e.g., for training a deep neural network to improve both the model overall and the internal component. (The internal component is q ϕ ( | x ) {\displaystyle q_{\phi }(\cdot |x)} , defined in detail later in this article.)

Definition

Let X {\displaystyle X} and Z {\displaystyle Z} be random variables, jointly distributed with distribution p θ {\displaystyle p_{\theta }} . For example, p θ ( X ) {\displaystyle p_{\theta }(X)} is the marginal distribution of X {\displaystyle X} , and p θ ( Z X ) {\displaystyle p_{\theta }(Z\mid X)} is the conditional distribution of Z {\displaystyle Z} given X {\displaystyle X} . Then, for a sample x p θ {\displaystyle x\sim p_{\theta }} , and any distribution q ϕ {\displaystyle q_{\phi }} , the ELBO is defined as

L ( ϕ , θ ; x ) := E z q ϕ ( | x ) [ ln p θ ( x , z ) q ϕ ( z | x ) ] . {\displaystyle L(\phi ,\theta ;x):=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right].}
The ELBO can equivalently be written as[2]

L ( ϕ , θ ; x ) = E z q ϕ ( | x ) [ ln p θ ( x , z ) ] + H [ q ϕ ( z | x ) ] = ln p θ ( x ) D K L ( q ϕ ( z | x ) | | p θ ( z | x ) ) . {\displaystyle {\begin{aligned}L(\phi ,\theta ;x)=&\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {}p_{\theta }(x,z)\right]+H[q_{\phi }(z|x)]\\=&\mathbb {\ln } {}\,p_{\theta }(x)-D_{KL}(q_{\phi }(z|x)||p_{\theta }(z|x)).\\\end{aligned}}}

In the first line, H [ q ϕ ( z | x ) ] {\displaystyle H[q_{\phi }(z|x)]} is the entropy of q ϕ {\displaystyle q_{\phi }} , which relates the ELBO to the Helmholtz free energy.[3] In the second line, ln p θ ( x ) {\displaystyle \ln p_{\theta }(x)} is called the evidence for x {\displaystyle x} , and D K L ( q ϕ ( z | x ) | | p θ ( z | x ) ) {\displaystyle D_{KL}(q_{\phi }(z|x)||p_{\theta }(z|x))} is the Kullback-Leibler divergence between q ϕ {\displaystyle q_{\phi }} and p θ {\displaystyle p_{\theta }} . Since the Kullback-Leibler divergence is non-negative, L ( ϕ , θ ; x ) {\displaystyle L(\phi ,\theta ;x)} forms a lower bound on the evidence (ELBO inequality)

ln p θ ( x ) E z q ϕ ( | x ) [ ln p θ ( x , z ) q ϕ ( z | x ) ] . {\displaystyle \ln p_{\theta }(x)\geq \mathbb {\mathbb {E} } _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z\vert x)}}\right].}

Motivation

Variational Bayesian inference

Suppose we have an observable random variable X {\displaystyle X} , and we want to find its true distribution p {\displaystyle p^{*}} . This would allow us to generate data by sampling, and estimate probabilities of future events. In general, it is impossible to find p {\displaystyle p^{*}} exactly, forcing us to search for a good approximation.

That is, we define a sufficiently large parametric family { p θ } θ Θ {\displaystyle \{p_{\theta }\}_{\theta \in \Theta }} of distributions, then solve for min θ L ( p θ , p ) {\displaystyle \min _{\theta }L(p_{\theta },p^{*})} for some loss function L {\displaystyle L} . One possible way to solve this is by considering small variation from p θ {\displaystyle p_{\theta }} to p θ + δ θ {\displaystyle p_{\theta +\delta \theta }} , and solve for L ( p θ , p ) L ( p θ + δ θ , p ) = 0 {\displaystyle L(p_{\theta },p^{*})-L(p_{\theta +\delta \theta },p^{*})=0} . This is a problem in the calculus of variations, thus it is called the variational method.

Since there are not many explicitly parametrized distribution families (all the classical distribution families, such as the normal distribution, the Gumbel distribution, etc, are far too simplistic to model the true distribution), we consider implicitly parametrized probability distributions:

  • First, define a simple distribution p ( z ) {\displaystyle p(z)} over a latent random variable Z {\displaystyle Z} . Usually a normal distribution or a uniform distribution suffices.
  • Next, define a family of complicated functions f θ {\displaystyle f_{\theta }} (such as a deep neural network) parametrized by θ {\displaystyle \theta } .
  • Finally, define a way to convert any f θ ( z ) {\displaystyle f_{\theta }(z)} into a simple distribution over the observable random variable X {\displaystyle X} . For example, let f θ ( z ) = ( f 1 ( z ) , f 2 ( z ) ) {\displaystyle f_{\theta }(z)=(f_{1}(z),f_{2}(z))} have two outputs, then we can define the corresponding distribution over X {\displaystyle X} to be the normal distribution N ( f 1 ( z ) , e f 2 ( z ) ) {\displaystyle {\mathcal {N}}(f_{1}(z),e^{f_{2}(z)})} .

This defines a family of joint distributions p θ {\displaystyle p_{\theta }} over ( X , Z ) {\displaystyle (X,Z)} . It is very easy to sample ( x , z ) p θ {\displaystyle (x,z)\sim p_{\theta }} : simply sample z p {\displaystyle z\sim p} , then compute f θ ( z ) {\displaystyle f_{\theta }(z)} , and finally sample x p θ ( | z ) {\displaystyle x\sim p_{\theta }(\cdot |z)} using f θ ( z ) {\displaystyle f_{\theta }(z)} .

In other words, we have a generative model for both the observable and the latent. Now, we consider a distribution p θ {\displaystyle p_{\theta }} good, if it is a close approximation of p {\displaystyle p^{*}} :

p θ ( X ) p ( X ) {\displaystyle p_{\theta }(X)\approx p^{*}(X)}
since the distribution on the right side is over X {\displaystyle X} only, the distribution on the left side must marginalize the latent variable Z {\displaystyle Z} away.
In general, it's impossible to perform the integral p θ ( x ) = p θ ( x | z ) p ( z ) d z {\displaystyle p_{\theta }(x)=\int p_{\theta }(x|z)p(z)dz} , forcing us to perform another approximation.

Since p θ ( x ) = p θ ( x | z ) p ( z ) p θ ( z | x ) {\displaystyle p_{\theta }(x)={\frac {p_{\theta }(x|z)p(z)}{p_{\theta }(z|x)}}} (Bayes' Rule), it suffices to find a good approximation of p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} . So define another distribution family q ϕ ( z | x ) {\displaystyle q_{\phi }(z|x)} and use it to approximate p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} . This is a discriminative model for the latent.

The entire situation is summarized in the following table:

X {\displaystyle X} : observable X , Z {\displaystyle X,Z} Z {\displaystyle Z} : latent
p ( x ) p θ ( x ) p θ ( x | z ) p ( z ) q ϕ ( z | x ) {\displaystyle p^{*}(x)\approx p_{\theta }(x)\approx {\frac {p_{\theta }(x|z)p(z)}{q_{\phi }(z|x)}}} approximable p ( z ) {\displaystyle p(z)} , easy
p θ ( x | z ) p ( z ) {\displaystyle p_{\theta }(x|z)p(z)} , easy
p θ ( z | x ) q ϕ ( z | x ) {\displaystyle p_{\theta }(z|x)\approx q_{\phi }(z|x)} approximable p θ ( x | z ) {\displaystyle p_{\theta }(x|z)} , easy

In Bayesian language, X {\displaystyle X} is the observed evidence, and Z {\displaystyle Z} is the latent/unobserved. The distribution p {\displaystyle p} over Z {\displaystyle Z} is the prior distribution over Z {\displaystyle Z} , p θ ( x | z ) {\displaystyle p_{\theta }(x|z)} is the likelihood function, and p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} is the posterior distribution over Z {\displaystyle Z} .

Given an observation x {\displaystyle x} , we can infer what z {\displaystyle z} likely gave rise to x {\displaystyle x} by computing p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} . The usual Bayesian method is to estimate the integral p θ ( x ) = p θ ( x | z ) p ( z ) d z {\displaystyle p_{\theta }(x)=\int p_{\theta }(x|z)p(z)dz} , then compute by Bayes' rule p θ ( z | x ) = p θ ( x | z ) p ( z ) p θ ( x ) {\displaystyle p_{\theta }(z|x)={\frac {p_{\theta }(x|z)p(z)}{p_{\theta }(x)}}} . This is expensive to perform in general, but if we can simply find a good approximation q ϕ ( z | x ) p θ ( z | x ) {\displaystyle q_{\phi }(z|x)\approx p_{\theta }(z|x)} for most x , z {\displaystyle x,z} , then we can infer z {\displaystyle z} from x {\displaystyle x} cheaply. Thus, the search for a good q ϕ {\displaystyle q_{\phi }} is also called amortized inference.

All in all, we have found a problem of variational Bayesian inference.

Deriving the ELBO

A basic result in variational inference is that minimizing the Kullback–Leibler divergence (KL-divergence) is equivalent to maximizing the log-likelihood:

E x p ( x ) [ ln p θ ( x ) ] = H ( p ) D K L ( p ( x ) p θ ( x ) ) {\displaystyle \mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]=-H(p^{*})-D_{\mathit {KL}}(p^{*}(x)\|p_{\theta }(x))}
where H ( p ) = E x p [ ln p ( x ) ] {\displaystyle H(p^{*})=-\mathbb {\mathbb {E} } _{x\sim p^{*}}[\ln p^{*}(x)]} is the entropy of the true distribution. So if we can maximize E x p ( x ) [ ln p θ ( x ) ] {\displaystyle \mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]} , we can minimize D K L ( p ( x ) p θ ( x ) ) {\displaystyle D_{\mathit {KL}}(p^{*}(x)\|p_{\theta }(x))} , and consequently find an accurate approximation p θ p {\displaystyle p_{\theta }\approx p^{*}} .

To maximize E x p ( x ) [ ln p θ ( x ) ] {\displaystyle \mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]} , we simply sample many x i p ( x ) {\displaystyle x_{i}\sim p^{*}(x)} , i.e. use Importance sampling

N max θ E x p ( x ) [ ln p θ ( x ) ] max θ i ln p θ ( x i ) {\displaystyle N\max _{\theta }\mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]\approx \max _{\theta }\sum _{i}\ln p_{\theta }(x_{i})}
where N {\displaystyle N} is the number of samples drawn from the true distribution. This approximation can be seen as overfitting.[note 1]

In order to maximize i ln p θ ( x i ) {\displaystyle \sum _{i}\ln p_{\theta }(x_{i})} , it's necessary to find ln p θ ( x ) {\displaystyle \ln p_{\theta }(x)} :

ln p θ ( x ) = ln p θ ( x | z ) p ( z ) d z {\displaystyle \ln p_{\theta }(x)=\ln \int p_{\theta }(x|z)p(z)dz}
This usually has no closed form and must be estimated. The usual way to estimate integrals is Monte Carlo integration with importance sampling:
p θ ( x | z ) p ( z ) d z = E z q ϕ ( | x ) [ p θ ( x , z ) q ϕ ( z | x ) ] {\displaystyle \int p_{\theta }(x|z)p(z)dz=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[{\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]}
where q ϕ ( z | x ) {\displaystyle q_{\phi }(z|x)} is a sampling distribution over z {\displaystyle z} that we use to perform the Monte Carlo integration.

So we see that if we sample z q ϕ ( | x ) {\displaystyle z\sim q_{\phi }(\cdot |x)} , then p θ ( x , z ) q ϕ ( z | x ) {\displaystyle {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}} is an unbiased estimator of p θ ( x ) {\displaystyle p_{\theta }(x)} . Unfortunately, this does not give us an unbiased estimator of ln p θ ( x ) {\displaystyle \ln p_{\theta }(x)} , because ln {\displaystyle \ln } is nonlinear. Indeed, we have by Jensen's inequality,

ln p θ ( x ) = ln E z q ϕ ( | x ) [ p θ ( x , z ) q ϕ ( z | x ) ] E z q ϕ ( | x ) [ ln p θ ( x , z ) q ϕ ( z | x ) ] {\displaystyle \ln p_{\theta }(x)=\ln \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[{\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]\geq \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]}
In fact, all the obvious estimators of ln p θ ( x ) {\displaystyle \ln p_{\theta }(x)} are biased downwards, because no matter how many samples of z i q ϕ ( | x ) {\displaystyle z_{i}\sim q_{\phi }(\cdot |x)} we take, we have by Jensen's inequality:
E z i q ϕ ( | x ) [ ln ( 1 N i p θ ( x , z i ) q ϕ ( z i | x ) ) ] ln E z i q ϕ ( | x ) [ 1 N i p θ ( x , z i ) q ϕ ( z i | x ) ] = ln p θ ( x ) {\displaystyle \mathbb {E} _{z_{i}\sim q_{\phi }(\cdot |x)}\left[\ln \left({\frac {1}{N}}\sum _{i}{\frac {p_{\theta }(x,z_{i})}{q_{\phi }(z_{i}|x)}}\right)\right]\leq \ln \mathbb {E} _{z_{i}\sim q_{\phi }(\cdot |x)}\left[{\frac {1}{N}}\sum _{i}{\frac {p_{\theta }(x,z_{i})}{q_{\phi }(z_{i}|x)}}\right]=\ln p_{\theta }(x)}
Subtracting the right side, we see that the problem comes down to a biased estimator of zero:
E z i q ϕ ( | x ) [ ln ( 1 N i p θ ( z i | x ) q ϕ ( z i | x ) ) ] 0 {\displaystyle \mathbb {E} _{z_{i}\sim q_{\phi }(\cdot |x)}\left[\ln \left({\frac {1}{N}}\sum _{i}{\frac {p_{\theta }(z_{i}|x)}{q_{\phi }(z_{i}|x)}}\right)\right]\leq 0}
At this point, we could branch off towards the development of an importance-weighted autoencoder[note 2], but we will instead continue with the simplest case with N = 1 {\displaystyle N=1} :
ln p θ ( x ) = ln E z q ϕ ( | x ) [ p θ ( x , z ) q ϕ ( z | x ) ] E z q ϕ ( | x ) [ ln p θ ( x , z ) q ϕ ( z | x ) ] {\displaystyle \ln p_{\theta }(x)=\ln \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[{\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]\geq \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]}
The tightness of the inequality has a closed form:
ln p θ ( x ) E z q ϕ ( | x ) [ ln p θ ( x , z ) q ϕ ( z | x ) ] = D K L ( q ϕ ( | x ) p θ ( | x ) ) 0 {\displaystyle \ln p_{\theta }(x)-\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]=D_{\mathit {KL}}(q_{\phi }(\cdot |x)\|p_{\theta }(\cdot |x))\geq 0}
We have thus obtained the ELBO function:
L ( ϕ , θ ; x ) := ln p θ ( x ) D K L ( q ϕ ( | x ) p θ ( | x ) ) {\displaystyle L(\phi ,\theta ;x):=\ln p_{\theta }(x)-D_{\mathit {KL}}(q_{\phi }(\cdot |x)\|p_{\theta }(\cdot |x))}

Maximizing the ELBO

For fixed x {\displaystyle x} , the optimization max θ , ϕ L ( ϕ , θ ; x ) {\displaystyle \max _{\theta ,\phi }L(\phi ,\theta ;x)} simultaneously attempts to maximize ln p θ ( x ) {\displaystyle \ln p_{\theta }(x)} and minimize D K L ( q ϕ ( | x ) p θ ( | x ) ) {\displaystyle D_{\mathit {KL}}(q_{\phi }(\cdot |x)\|p_{\theta }(\cdot |x))} . If the parametrization for p θ {\displaystyle p_{\theta }} and q ϕ {\displaystyle q_{\phi }} are flexible enough, we would obtain some ϕ ^ , θ ^ {\displaystyle {\hat {\phi }},{\hat {\theta }}} , such that we have simultaneously

ln p θ ^ ( x ) max θ ln p θ ( x ) ; q ϕ ^ ( | x ) p θ ^ ( | x ) {\displaystyle \ln p_{\hat {\theta }}(x)\approx \max _{\theta }\ln p_{\theta }(x);\quad q_{\hat {\phi }}(\cdot |x)\approx p_{\hat {\theta }}(\cdot |x)}
Since
E x p ( x ) [ ln p θ ( x ) ] = H ( p ) D K L ( p ( x ) p θ ( x ) ) {\displaystyle \mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]=-H(p^{*})-D_{\mathit {KL}}(p^{*}(x)\|p_{\theta }(x))}
we have
ln p θ ^ ( x ) max θ H ( p ) D K L ( p ( x ) p θ ( x ) ) {\displaystyle \ln p_{\hat {\theta }}(x)\approx \max _{\theta }-H(p^{*})-D_{\mathit {KL}}(p^{*}(x)\|p_{\theta }(x))}
and so
θ ^ arg min D K L ( p ( x ) p θ ( x ) ) {\displaystyle {\hat {\theta }}\approx \arg \min D_{\mathit {KL}}(p^{*}(x)\|p_{\theta }(x))}
In other words, maximizing the ELBO would simultaneously allow us to obtain an accurate generative model p θ ^ p {\displaystyle p_{\hat {\theta }}\approx p^{*}} and an accurate discriminative model q ϕ ^ ( | x ) p θ ^ ( | x ) {\displaystyle q_{\hat {\phi }}(\cdot |x)\approx p_{\hat {\theta }}(\cdot |x)} .[5]

Main forms

The ELBO has many possible expressions, each with some different emphasis.

E z q ϕ ( | x ) [ ln p θ ( x , z ) q ϕ ( z | x ) ] = q ϕ ( z | x ) ln p θ ( x , z ) q ϕ ( z | x ) d z {\displaystyle \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]=\int q_{\phi }(z|x)\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}dz}

This form shows that if we sample z q ϕ ( | x ) {\displaystyle z\sim q_{\phi }(\cdot |x)} , then ln p θ ( x , z ) q ϕ ( z | x ) {\displaystyle \ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}} is an unbiased estimator of the ELBO.

ln p θ ( x ) D K L ( q ϕ ( | x ) p θ ( | x ) ) {\displaystyle \ln p_{\theta }(x)-D_{\mathit {KL}}(q_{\phi }(\cdot |x)\;\|\;p_{\theta }(\cdot |x))}

This form shows that the ELBO is a lower bound on the evidence ln p θ ( x ) {\displaystyle \ln p_{\theta }(x)} , and that maximizing the ELBO with respect to ϕ {\displaystyle \phi } is equivalent to minimizing the KL-divergence from p θ ( | x ) {\displaystyle p_{\theta }(\cdot |x)} to q ϕ ( | x ) {\displaystyle q_{\phi }(\cdot |x)} .

E z q ϕ ( | x ) [ ln p θ ( x | z ) ] D K L ( q ϕ ( | x ) p ) {\displaystyle \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}[\ln p_{\theta }(x|z)]-D_{\mathit {KL}}(q_{\phi }(\cdot |x)\;\|\;p)}

This form shows that maximizing the ELBO simultaneously attempts to keep q ϕ ( | x ) {\displaystyle q_{\phi }(\cdot |x)} close to p {\displaystyle p} and concentrate q ϕ ( | x ) {\displaystyle q_{\phi }(\cdot |x)} on those z {\displaystyle z} that maximizes ln p θ ( x | z ) {\displaystyle \ln p_{\theta }(x|z)} . That is, the approximate posterior q ϕ ( | x ) {\displaystyle q_{\phi }(\cdot |x)} balances between staying close to the prior p {\displaystyle p} and moving towards the maximum likelihood arg max z ln p θ ( x | z ) {\displaystyle \arg \max _{z}\ln p_{\theta }(x|z)} .

H ( q ϕ ( | x ) ) + E z q ( | x ) [ ln p θ ( z | x ) ] + ln p θ ( x ) {\displaystyle H(q_{\phi }(\cdot |x))+\mathbb {E} _{z\sim q(\cdot |x)}[\ln p_{\theta }(z|x)]+\ln p_{\theta }(x)}

This form shows that maximizing the ELBO simultaneously attempts to keep the entropy of q ϕ ( | x ) {\displaystyle q_{\phi }(\cdot |x)} high, and concentrate q ϕ ( | x ) {\displaystyle q_{\phi }(\cdot |x)} on those z {\displaystyle z} that maximizes ln p θ ( z | x ) {\displaystyle \ln p_{\theta }(z|x)} . That is, the approximate posterior q ϕ ( | x ) {\displaystyle q_{\phi }(\cdot |x)} balances between being a uniform distribution and moving towards the maximum a posteriori arg max z ln p θ ( z | x ) {\displaystyle \arg \max _{z}\ln p_{\theta }(z|x)} .

Data-processing inequality

Suppose we take N {\displaystyle N} independent samples from p {\displaystyle p^{*}} , and collect them in the dataset D = { x 1 , . . . , x N } {\displaystyle D=\{x_{1},...,x_{N}\}} , then we have empirical distribution q D ( x ) = 1 N i δ x i {\displaystyle q_{D}(x)={\frac {1}{N}}\sum _{i}\delta _{x_{i}}} .


Fitting p θ ( x ) {\displaystyle p_{\theta }(x)} to q D ( x ) {\displaystyle q_{D}(x)} can be done, as usual, by maximizing the loglikelihood ln p θ ( D ) {\displaystyle \ln p_{\theta }(D)} :

D K L ( q D ( x ) p θ ( x ) ) = 1 N i ln p θ ( x i ) H ( q D ) = 1 N ln p θ ( D ) H ( q D ) {\displaystyle D_{\mathit {KL}}(q_{D}(x)\|p_{\theta }(x))=-{\frac {1}{N}}\sum _{i}\ln p_{\theta }(x_{i})-H(q_{D})=-{\frac {1}{N}}\ln p_{\theta }(D)-H(q_{D})}
Now, by the ELBO inequality, we can bound ln p θ ( D ) {\displaystyle \ln p_{\theta }(D)} , and thus
D K L ( q D ( x ) p θ ( x ) ) 1 N L ( ϕ , θ ; D ) H ( q D ) {\displaystyle D_{\mathit {KL}}(q_{D}(x)\|p_{\theta }(x))\leq -{\frac {1}{N}}L(\phi ,\theta ;D)-H(q_{D})}
The right-hand-side simplifies to a KL-divergence, and so we get:
D K L ( q D ( x ) p θ ( x ) ) 1 N i L ( ϕ , θ ; x i ) H ( q D ) = D K L ( q D , ϕ ( x , z ) ; p θ ( x , z ) ) {\displaystyle D_{\mathit {KL}}(q_{D}(x)\|p_{\theta }(x))\leq -{\frac {1}{N}}\sum _{i}L(\phi ,\theta ;x_{i})-H(q_{D})=D_{\mathit {KL}}(q_{D,\phi }(x,z);p_{\theta }(x,z))}
This result can be interpreted as a special case of the data processing inequality.

In this interpretation, maximizing L ( ϕ , θ ; D ) = i L ( ϕ , θ ; x i ) {\displaystyle L(\phi ,\theta ;D)=\sum _{i}L(\phi ,\theta ;x_{i})} is minimizing D K L ( q D , ϕ ( x , z ) ; p θ ( x , z ) ) {\displaystyle D_{\mathit {KL}}(q_{D,\phi }(x,z);p_{\theta }(x,z))} , which upper-bounds the real quantity of interest D K L ( q D ( x ) ; p θ ( x ) ) {\displaystyle D_{\mathit {KL}}(q_{D}(x);p_{\theta }(x))} via the data-processing inequality. That is, we append a latent space to the observable space, paying the price of a weaker inequality for the sake of more computationally efficient minimization of the KL-divergence.[6]

References

  1. ^ Kingma, Diederik P.; Welling, Max (2014-05-01). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [stat.ML].
  2. ^ Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "Chapter 19". Deep learning. Adaptive computation and machine learning. Cambridge, Mass: The MIT press. ISBN 978-0-262-03561-3.
  3. ^ Hinton, Geoffrey E; Zemel, Richard (1993). "Autoencoders, Minimum Description Length and Helmholtz Free Energy". Advances in Neural Information Processing Systems. 6. Morgan-Kaufmann.
  4. ^ Burda, Yuri; Grosse, Roger; Salakhutdinov, Ruslan (2015-09-01). "Importance Weighted Autoencoders". arXiv:1509.00519 [stat.ML].
  5. ^ Neal, Radford M.; Hinton, Geoffrey E. (1998), "A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants", Learning in Graphical Models, Dordrecht: Springer Netherlands, pp. 355–368, doi:10.1007/978-94-011-5014-9_12, ISBN 978-94-010-6104-9, S2CID 17947141
  6. ^ Kingma, Diederik P.; Welling, Max (2019-11-27). "An Introduction to Variational Autoencoders". Foundations and Trends in Machine Learning. 12 (4). Section 2.7. arXiv:1906.02691. doi:10.1561/2200000056. ISSN 1935-8237. S2CID 174802445.

Notes

  1. ^ In fact, by Jensen's inequality, E x p ( x ) [ max θ i ln p θ ( x i ) ] max θ E x p ( x ) [ i ln p θ ( x i ) ] = N max θ E x p ( x ) [ ln p θ ( x ) ] {\displaystyle \mathbb {E} _{x\sim p^{*}(x)}\left[\max _{\theta }\sum _{i}\ln p_{\theta }(x_{i})\right]\geq \max _{\theta }\mathbb {E} _{x\sim p^{*}(x)}\left[\sum _{i}\ln p_{\theta }(x_{i})\right]=N\max _{\theta }\mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]} The estimator is biased upwards. This can be seen as overfitting: for some finite set of sampled data x i {\displaystyle x_{i}} , there is usually some θ {\displaystyle \theta } that fits them better than the entire p {\displaystyle p^{*}} distribution.
  2. ^ By the delta method, we have
    E z i q ϕ ( | x ) [ ln ( 1 N i p θ ( z i | x ) q ϕ ( z i | x ) ) ] 1 2 N V z q ϕ ( | x ) [ p θ ( z | x ) q ϕ ( z | x ) ] = O ( N 1 ) {\displaystyle \mathbb {E} _{z_{i}\sim q_{\phi }(\cdot |x)}\left[\ln \left({\frac {1}{N}}\sum _{i}{\frac {p_{\theta }(z_{i}|x)}{q_{\phi }(z_{i}|x)}}\right)\right]\approx -{\frac {1}{2N}}\mathbb {V} _{z\sim q_{\phi }(\cdot |x)}\left[{\frac {p_{\theta }(z|x)}{q_{\phi }(z|x)}}\right]=O(N^{-1})}
    If we continue with this, we would obtain the importance-weighted autoencoder.[4]