Fundamental Bayesian Samplers

by Eric · Published June 11, 2019 · Updated April 9, 2020

Introduction

The posterior probability distribution is the heart of Bayesian statistics and a fundamental tool for Bayesian parameter estimation. Naturally, how to infer and build these distributions is a widely examined topic, the scope of which cannot fit in one blog.

We can, however, start to build a better understanding of sampling by examining three basic, but fundamental techniques:

The importance sampler
Metropolis-Hastings sampler
Gibbs sampler

These samplers fall into the family of Monte Carlo samplers. Monte Carlo samplers:

Use repeated random draws to approximate a target probability distribution.
Produce a sequence of draws that can be used to estimate unknown parameters.

The Metropolis-Hastings and Gibbs samplers are also both Markov Chain Monte Carlo (MCMC). Markov Chain Monte Carlo samplers:

Are a type of Monte Carlo sampler.
Build a Markov Chain by starting at a specific state and making random changes to the state during each iteration.
Result in a Markov Chain that converges to the target distribution.

The importance sampler

Importance samplers use weighted draws from a proposed importance distribution to approximate characteristics of a different target distribution.

Importance sampling is useful when the area we are interested in may lie in a region that has a small probability of occurrence. In these cases, other sampling techniques may fail to even draw from that area.

Importance sampling overcomes this issue by sampling from a distribution which overweights the region of interest.

Preparing for sampling

In order to implement the importance sampler we must decide on our importance distribution and our target distribution. The importance distribution is what we will draw from during each iteration.

There are a few things to consider when deciding on our importance distribution (Koop, Poirier & Tobias, 2007):

The fewer draws you make, the more crucial it is to have an importance distribution that is a close approximation to the target distribution.
In general, importance distributions should have fatter tails than the target distribution. For example, today we will use a t-distribution as our importance distribution for estimating the mean and variance of a standard normal distribution.
Choosing an importance distribution is not as obvious or simple as our toy example suggestion. In reality, there are statistical techniques to help choose the optimal importance distribution.

The algorithm

The importance sampler algorithm iterates over the following steps:

Draw a single sample, $X$, from the importance distribution. For example, let's suppose we draw -0.01014 from a $t(0, 1, 2)$ distribution.
Calculate the importance sampling weight. This is a likelihood ratio of the probability of $X$ being drawn from the target distribution, $p(x)$, and probability of $X$ being drawn from the importance distribution, $q(X)$.
$$p(X) = \text{Probability of -0.010139 from } N(0, 1) = 0.3989$$ $$q(X) = \text{Probability of -0.010139 from } t(0, 1, 2) = 0.3535$$ $$w = \frac{p(X)}{q(X)} = \frac{0.3989}{0.3535} = 1.1284$$
Multiply draw $X$ by the importance weight. $$ w * X = 1.1284 * -0.01014 = -0.01144 $$
Repeat steps 1-3 until a sample of the desired size is created.
Use the sample collected in steps 1-4 to compute the expected value and variance of the target distribution.

Example*

Now let's use the iterative importance sampler to estimate the mean and variance of the standard normal distribution, $N(0,1)$.

We can implement this importance sampler using the importanceSamplerTDist function found in the GAUSS samplerlib library.

This function takes one required input and three optional inputs:

keep_draws: Scalar, the total number of draws to be kept.
dof_is: Optional input, Scalar, the degrees of freedom of the t-distribution. Default = 2.
mean_is: Optional input, Scalar, the mean of the importance function. Default = 0.
scale_is: Optional input, Scalar, the scale factor of the importance function. Default = 1.

We'll run the importanceSamplerTDist using 50,000 repetitions and the default dof_is, mean_is, and scale_is:

new;
library samplerlib;

rndseed 34532;

// Number of iterations 
keep_draws = 50000;

// Call the function
struct importanceOut iOut;
iOut = importanceSamplerTdist(keep_draws);

The importanceSamplerTDist function returns a filled instance of the importanceOut structure. The structure iOut contains the following relevant members:

iOut.theta_draw_is: Vector, sequence of draws from the importance sampler.
iOut.theta_mean_is: Scalar, importance sampler posterior mean.
iOut.theta_std_is: Scalar, importance sampler posterior standard deviation.
iOut.w: Vector, the importance weights.
iOut.wmean: Scalar, the mean of importance sampling weights.
iOut.wstd: Scalar, standard deviation of importance sampling weights.

The function prints the posterior distribution mean and variance-covariance matrix as output:

Importance Sampling Posterior Mean and Standard Deviation
  -0.00064283983       0.99621825

Mean and standard deviation of importance sampling weights
       1.0000216       0.38137929

We can see that our estimated mean of -0.0006 and variance of 0.99 are fairly close to the true mean of 0 and variance of 1.

Additionally, the mean and standard deviation of our importance weights provide some insight into the relationship between our importance distribution and the target distribution.

Advantages

The importance sampler:

Allows us to solve problems that may not be feasible using other sampling methods.
Can be used to study one distribution using samples generated from another distribution.
By choosing the appropriate importance distribution we can sample from interesting or important regions other Monte Carlo sampling techniques may overlook.
Examples include Bayesian inference, rare event simulation in finance or insurance, and high energy physics.

Disadvantages

There are a few disadvantages of the importance sampler to consider:

It does not work well in high dimensions. The variance of the samples increases as the dimensionality increases.
It can be difficult to identify a suitable importance distribution that is easy-to-sample from.

Gibbs Sampling

Gibbs sampling is a Markov Chain Monte Carlo technique used to sample from distributions with at least two dimensions.

The Gibbs sampler draws iteratively from posterior conditional distributions rather than drawing directly from the joint posterior distribution. By iteration, we build a chain of draws, with each current draw depending on the previous draw.

The Gibbs Sampler is particularly useful, as the joint posterior is not always easy to work with. However, we can solve the joint estimation problem as a series of smaller, easier estimation problems.

Preparing for sampling

Before running the Gibbs sampler we must find the joint posteriors. Today we are going to use the Gibbs sampler to estimate two parameters from a bivariate normal posterior distribution such that

$$ \begin{eqnarray*} \begin{pmatrix} \theta_1\\ \theta_2 \end{pmatrix} & \sim & N\left[\left(\begin{array}{c} 0\\ 0 \end{array}\right),\left(\begin{array}{ccc} 1 & \rho\\ \rho & 1 \end{array}\right)\right]\\ \end{eqnarray*} $$

where $\theta_1$ and $\theta_2$ are unknown parameters of the model, while $\rho$ is the known posterior correlation between $\theta_1$ and $\theta_2$.

Using the properties of the multivariate normal distribution we can define the conditional distributions necessary for our sampler

$$\theta_1|\theta_2,\: y \sim N(\rho\theta_2,\: 1-\rho^2) \sim \rho\theta_2 + \sqrt{1-\rho^2}N(0,\:1)$$

and

$$\theta_2|\theta_1,\: y \sim N(\rho\theta_1,\: 1-\rho^2) \sim \rho\theta_1 + \sqrt{1-\rho^2}N(0,\:1) .$$

The algorithm

The Gibbs sampler algorithm is dependent on the number of dimensions in our model. Let's consider a bivariate normal sampler with two parameters, $\theta_1$ and $\theta_2$.

Determine the conditional distributions for each variable, $\theta_1$ and $\theta_2$.
Choose a starting values $\theta_1^{(0)}$ and $\theta_2^{(0)}$. $$\theta_1^{(0)} = 0$$ $$\theta_2^{(0)} = 0$$
Draw $\theta_2^{(r)}$ from $\:p(\theta_2|y,\:\theta_1^{(r-1)})$. $$\theta_2^{(1)}|\theta_1^{(0)} = \rho\theta_1^{(0)} + \sqrt{1-\rho^2}N(0,\:1) =\\ \rho*0 + \sqrt{1-\rho^2}N(0,\:1) = -0.15107$$.
Draw $\theta_1^{(r)}$ from $\:p(\theta_1|y,\:\theta_2^{(r)})$.
$$\theta_1^{(1)}|\theta_2^{(1)} = \rho\theta_2^{(1)} + \sqrt{1-\rho^2}N(0,\:1) =\\ \rho*-0.15107 + \sqrt{1-\rho^2}N(0,\:1) = -0.65117$$.
Repeat 2-3 for desired number of iterations.

Example*

To implement this Gibbs sampler we can use the gibbsSamplerBiN function found in the GAUSS samplerlib library.

This function takes two required inputs and three optional inputs:

keep_draws: Scalar, the total number of draws to be kept.
rho: Scalar, the correlation parameter.
burn_in: Optional input, Scalar, the number of burn-in iterations. Default = 10% of kept draws.
theta_1_init: Optional input, Scalar, the initial value of $\theta_1$. Default = 0.
theta_2_init: Optional input, Scalar, the initial value of $\theta_2$. Default = 0.

We'll run the gibbsSamplerBiN using $\rho = 0.6$, 1000 burn-in iterations, and keeping 10,000 repetitions:

new;
library samplerlib;

// Set random seed for repeatable random numbers
rndseed 45352;

/*
** Gibbs Sampler Specifications
** Known correlation parameter
*/
rho = 0.6;      

// Burn-in for the Gibbs sampler
burn_in = 1000;        

// Draws to keep from sampler
keep_draws = 10000;  

// Call Gibbs sampler
struct gibbsBiNOut gOut;
gOut = gibbsSamplerBiN(keep_draws, rho, burn_in);

The gibbsSamplerBiN function returns a filled instance of the gibbsBiNOut structure. The structure gOut contains the following relevant members:

gOut.theta_keep_gibbs: Matrix, sequence of kept draws from the Gibbs sampler.
gOut.theta_mean_gibbs: Matrix, Gibbs sampler posterior mean.
gOut.theta_std_gibbs: Scalar, Gibbs sampler posterior standard deviation.

The functions prints the posterior distribution mean an variance-covariance matrix as output:

---------------------------------------------------
Variance-Covariance Matrix from Gibbs Sampler
---------------------------------------------------
Sample Mean:
    0.0060170250
    0.0032377421

Sample Variance-Covariance Matrix:
      0.97698091       0.58179239
      0.58179239       0.98180547

In this case, we actually knew the true standard values of our correlation parameter and means, which gives us some insight into the performance of our sampler.

Note that our sample estimates $\hat{\rho} = 0.58179239$ while our true correlation parameter is $\rho = 6$. In addition, the true $\theta_1$ and $\theta_2$ were both equal to 0, while our sampler estimates $\hat{\theta}_1=0.006$ and $\hat{\theta}_2=0.003$.

Advantages

The Gibbs sampler is a powerful sampler with a number of advantages:

Joint distributions may be complex to draw directly from but we may be able to sample directly from less complicated conditional distributions.
Conditional distributions are lower in dimension than joint distributions and may be more suitable for applying other sampling techniques.

Disadvantages

The Gibbs sampler is not without disadvantages, though:

To perform the Gibbs sampler we must be able to find posterior conditionals for each of the variables.
As the correlation between variables increases, the performance of the Gibbs sampler decreases. This is because the sequence of draws from the Gibbs sampler becomes more correlated.
Even if we can extract the conditional distributions they may not be known forms, so we could not draw from them.
Drawing from multiple conditional distributions may be slow and inefficient.

Metropolis-Hastings sampler

Like the Gibbs sampler, the Metropolis-Hastings sampler is a MCMC sampler. While the Gibbs sampler relies on conditional distributions, the Metropolis-Hastings sampler uses a full joint density distribution to generate a candidate draws.

The candidate draws are not automatically added to the chain but rather an acceptance probability distribution is used to accept or reject candidate draws.

Preparing for sampling

In order to implement the Metropolis-Hastings sampler two things must be chosen:

The proposal distribution which defines the probability of a candidate draw given the previous draw. In our example we will use the normal distribution, $ N\big(0,\text{ }d^2\big)$, to generate the candidate draws.
The acceptance probability which is the probability that we accept the candidate draw. In our example we will use

$$ \alpha\big(\theta^{(r-1)}, \theta^*\big) = min \bigg[exp\Big[\frac{1}{2}\Big( | \theta^{(r-1)}| - | \theta^*| - \bigg(\frac{\theta^{(r-1)}}{d}\bigg)^2 + \bigg(\frac{\theta^*}{d}\bigg)^2\Big) \Big],1\bigg]$$

where $d$ is the performance parameter, which we will set to $d=6$.

The algorithm

The general Metropolis-Hastings algorithm can be broken down into simple steps:

Choose a starting value for $\theta$. $$\theta^{(0)} = 1$$
Draw $\theta^*$ from the candidate generating density. $$\theta^* = N\big(0,\text{ }36\big) = -1.3847$$
Calculate the acceptance probability $\alpha(\theta^{(r-1)}, \theta^*)$.
$$ \alpha \big( 1, -1.3847 \big) = min \bigg[ exp \Big[ \frac{1}{2} \Big( |1| - |-1.3847| - \bigg( \frac{1}{6} \bigg)^2 + \bigg( \frac{-1.3847}{6} \bigg)^2 \Big) \Big], 1 \bigg] = 0.8356$$
Set $\theta^{(r)} = \theta^*$ with probability $\alpha(\theta^{(r-1)}, \theta^*)$, or else set $\theta^{(r)} = \theta^{(r-1)}$. $$ 0.8356 > U(0,1) \rightarrow \theta^{(1)} = -1.3847$$
Repeat steps 3-5 for $r = 1, ... , R$.

Example*

In this example we will use the Metropolis-Hastings sampler to make inference about a parameter, $\theta$, which we assume has a posterior distribution given by

$$ p(\theta|y) \propto exp\bigg(-\frac{1}{2}|(\theta)|\bigg) $$

This a special case of the Laplace distribution which has a true mean of 0 and a variance of 8.

Though we know the true mean and variance, we are going to see how to use the independent chain Metropolis-Hastings sampler to estimate both.

We will implement this Metropolis-Hastings sampler using the metropolisHastingsIC function found in the GAUSS samplerlib library.

This function takes one required input and three optional inputs:

keep_draws: Scalar, the total number of draws to be kept.
burn_in: Optional input, Scalar, the number of burn-in iterations. Default = 10% of kept draws.
sd_ic: Optional input, Scalar, standard deviation of Metropolis-Hastings draw. Default = 1.
theta_init: Optional input, Scalar, the initial value of $\theta$. Default = 0.

We'll run the metropolisHastingsIC using 100 burn-in iterations, and keeping 10,000 repetitions, sd_ic = 6, and theta_init = 1:

new;
library samplerlib;

// Set random seed for repeatable random numbers
rndseed 97980;

/*
** Set up Metropolis Hastings chain
** Discard r0 burnin replications
*/
burn_in = 100;

// Specify the total number of replications
keep_draws = 10000;

/*
** Standard deviation for the increment 
** in the independent chain M-H algorithm
*/
sd_ic = 6;

// Set initial theta
theta_init = 1;

// Call Metropolis-Hastings sampler
struct metropolisOut mOut;
mOut = metropolisHastingsIC(keep_draws, burn_in, sd_ic, theta_init);

The metropolisHastingsIC function returns a filled instance of the metropolisOut structure. The structure mOut contains the following relevant members:

mOut.theta_draw_mh: Vector, the vector of accepted draws in the sample.
mOut.theta_mean_mh: Scalar, the Metropolis-Hastings posterior mean.
mOut.theta_std_mh: Matrix, the Metropolis-Hastings posterior standard deviation.
mOut.accepted_count_mh: Scalar, the proportion of accepted candidate draws.

The functions prints the posterior distribution mean and variance-covariance matrix as output:

Posterior Mean and Variance
     0.038943707
       8.1069691

Proportion of accepted candidate draws: Independence chain M-H
      0.48455446

Advantages

The Metropolis-Hastings sampler has number of advantages:

It is simple to implement. There is no need to determine conditional distributions.
Reasonable for sampling from correlated, high dimensional distributions.

Disadvantages

Can have a poor convergence rate.
Struggles with multi-modal distributions.
The sampler is sensitive to the step size between draws. Either too large or too small of a step size can have a negative impact on convergence.

Conclusion

Congratulations! Today we've learned about three fundamental types of Bayesian samplers, the importance sampler, the Gibbs sampler, and the Metropolis-Hastings sampler.

In particular, we looked at:

The algorithms of each.
Some of the disadvantages and advantages of the samplers.
Examples of how to implement the samplers using the GAUSS samplerlib library.

References

Our examples today are based on examples provided in the Bayesian Econometric Methods textbook by Gary Koop, Dale Poirer, and Justin Tobias.

Koop, G., Poirier, D. J., & Tobias, J. L. (2007). Bayesian econometric methods. Cambridge University Press.

Eric( Director of Applications and Training at Aptech Systems, Inc. )

Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.