Metropolis-Hastings MCMC from Scratch in Python

Jonathan Fraine
13 min readFeb 16, 2023

--

Photo by Cem Sagisman on Unsplash

This is the second instalment in a series about Bayesian Inference with Python. The primer for the series starts with Bayesian reasoning and finales with a rudimentary MCMC operations

We will recap the bare minimum from that article here, in order to explore the Metropolis-Hastings (MH) Algorithm for MCMC. The MH flavour of MCMC — implemented below — is a more direct algorithm because it keeps the stepsize static over the entire chain. Some MCMC methods use adaptive refinement to adjust the step sizes to either strengthen or speed up convergence (or both). There are dozens of other flavours of MCMC — or just MC. Which, we will discuss and diagnose in future articles.

Bayes to MCMC: Recap from Primer

Bayesian analysis is a statistical framework that uses Bayes’ theorem to update beliefs about an event by building inferences from hypotheses, models, or quantities with regard to observed data. Furthermore, Bayesian inference is a component of Bayesian analysis that uses Bayes’ thereom to infer conclusions when presented with data, hypotheses, and prior knowledge.

Image created by Author

Down the chain, Monte Carlo (MC) is the set of techniques within Statistical analysis that use random samples over a distribution to estimate solutions or perform simulations. Markov Chain Monte Carlo is a flavour of this MC where a new sample is generated within a probability distribution of the previous sample in the chain. MCMC is popular because it is a computationally tractable application of Bayesian inference.

The above figure is a multi-layered Venn diagram, which details that Markov Chains and Monte Carlo techniques do not require Bayesian statistics. In fact, the formative work on MCMC (Metropolis et al., 1953) does not mention Bayes at all. As such, Markov Chains can be designed without Bayesian inference. Furthermore, many Monte Carlo techniques exist that do not involve Markov Chains or Bayesian inference. The diagram above is a visual representation that the context discussed here is a sub-set of MCMC within the framework of Bayesian Inference.

Bayes’ theorem

Bayes’ theorem is a mathematical formula for determining the conditional probability of an event — P(X|Y) — given prior knowledge — P(X) — about the event. The theorem is written as:

P(Event | Data) = P(Data | Event) * P(Event) / P(Data)

This equation asks the question: what is the probability that “a thing happened” (Event), given that we have observed a set of data (Data)?

Metropolis-Hastings: Issues with Direct Calculations

Metropolis-Hastings (MH) is a common method of executing an MCMC, which is not too complex to implement or understand. The underlying principle of MCMC is that the chain is a sequence of states created through iterative estimations of new states. States are sets of parameters that define the likelihood. Most commonly, states are the parameters of an analytic model over which the MCMC is iterating in order to maximise the Posterior.

A Markov chain requires that each proposed (new) state is generated only from the previous state. In the Bayesian context, the acceptance of a new state often involves comparing the likelihood of the proposal state to the likelihood of the current state, with respect to a uniformly sampled acceptance threshold. This last part is often confusing to say out loud, but easy to code.

We will now detail a step-by-step procedure for updating a Markov Chain by comparing the likelihoods between the current state and a proposal (new) state — see previous instalment for details about Bayesian Updating. In our context below, the new state will be accepted if the ratio of the proposal likelihood over the current likelihood is greater than a threshold, randomly sampled from a standard Uniform distribution: U[0, 1).

Metropolis-Hastings for a single parameter system

[Note that {Name}, below, symbolizes “sampling the Named distribution”]

  1. Sample the prior to start with a random state as the initial parameter(s), and assign it as the current state x_init = curr = {Prior}
  2. Compute the likelihood of the current state: L_curr = N(mu, var)(curr)
    For a Gaussian Likelihood, to compute the Likelihood is to input x into the following equation:
    L(x; mu, var) = exp((x-mu)^2/var) / sqrt(2*pi*var)
  3. Propose a new state within the prior distribution by sampling from the transition distribution — i.e., N(curr, stepsize) for a Normal transition distribution:proposed = {N(curr, stepsize)}
  4. Compute the likelihood of the proposed state,
    L_prop(proposed) = N(mu, sig)(proposed)
  5. Compute the acceptance criterion as the ratio of the proposed likelihood over the current likelihood: accept_crit = L_prop / L_curr
  6. Sample a uniform distribution from zero to one as the acceptance threshold: accept_threshold = {U[0,1)}
  7. If the acceptance criterion is greater than the acceptance threshold, then the proposed state is stored as the current state:
    if accept_crit > accept_threshold: curr = proposed
  8. Repeat steps 3–7 until a predetermined number of iterations have been completed or convergence has been achieved.

Let us assume that our likelihood distribution is a standard normal, with a mean of zero and a width of one. This assumption is often quite normal (pun!) when fitting residuals of an analytic model.

To understand the acceptance criterion, if our proposed parameter is 5 (proposed = 5) and the likelihood distribution is a standard normal — N(0,1) — then our proposal is 5-sigma away from the mode and sustains a very low likelihood (L_prop).

Furthermore, if our current parameter is 3 (curr = 3), then our current likelihood (L_curr) has a much larger value than the proposed likelihood (L_prop).

As such, the acceptance criterion is equal to the proposed likelihood over the current likelihood, which would be a very small number; for our example here: accept_crit = 0.000335 or 0.0335%

acceptance criterion = L(proposed) / L(current)

Following which, we would draw the acceptance threshold from a uniform sample from zero to one: U(0,1).

If acceptance criterion > acceptance threshold, then accept the new state.

If (proposed likelihood / current likelihood) > U(0,1):

then: (accept) current = proposed

Else: (reject) keep the current state.

If the acceptance criterion is small — as in the example above — then there is not much space or “wiggle room” in the uniform probability for the acceptance criterion to be greater than the acceptance threshold.

From the example above, if the criterion is 0.0003, then only 0.03% of the range of U[0,1) could viably accept the proposed state. In clear, although imprecise, terms: there is a 0.03% chance that the example proposal state above could be accepted.

The following code shows the process, per step in the chain, of how the MH algorithm updates the Bayesian posterior distribution to evaluate a newly proposed state. The mcmc_updater function below processes steps 3–7 in the algorithmic process above:

def mcmc_updater(curr_state, curr_likeli, 
likelihood, proposal_distribution):
""" Propose a new state and compare the likelihoods

Given the current state (initially random),
current likelihood, the likelihood function, and
the transition (proposal) distribution, `mcmc_updater` generates
a new proposal, evaluate its likelihood, compares that to the current
likelihood with a uniformly samples threshold,
then it returns new or current state in the MCMC chain.

Args:
curr_state (float): the current parameter/state value
curr_likeli (float): the current likelihood estimate
likelihood (function): a function handle to compute the likelihood
proposal_distribution (function): a function handle to compute the
next proposal state

Returns:
(tuple): either the current state or the new state
and its corresponding likelihood
"""
# Generate a proposal state using the proposal distribution
# Proposal state == new guess state to be compared to current
proposal_state = proposal_distribution(curr_state)

# Calculate the acceptance criterion
prop_likeli = likelihood(proposal_state)
accept_crit = prop_likeli / curr_likeli

# Generate a random number between 0 and 1
accept_threshold = np.random.uniform(0, 1)

# If the acceptance criterion is greater than the random number,
# accept the proposal state as the current state
if accept_crit > accept_threshold:
return proposal_state, prop_likeli

# Else
return curr_state, curr_likeli

Metropolis-Hastings from Scratch in Python

The following metropolis_hastings routine inputs the required functions, hyperparameters, and initial conditions to iterate over the posterior distribution to converge the MCMC chain up to num_samples iterations.

import numpy as np

def metropolis_hastings(
likelihood, proposal_distribution, initial_state,
num_samples, stepsize=0.5, burnin=0.2):
""" Compute the Markov Chain Monte Carlo

Args:
likelihood (function): a function handle to compute the likelihood
proposal_distribution (function): a function handle to compute the
next proposal state
initial_state (list): The initial conditions to start the chain
num_samples (integer): The number of samples to compte,
or length of the chain
burnin (float): a float value from 0 to 1.
The percentage of chain considered to be the burnin length

Returns:
samples (list): The Markov Chain,
samples from the posterior distribution
"""
samples = []

# The number of samples in the burn in phase
idx_burnin = int(burnin * num_samples)

# Set the current state to the initial state
curr_state = initial_state
curr_likeli = likelihood(curr_state)

for i in range(num_samples):
# The proposal distribution sampling and comparison
# occur within the mcmc_updater routine
curr_state, curr_likeli = mcmc_updater(
curr_state=curr_state,
curr_likeli=curr_likeli,
likelihood=likelihood,
proposal_distribution=proposal_distribution
)

# Append the current state to the list of samples
if i >= idx_burnin:
# Only append after the burnin to avoid including
# parts of the chain that are prior-dominated
samples.append(curr_state)

return samples

We will discuss the burnin input parameter of the above code later. In short, we only want to compute estimators over the posterior region of the chain, which is often considered to be the latter ~80% of the samples for a converged chain.

To use the functions above, we need to define the likelihood distribution and transition proposal distribution. In an experimentation context, the likelihood distribution is the distribution of probabilities that the proposed state (i.e., model parameters) explains the data. In addition, the proposal distribution is a function that generates proposed (new) states for the Markov chain, by sampling a Normal distribution around the current state.

For example, if we want to establish the likelihood as a standard normal distribution — i.e., a normal distribution with a mean of zero and a standard deviation of one — then, our likelihood distribution would be defined as:

def likelihood(x):
# Standard Normal Distribution
# An underlying assumption of linear regression is that the residuals
# are Gaussian Normal Distributed; often, Standard Normal distributed
return np.exp(-x**2 / 2) / np.sqrt(2 * np.pi)

Next, we must define the proposal distribution — i.e., how to generate the next guess. For our direct MH algorithm here, we chose a normal distribution with a mean equal to the current state and a fixed standard deviation (or variance): N(curr, var). This is equivalent to adding a zero mean Normal to the current state:

proposed = N(curr, var) = curr + N(0, var)

Sampling over the (transition) proposal distribution specifies that the next guess (proposed state) is nearby the current state; and that it is randomly selected from a Gaussian (Normal) distribution.

The variance (std dev squared) of the proposal distribution is often referred to as the “step size”. The step size (std dev) should be selected based on the scale of the parameter space being probed. For example, if we are measuring the temperature in degrees, then a standard deviation of half-degree jumps is reasonable. In contrast, if we are measuring the height of humans in meters, then a standard deviation of half-meter jumps is too large.

def proposal_distribution(x, stepsize=0.5):
# Select the proposed state (new guess) from a Gaussian distribution
# centered at the current state, within a Guassian of width `stepsize`
return np.random.normal(x, stepsize)

Given the above functions, we can generate samples over the posterior distribution using the metropolis_hastingsfunction above:

np.random.seed(42)

initial_state = 0 # Trivial case, starting at the mode of the likelihood
num_samples = int(1e4)
burnin = 0.2

samples = metropolis_hastings(
likelihood,
proposal_distribution,
initial_state,
num_samples,
burnin=burnin
)

MCMC Estimators and the Burnin Region

The MCMC estimator is often taken to be the mean over the chain within the Bayesian credible region of plus or minus the standard deviation of the chain.

estimator = mean(chain) +/- std(chain)

Because the initial conditions are set by the prior, we refer to early samples in the chain as prior-dominated. The posterior distribution is, necessarily, a subregion of the prior distribution. If the posterior includes large areas of the prior boundaries, then the prior may be too restrictive or the likelihood may not be well designed to model the data.

Because the posterior is expected to sample a subregion of the prior’s extent — and a well-constrained MCMC chain is expected to iteratively sample the posterior — we tend to only accept the latter 80% of the Markov Chain when computing the estimators: mean +/- standard deviation over the chain. The initial 20% is referred to as the “burn-in” phase, which is often excluded from estimators over the final chain.

As we will see later, estimating after excluding the burn-in is equivalent to:

estimator = mean(chain[burnin:]) +/- std(chain[burnin:])

Visualisations for Intuition Building

To gain an intuitive grasp of MCMC chains and posterior distributions, the below figures exhibit the chain with

  • (pink) quantile boundaries,
  • (purple) coloured sections for “initial guess”,
  • (orange) “burn-in region”,
  • (magenta) “posterior samples”, and
  • (same colours) density plots in the right-hand figure.

The code to generate these chains and figures is included in the GitHub and Colab notebooks for this article.

plot_chain(samples, burnin=burnin, initial=burnin/10)
An MCMC chain probing standard normal distribution. Figure created by Author; code available in GitHub and Colab Notebook

The chain in the figure above was configured to measure the distance between the state and the likelihood, within the prior. For simplicity, this example started at an initial state of x=0 (the trivial solution).

The chain probed the posterior, which is — by design here — a standard normal: N(0,1). The right-hand side of the figure above exhibits three density estimates over each section of the chain: initial, burn-in, and posterior. The density estimates evolve from too tight (initial: purple) to a close approximation of a standard normal (posterior: magenta).

When fitting a model, the underlying assumption is that the residuals — data minus model — are Gaussian distributed. Often, this assumption is simplified to a standard normal distribution after normalising the feature vectors (input data). As such, the Normality of the MCMC chains is an expected result when the data is “well fit” by the model and the chains have converged. There is a full literature on “goodness of fit that we will not append here. Instead, we will show a sequence of examples to illustrate the frequently observed behaviours.

Examples of MCMC Cases

Above, we showed the trivial case where the initial condition was set to the pre-defined mode over the likelihood. In the following examples, we elaborate on the purpose of excluding the prior-dominated and burn-in regions of the MCMC chain to minimise contamination from improper initial conditions and maximise probing of the posterior distribution.

Two MCMC chains probing a standard normal distribution. Figure created by Author; code available in GitHub and Colab Notebook

The two figures above were generated from the same simplified MCMC, probing a standard normal likelihood function. The first figure (left) was generated from a chain with an initial condition of x=5 , which is therefore 5-sigma away from the mode of the likelihood function (x=0). The chain efficiently converges to a standard normal distribution within a few hundred steps.

The second figure (right) was generated from a chain with an initial condition of x = 40, which is thus 40-sigma away from the mean of the likelihood function (x=0). The chain also converges to a standard normal distribution within a few hundred steps.

The initial behaviour of the latter chain (starting at x=40) shows two essential characteristics of MCMC chains that illuminate the burn-in region, and why we ignore it with our final solution. In particular, at 40-sigma away from the standard normal likelihood function (i.e., a very poor fit), the chain is flat for up to 100 steps. This flat behaviour is expected when the initial condition is too far away from the mode of a Normal likelihood distribution. We can understand that better by examining the MH algorithm again.

The algorithm computes two exponentials when comparing the current and proposal likelihoods. The latter could be computationally close enough to zero that it is lost in the digital representation. The acceptance criterion would therefore be either zero or NaN (~one divided by zero), which offers no wiggle room for the acceptance threshold to accept the proposal. This results in no updates for up to 100 proposed samples.

We narrow in on this example to illuminate that if an initial condition is too far from the maximum likelihood, then the MCMC may not converge, especially within the number of steps defined by the user.

Simplistic Ensemble MCMC

Because the initial conditions may cause the chain to fail — not converge — it is often very useful to track multiple chains — after randomly sampling over the prior distribution to establish each of their initial conditions.

The simplistic ensemble method below provides multiple, synchronous experiments to probe the same likelihood and posterior distributions. The overlapping convergence of multiple MCMC chains — from randomly selected initial conditions — provides greater evidence that the specific model fits the data well.

Simplistic Ensemble Methods: 10 MCMC chains probing the same likelihood and posterior distributions.

The figure above shows 10 uniquely initialised MCMC chains probing the same likelihood and posterior distributions. Note that one of the chains exhibited the initially flat behaviour from a very low, initial likelihood evaluation.

Excluding the burn-in region (n < 200), all 10 chains visually converge together to form 10 standard normal distributed MCMC chains — see the right-hand side density estimate plot above. We will discuss more robust ensemble MCMC techniques in future articles — as well as convergence criteria.

The code for this article can be found in the GitHub and Colab notebook.

If you like the article and would like to support me make sure to:

  • 👏 Clap for the story (50 claps) and follow me 👉
  • 📰 View more content on my medium profile
  • 🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter
  • 🚀👉 Join the Medium membership program to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.

--

--

Jonathan Fraine
Jonathan Fraine

Written by Jonathan Fraine

Director of Engineering for Wikimedia DE. I work with dozens of motivated and enthusiastic developers to improve the future of free knowledge around the world.

No responses yet