Bayes to MCMC with Examples in Python

Jonathan Fraine
12 min readJan 8, 2023

--

Image Provided by WikiCommons under CC0

Bayesian inference is a method of statistical inference in which Bayes’ theorem is applied to update the probability for a hypothesis as more “evidence” — i.e., data — becomes available. Bayesian inference is an important technique in statistical modelling, scientific experiments, machine learning, and artificial intelligence.

Bayesian inference is used for predicting estimators and uncertainty — i.e., Bayesian Credible Intervals — in far too many fields to list here: physics, biology, sociology, economics, chemistry, astronomy, and engineering. It is even used in AI art forms with stable diffusion models embedding Bayesian inference within the neural networks.

Statistical Inference is the process of using statistical models and data to estimate predictions about a population or other aspects of reality. It is a central part of statistical analysis and is used to make predictions, estimate quantities, uncertainty regions, and test hypotheses.

Bayesian inference is based on the concept of updating our beliefs about the probability distribution that generated the observations (posterior) as we observe more data (evidence). In statistics, Belief is the quantitative constraint on uncertainty — i.e., the variance of a distribution. In Bayesian inference, we start with an initial set of beliefs about the probability distribution, which is represented by a prior distribution. Then, we update these beliefs using Bayes’ theorem as we gather more data (evidence). The posterior distribution represents our updated beliefs, which is the probability that the data was generated from the likelihood (model or hypothesis).

Bayesian inference allows us to incorporate subjective beliefs and prior knowledge into our analysis; then quantify the uncertainty in our inferences using probability distributions. Because Bayesian inference incorporates new evidence to update its priors — new data to reduce uncertainty over the models — we can use it as an iterative process to improve our estimates of the population with updated constraints or improved sets of data.

Bayes’ theorem

Bayesian inference is a statistical analysis technique that implements updates according to Bayes’ theorem. Bayes’ theorem is a mathematical formula for determining the conditional probability of an event, given prior knowledge or evidence about the event. It is written as:

P(Event | Data) = (P(Data | Event) * P(Event)) / P(Data)

This equation asks the question: what is the probability that “a thing happened” (Event), given that we have observed a set of data (Data)? Note that the Data is the “evidence that a thing happened”; as such, P(Data) is referred to as the evidence.

This comes together with:

  • P(Event | Data) is the posterior probability or probability of the event given the data
  • P(Data | Event) is the likelihood of data D given event E: i.e., the probability that the Data was generated because the Event occurred
  • P(Event) is the prior probability of Event, which means the probability that this Event could happen as described
  • P(Data) is the evidence, which is the probability that the data is observable within reality.

Bayesian inference involves using Bayes’ theorem to update the probability for a hypothesis as more evidence [~Data] becomes available.

Dan Foreman-Mackey of the Flatiron Institute interprets Bayesian inference as “estimating the Physics given the Data”

P(Physics | Data) = P(Data | Physics) * P(Physics) / P(Data)

Where Physics is humanity’s logical interpretation of Reality.

Qualitative Example

As an example, we want to know if tomorrow (January) the temperature will be warm enough to wear shorts in Lagos. We start with a set of prior beliefs about the probability that it will be warm tomorrow, based on past weather patterns from January in Lagos, Nigeria.

If we then receive new information — such as a prediction from our weather app of choice — then, we can apply Bayes’ theorem to update our prior belief and estimate a more accurate prediction about the probability of shorts weather tomorrow.

  • P(Event | Data) is the posterior probability that tomorrow will be shorts weather
  • P(Data | Event) is the likelihood that shorts weather was estimated from the input data.
  • P(Event) is the prior probability that we understand shorts weather
  • P(Data) is the evidence that temperature can be measured with our devices

Quantitative Example

If we want to know if we have Covid19, given a positive antigen rapid test result and 70% chance of direct exposure (prior), then we can use Bayesian inference to estimate the probability of having Covid19.

  • The prior probability of exposure: P(Event) = 70%
  • Likelihood of a positive test result if we have Covid (true positive):
    P(Data | Event) = 90%
  • The probability that the test reports positive, whether true or false:
    P(Data) = 95% (90% True positive rate and 5% false positive rate)

Now compute the posterior probability by introducing the information above to Bayes’ theorem

P(Event | Data) = (P(Data | Event) * P(Event)) / P(Data)

This allows us to infer that the probability that we have covid is

P(Event | Data) = (0.9 * 0.7) / 0.95 = 0.663 or 66.3%

Now that we believe (66.3% probability) that we have Covid19, our flatmates can use that evidence to update their prior when testing themselves.

P(Event | Data) = (0.9 * 0.663) / 0.95 = 0.628 or 62.8%

Note that using 66.3% for our prior here assumes that we have had transmissive contact with our flatmates.

The use of existing knowledge (we probably have Covid19) to inform future observations (our flatmates probably have Covid19) is referred to as Bayesian Updating.

Bayesian Updating

Bayesian inference is an efficient framework for computationally iterative solutions because the underlying principle of Bayes’ theorem is that adding more data (i.e. evidence) should improve our understanding of uncertainty for a given model (hypothesis) and data (evidence).

Here is a procedure to implement a Bayesian update:

  1. Start with a set of possible explanations (hypotheses or models) for some observation or event. Additionally, we need a set of prior probabilities for each hypothesis; i.e., domain knowledge about the events and models. These prior probabilities reflect our initial beliefs about the likelihood of each hypothesis before we see any data. In the example above, the temperature in Lagos, Nigeria is highly unlikely to reach zero Celsius.
  2. Collect data and compute the likelihood of each hypothesis (model) given the data; e.g. the history of temperatures over years on the same day, the radar imagery showing wind patterns in nearby regions or waters, or a set of thermometers in distributed locations.
  3. Compute the posterior probability of each hypothesis (model) using Bayes’ theorem — the probability of the event given the data.
  4. The posterior probabilities are the update in our beliefs about future events; e.g., the temperature tomorrow or next week.

Bayesian Updating is an iterative process. We can continue to include new data (evidence) as it becomes available. In turn, this new data will likely update our beliefs further — improving our prediction and uncertainty estimations. Bayesian Updating allows us to continuously refine our understanding of an event or parameter when new information becomes accessible, as opposed to estimating once with all of the data after the fact.

Bayes to MCMC

Bayesian analysis is a statistical framework that uses Bayes’ theorem to update beliefs about an event by building inferences about hypotheses, models, or quantities with regard to observed data. Furthermore, Bayesian inference is a component of Bayesian analysis that uses Bayes’ thereom to infer conclusions when presented with data, hypotheses, and prior knowledge.

Image created by Author

Down the chain, Monte Carlo is the set of techniques within Statistical analysis that use samples over a distribution to estimate solutions or perform simulations. Markov Chain Monte Carlo, in turn, is a flavour of this technique where a new sample is generated within a probability distribution of the previous sample in the chain. MCMC is popular because it is a computationally tractable application of Bayesian inference.

Markov Chain: New sample = P(Previous Sample, Variance)

The evidence (i.e., P(Data) in Bayes’ theorem) is computationally intractable using standard numerical integration methods. It is considered intractable because it technically requires integrating over all of reality. As such, techniques were derived for approximating this integral iteratively. The most well-established techniques are Markov Chain Monte Carlo and Nested Sampling — to be discussed in a future article.

Monte Carlo methods use random sampling to generate a large number of samples (proposals): predictions or simulations. These proposals estimate probabilities, expectation values, and other statistical quantities. This approach is particularly useful when the underlying system or process that is being modelled is too complex or computationally expensive to analyse directly.

Markov chains are a mathematical technique that proposes sequential states of the system, according to probabilistic rules. These states are then selected by comparing the likelihoods and random distributions to transition the system to an updated state. Markov chains are considered memoryless because the probability of transitioning to any particular state is solely dependent on the current state and length of the chain (i.e., number of samples or run time).

A Markov chain is the sequence of states proposed, selected, and transitioned to by the Monte Carlo method. An MCMC is thus defined by the parameters or states (e.g., model parameters), a set of transition probabilities (i.e., proposal distribution), and a distribution of likelihoods associated with each state. The set of transition probabilities — or step distribution — specifies the probability of transitioning (stepping) from one state to another. As such, the ratio of likelihoods between the current and proposed states determines the probability of transitioning to the next state.

MCMC Example

Assume that the likelihood is standard normal distributed — i.e., N(0,1) a Gaussian with a mean of zero and a standard deviation of one — and that the prior is an open, uniform distribution: U(-inf, inf). As a result, a converged chain is likely to estimate a quantity within the region of zero +/- one.

To generate an MCMC chain for this example, start with an initial proposal (initial condition) init_proposal = 5 and define a transition probability of N(x, stepsize), with stepsize = 0.5. From these two quantities (and the likelihood) we can start the chain and iterate for a predefined number of samples (states) to probe the posterior by sampling the likelihood within the prior.

Likelihood: N(0,1)(x) = exp(-x²/2) / sqrt(2*pi)

When we compute the likelihood of the new/init proposal state, we are determining the probability that the new/init state is observed with the given likelihood (a standard normal). Quick Example: if we want to measure the melting temperature of ice, then what is the likelihood of observing a piece of ice at 50 degrees Celsius?

The transition probability proposes a new state that is Gaussian distributed around the current state, with a standard deviation of stepsize = 0.5.

New proposal = N(Current proposal, step size)

Implementing an MCMC chain with too small of a stepsize may take longer to converge. In contrast, implementing an MCMC chain with too large of a stepsize may not converge at all.

MCMC Chain in action

  1. Inputs: init_proposal = 5 , proposal_distribution = N(x, 0.5) , and likelihood N(0,1)(x). Compute the likelihood of our initial proposal:
    L_init_prop = 1.487e-6 = L_curr_prop
  2. Propose a new state: new_proposal = N(init_proposal, 0.5) = 5.248
  3. Compute the likelihood of this proposed state: L_new_prop = 4.164e-7
  4. Compute the acceptance probability (ratio of likelihoods):
    accept_prob = L_new_prop / L_curr_prop = 0.280
  5. Compute the acceptance threshold from a uniform distribution U[0,1)
    accept_threshold = 0.732
  6. Compare the acceptance probability to the acceptance threshold.
    If the acceptance probability is greater than the acceptance threshold, then keep the proposed state; else, keep the current state:
    accept_prob > accept_threshold = False
  7. Repeat steps 2–6 until the predefined number of samples has been computed, or the chain has converged
Proposal | Likelihood | Acceptance Prob | Acceptance Threshold | Keep?
======================================================================
5.000 | 1.487e-6 | 0.000 | 0.000 | Init
5.248 | 4.164e-7 | 0.280 | 0.732 | False
4.931 | 2.096e-6 | 1.410 | 0.599 | True
... | ... | ... | ... | ...
0.120 | 0.396 | 1.031 | 0.746 | True
0.602 | 0.333 | 0.840 | 0.658 | True
0.808 | 0.288 | 0.865 | 0.568 | True
... | ... | ... | ... | ...
0.001 | 0.399 | 1.244 | 0.046 | True

The table above shows that — with iterative proposals, acceptance checking, and probabilistic state transitions — the MCMC chain has converged to a standard normal distribution: N(0,1).

Comparing MCMC to Deterministic MLE

It is important to understand that if the new likelihood is greater than the current likelihood, then the acceptance ratio — the ratio of new likelihood to current likelihood — will be larger than one. When comparing to the acceptance threshold — a sample from the uniform distribution U[0,1) — it will always be true. Therefore, if the new likelihood is greater than the current likelihood, it will always be accepted. Additionally, if the chain has converged, then successive iterations will generate a random distribution around zero with a standard deviation of one.

In contrast to deterministic MLE procedures, if the new likelihood is less than the current likelihood, there is a chance that the acceptance threshold — sample of U[0,1) — will be smaller than the acceptance ratio. As such, there is a non-zero probability that a less likely proposal will be accepted. This mechanism allows the MCMC technique to not become or remain stuck on local maxima.

Furthermore, deterministic MLE techniques consider the algorithm to have converged after the algorithm has determined that successive iterations will not increase the likelihood estimation, which may trap the deterministic MLE algorithm at a local maximum.

These contrasts illuminate how the MCMC is able to approach the global maximum by overcoming the trap of local maxima. The deterministic MLE often stops iterating when a local maximum has been found, hoping that it has found the global maximum. The MCMC may continue iterating to find better local maxima or the global maximum.

Code to Generate Examples

To generate the table above, we used the following Python procedures:

def metropolis_hasting_iterator(curr_state, verbose=False):
proposal = np.random.normal(curr_state, 0.5)
l_curr = np.exp(-curr_state**2 / 2) / np.sqrt(2 * np.pi)
l_prop = np.exp(-proposal**2 / 2) / np.sqrt(2 * np.pi)
accept_prob = l_prop / l_curr
accept_thresh = np.random.uniform(0,1)

if verbose:
print(
proposal,
l_prop,
accept_prob,
accept_thresh,
accept_prob > accept_thresh
)

if accept_prob > accept_thresh:
return proposal

return curr_state

def mcmc(initial_state=0, n_samples=1, verbose=False):
chain = []
curr_state = initial_state
for _ in range(n_samples):
curr_state = metropolis_hasting_iterator(
curr_state,
verbose=verbose
)
chain.append(curr_state)

return chain

if __name__ == '__main__':
import numpy as np

np.random.seed(42)
initial_state = 5.0
n_samples = 1000

chain = mcmc(
initial_state=initial_state,
n_samples=n_samples,
verbose=False
)

Below is a visualisation of the MCMC chain for 1000 steps. It starts at init_proposal = 5 and converges to N(0,1) after about 200 steps. The states before convergence x < 200 are referred to as the burn-in region. This section is often excluded when computing any estimator because it is prior-dominated — representing the steps before the chain has achieved convergence. We did not remove the burn-in region in this density figure below for illustration purposes.

Figure created by Author

The distribution plot on the right represents the probability density over the full chain — including the burn-in region. Because the burn-in is approximately 200 steps, the majority of the states sampled (x > 200) are post-convergence. As such, the density plot represents a standard normal N(0,1), with a long tail towards larger values to represent the burn-in region — left in for illustrative purposes only.

Here is the Python code for plotting the chain with an attached density plot

from plotly import express as px
from plotly import graph_objs as go

def plot_chain(chain):

title = 'Example MCMC Chain with Standard Normal Likelihood'

fig = px.line(y=chain, title=title)

fig.update_layout({
'plot_bgcolor': 'rgba(0,0,0,0)',
'paper_bgcolor': 'rgba(0,0,0,0)',
})

fig.add_trace(
go.Violin(
y=chain,
side='positive',
xaxis="x2",
yaxis="y2",
showlegend=False
)
)

width=1600
height=800
margins={'l':20, 'r':20, 't':50, 'b':20}

layout = {
'showlegend': False,
'xaxis': {'domain': [0, 0.88]},
'xaxis2': {'domain': [0.9, 1]},
'yaxis2': {
'anchor': 'x2',
},
"margin": margins,
"width": width,
"height": height

}
fig.update_layout(layout)
fig.update_traces(line_color='orange', line_width=5)

fig.show()
plot_chain(chain)

The code for this tutorial can be found on GitHub.

If you like the article and would like to support me make sure to:

  • 👏 Clap for the story (50 claps) and follow me 👉
  • 📰 View more content on my medium profile
  • 🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter
  • 🚀👉 Join the Medium membership program to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.

--

--

Jonathan Fraine

Director of Engineering for Wikimedia DE. I work with dozens of motivated and enthusiastic developers to improve the future of free knowledge around the world.