Markov chain Monte Carlo (MCMC)
Markov chain Monte Carlo (MCMC) algorithms, also called samplers, are numerical approximation algorithms. There are a large number of MCMC algorithms, too many to review here. Popular families (which are often non-distinct) include Gibbs sampling, Metropolis-Hastings, Random-Walk Metropolis (RWM), slice sampling, and many others, including hybrid algorithms such as Hamiltonian Monte Carlo, Metropolis-within-Gibbs (Tierney 1994), and algorithms for specific methods, such as Updating Adaptive Metropolis-within-Gibbs for state-space models (SSMs). RWM was developed first (Metropolis, Rosenbluth, M.N., and Teller 1953), and Metropolis-Hastings was a generalization of RWM (Hastings 1970). All MCMC algorithms are known as special cases of the Metropolis-Hastings algorithm. Regardless of the algorithm, the goal in Bayesian inference is to maximize the unnormalized joint posterior distribution and collect samples of the target distributions, which are marginal posterior distributions, later to be used for inference.
The most generalizable MCMC algorithm is the Metropolis-Hastings (MH) generalization of the RWM algorithm. The MH algorithm extended RWM to include asymmetric proposal distributions. For years, the main disadvantage of the RWM and MH algorithms was that the proposal variance (see below) had to be tuned manually, and therefore other MCMC algorithms have become popular because they do not need to be tuned.
Gibbs sampling became popular for Bayesian inference, though it requires conditional sampling of conjugate distributions, so it is precluded from non-conjugate sampling in its purest form. Gibbs sampling also suffers under high correlations (Gilks and Roberts 1996). Due to these limitations, Gibbs sampling is less generalizable than RWM, though RWM and other algorithms are not immune to problems with correlation. The original slice sampling algorithm of Neal (1997) is a special case of Gibbs sampling that samples a distribution by sampling uniformly from the region under the plot of its density function, and is more appropriate with bounded distributions that cannot approach infinity, though the improved slice sampler of Neal (2003) is available here.
There are valid ways to tune the RWM algorithm as it updates. This is known by many names, including adaptive Metropolis and adaptive MCMC, among others.
Usually, there is more than one target distribution, in which case it must be determined whether it is best to sample from target distributions individually, in groups, or all at once. Block updating refers to splitting a multivariate vector into groups called blocks, and each block is sampled separately. A block may contain one or more parameters.
Parameters are usually grouped into blocks such that parameters within a block are as correlated as possible, and parameters between blocks are as independent as possible. This strategy retains as much of the parameter correlation as possible for blockwise sampling, as opposed to componentwise sampling where parameter correlation is ignored. The
PosteriorChecks function can be used on the output of previous runs to find highly correlated parameters, and the
Blocks function may be used to create blocks based on posterior correlation.
Advantages of blockwise sampling are that a different MCMC algorithm may be used for each block (or parameter, for that matter), creating a more specialized approach (though different algorithms by block are not supported here), the acceptance of a newly proposed state is likely to be higher than sampling from all target distributions at once in high dimensions, and large proposal covariance matrices can be reduced in size, which is most helpful again in high dimensions.
Disadvantages of blockwise sampling are that correlations probably exist between parameters between blocks, and each block is updated while holding the other blocks constant, ignoring these correlations of parameters between blocks. Without simultaneously taking everything into account, the algorithm may converge slowly or never arrive at the proper solution. However, there are instances when it may be best when everything is not taken into account at once, such as in state-space models. Also, as the number of blocks increases, more computation is required, which slows the algorithm. In general, blockwise sampling allows a more specialized approach at the expense of accuracy, generalization, and speed. Blockwise sampling is offered in the following algorithms: Adaptive-Mixture Metropolis (AMM), Elliptical Slice Sampling (ESS), Hit-And-Run Metropolis (HARM), and Random-Walk Metropolis (RWM).
In MCMC algorithms, each iterative estimate of a parameter is part of a changing state. The succession of states or iterations constitutes a Markov chain when the current state is influenced only by the previous state. In random-walk Metropolis (RWM), a proposed future estimate, called a proposal or candidate, of the joint posterior density is calculated, and a ratio of the proposed to the current joint posterior density, called α, is compared to a random number drawn uniformly from the interval (0,1). In practice, the logarithm of the unnormalized joint posterior density is used, so log(α) is the proposal density minus the current density. The proposed state is accepted, replacing the current state with probability 1 when the proposed state is an improvement over the current state, and may still be accepted if the logarithm of a random draw from a uniform distribution is less than log(α). Otherwise, the proposed state is rejected, and the current state is repeated so that another proposal may be estimated at the next iteration. By comparing log(α) to the log of a random number when log(α) is not an improvement, random-walk behavior is included in the algorithm, and it is possible for the algorithm to backtrack while it explores.
Random-walk behavior is desirable because it allows the algorithm to explore, and hopefully avoid getting trapped in undesirable regions. On the other hand, random-walk behavior is undesirable because it takes longer to converge to the target distribution while the algorithm explores. The algorithm generally progresses in the right direction, but may periodically wander away. Such exploration may uncover multimodal target distributions, which other algorithms may fail to recognize, and then converge incorrectly. With enough iterations, RWM is guaranteed theoretically to converge to the correct target distribution, regardless of the starting point of each parameter, provided the proposal variance for each proposal of a target distribution is sensible. Nonetheless, multimodal target distributions are often problematic.
Multiple parameters usually exist, and therefore correlations may occur between the parameters. Many MCMC algorithms attempt to estimate multivariate proposals, usually taking correlations into account through a covariance matrix.
The acceptance or rejection of each proposal should be observed at the completion of the RWM algorithm as the acceptance rate, which is the number of acceptances divided by the total number of iterations. If the acceptance rate is too high, then the proposal variance or covariance is too small. In this case, the algorithm will take longer than necessary to find the target distribution and the samples will be highly autocorrelated. If the acceptance rate is too low, then the proposal variance or covariance is too large, and the algorithm is ineffective at exploration. In the worst case scenario, no proposals are accepted and the algorithm fails to move. Under theoretical conditions, the optimal acceptance rate for a sole, independent and identically distributed (IID), Gaussian, marginal posterior distribution is 0.44 or 44%. The optimal acceptance rate for an infinite number of distributions that are IID and Gaussian is 0.234 or 23.4%.
RWM may be used without any algorithm specifications, or by supplying a list of blocks for blockwise sampling. Blockwise RWM sampling also requires a list of proposal covariance matrices.
Markov chain Properties
Only the basics of Markov chain properties are introduced here. A Markov chain is Markovian when the current iteration depends only on the previous iteration. Many (but not all) adaptive algorithms are merely chains but not Markov chains when the adaptation is based on the history of the chains, not just the previous iteration. A Markov chain is said to be aperiodic when it is not repeating a cycle. A Markov chain is considered irreducible when it is possible to go from any state to any other state, though not necessarily in one iteration. A Markov chain is said to be recurrent if it will eventually return to a given state with probability 1, and it is positive recurrent if the expected return time is finite, and null recurrent otherwise. The ergodic theorem states that a Markov chain is ergodic when it is aperiodic, irreducible, and positive recurrent.
The non-Markovian chains of an adaptive algorithm that adapt based on the history of the chains should have two conditions: containment and diminishing adaptation. Containment is difficult to implement. The condition of diminishing adaptation is fulfilled when the amount of adaptation diminishes with the length of the chain. Diminishing adaptation can be achieved when the proposal variances become smaller or by decreasing the probability of performing adaptations with more iterations (Roberts and Rosenthal 2007).
The MCMC algorithms in the LaplacesDemon package are now presented alphabetically:
- Adaptive Directional Metropolis-within-Gibbs (ADMG)
- Adaptive Griddy-Gibbs (AGG)
- Adaptive Hamiltonian Monte Carlo (AHMC)
- Adaptive Metropolis (AM)
- Adaptive Metropolis-within-Gibbs (AMWG)
- Adaptive-Mixture Metropolis (AMM)
- Affine-Invariant Ensemble Sampler (AIES)
- Componentwise Hit-And-Run Metropolis (CHARM)
- Delayed Rejection Adaptive Metropolis (DRAM)
- Delayed Rejection Metropolis (DRM)
- Differential Evolution Markov Chain (DEMC)
- Elliptical Slice Sampling (ESS)
- Griddy-Gibbs (GG)
- Hamiltonian Monte Carlo (HMC)
- Hamiltonian Monte Carlo with Dual-Averaging (HMCDA)
- Hit-And-Run Metropolis (HARM)
- Independence Metropolis (IM)
- Interchain Adaptation (INCA)
- Metropolis-Adjusted Langevin Algorithm (MALA)
- Metropolis-within-Gibbs (MWG)
- No-U-Turn Sampler (NUTS)
- Random-Walk Metropolis (RWM)
- Reversible-Jump (RJ)
- Robust Adaptive Metropolis (RAM)
- Sequential Adaptive Metropolis-within-Gibbs (SAMWG)
- Sequential Metropolis-within-Gibbs (SMWG)
- Slice Sampling (Slice)
- Stochastic Gradient Langevin Dynamics (SGLD)
- Tempered Hamiltonian Monte Carlo (THMC)
- t-walk (twalk)
- Updating Sequential Adaptive Metropolis-within-Gibbs (USAMWG)
- Updating Sequential Metropolis-within-Gibbs (USMWG)
Algorithm Summary Table
Each algorithm, presented above, has an algorithm summary table that is meant to provide a quick overview and facilitate comparisons between algorithms. Below is a description of the contents of these tables.
|Acceptance Rate||The acceptance rate is a consideration of the number of accepted proposals and the number of rejected proposals. It is usually a ratio of the number of proposals to the number of iterations. When known, the optimal acceptance rate and recommended interval is reported.|
|Applications||Since no algorithm is suitable to every problem, this is meant to provide guidance to the applicability of the algorithm.|
|Difficulty||The difficulty of using the algorithm is considered for a beginner. Considerations include the number of algorithm specifications, tuning the algorithm, whether or not it may be used as a final algorithm, and general experiences with use.|
|Final Algorithm?||This description is usually 'Yes', 'No', or 'User Discretion', followed by an explanation. When a 'final algorithm' seems to have converged, the samples may be used for inference. When it is not a final algorithm, then one is suggested.|
|Proposal||This describes how the proposal is generated, and is usually 'Multivariate' or 'Componentwise', though other descriptions exist as well. In general, a multivariate proposal means that each iteration, a proposal is generated for all parameters that takes correlation into account, usually from a multivariate normal distribution and a proposal covariance matrix. Componentwise proposals usually indicate that a proposal is made for each parameter, without considering correlation. A componentwise algorithm must evaluate the model a number of times equal to the number of parameters, per iteration. Componentwise algorithms are consierably slower per iteration, and speed per iteration decreases as the number of parameters increases.|
After a MCMC algorithm is selected, a model is specified and the priors are updated given the likelihood, and numerous iterations. A question with MCMC is: when to stop iterating/updating? Many stopping rules have been proposed. The most popular single-chain stopping rule involves Monte Carlo Standard Error (MCSE), and the most popular parallel-chain stopping rule uses the Gelman-Rubin Diagnostic.
After the user decides to stop iterating MCMC and updating the model, MCMC diagnostics are performed to assess non-convergence. Markov (and non-Markovian) Chains cannot be confirmed to have converged, so diagnostics are used to look for signs of non-convergence. Numerous MCMC diagnostics have been proposed. It is customary to observe trace plots and autocorrelation plots of the chains, assess stationarity with the Geweke Diagnostic, and search for signs of non-convergence with other diagnostics. The LaplacesDemon software package offers a variety of MCMC diagnostics.
- Gilks W, Roberts G (1996). "Strategies for Improving MCMC." In W Gilks, S Richardson, D Spiegelhalter (eds.), Markov Chain Monte Carlo in Practice, pp. 89-114. Chapman & Hall, Boca Raton, FL.
- Hastings W (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications." Biometrika, 57(1), 97-109.
- Metropolis N, Rosenbluth A, MN R, Teller E (1953). "Equation of State Calculations by Fast Computing Machines." Journal of Chemical Physics, 21, 1087-1092.
- Neal R (1997). "Markov Chain Monte Carlo Methods Based on Slicing the Density Function." Technical Report, University of Toronto.
- Neal R (2003). "Slice Sampling (with Discussion)." Annals of Statistics, 31(3), 705-767.
- Roberts G, Rosenthal J (2007). "Coupling and Ergodicity of Adaptive Markov Chain Monte Carlo Algorithms." Journal of Applied Probability, 44, 458-475.
- Tierney L. (1994). "Markov Chains for Exploring Posterior Distributions." The Annals of Statistics, 22(4), 1701-1762. With discussion and a rejoinder by the author.