Markov chain Monte Carlo (MCMC)
Markov chain Monte Carlo (MCMC) algorithms, also called samplers, are numerical approximation algorithms. There are a large number of MCMC algorithms, too many to review here. Popular families (which are often non-distinct) include Gibbs sampling, Metropolis-Hastings, slice sampling, Hamiltonian Monte Carlo, and many others. Though the name is misleading, Metropolis-within-Gibbs (MWG) was developed first (Metropolis, Rosenbluth, and Teller 1953), and Metropolis-Hastings was a generalization of MWG (Hastings 1970). All MCMC algorithms are known as special cases of the Metropolis-Hastings algorithm. Regardless of the algorithm, the goal in Bayesian inference is to maximize the unnormalized joint posterior distribution and collect samples of the target distributions, which are marginal posterior distributions, later to be used for inference.
The most generalizable MCMC algorithm is the Metropolis-Hastings (MH) generalization of the MWG algorithm. The MH algorithm extended MWG to include asymmetric proposal distributions. For years, the main disadvantage of the MWG and MH algorithms was that the proposal variance (see below) had to be tuned manually, and therefore other MCMC algorithms have become popular because they do not need to be tuned.
Gibbs sampling became popular for Bayesian inference, though it requires conditional sampling of conjugate distributions, so it is precluded from non-conjugate sampling in its purest form. Gibbs sampling also suffers under high correlations (Gilks and Roberts 1996). Due to these limitations, Gibbs sampling is less generalizable than RWM, though RWM and other algorithms are not immune to problems with correlation. The original slice sampling algorithm of Neal (1997) is a special case of Gibbs sampling that samples a distribution by sampling uniformly from the region under the plot of its density function, and is more appropriate with bounded distributions that cannot approach infinity, though the improved Slice sampler of Neal (2003) is available here.
Usually, there is more than one target distribution, in which case it must be determined whether it is best to sample from target distributions individually, in groups, or all at once. Block updating refers to splitting a multivariate vector into groups called blocks, and each block is sampled separately. A block may contain one or more parameters.
Parameters are usually grouped into blocks such that parameters within a block are as correlated as possible, and parameters between blocks are as independent as possible. This strategy retains as much of the parameter correlation as possible for blockwise sampling, as opposed to componentwise sampling where parameter correlation is ignored. The
PosteriorChecks function can be used on the output of previous runs to find highly correlated parameters, and the
Blocks function may be used to create blocks based on posterior correlation.
Advantages of blockwise sampling are that a different MCMC algorithm may be used for each block (or parameter, for that matter), creating a more specialized approach (though different algorithms by block are not supported here), the acceptance of a newly proposed state is likely to be higher than sampling from all target distributions at once in high dimensions, and large proposal covariance matrices can be reduced in size, which is most helpful again in high dimensions.
Disadvantages of blockwise sampling are that correlations probably exist between parameters between blocks, and each block is updated while holding the other blocks constant, ignoring these correlations of parameters between blocks. Without simultaneously taking everything into account, the algorithm may converge slowly or never arrive at the proper solution. However, there are instances when it may be best when everything is not taken into account at once, such as in state-space models. Also, as the number of blocks increases, more computation is required, which slows the algorithm. In general, blockwise sampling allows a more specialized approach at the expense of accuracy, generalization, and speed. Blockwise sampling is offered in the following algorithms: Adaptive-Mixture Metropolis (AMM), Elliptical Slice Sampler (ESS), Hit-And-Run Metropolis (HARM), and Random-Walk Metropolis (RWM).
Markov chain Properties
Only the basics of Markov chain properties are introduced here. A Markov chain is Markovian when the current iteration depends only on the previous iteration. Many (but not all) adaptive algorithms are merely chains but not Markov chains when the adaptation is based on the history of the chains, not just the previous iteration. A Markov chain is said to be aperiodic when it is not repeating a cycle. A Markov chain is considered irreducible when it is possible to go from any state to any other state, though not necessarily in one iteration. A Markov chain is said to be recurrent if it will eventually return to a given state with probability 1, and it is positive recurrent if the expected return time is finite, and null recurrent otherwise. The ergodic theorem states that a Markov chain is ergodic when it is aperiodic, irreducible, and positive recurrent.
The non-Markovian chains of an adaptive algorithm that adapt based on the history of the chains should have two conditions: containment and diminishing adaptation. Containment is difficult to implement. The condition of diminishing adaptation is fulfilled when the amount of adaptation diminishes with the length of the chain. Diminishing adaptation can be achieved when the proposal variances become smaller or by decreasing the probability of performing adaptations with more iterations (Roberts and Rosenthal 2007).
The MCMC algorithms in the LaplacesDemon package are now presented alphabetically:
- Adaptive Directional Metropolis-within-Gibbs (ADMG)
- Adaptive Griddy-Gibbs (AGG)
- Adaptive Hamiltonian Monte Carlo (AHMC)
- Adaptive Metropolis (AM)
- Adaptive Metropolis-within-Gibbs (AMWG)
- Adaptive-Mixture Metropolis (AMM)
- Affine-Invariant Ensemble Sampler (AIES)
- Componentwise Hit-And-Run Metropolis (CHARM)
- Delayed Rejection Adaptive Metropolis (DRAM)
- Delayed Rejection Metropolis (DRM)
- Differential Evolution Markov Chain (DEMC)
- Elliptical Slice Sampler (ESS)
- Gibbs Sampler (Gibbs)
- Griddy-Gibbs (GG)
- Hamiltonian Monte Carlo (HMC)
- Hamiltonian Monte Carlo with Dual-Averaging (HMCDA)
- Hit-And-Run Metropolis (HARM)
- Independence Metropolis (IM)
- Interchain Adaptation (INCA)
- Metropolis-Adjusted Langevin Algorithm (MALA)
- Metropolis-Coupled Markov Chain Monte Carlo (MCMCMC)
- Metropolis-within-Gibbs (MWG)
- Multiple-Try Metropolis (MTM)
- No-U-Turn Sampler (NUTS)
- Oblique Hyperrectangle Slice Sampler (OHSS)
- Random-Walk Metropolis (RWM)
- Random Dive Metropolis-Hastings (RDMH)
- Reflective Slice Sampler (RSS)
- Refractive Sampler (Refractive)
- Reversible-Jump (RJ)
- Robust Adaptive Metropolis (RAM)
- Sequential Adaptive Metropolis-within-Gibbs (SAMWG)
- Sequential Metropolis-within-Gibbs (SMWG)
- Slice Sampler (Slice)
- Stochastic Gradient Langevin Dynamics (SGLD)
- Tempered Hamiltonian Monte Carlo (THMC)
- t-walk (twalk)
- Univariate Eigenvector Slice Sampler (UESS)
- Updating Sequential Adaptive Metropolis-within-Gibbs (USAMWG)
- Updating Sequential Metropolis-within-Gibbs (USMWG)
Algorithm Summary Table
Each algorithm, presented above, has an algorithm summary table that is meant to provide a quick overview and facilitate comparisons between algorithms. Below is a description of the contents of these tables.
|Acceptance Rate||The acceptance rate is a consideration of the number of accepted proposals and the number of rejected proposals. It is usually a ratio of the number of proposals to the number of iterations. When known, the optimal acceptance rate and recommended interval is reported.|
|Applications||Since no algorithm is suitable to every problem, this is meant to provide guidance to the applicability of the algorithm.|
|Difficulty||The difficulty of using the algorithm is considered for a beginner. Considerations include the number of algorithm specifications, tuning the algorithm, whether or not it may be used as a final algorithm, and general experiences with use.|
|Final Algorithm?||This description is usually 'Yes', 'No', or 'User Discretion', followed by an explanation. When a 'final algorithm' seems to have converged, the samples may be used for inference. When it is not a final algorithm, then one is suggested.|
|Proposal||This describes how the proposal is generated, and is usually 'Multivariate' or 'Componentwise', though other descriptions exist as well. In general, a multivariate proposal means that each iteration, a proposal is generated for all parameters that takes correlation into account, usually from a multivariate normal distribution and a proposal covariance matrix. Componentwise proposals usually indicate that a proposal is made for each parameter, without considering correlation. A componentwise algorithm must evaluate the model a number of times equal to the number of parameters, per iteration. Componentwise algorithms are consierably slower per iteration, and speed per iteration decreases as the number of parameters increases.|
After a MCMC algorithm is selected, a model is specified and the priors are updated given the likelihood, and numerous iterations. A question with MCMC is: when to stop iterating/updating? Many stopping rules have been proposed. The most popular single-chain stopping rule involves Monte Carlo Standard Error (MCSE), and the most popular parallel-chain stopping rule uses the Gelman-Rubin Diagnostic.
After the user decides to stop iterating MCMC and updating the model, MCMC diagnostics are performed to assess non-convergence. Markov (and non-Markovian) Chains cannot be confirmed to have converged, so diagnostics are used to look for signs of non-convergence. Numerous MCMC diagnostics have been proposed. It is customary to observe trace plots and autocorrelation plots of the chains, assess stationarity with the BMK Diagnostic, and search for signs of non-convergence with other diagnostics. The LaplacesDemon software package offers a variety of MCMC diagnostics.
- Gilks W, Roberts G (1996). "Strategies for Improving MCMC." In W Gilks, S Richardson, D Spiegelhalter (eds.), Markov Chain Monte Carlo in Practice, pp. 89-114. Chapman & Hall, Boca Raton, FL.
- Hastings W (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications." Biometrika, 57(1), 97-109.
- Metropolis N, Rosenbluth A, MN R, Teller E (1953). "Equation of State Calculations by Fast Computing Machines." Journal of Chemical Physics, 21, 1087-1092.
- Neal R (1997). "Markov Chain Monte Carlo Methods Based on Slicing the Density Function." Technical Report, University of Toronto.
- Neal R (2003). "Slice Sampling (with Discussion)." Annals of Statistics, 31(3), 705-767.
- Roberts G, Rosenthal J (2007). "Coupling and Ergodicity of Adaptive Markov Chain Monte Carlo Algorithms." Journal of Applied Probability, 44, 458-475.