**1. General state space Markov chains **

Most applications of Markov Chain Monte Carlo algorithms (MCMC) are concerned with continuous random variables, i.e the corresponding Markov chain has a continuous state space *S*. In this section we will give a brief overview of the theory underlying Markov chains with general state spaces. Although the basic principles are not much different from the discrete case, the study of general state Markov chainns involves many more technicalities and subtleties. Though this section is concerned with general state spaces we will notationally assume that the state space is We need a definition of a Markov chain, to be a stochastic process in which, conditionally on the present, the past and the future are independent. In the discrete case we formalised this idea using the conditional probability of given different collections of past events.

In a general state space it can be that all events of the type have probability 0, as it is the case for a process with a continuous state space. A process with a continuous state space spreads the probability so thinly that the probability of hitting one given state is 0 for all states. Thus we have to work with conditional probabilities of sets of states, rather than individual states.

Definition 1 (Markov chain). Let X be a stochastic process in discrete time with general state space S. X is called a Markov chain if X satisfies the Markov property

for all measurable sets .

In the following we will assume that the Markov chain is *homogeneous*. i.e. the probabilities are independent of t. For the remainder of this section we shall also assume that we can express the probability from definition 1 using a *transition kernel*

where the integration is with respect to a suitable dominating measure, i.e. for example with respect to the Lebesgue measure if S = . The transition kernel K(x,y) is thus just the conditional probability density of given . We obtain the special case of the definition of a transition kernel.

Definition 2The matrix is called the transition kernel (or transition matrix) of the homogeneous Markov chain X.

We will see that together with the initial distribution, whcih we might write as a vector the transition kernel fully specifies the distribution of a homogeneous Markov chain. However, we start by stating two basic properties of the transition kernel **K**:

- The entries of the transition kernel are non-negative (they are probabilities).
- Each row of the kernel sums to 1, as

We obtain the special case of definition 1.8 by setting K(i,j) = , where is the (i,j)-th element of the transition matrix . For a discrete state space the dominating measure is the counting measure, so integration just corresponds to summation, i.e. equation 3 is equivalent to We have for measurable set that thus the m-step transition kernel is The m-step transition kernel allows for expressing the m-step transition probabilities more conveniently:

Let us consider an example.

Example 1Consider the Gaussian random walk on . Consider the random walk on defined by where , i.e. the probability density function of is . This is equivalent to assuming that We also assume that is independent of . Suppose that . In contrast to the random walk on the state space of the Gaussian random walk is . We have that where A – . Thus X is indeed a Markov chain. Furthermore we have that Thus the transition kernel (which is nothing other than the conditional density of ) is thus To find the m-step transition kernel we could use equation 2. However, the resulting integral is difficult to compute. Rather we exploit the fact that where the boxed formula is approximately thus we can write Comparing this with 2 we can identify as m-step transition kernel \qed

We need the powerful probabilistic notion of irreducibility.

Definition 3 (Irreducibility)Given a distribution on the states S, a Markov chain is said to be -irreducible if for all sets A with and for all , there exists an such that If the number of steps m=1 then for all A, then the chain is said to be strongly -irreducible.

Example 2In the example above we had that As the range of the Gaussian distribution is , we have that > 0 for all sets A of non-zero Lebegue measure. Thus the chain is strongly irreducible with the respect to any continuous distribution. \qed

We can extend the concepts of periodicity, recurrence, and transience from the discrete case to the general case. However this requires additional technical concepts like *atoms* and *small sets* one can see ‘Robert and Casella, 2004’ for a rigorous treatment of these concepts. Let us define a recurrent discrete Markov chain.

Definition 4A discrete Markov chain is recurrent, if all states (on average) are visited inifinitely often.

For more general state spaces, we need to consider the number of visits to a set of states rather than single states. Let be the number of visits the chain makes to states in the set . We then define the expected number of visits in , when we start the chain in : This allows us to define recurrence for general state spaces. We start with defining recurrence of sets before extending the definition of recurrence of an entire Markov chain.

Definition 5(a) A set A is said to be recurrent for a Markov chain X if for all (b) A Markov chain to be recurrent, if

- The chain is -irreducible for some distribution .
- Every measurable set with is recurrent.

According to the definition a set is recurrent if on average it is visited infinitely often. This is already the case if there is a non-zero probability of visiting the set infinitely often. A stronger concept of recurrence can be obtained if we require that the set is visited infinitely often with probability 1. This type of recurrence is referred to as *Harris recurrence.*

Definition 6 (Harris Recurrence). (a) A set is said to be Harris-recurrent for a Markov chain X if for all (b) A Markov chain is said to be Harris-recurrent, if

- The chain is -irreducible for some distribution .
- Every measurable set with 0 is Harris-recurrent.

It is easy to see that Harris-recurrence implies recurrence. For discrete state spaces the two concepts are equivalent. Checking recurrence or Harris recurrence can be very difficult. We will state (without) proof a proposition which establishes that if a Markov chain is irreducible and has a unique invariant distribution, then the chain is also recurrent.

However, before, we can state this proposition, we need to define invariant distributions for general state spaces.

Definition 7(Invariant Distribution). A distribution with density function is said to be the invariant distribution of a Markov chain X with transition kernel K if for almost all

Proposition 8Suppose that X is a -irreducible Markov chain having as unique invariant distibution. Then X is also recurrent.

Checking the invariance condition of definition7 requires computing an integral, but this can be cumbersome, so an alternative condition is the simpler (sufficient but not necessary) condition of detailed balance.

Definition 9 (Detailed balance). A transition kernel K is said to be in detailed balance with a distribution with denisity if for almost all x,y

In complete analogy with theorem 1.22 one can also show in the general case that if the transition kernel of a Markov chain is in detailed balance with a distribution , then the chain is time-reversible and has as its invariant distribution.

** 1.1. Ergodic theorems **

In this section we will study the question of whether we can use observations from a Markov chain to make inferences about its invariant distribution. We will see that under some regularity conditions it is even enough to follow a single sample path of the Markov chain.

For independently identically distributed data the Law of Large Numbers is used to justify estimating the expected value of a functional using empirical averages. A similar result can be obtained for Markov chains. This result is the reason why MCMC methods work: it allows us to set up simulation algorithms to generate a Markov chain, whose sample path we can then use for estimating various quantities of interest.

Theorem 10 (Ergodic Theorem). Let X be a -irreducible, recurrent -valued Markov chain with invariant distribution . Then we have for any integrable function that with probability 1 for almost every starting value . If X is Harris-recurrent this holds for every starting value x.

*Proof:* For a proof see (Roberts and Rosenthal, 2004, fact 5)

We conclude with an example that illustates that the condition of irreducibility and recurrence are necessary in theorem 10. These conditions ensure that the chain is permamently exploring the entire state space, which is a necessary condition for the convergence of ergodic averages.

Example 3Consider a discrete chain with two states and transition matrix Any distribution on is an invariant distribution, as for all . However the chain is not irreducible (or recurrent): we cannot get from state 1 to state 2 and vice versa. If the inital distribution is with then for every we have that By observing one sample path (which is either 1,1,1,… or 2,2,2,…) we can make no inference about the distribution of or the parameter . The reason for this is that the chain fails to explore the whole space space. To clarify the chain fails to switch between the states 1 and 2. In order to estimate the parameter we would need to look at more than one sample path. \qed

**2. Monte Carlo Methods **

** 2.1. What are Monte Carlo Methods? **

This collection of lectures is concerned with Monte Carlo methods, which are sometimes referred to as *stochastic simulation*. Examples of Monte Carlo methods include stochastic integration, where we use a simulation-based method to evaluate an integral, Monte Carlo tests, where we resort to simulation in order to computer the p-value, and Markov-Chain Monte Carlo (MCMC), where we construct a Markov chain which (hopefully) converges to the distribution of interest.

A formal definition of Monte Carlo methods was given (amongst others) by Halton (1970)\footnote{Halton, J.H. A retrospective and prospective survey of the Monte Carlo method. SIAM Review, **12**, 1-63.} He defined a Monte Carlo method as “representing the solution of a problem as a parameter of a hypothetical population, and using a random sequence of numbers to construct a sample of the population, from which statistical estimates of the parameter can be obtained.”

** 2.2. Introductory examples **

Example 4 (A raindrop experiment for computing )Assume we want to compute an Monte Carlo estimate of using a simple experiment. Assume that we could produce “uniform rain” on the square , such that the probability of a raindrop falling in to a region is proportional to the area of but independent of the position of . It is easy to see that this is the case iff the two coordinates X,Y are i.i.d. realisations of uniform distribution on the interval (in short Now consider the probability that a raindrop falls into the unit circle. It is In other words, i.e. we found a way of expressing the desired quantity as a function of a probability. We can estimate the probability using our raindrop experiment. If we observe n raindrops, then the number of raindrops Z that fall inside the circle is a binomial random variable: Thus we can estimate p by its maximum -likelihood estimate and we can estimate by Assume we have observed that 77 of the 100 raindrops were inside the circle. In our case our estimate of is which is relatively poor.

However thelaw of large numbersguarantees that our estimate converges almost surely to . As n increases, our estimate improves. We can assess the quality of our estimate by computing a confidence interval for . As we have and we use the approximation that Hence, N(p,p(1-p)/100) and we can obtain a 95 confidence interval for p using this normal approximation = , As our estimate of is four times the estimate of , we now also have a confidence interval for : Historically, the main drawback of Monte Carlo methods was that they used to be expensive to carry out. Physically random experiments (for example an experiment to examine ‘Buffon’s Needle’ were difficult to perform and so was the numerical processing of their results. This changed fundamentally with the advent of the digital computer. Amongst the first to realize this potential were John von Neuman and Stanislaw Ulam. For any Monte-Carlo simulation we need to be able to reproduce randomness by a deterministic Computer Algorithm. Clearly this is a philosophical paradox, but lots of work has been done on this, and the statistical language R has a lot of ‘random number generators’ see in GNU R for further details.