Markov Chains and Monte Carlo Algorithms


1. General state space Markov chains

Most applications of Markov Chain Monte Carlo algorithms (MCMC) are concerned with continuous random variables, i.e the corresponding Markov chain has a continuous state space S. In this section we will give a brief overview of the theory underlying Markov chains with general state spaces. Although the basic principles are not much different from the discrete case, the study of general state Markov chainns involves many more technicalities and subtleties. Though this section is concerned with general state spaces we will notationally assume that the state space is { S = \mathbb{R}^{d}} We need a definition of a Markov chain, to be a stochastic process in which, conditionally on the present, the past and the future are independent. In the discrete case we formalised this idea using the conditional probability of {{X_t = j}} given different collections of past events.
In a general state space it can be that all events of the type {{X_t = j}} have probability 0, as it is the case for a process with a continuous state space. A process with a continuous state space spreads the probability so thinly that the probability of hitting one given state is 0 for all states. Thus we have to work with conditional probabilities of sets of states, rather than individual states.

Definition 1 (Markov chain) . Let X be a stochastic process in discrete time with general state space S. X is called a Markov chain if X satisfies the Markov property

\displaystyle  \mathbb{P}(X_{t+1} \in A|X_0 = x_0, \cdots , X_t = x_t) = \mathbb{P}(X_{t+1} \in A| X_t = x_t) \ \ \ \ \ (1)

for all measurable sets {A \subset S}.

In the following we will assume that the Markov chain is homogeneous. i.e. the probabilities {\mathbb{P}} {(X_{t+1} \in A| X_t = x_t)} are independent of t. For the remainder of this section we shall also assume that we can express the probability from definition 1 using a transition kernel {K: S \times S \rightarrow \mathbb{R}^{+}_{0}:}

\displaystyle  \mathbb{P}(X_{t+1} \in A| X_t = x_t) = \int_A K(x_t,x_{t+1})dx_{t+1} \ \ \ \ \ (2)

where the integration is with respect to a suitable dominating measure, i.e. for example with respect to the Lebesgue measure if S = {\mathbb{R}^d}. The transition kernel K(x,y) is thus just the conditional probability density of {X_{t+1}} given {X_t = x_t}. We obtain the special case of the definition of a transition kernel.

Definition 2 The matrix { \mathbf{K} = (k_{ij})_{ij} = \mathbb{P}(X_{t+1} = j|X_{t} = i)} is called the transition kernel (or transition matrix) of the homogeneous Markov chain X.

We will see that together with the initial distribution, whcih we might write as a vector {\mathbf{\lambda_{0}} = (\mathbb{P}(X_{0} = i) _{(i \in S)}} the transition kernel {\mathbf{K}} fully specifies the distribution of a homogeneous Markov chain. However, we start by stating two basic properties of the transition kernel K:

  • The entries of the transition kernel are non-negative (they are probabilities).
  • Each row of the kernel sums to 1, as

    \displaystyle  \sum_{j} k_{ij} = \sum_{j} \mathbb{P}(X_{t+1} = j|X_{t} = i) = \mathbb{P}(X_{t+1} \in S|X_{t} = i) = 1 \ \ \ \ \ (3)

We obtain the special case of definition 1.8 by setting K(i,j) = {k_{ij}}, where {k_{ij}} is the (i,j)-th element of the transition matrix {\mathbb{K}}. For a discrete state space the dominating measure is the counting measure, so integration just corresponds to summation, i.e. equation 3 is equivalent to \mathbb{P}(X_{t+1} \in A|X_{t} = x_t) = \sum_{x_{t+1} \in A} k_{x_{t},x_{t+1}} We have for measurable set {A \subset S} that \mathbb{P} (X_{t+m} \in A|X_t = x_t) = \int_A \int_S \cdots \int_S K(x_t,x_{t+1}K(x_{t+1},x_{t+2}) \cdots K(x_{t+m-1},x_{t+m})dx_{t+1} \cdots dx_{t+m-1}dx_{t+m}, thus the m-step transition kernel is K^{(m)}(x_0,x_m) = \int_S \cdots \int_S K(x_0,x_1)\cdots K(x_{m-1},x_{m})dx_{m-1}\cdots dx_1 The m-step transition kernel allows for expressing the m-step transition probabilities more conveniently: \mathbb{P} (X_{t+m} \in A|X_t = x_t) = \int_A K^{(m)}(x_t,x_{t+m})dx_{t+m}

Let us consider an example.

Example 1 Consider the Gaussian random walk on {\mathbb{R}}. Consider the random walk on {\mathbb{R}} defined by X_{t+1} = X_{t} + E_{t} where {E_{t}\cong N(0,1)}, i.e. the probability density function of {E_t} is {\phi(z) = \frac{1}{\sqrt{2\pi}}exp(-\frac{z^{2}}{2})}. This is equivalent to assuming that {X_{t+1}|X_t = x_t \cong N(x_t, 1)} We also assume that {E_t} is independent of {X_0,E_1,\cdots,E_{t-1}}. Suppose that {X_0 \cong N(0,1)}. In contrast to the random walk on {\mathbb{Z}} the state space of the Gaussian random walk is {\mathbb{R}}. We have that \mathbb{P}(X_{t+1} \in A|X_t = x_t,\cdots X_0 = x_0) = \mathbb{P}(E_t \in A - x_t| X_t =x_t, \cdots , X_0 = x_0) {=\mathbb{P}(E_t \in A - x_t) = \mathbb{P}(X_{t+1} \in A| X_t = x_t),} where A – {x_t = {a - x_t : a \in A}}. Thus X is indeed a Markov chain. Furthermore we have that \mathbb{P}(X_{t+1} \in A| X_t = x_t) = \mathbb{P}(E_t \in A - x_t) = \int_A \phi(x_{t+1} - x_{t})dx_{t+1} Thus the transition kernel (which is nothing other than the conditional density of {X_{t+1}| X_t = x_t}) is thus K(x_t,x_{t+1}) = \phi(x_{t+1} -x_{t}) To find the m-step transition kernel we could use equation 2. However, the resulting integral is difficult to compute. Rather we exploit the fact that X_{t+m} = X_{t} + \boxed{E_{t} + \cdots + E_{t+m-1}}, where the boxed formula is approximately { N(0,m)} thus we can write {X_{t+m}|X_{t} ~ N(x_{t},m)} \mathbb{P}(X_{t+m} \in A|X_t = x_t) = \mathbb{P}(X_{t+m} - X_{t}\in A - x_{t}) = \int_A \frac{1}{\sqrt{m}}\phi (\frac{x_{t+m} -x_{t}}{\sqrt{m}} )dx_{t+m} Comparing this with 2 we can identify K^{(m)}(x_t, x_{t+m}) = \frac{1}{\sqrt{m}}\phi (\frac{x_{t+m} -x_{t}}{\sqrt{m}} as m-step transition kernel \qed

We need the powerful probabilistic notion of irreducibility.

Definition 3 (Irreducibility) Given a distribution {\mu} on the states S, a Markov chain is said to be {\mu}-irreducible if for all sets A with {\mu(A)> 0} and for all {x \in S}, there exists an {m} {\in \mathbb{N}_{0}} such that \mathbb{P}(X_{t+m} \in A| X_t = x) = \int_A K^{(m)} (x,y) dy > 0 If the number of steps m=1 then for all A, then the chain is said to be strongly {\mu}-irreducible.

Example 2 In the example above we had that {X_{t+1} | X_t = x_t \cong N(x_{t},1).} As the range of the Gaussian distribution is {\mathbb{R}}, we have that {\mathbb{P}(X_{t+1} \in A | X_t =x_t)} > 0 for all sets A of non-zero Lebegue measure. Thus the chain is strongly irreducible with the respect to any continuous distribution. \qed

We can extend the concepts of periodicity, recurrence, and transience from the discrete case to the general case. However this requires additional technical concepts like atoms and small sets one can see ‘Robert and Casella, 2004’ for a rigorous treatment of these concepts. Let us define a recurrent discrete Markov chain.

Definition 4 A discrete Markov chain is recurrent, if all states (on average) are visited inifinitely often.

For more general state spaces, we need to consider the number of visits to a set of states rather than single states. Let {V_A = \sum^{+\infty}_{t=0}\mathbb{\oldstylenums{1}}_{{X_t \in A}}} be the number of visits the chain makes to states in the set {A \subset S}. We then define the expected number of visits in {A \subset S}, when we start the chain in {x \in S}: \mathbb{E}(V_{A}|X_0 = x) = \mathbb{E}(\sum^{+\infty}_{t=0} \mathbb{\oldstylenums{1}}_{{X_t \in A}}|X_0 = x) = \sum_{t=0}^{+\infty} \int_A K^{(t)}(x,y)dy This allows us to define recurrence for general state spaces. We start with defining recurrence of sets before extending the definition of recurrence of an entire Markov chain.

Definition 5 (a) A set A {\subset S} is said to be recurrent for a Markov chain X if for all {x\;\in\; A} \mathbb{E}(V_A|X_0 = x) = +\infty, (b) A Markov chain to be recurrent, if

  • The chain is {\mu}-irreducible for some distribution {\mu}.
  • Every measurable set {A \;\subset\;S} with {\mu(A) > 0} is recurrent.

According to the definition a set is recurrent if on average it is visited infinitely often. This is already the case if there is a non-zero probability of visiting the set infinitely often. A stronger concept of recurrence can be obtained if we require that the set is visited infinitely often with probability 1. This type of recurrence is referred to as Harris recurrence.

Definition 6 (Harris Recurrence) . (a) A set {A \;\subset\; S} is said to be Harris-recurrent for a Markov chain X if for all {x\;\in \; A} {\mathbb{P} \left(V_{A} = +\infty | X_0 = x \right) = 1,} (b) A Markov chain is said to be Harris-recurrent, if

  • The chain is {\mu}-irreducible for some distribution {\mu}.
  • Every measurable set {A \; \subset \; S} with {\mu(A)} {>} 0 is Harris-recurrent.

It is easy to see that Harris-recurrence implies recurrence. For discrete state spaces the two concepts are equivalent. Checking recurrence or Harris recurrence can be very difficult. We will state (without) proof a proposition which establishes that if a Markov chain is irreducible and has a unique invariant distribution, then the chain is also recurrent.
However, before, we can state this proposition, we need to define invariant distributions for general state spaces.

Definition 7 (Invariant Distribution). A distribution {\mu} with density function {f_{\mu}} is said to be the invariant distribution of a Markov chain X with transition kernel K if f_{\mu} (y) = \int_{S} f_{\mu} (x) K(x,y) dx for almost all {y \;\in\;S.}

Proposition 8 Suppose that X is a {\mu}-irreducible Markov chain having {\mu} as unique invariant distibution. Then X is also recurrent.

Checking the invariance condition of definition7 requires computing an integral, but this can be cumbersome, so an alternative condition is the simpler (sufficient but not necessary) condition of detailed balance.

Definition 9 (Detailed balance) . A transition kernel K is said to be in detailed balance with a distribution {\mu} with denisity {f_{\mu}} if for almost all x,y {\in\;S}

f_{\mu}(x)K(x,y) = f_{\mu}(y)K(y,x). In complete analogy with theorem 1.22 one can also show in the general case that if the transition kernel of a Markov chain is in detailed balance with a distribution {\mu}, then the chain is time-reversible and has {\mu} as its invariant distribution.

1.1. Ergodic theorems

In this section we will study the question of whether we can use observations from a Markov chain to make inferences about its invariant distribution. We will see that under some regularity conditions it is even enough to follow a single sample path of the Markov chain.
For independently identically distributed data the Law of Large Numbers is used to justify estimating the expected value of a functional using empirical averages. A similar result can be obtained for Markov chains. This result is the reason why MCMC methods work: it allows us to set up simulation algorithms to generate a Markov chain, whose sample path we can then use for estimating various quantities of interest.

Theorem 10 (Ergodic Theorem) . Let X be a {\mu}-irreducible, recurrent {\mathbb{R}^{d}}-valued Markov chain with invariant distribution {\mu}. Then we have for any integrable function {g: \mathbb{R}^{d} \rightarrow \mathbb{R}} that with probability 1 lim_{t \rightarrow \infty} \frac{1}{t} \sum^{t}_{i=1} g(X_{i}) \rightarrow \mathbb{E}){\mu}(g(X)) = \int_S g(x)f_{\mu}(x)dx for almost every starting value {X_0 = x}. If X is Harris-recurrent this holds for every starting value x.

Proof: For a proof see (Roberts and Rosenthal, 2004, fact 5) \Box

We conclude with an example that illustates that the condition of irreducibility and recurrence are necessary in theorem 10. These conditions ensure that the chain is permamently exploring the entire state space, which is a necessary condition for the convergence of ergodic averages.

Example 3 Consider a discrete chain with two states {S = \left\lbrace 1,2 \right\rbrace} and transition matrix Any distribution {\mu} on {{1,2}} is an invariant distribution, as \mathbf{\mu ' K = \mu ' I = \mu '} for all {\mu}. However the chain is not irreducible (or recurrent): we cannot get from state 1 to state 2 and vice versa. If the inital distribution is {\mu = (\alpha, 1 - \alpha)'} with {\alpha \in [0,1]} then for every {t \in \mathbb{N}_{0}} we have that \mathbb{P}(X_t = 1) = \alpha \;\;\;\;\;\;\;\;\; \mathbb{P}(X_t = 2) = 1 - \alpha By observing one sample path (which is either 1,1,1,… or 2,2,2,…) we can make no inference about the distribution of {X_t} or the parameter {\alpha}. The reason for this is that the chain fails to explore the whole space space. To clarify the chain fails to switch between the states 1 and 2. In order to estimate the parameter {\alpha} we would need to look at more than one sample path. \qed

2. Monte Carlo Methods

2.1. What are Monte Carlo Methods?

This collection of lectures is concerned with Monte Carlo methods, which are sometimes referred to as stochastic simulation. Examples of Monte Carlo methods include stochastic integration, where we use a simulation-based method to evaluate an integral, Monte Carlo tests, where we resort to simulation in order to computer the p-value, and Markov-Chain Monte Carlo (MCMC), where we construct a Markov chain which (hopefully) converges to the distribution of interest.
A formal definition of Monte Carlo methods was given (amongst others) by Halton (1970)\footnote{Halton, J.H. A retrospective and prospective survey of the Monte Carlo method. SIAM Review, 12, 1-63.} He defined a Monte Carlo method as “representing the solution of a problem as a parameter of a hypothetical population, and using a random sequence of numbers to construct a sample of the population, from which statistical estimates of the parameter can be obtained.”

2.2. Introductory examples

Example 4 (A raindrop experiment for computing {\pi}) Assume we want to compute an Monte Carlo estimate of {\pi} using a simple experiment. Assume that we could produce “uniform rain” on the square {[-1,1] \times [-1,1]}, such that the probability of a raindrop falling in to a region {\mathcal{R} \;\subset\;[-1,1]^{2}} is proportional to the area of {\mathcal{R},} but independent of the position of {\mathcal{R}}. It is easy to see that this is the case iff the two coordinates X,Y are i.i.d. realisations of uniform distribution on the interval {[-1,1]}(in short {X,Y i.i.d\sim \cup[-1,1]).} Now consider the probability that a raindrop falls into the unit circle. It is \mathbb{P}(drop\; within\; the\; circle) = \frac{area\; of \;the\; unit\; circle}{area\; of\; the \;square} = \frac{\int \int _{{x^2 + y^2 \leq 1}} 1 dxdy}{\int \int _{{-1 \leq x,y \leq 1} }1 dxdy} = \frac{\pi}{2.2} = \frac{\pi}{4} In other words, \pi = 4.\mathbb{P}(drop\;within\;circle), i.e. we found a way of expressing the desired quantity {\pi} as a function of a probability. We can estimate the probability using our raindrop experiment. If we observe n raindrops, then the number of raindrops Z that fall inside the circle is a binomial random variable: Z \sim B(n,p) \;\;\;\;\;\;\; with\;p\;=\;\mathbb{P}(drop\;within\;circle). Thus we can estimate p by its maximum -likelihood estimate \hat{p} = \frac{Z}{n} and we can estimate {\pi} by \hat{\pi} = 4\hat{p} = 4\cdot\frac{Z}{n}. Assume we have observed that 77 of the 100 raindrops were inside the circle. In our case our estimate of {\pi} is \hat{\pi} = \frac{4 \cdot 77}{100} = 3.08 which is relatively poor.
However the law of large numbers guarantees that our estimate {\hat{\pi}} converges almost surely to {\pi}. As n increases, our estimate improves. We can assess the quality of our estimate by computing a confidence interval for {\pi}. As we have {Z \sim B(100,p)} and {\hat{p} = \frac{Z}{n},} we use the approximation that {Z \;\sim\;N(100p,100p(1-p)). } Hence, {\hat{p} \sim } N(p,p(1-p)/100){,} and we can obtain a 95{\%} confidence interval for p using this normal approximation \left[ 0.77 - 1.96\cdot \sqrt{\frac{0.77\cdot(1-0.77)}{100}}, 0.77 1.96\cdot \sqrt{\frac{0.77\cdot(1-0.77)}{100}}\right] = {[0.6875,0.8525]}, As our estimate of {\pi} is four times the estimate of {p}, we now also have a confidence interval for {\pi}: [2.750,3.410] Historically, the main drawback of Monte Carlo methods was that they used to be expensive to carry out. Physically random experiments (for example an experiment to examine ‘Buffon’s Needle’ were difficult to perform and so was the numerical processing of their results. This changed fundamentally with the advent of the digital computer. Amongst the first to realize this potential were John von Neuman and Stanislaw Ulam. For any Monte-Carlo simulation we need to be able to reproduce randomness by a deterministic Computer Algorithm. Clearly this is a philosophical paradox, but lots of work has been done on this, and the statistical language R has a lot of ‘random number generators’ see {(?RNGkind)} in GNU R for further details.

Transfinite Induction



1. Introduction

This is supposed to be a primer inspired by a piece by Hilbert Levitz. The theory of transfinite ordinals is a part of set theory. While the concept is tied up with the completed infinite and high cardinalities, we’ll emphasize more constructive aspects of the theory. There have been applicatons of constructive aspects of the theory. There have been applications of constructive treatments of ordinals to recursive function theory, and proof theory. Pretty much everything we need to know about ordinals follows from the following three properties of the class of all ordinals O.

2. Basic Properties of the System of Ordinal Numbers.

  1. O has a well ordering, which will be denoted by {<}.
  2. Any well ordered set whatsoever is order isomorphic to (a unique) initial segment of O. The ordinal determining the segment is called the ordinal of the set. This makes the system of ordinals in some sense the “mother of all well ordered sets”
  3. Any set of members of O has a strict upper bound (and therefore by well ordering, a least upper bound) in O. Warning: The collection of all ordinals itself is not a “set”. You’ll get hit with a famous paradox if you treat it like one. Sub-collections of initial segments of the ordinals are sets, and these are all we use. As a consequence of the well ordering, we see easily that:
    • One can do arguments by induction over the ordinals, or over any initial segment of them, if we use the {<} form of the induction principle. Recall that for natural numbers we also have induction in the {n} to {n+1} form, but that fails as soon as one has an object greater than every natural number. Induction in the {<} form frequently goes under the fancy name “transfinite induction''
    • We let 0 denote the smallest ordinal. Every ordinal has an immediate successor, so by taking repeated successors of zero, we generate all the natural numbers as an initial segment of the system of ordinals. Not all non-zero ordinals have an immediate predecessor. Those without an immediate predecessor are called limit ordinals. The smallest number is to follow all the natural numbers is denoted by {\omega} and it is a limit ordinal. Each limit ordinal is the least upper bound (‘supremum’) of the set of smaller ordinals.
    • On account of the well ordering we can define by recursion operations of addition, multiplication, and exponentiation. The only added feature over recursion on natural numbers is that we need to specify what to do at limit ordinals.

    \displaystyle  x + 0 = x \ \ \ \ \ (1)

    \displaystyle  x + y' = (x + y)' \ \ \ \ \ (2)

    \displaystyle  x + \bar{y} = \sup_{y < \bar{y}}x+y \ \ \ \ \ (3)

    when {\bar{y}} is a limit ordinal. {x x 0 = 0}

    \displaystyle  x \times y' = x \times y + x \ \ \ \ \ (4)

    \displaystyle  x \times \bar{y} = \sup_{y < \bar{y}}x \times y \ \ \ \ \ (5)

    when {\bar{y}} is limit ordinal.

    \displaystyle  x^{0} = 1 \ \ \ \ \ (6)

    \displaystyle  x^{y'} = x^{y} \times x \ \ \ \ \ (7)

    \displaystyle  x^{\bar{y}} = sup_{y < \bar{y}} x^{y} \ \ \ \ \ (8)

    when {\bar{y}} is a limit ordinal. It turns out that:

    • Addition an Multiplication are associative. Neither is commutative. Multiplication distributes from the left over addition; that is {x \times (y + z) = x \times y + x \times z} Right distributivity fails, but we do have the inequality {(x + y)\times z < x \times z + y \times z}
    • Addition, multiplication and exponetiation are (with trivial exceptions): i)Continuous strictly increasing function of the right argument. ii) Weakly monotone increasing functions of the left argument.

3. Cantor Normal Form

The theorem of ordinary number theory that justifies writing any non-zero number to a number base, like base 10 or base 2, applies to transfinite ordinals as well.

Theorem 1 Any non-zero ordinal can be written uniquely as a polynomial to any base greater than 1 with descending exponents and coefficients less than the base. The coefficients are written to the right of the base. Such a representation is called a Cantor normal form.

It’s common to use as the base the number {\omega}, in which case the coefficients are natural numbers, thus a typical normal formlooks like:

\displaystyle \omega^{\alpha_{1}} \times n_{1} + \omega^{\alpha_{2}} \times n_{2} + ... + \omega^{\alpha_{k}} \times n_{k}

where {\alpha_{1} > \alpha_{2} > ... > \alpha_{k}}

  • The rule for comparing two normal forms to see which represents the bigger ordinal is as follows: One first looks to see which has the highest leading exponent, if they are the same then you look to see which has the highest leading coefficient, etc. Such a method could go into an infinite regress on account of the fact (shown in the next section) that some ordinals can equal their own leading exponent.
  • Numbers of the form {\omega^{x}} determine initial segments closed under addition, and numbers of the form {\omega^{\omega^{x}}} determine initial segments of the ordinals closed under multiplication.

4. Definition of the ordinal {\epsilon_{0}}

  • The limit of the sequence {\omega, \omega^{\omega},\omega^{\omega^{\omega}}.......} was named \epsilon_{0} by Cantor. We can see that it’s a solution of the equation {\omega^{x} = x} by the following simple argument: Consider the recursively defined sequence
    1. {a_{0} = 0}
    2. {a_{n+1} = \omega^{a_{n}}}

    then {\omega^{lim_{a_n}} = lim \omega^{a_{n}} = lim a_{n+1} = lim a_{n}} Another way to say that a number is a solution of {\omega^{x} = x} is to say that it’s a fixed point of the function {\omega^{x}}.

  • {\epsilon_{0}} is the smallest ordinal bigger than {\omega} that determines an initial segment closed under ordinal addition, multiplication, and exponetiation. From this it follows that you can’t denote it just using symbols 0, {\omega}, plus, times, and exponetiation. One can show using Cantor Normal Form that anything smaller can be so represented.
  • Actually a similar argument can be used to show additional fixed points and they run clear through the ordinals. They can be arranged in a transfinite sequence as {\epsilon_0, \epsilon_1, \epsilon_2 .... \epsilon_{\omega}....}

Next time we will consider computation with Ordinals on Real Computers. ;

Information Retrieval


Attention conservation notice: 680 words about Information Retrieval, and highly unoriginal.
The following is very much inspired by a course by Cosma Shalizi but I felt it was worth rewriting to get to grips with the concepts. This is the first of what is hopefully a series of posts on ‘Information Retrieval’, and applications of Mathematics to ‘Data Mining’.
1. Textual features

I’d like to introduce the concepts of how features are extracted, and we shall consider these proxies of actual meanings. One classic representation we can use is bag-of-words (BoW). Let us define this representation, this means we list all of the distinct words in the document together with how often each one appears. This is easy to calculate from the text. Vectors One way we could try to code up the bag-of-words is to use vectors. Let each component of the vector correspond to a different word in the total lexicon of our document collection, in a fixed, standardised order. The value of the component would be the number of times the word appears, possibly including zero. We use this vector bag-of-words representation of documents for two big reasons:

  • There is a huge pre-existing technology for vectors: people have worked out, in excruciating detail, how to compare them, compose them, simplify them, etc. Why not exploit that, rather than coming up with stuff from scratch?
  • In practice, it’s proved to work pretty well.

We can store data from a corpus in the form of a matrix. Each row corresponds to a distinct case (or instance instance, unit, subject,…) – here, a document – and each column to a distinct feature. Conventionally, the number of cases is n and the number of features is p. It is no coincidence that this is the same format as the data matrix X in linear regression.

2. Measuring Similarity

Right now, we are interested in saying which documents are similar to each other because we want to do a search by content. But measuring similarity – or equivalently measuring dissimilarity or distance</b> – is fundamental to data mining. Most of what we will do will rely on having a sensible way of saying how similar to each other different objects are, or how close they are in some geometric setting. Getting the right measure of closeness will have a huge impact on our results. \paragraph{} This is where representing the data as vectors comes in so handy. We already know a nice way of saying how far apart two vectors are, the ordinary or Euclidean distance, which we can calculate with the Pythagorean formula:

\displaystyle \|\bar{x} - \bar{y}\| = \sqrt{\sum_{i=1}^{p}(x_i - y_i)^2}

where {x_i}, {y_i} are the {i^{th}} components of {\bar{x}} and {\bar{y}}. Remember that for bag-of-words vectors each distinct word – each entry in the lexicon – is a component or a fecture. We can also use our Linear Algebra skills to calculate the Euclidean norm or Euclidean distance . Of any vector this is {\|\bar{x}\| = \sqrt{\sum_{i=1}^{p}x^{2}_{i}}} so the distance between two vectors is the norm of their distance {\bar{x} - \bar{y}}. Equivalently, the norm of a vector is the distance from it to the origin, {\bar{0}} Obviously, one can just look up a topology textbook and remind oneself of other metrics such as the taxicab metric.

2.1. Normalisation

Just looking at the Euclidean distances between document vectors doesn’t work, at least if the documents are at all different in size. Instead, we need to normalise by document size, so that we can fairly compare short texts with long ones. There are (at least) two ways of doing this.
Document length normalisation Divide the word counts by the total number of words in the document. In symbols,

\displaystyle  \bar{x} \mapsto \frac{\bar{x}}{\sum_{i=1}^{p} x_{i}}\ \ \ \ \ (1)

Notice that all the entries in the normalised vector are non-negative fractions, which sum to 1. The i-th component is thus the probability that if we pick a word out of the bag at random, it’s the i-th entry in the lexicon.

Cosine ‘distance’ is actually a similarity measure, not a distance`:

\displaystyle  d_{cos} \bar{x}, \bar{y} = \frac{\sum_{i} x_{i}y_{i}}{\|\bar{x}\|\|\bar{y}\|} \ \ \ \ \ (2)

It’s the cosine of the angle between the vectors {\bar{x}} and {\bar{y}}.

Examples of PDE


1. PDE Exercises

f: \mathbb{R}^n \Rightarrow \mathbb{C} given {\tilde{M}f} is a solution of the homogeneous Cauchy problem, with {\tilde{M}f(0,x) = 0}, {\frac{\partial \tilde{M}f}{\partial t}(0,x) = f(x)} Let {g:[0,\infty) \times \mathbb{R}^n \Rightarrow \mathbb{C}} be sufficiently often continuously differentiable. Show without using any explicit formular for {\tilde{M}f} that the function Qg defined by Qg(t,x):= \int^{t}_{0} \tilde{M}g_s(t-s,x)ds, where {g_s(x):= g(s,x),} is the solution of the Cauchy problem for the inhomogeneous wave equation with right hand side g and zero initial conditions. We want to show that this implies {\Box Qg = g} {Qg(0,x)= \frac{\partial Qg}{\partial t}(0,x) = 0} \newline For n = 1 this gives an alternative solution of Example 1. \tilde{M}f(0,x) = \frac{1}{2c} \int^{x+t}_{x-t}f(\tau)d\tau This implies

\displaystyle  Qg(t,x) = \frac{1}{2c} \int^{t}_{0} \int^{x+c(t-s)}_{x-c(t-s)}g(s,\tau)d\tau ds \ \ \ \ \ (1)

We can then write Qg(0,x) = {\int_0^0 \cdots ds = 0} {\frac{\partial Q}{\partial t}(t,x) = \tilde{M} g_s(t-s,x)} We let s = t {+\int^t_0 \frac{\partial}{\partial t} \tilde{M}g_s(t-s,x)ds} We know {\tilde{M}g_t(0,x) =0} SO:

\displaystyle  \frac{\partial Qg}{\partial t}(t,x) = \int^{t}_0 \frac{\partial}{\partial t} \tilde{M}g_s(t-s,x)ds \ \ \ \ \ (2)

In particular if t=0 We get {\frac{\partial Qg}{\partial t} (0,x) = \int^0_0 \cdots = 0} We also have to look at the box operator {\Box = \frac{\partial^2}{\partial t^2} - \triangle_x} we obtain by applying \frac{\partial}{\partial t}.

By parameter dependent integrals; \frac{\partial^2 Qg}{\partial t^2}(t,x) = \int^{t}_0 \frac{\partial^2}{\partial t^2} \tilde{M}g_s(t-s,x) + \frac{\partial}{\partial t} \tilde{M}g_s (t-s,x)|_{s=t}
\frac{\partial^2 Qg(t,x)}{\partial t^2} = g(t,x) + \int^t_0 \cdots ds By parameter dependent integrals; \triangle_x Qg(t,x) =\int^t_0 \triangle_x \tilde{M}g_s(t-s,x) ds \Box Qg(t,x) = g(t,x) + \int^t_0 \Box Mg(t-s,x)ds The integral above goes to zero by the homogeneous Cauchy problem solution. Therefore we can write

\displaystyle  \Box Qg(t,x) = g(t,x)

We only need the properties of {\tilde{M}g_s}. This should be seen as an exercise in parameter dependent integrals.

2. Examples of distributions and their order

Let us recall the criterion: T:{C^{\infty}_{c}(U) \Rightarrow \mathbb{C}} linear T is a distribution {\iff} {\forall K \subset U} compact {\exists p(K) \in \mathbb{N}_0}, {C(K) \in [0,\infty)} such that for {\phi \in C^{\infty}_{c}(U)} with supp{\phi \subset K} |T(\phi)| \leq C(K) \sum_{|\alpha|\leq p(K)}\|D^{\alpha}\phi\|_{\infty} Where the last norm is the suprenum norm. \delta_{x_{0}} |\delta_{x_{0}}| = |\phi(x_{0})| \leq \|\phi \|_{\infty} (c=1, p = 0, indepedent of K. f \in L^1_{loc}(U) {T_f(\phi) = \int_U f(x) \phi (x) ds} {|T_f(\phi)| = |\int_U f(x) \phi(x) dx| \leq \int_U|f(x)||\phi(x)|dx}

Assume {K \subset U} compact {supp \phi \subset K} {=\int_K|f(x)||\phi(x)|dx} {\leq \|\phi\|_\infty \int_K |f(x)|dx} the integral above can be considered as C(K) and it is less than infinity (p=0, independent of K). {T(\phi) = \int_M \phi|_M . \omega} M {\subset U} k-dim oriented subset and we can consider {\omega} a k(top) form There is a trick:

In local coordinates {Y_1,\cdots Y_k} {V\subset M} open {y_1: V \Rightarrow \mathbb{R}} {\omega = g(y_1,\cdots, y_k)dY_1\wedge \cdots \wedge dY_k} We use the pullback in fact, but we can sloppy write \int_V \phi|_M \omega = \int_V \phi(Y_1 \cdots Y_k)g(Y_1,\cdots,Y_k)dy_1,\cdots, dy_k {\int_M \phi_M \omega =} sum over open sets, where we have coordinates using partition of unity. We need a volume form or take the modulus of {|g(y_1,...,y_k)|} If we define, in local coordinates of the correct orientation {|\omega(x)|=|g(y_1,\cdots,y_k)|dy_1 \wedge \cdots \wedge dy_k} then {|\omega|} is well-defined independent of co-ordinates. We can then write the following |\int_M \phi|_M . \omega| \leq \int_M|\phi|_M . |\omega| = \int_{K \cap M} |\phi|_{|M}.|\omega| \leq \|\phi\|_\infty . \int_{K \cap M} |\omega| Because {supp \phi \subset K} we can do this trick. The integral above becomes C(K) and is finite, since {\omega} has a finite number. This is indeed a distribution and clearly of order 0.