Interview with a Data Scientist: Brad Klingenberg



Brad Klingenberg is the Director of Styling Algorithms at Stitch Fix in San Francisco. His team uses data and algorithms to improve the selection of merchandise sent to clients. Prior to joining Stitch Fix Brad worked with data and predictive analytics at financial and technology companies. He studied applied mathematics at the University of Colorado at Boulder and earned his PhD in Statistics at Stanford University in 2012.


1. What project have you worked on do you wish you could go back to, and do better?


Nearly everything! A common theme would be not taking the framing of a problem for granted. Even seemingly basic questions like how to measure success can have subtleties. As a concrete example, I work at Stitch Fix, an online personal styling service for women. One of the problems that we study is predicting the probability that a client will love an item that we select and send to her. I have definitely tricked myself in the past by trying to optimize a measure of prediction error like AUC.

This is trickier than it seems because there are some sources of variance that are not useful for making recommendations. For example, if I can predict the marginal probability that a given client will love any item then that model may give me a great AUC when making predictions over many clients, because some clients may be more likely to love things than others and the model will capture this. But if the model has no other information it will be useless for making recommendations because it doesn’t even depend on the item. Despite its AUC, such a model is therefore useless for ranking items for a given client. It is important to think carefully about what you are really measuring.


2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences and Social Sciences?


Focus on learning the basic tools of applied statistics. It can be tempting to assume that more complicated means better, but you will be well-served by investing time in learning workhorse tools like basic inference, model selection and linear models with their modern extensions. It is very important to be practical. Start with simple things.

Learn enough computer science and software engineering to be able to get things done. Some tools and best practices from engineering, like careful version control, go a long ways. Try to write clean, reusable code. Popular tools in R and Python are great for starting to work with data. Learn about convex optimization so you can fit your own models when you need to – it’s extremely useful to be able to cast statistical estimates as the solution to optimization problems.

Finally, try to get experience framing problems. Talk with colleagues about problems they are solving. What tools did they choose? Why? How should did they measure success? Being comfortable with ambiguity and successfully framing problems is a great way to differentiate yourself. You will get better with experience – try to seek out opportunities.


3. What do you wish you knew earlier about being a data scientist?


I have always had trouble identifying as a data scientist – almost everything I do with data can be considered applied statistics or (very) basic software engineering. When starting my career I was worried that there must be something more to it – surely, there had to be some magic that I was missing. There’s not. There is no magic. A great majority of what an effective data scientist does comes back to the basic elements of looking at data, framing problems, and designing experiments. Very often the most important part is framing problems and choosing a reasonable model so that you can estimate its parameters or make inferences about them.


4. How do you respond when you hear the phrase ‘big data’?


I tend to lose interest. It’s a very over-used phrase. Perhaps more importantly I find it to be a poor proxy for problems that are interesting. It can be true that big data brings engineering challenges, but data science is generally made more interesting by having data with high information content rather than by sheer scale. Having lots of data does not necessarily mean that there are interesting questions to answer or that those answers will be important to your business or application. That said, there are some applications like computer vision where it can be important to have a very large amount of data.


5. What is the most exciting thing about your field?


While “big data” is overhyped, a positive side effect has been an increased awareness of the benefits of learning from data, especially in tech companies. The range of opportunities for data scientists today is very exciting. The abundance of opportunities makes it easier to be picky and to find the problems you are most excited to work on. An important aspect of this is to look in places you might not expect. I work at Stitch Fix, an online personal styling service for women. I never imagined working in women’s apparel, but due to the many interesting problems I get to work on it has been the most exciting work of my career.


6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?


As I mentioned previously, it can be helpful to start framing a problem by thinking about how you would measure success. This will often help you figure out what to focus on. You will also seldom go wrong by starting simple. Even if you eventually find that another approach is more effective a simple model can be a hugely helpful benchmark. This will also help you understand how well you can reasonably expect your ultimate approach to perform. In industry, it is not uncommon to find problems where (1) it is just not worth the effort to do more than something simple, or (2) no plausible method will do well enough to be considered successful. Of course, measuring these trade-offs depends on the context of your problem, but a quick pass with a simple model can often help you make an assessment.


7. How do you explain to C-level execs the importance of Data Science? How do you deal with the ‘educated selling’ parts of the job? In particular – how does this differ from sports and industry?


It is usually better if you are not the first to evangelize the use of data. That said, data scientists will be most successful if they put themselves in situations where they have value to offer a business. Not all problems that are statistically interesting are important to a business. If you can deliver insights, products or predictions that have the potential to help the business then people will usually listen. Of course this is most effective when the data scientist clearly articulates the problem they are solving and what its impact will be.

The perceived importance of data science is also a critical aspect of choosing where to work – you should ask yourself if the company values what you will be working on and whether data science can really make it better. If this is the case then things will be much easier.


8. What is the most exciting thing you’ve been working on lately and tell us a bit about it.


I lead the styling algorithms team at Stitch Fix. Among the problems we work on is making recommendations to our stylists, human experts who curate our recommendations for our clients. Making recommendations with humans in the loop is fascinating problem because it introduces an extra layer of feedback – the selections made by our stylists. Combining this feedback with direct feedback from our clients to make better recommendations is an interesting and challenging problem.


9. What is the biggest challenge of leading a data science team?


Hiring and growing a team are constant challenges, not least because there is not much consensus around what data science even is. In my experience a successful data science team needs people with a variety of skills. Hiring people with a command of applied statistics fundamentals is a key element, but having enough engineering experience and domain knowledge can also be important. At Stitch Fix we are fortunate to partner with a very strong data platform team, and this enables us to handle the engineering work that comes with taking on ever more ambitious problems.


Interviews with Data Scientists: David J. Hand


I recently reached out as part of my Data Science interview series to David J. Hand.

David has an impressive biography and has contributed a lot to fraud detection and data mining. His answers are insightful and from a statistical point of view. I feel that these academics have a lot to teach us practicing data scientists.

  1. What project have you worked on do you wish you could go back to, and do better?

I think I always have this feeling about most of the things I have worked on – that, had I been able to spend more time on it, I could have done better. Unfortunately, there are so many things crying out for one’s attention that one has to do the best one can in the time available. Quality of projects probably also has a diminishing returns aspect – spend another day/week/year on a project and you reduce the gap between its current quality and perfection by a half. Which means you never achieve perfection.

  1. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I generally advise PhD students to find a project which interests them, which is solvable or on which significant headway can be made in the time they have available, and which other people (but not too many) care about. That last point means that others will be interested in the results you get, while the qualification means that there are not also thousands of others working on the problem (because that would mean you would probably be pipped to the post).

  1. What do you wish you knew earlier about being a statistician? What do you think industrial data scientists have to learn from this?

I think it is important that people recognise that statistics is not a branch of mathematics. Certainly statistics is a mathematical discipline, but so are engineering, physics, and surveying, and we don’t regard them as parts of mathematics. To be a competent professional statistician one needs to understand the mathematics underlying the tools, but one also needs to understand something about the area in which one is applying those tools. And then there are other aspects: it may be necessary, for example, to use a suboptimal method if this means that others can understand and buy in to what you have done. Industrial data scientists need to recognise the fundamental aim of a data scientist is to solve a problem, and to do this one should adopt the best approach for the job, be it a significance test, a likelihood function, or a Bayesian analysis. Data scientists must be pragmatic, not dogmatic. But I’m sure that most practicing data scientists do recognise this.

  1. How do you respond when you hear the phrase ‘big data’?

Probably a resigned sigh. ‘Big data’ is proclaimed as the answer to humanity’s problems. However, while it’s true that large data sets, a consequence of modern data capture technologies, do hold great promise for interesting and valuable advances, we should not fail to recognise that they also come with considerable technical challenges. The easiest of these lie in the data manipulation aspects of data science (the searching, sorting, and matching of large sets) while the toughest lie in the essentially statistical inferential aspects. The notion that one nowadays has ‘all’ of the data for any particular context is seldom true or relevant. And big data come with the data quality challenges of small data along with new challenges of its own.

  1. What is the most exciting thing about your field?

Where to begin! The eminent statistician John Tukey once said ‘the great thing about statistics is that you get to play in everyone’s back yard’, meaning that statisticians can work in medicine, physics, government, economics, finance, education, and so on. The point is that data are evidence, and to extract meaning, information, and knowledge from data you need statistics. The world truly is the statistician’s oyster.

  1. Do you feel universities will have to adapt to ‘data science’? What do you think will have to be done in say mathematical education to keep up with these trends?

Yes, and you can see that this is happening, with many universities establishing data science courses. Data science is mostly statistics, but with a leavening of relevant parts of computer science – some knowledge of databases, search algorithms, matching methods, parallel processing, and so on.


Professor David J. Hand

Imperial College, London

Bio: David Hand is Senior Research Investigator and Emeritus Professor of Mathematics at Imperial College, London, and Chief Scientific Advisor to Winton Capital Management. He is a Fellow of the British Academy, and a recipient of the Guy Medal of the Royal Statistical Society. He has served (twice) as President of the Royal Statistical Society, and is on the Board of the UK Statistics Authority. He has published 300 scientific papers and 26 books. He has broad research interests in areas including classification, data mining, anomaly detection, and the foundations of statistics. His applications interests include psychology, physics, and the retail credit industry – he and his research group won the 2012 Credit Collections and Risk Award for Contributions to the Credit Industry. He was made OBE for services to research and innovation in 2013.

Data Science as a Process


Hilary Mason one of the shining lights of the world of data science Tweeted recently 

‘Data people: What is the very first thing you do when you get your hands on a new data set?’ 

What I do when I get a new dataset is a recent article on the Simple Statistics blog, is a response to this. 

I’ve been thinking about my own data science process. My academic background is in Physics and Mathematics, so I am influenced by those disciplines. This is a personal blog post, just to document my own Data Science Process. 

0) Try to understand the data set: I must admit there have been projects that I’ve forgotten this during. I’ve been so eager to apply ‘Cool algorithm X’ or ‘Cool model Y’ to a problem that I’ve forgotten that Exploratory Data Analysis – which always strikes me as low tech is extremely valuable. Now I love this period, I plot graphs, I look for trends, I clean out outliers. I try to figure out what sort of distribution the data follows. I try to figure out anything interesting or novel. This often highlights duplicates, or possible duplicates. And I find myself coming back to this period. 

So I have three interesting heuristics to follow which I find work

  • Talk to the Business Analysts or domain experts: Never be so arrogant as a data scientist that you think that you can ignore an experienced Business Analyst. When I was at Amazon I learned countless things from chatting to Supply Chain Analysts or Financial Analysts. Some of these included business logic, or ‘what does that acronym’ mean and some of these things involved an explanation of a process that was new. Remember that data in lots of companies is designed for operations not data analysis. 
  • Import the data into a database: Building a database doesn’t take too long, and often you are given data in some sort of ugly csv or log format. You can’t underestimate the effects of running a few groupby and count statements to see if the data is corrupt. 
  • Examine the heads and tails. In R this is sapply(df, class); head(df), tail(df). You can do similar things in Python – Pandas. This generally gives you a chance to look at outliers or NA’s. 

1) Learn about what kind of elephant you are trying to eat.

One of my old mentors at Amazon told me that all problems at Amazon are huge elephants. At a dynamic and exciting company like that, he was right. So we used to talk about ‘eating the elephant, piece by piece’. I would add a corollary, you need to know what kind of ‘elephant’ you are dealing with. For example what data sources are you dealing with, what is the business process? Are there quirks to this business process that you’ll need a domain specific expert to help you understand it.
Here chatting to business analysts helps as well. Unless I know the domain really well, I find this period takes some time. Sometimes you’ll have stakeholders who are impatient about this period. But you must be frank with them that this period is needed. That this period is valuable for the later on analysis. And if you aren’t allowed to do this, you can just update your linkedin profile and find another job 🙂 

I consider the outliers checks and reading column headings to be part of this as well.

You can learn a lot just by documenting what the column headers are.

The final part of this step is to generate a question about the data. Write this down in a document, with the steps your taking. It doesn’t have to be scientific but this is a good way to keep you focused on the business questions. 

2) Clean the data 

I don’t necessarily enjoy this part. It is painful but necessary, and involves a lot of tedious and sometimes annoying work.

  • Look for encoding issues
  • Test boring hypothesis. Like if you have telecommunication data that num_of_foreign_calls <= total_number_of_calls
  • Sanity checks

3) Plot the data

Now I plot the data and try to understand it. Ggplot and Matplotlib are your friends here – until they aren’t. I am currently investing time in learning about the visualization libraries in Python and R. I suggest you do the same. Yet histograms and simple plots work best here. The purpose of plotting data is to lead to insight. The sexy stuff can come later. 

Scatter plots, density plots, historgrams, etc etc.

    Here is an example of a scatter plot from a logistic regression model.
    Logistic Regression

    Logistic Regression

    4)In step 1 we came up with a hypothesis such as ‘correlation between X and revenue’ answer this simple hypothesis

    Now you should know enough about the data to do an analysis. It could be a simple machine learning model or a regression model. 




Markov Chains and Monte Carlo Algorithms


1. General state space Markov chains

Most applications of Markov Chain Monte Carlo algorithms (MCMC) are concerned with continuous random variables, i.e the corresponding Markov chain has a continuous state space S. In this section we will give a brief overview of the theory underlying Markov chains with general state spaces. Although the basic principles are not much different from the discrete case, the study of general state Markov chainns involves many more technicalities and subtleties. Though this section is concerned with general state spaces we will notationally assume that the state space is { S = \mathbb{R}^{d}} We need a definition of a Markov chain, to be a stochastic process in which, conditionally on the present, the past and the future are independent. In the discrete case we formalised this idea using the conditional probability of {{X_t = j}} given different collections of past events.
In a general state space it can be that all events of the type {{X_t = j}} have probability 0, as it is the case for a process with a continuous state space. A process with a continuous state space spreads the probability so thinly that the probability of hitting one given state is 0 for all states. Thus we have to work with conditional probabilities of sets of states, rather than individual states.

Definition 1 (Markov chain) . Let X be a stochastic process in discrete time with general state space S. X is called a Markov chain if X satisfies the Markov property

\displaystyle  \mathbb{P}(X_{t+1} \in A|X_0 = x_0, \cdots , X_t = x_t) = \mathbb{P}(X_{t+1} \in A| X_t = x_t) \ \ \ \ \ (1)

for all measurable sets {A \subset S}.

In the following we will assume that the Markov chain is homogeneous. i.e. the probabilities {\mathbb{P}} {(X_{t+1} \in A| X_t = x_t)} are independent of t. For the remainder of this section we shall also assume that we can express the probability from definition 1 using a transition kernel {K: S \times S \rightarrow \mathbb{R}^{+}_{0}:}

\displaystyle  \mathbb{P}(X_{t+1} \in A| X_t = x_t) = \int_A K(x_t,x_{t+1})dx_{t+1} \ \ \ \ \ (2)

where the integration is with respect to a suitable dominating measure, i.e. for example with respect to the Lebesgue measure if S = {\mathbb{R}^d}. The transition kernel K(x,y) is thus just the conditional probability density of {X_{t+1}} given {X_t = x_t}. We obtain the special case of the definition of a transition kernel.

Definition 2 The matrix { \mathbf{K} = (k_{ij})_{ij} = \mathbb{P}(X_{t+1} = j|X_{t} = i)} is called the transition kernel (or transition matrix) of the homogeneous Markov chain X.

We will see that together with the initial distribution, whcih we might write as a vector {\mathbf{\lambda_{0}} = (\mathbb{P}(X_{0} = i) _{(i \in S)}} the transition kernel {\mathbf{K}} fully specifies the distribution of a homogeneous Markov chain. However, we start by stating two basic properties of the transition kernel K:

  • The entries of the transition kernel are non-negative (they are probabilities).
  • Each row of the kernel sums to 1, as

    \displaystyle  \sum_{j} k_{ij} = \sum_{j} \mathbb{P}(X_{t+1} = j|X_{t} = i) = \mathbb{P}(X_{t+1} \in S|X_{t} = i) = 1 \ \ \ \ \ (3)

We obtain the special case of definition 1.8 by setting K(i,j) = {k_{ij}}, where {k_{ij}} is the (i,j)-th element of the transition matrix {\mathbb{K}}. For a discrete state space the dominating measure is the counting measure, so integration just corresponds to summation, i.e. equation 3 is equivalent to \mathbb{P}(X_{t+1} \in A|X_{t} = x_t) = \sum_{x_{t+1} \in A} k_{x_{t},x_{t+1}} We have for measurable set {A \subset S} that \mathbb{P} (X_{t+m} \in A|X_t = x_t) = \int_A \int_S \cdots \int_S K(x_t,x_{t+1}K(x_{t+1},x_{t+2}) \cdots K(x_{t+m-1},x_{t+m})dx_{t+1} \cdots dx_{t+m-1}dx_{t+m}, thus the m-step transition kernel is K^{(m)}(x_0,x_m) = \int_S \cdots \int_S K(x_0,x_1)\cdots K(x_{m-1},x_{m})dx_{m-1}\cdots dx_1 The m-step transition kernel allows for expressing the m-step transition probabilities more conveniently: \mathbb{P} (X_{t+m} \in A|X_t = x_t) = \int_A K^{(m)}(x_t,x_{t+m})dx_{t+m}

Let us consider an example.

Example 1 Consider the Gaussian random walk on {\mathbb{R}}. Consider the random walk on {\mathbb{R}} defined by X_{t+1} = X_{t} + E_{t} where {E_{t}\cong N(0,1)}, i.e. the probability density function of {E_t} is {\phi(z) = \frac{1}{\sqrt{2\pi}}exp(-\frac{z^{2}}{2})}. This is equivalent to assuming that {X_{t+1}|X_t = x_t \cong N(x_t, 1)} We also assume that {E_t} is independent of {X_0,E_1,\cdots,E_{t-1}}. Suppose that {X_0 \cong N(0,1)}. In contrast to the random walk on {\mathbb{Z}} the state space of the Gaussian random walk is {\mathbb{R}}. We have that \mathbb{P}(X_{t+1} \in A|X_t = x_t,\cdots X_0 = x_0) = \mathbb{P}(E_t \in A - x_t| X_t =x_t, \cdots , X_0 = x_0) {=\mathbb{P}(E_t \in A - x_t) = \mathbb{P}(X_{t+1} \in A| X_t = x_t),} where A – {x_t = {a - x_t : a \in A}}. Thus X is indeed a Markov chain. Furthermore we have that \mathbb{P}(X_{t+1} \in A| X_t = x_t) = \mathbb{P}(E_t \in A - x_t) = \int_A \phi(x_{t+1} - x_{t})dx_{t+1} Thus the transition kernel (which is nothing other than the conditional density of {X_{t+1}| X_t = x_t}) is thus K(x_t,x_{t+1}) = \phi(x_{t+1} -x_{t}) To find the m-step transition kernel we could use equation 2. However, the resulting integral is difficult to compute. Rather we exploit the fact that X_{t+m} = X_{t} + \boxed{E_{t} + \cdots + E_{t+m-1}}, where the boxed formula is approximately { N(0,m)} thus we can write {X_{t+m}|X_{t} ~ N(x_{t},m)} \mathbb{P}(X_{t+m} \in A|X_t = x_t) = \mathbb{P}(X_{t+m} - X_{t}\in A - x_{t}) = \int_A \frac{1}{\sqrt{m}}\phi (\frac{x_{t+m} -x_{t}}{\sqrt{m}} )dx_{t+m} Comparing this with 2 we can identify K^{(m)}(x_t, x_{t+m}) = \frac{1}{\sqrt{m}}\phi (\frac{x_{t+m} -x_{t}}{\sqrt{m}} as m-step transition kernel \qed

We need the powerful probabilistic notion of irreducibility.

Definition 3 (Irreducibility) Given a distribution {\mu} on the states S, a Markov chain is said to be {\mu}-irreducible if for all sets A with {\mu(A)> 0} and for all {x \in S}, there exists an {m} {\in \mathbb{N}_{0}} such that \mathbb{P}(X_{t+m} \in A| X_t = x) = \int_A K^{(m)} (x,y) dy > 0 If the number of steps m=1 then for all A, then the chain is said to be strongly {\mu}-irreducible.

Example 2 In the example above we had that {X_{t+1} | X_t = x_t \cong N(x_{t},1).} As the range of the Gaussian distribution is {\mathbb{R}}, we have that {\mathbb{P}(X_{t+1} \in A | X_t =x_t)} > 0 for all sets A of non-zero Lebegue measure. Thus the chain is strongly irreducible with the respect to any continuous distribution. \qed

We can extend the concepts of periodicity, recurrence, and transience from the discrete case to the general case. However this requires additional technical concepts like atoms and small sets one can see ‘Robert and Casella, 2004’ for a rigorous treatment of these concepts. Let us define a recurrent discrete Markov chain.

Definition 4 A discrete Markov chain is recurrent, if all states (on average) are visited inifinitely often.

For more general state spaces, we need to consider the number of visits to a set of states rather than single states. Let {V_A = \sum^{+\infty}_{t=0}\mathbb{\oldstylenums{1}}_{{X_t \in A}}} be the number of visits the chain makes to states in the set {A \subset S}. We then define the expected number of visits in {A \subset S}, when we start the chain in {x \in S}: \mathbb{E}(V_{A}|X_0 = x) = \mathbb{E}(\sum^{+\infty}_{t=0} \mathbb{\oldstylenums{1}}_{{X_t \in A}}|X_0 = x) = \sum_{t=0}^{+\infty} \int_A K^{(t)}(x,y)dy This allows us to define recurrence for general state spaces. We start with defining recurrence of sets before extending the definition of recurrence of an entire Markov chain.

Definition 5 (a) A set A {\subset S} is said to be recurrent for a Markov chain X if for all {x\;\in\; A} \mathbb{E}(V_A|X_0 = x) = +\infty, (b) A Markov chain to be recurrent, if

  • The chain is {\mu}-irreducible for some distribution {\mu}.
  • Every measurable set {A \;\subset\;S} with {\mu(A) > 0} is recurrent.

According to the definition a set is recurrent if on average it is visited infinitely often. This is already the case if there is a non-zero probability of visiting the set infinitely often. A stronger concept of recurrence can be obtained if we require that the set is visited infinitely often with probability 1. This type of recurrence is referred to as Harris recurrence.

Definition 6 (Harris Recurrence) . (a) A set {A \;\subset\; S} is said to be Harris-recurrent for a Markov chain X if for all {x\;\in \; A} {\mathbb{P} \left(V_{A} = +\infty | X_0 = x \right) = 1,} (b) A Markov chain is said to be Harris-recurrent, if

  • The chain is {\mu}-irreducible for some distribution {\mu}.
  • Every measurable set {A \; \subset \; S} with {\mu(A)} {>} 0 is Harris-recurrent.

It is easy to see that Harris-recurrence implies recurrence. For discrete state spaces the two concepts are equivalent. Checking recurrence or Harris recurrence can be very difficult. We will state (without) proof a proposition which establishes that if a Markov chain is irreducible and has a unique invariant distribution, then the chain is also recurrent.
However, before, we can state this proposition, we need to define invariant distributions for general state spaces.

Definition 7 (Invariant Distribution). A distribution {\mu} with density function {f_{\mu}} is said to be the invariant distribution of a Markov chain X with transition kernel K if f_{\mu} (y) = \int_{S} f_{\mu} (x) K(x,y) dx for almost all {y \;\in\;S.}

Proposition 8 Suppose that X is a {\mu}-irreducible Markov chain having {\mu} as unique invariant distibution. Then X is also recurrent.

Checking the invariance condition of definition7 requires computing an integral, but this can be cumbersome, so an alternative condition is the simpler (sufficient but not necessary) condition of detailed balance.

Definition 9 (Detailed balance) . A transition kernel K is said to be in detailed balance with a distribution {\mu} with denisity {f_{\mu}} if for almost all x,y {\in\;S}

f_{\mu}(x)K(x,y) = f_{\mu}(y)K(y,x). In complete analogy with theorem 1.22 one can also show in the general case that if the transition kernel of a Markov chain is in detailed balance with a distribution {\mu}, then the chain is time-reversible and has {\mu} as its invariant distribution.

1.1. Ergodic theorems

In this section we will study the question of whether we can use observations from a Markov chain to make inferences about its invariant distribution. We will see that under some regularity conditions it is even enough to follow a single sample path of the Markov chain.
For independently identically distributed data the Law of Large Numbers is used to justify estimating the expected value of a functional using empirical averages. A similar result can be obtained for Markov chains. This result is the reason why MCMC methods work: it allows us to set up simulation algorithms to generate a Markov chain, whose sample path we can then use for estimating various quantities of interest.

Theorem 10 (Ergodic Theorem) . Let X be a {\mu}-irreducible, recurrent {\mathbb{R}^{d}}-valued Markov chain with invariant distribution {\mu}. Then we have for any integrable function {g: \mathbb{R}^{d} \rightarrow \mathbb{R}} that with probability 1 lim_{t \rightarrow \infty} \frac{1}{t} \sum^{t}_{i=1} g(X_{i}) \rightarrow \mathbb{E}){\mu}(g(X)) = \int_S g(x)f_{\mu}(x)dx for almost every starting value {X_0 = x}. If X is Harris-recurrent this holds for every starting value x.

Proof: For a proof see (Roberts and Rosenthal, 2004, fact 5) \Box

We conclude with an example that illustates that the condition of irreducibility and recurrence are necessary in theorem 10. These conditions ensure that the chain is permamently exploring the entire state space, which is a necessary condition for the convergence of ergodic averages.

Example 3 Consider a discrete chain with two states {S = \left\lbrace 1,2 \right\rbrace} and transition matrix Any distribution {\mu} on {{1,2}} is an invariant distribution, as \mathbf{\mu ' K = \mu ' I = \mu '} for all {\mu}. However the chain is not irreducible (or recurrent): we cannot get from state 1 to state 2 and vice versa. If the inital distribution is {\mu = (\alpha, 1 - \alpha)'} with {\alpha \in [0,1]} then for every {t \in \mathbb{N}_{0}} we have that \mathbb{P}(X_t = 1) = \alpha \;\;\;\;\;\;\;\;\; \mathbb{P}(X_t = 2) = 1 - \alpha By observing one sample path (which is either 1,1,1,… or 2,2,2,…) we can make no inference about the distribution of {X_t} or the parameter {\alpha}. The reason for this is that the chain fails to explore the whole space space. To clarify the chain fails to switch between the states 1 and 2. In order to estimate the parameter {\alpha} we would need to look at more than one sample path. \qed

2. Monte Carlo Methods

2.1. What are Monte Carlo Methods?

This collection of lectures is concerned with Monte Carlo methods, which are sometimes referred to as stochastic simulation. Examples of Monte Carlo methods include stochastic integration, where we use a simulation-based method to evaluate an integral, Monte Carlo tests, where we resort to simulation in order to computer the p-value, and Markov-Chain Monte Carlo (MCMC), where we construct a Markov chain which (hopefully) converges to the distribution of interest.
A formal definition of Monte Carlo methods was given (amongst others) by Halton (1970)\footnote{Halton, J.H. A retrospective and prospective survey of the Monte Carlo method. SIAM Review, 12, 1-63.} He defined a Monte Carlo method as “representing the solution of a problem as a parameter of a hypothetical population, and using a random sequence of numbers to construct a sample of the population, from which statistical estimates of the parameter can be obtained.”

2.2. Introductory examples

Example 4 (A raindrop experiment for computing {\pi}) Assume we want to compute an Monte Carlo estimate of {\pi} using a simple experiment. Assume that we could produce “uniform rain” on the square {[-1,1] \times [-1,1]}, such that the probability of a raindrop falling in to a region {\mathcal{R} \;\subset\;[-1,1]^{2}} is proportional to the area of {\mathcal{R},} but independent of the position of {\mathcal{R}}. It is easy to see that this is the case iff the two coordinates X,Y are i.i.d. realisations of uniform distribution on the interval {[-1,1]}(in short {X,Y i.i.d\sim \cup[-1,1]).} Now consider the probability that a raindrop falls into the unit circle. It is \mathbb{P}(drop\; within\; the\; circle) = \frac{area\; of \;the\; unit\; circle}{area\; of\; the \;square} = \frac{\int \int _{{x^2 + y^2 \leq 1}} 1 dxdy}{\int \int _{{-1 \leq x,y \leq 1} }1 dxdy} = \frac{\pi}{2.2} = \frac{\pi}{4} In other words, \pi = 4.\mathbb{P}(drop\;within\;circle), i.e. we found a way of expressing the desired quantity {\pi} as a function of a probability. We can estimate the probability using our raindrop experiment. If we observe n raindrops, then the number of raindrops Z that fall inside the circle is a binomial random variable: Z \sim B(n,p) \;\;\;\;\;\;\; with\;p\;=\;\mathbb{P}(drop\;within\;circle). Thus we can estimate p by its maximum -likelihood estimate \hat{p} = \frac{Z}{n} and we can estimate {\pi} by \hat{\pi} = 4\hat{p} = 4\cdot\frac{Z}{n}. Assume we have observed that 77 of the 100 raindrops were inside the circle. In our case our estimate of {\pi} is \hat{\pi} = \frac{4 \cdot 77}{100} = 3.08 which is relatively poor.
However the law of large numbers guarantees that our estimate {\hat{\pi}} converges almost surely to {\pi}. As n increases, our estimate improves. We can assess the quality of our estimate by computing a confidence interval for {\pi}. As we have {Z \sim B(100,p)} and {\hat{p} = \frac{Z}{n},} we use the approximation that {Z \;\sim\;N(100p,100p(1-p)). } Hence, {\hat{p} \sim } N(p,p(1-p)/100){,} and we can obtain a 95{\%} confidence interval for p using this normal approximation \left[ 0.77 - 1.96\cdot \sqrt{\frac{0.77\cdot(1-0.77)}{100}}, 0.77 1.96\cdot \sqrt{\frac{0.77\cdot(1-0.77)}{100}}\right] = {[0.6875,0.8525]}, As our estimate of {\pi} is four times the estimate of {p}, we now also have a confidence interval for {\pi}: [2.750,3.410] Historically, the main drawback of Monte Carlo methods was that they used to be expensive to carry out. Physically random experiments (for example an experiment to examine ‘Buffon’s Needle’ were difficult to perform and so was the numerical processing of their results. This changed fundamentally with the advent of the digital computer. Amongst the first to realize this potential were John von Neuman and Stanislaw Ulam. For any Monte-Carlo simulation we need to be able to reproduce randomness by a deterministic Computer Algorithm. Clearly this is a philosophical paradox, but lots of work has been done on this, and the statistical language R has a lot of ‘random number generators’ see {(?RNGkind)} in GNU R for further details.

Information Retrieval


Attention conservation notice: 680 words about Information Retrieval, and highly unoriginal.
The following is very much inspired by a course by Cosma Shalizi but I felt it was worth rewriting to get to grips with the concepts. This is the first of what is hopefully a series of posts on ‘Information Retrieval’, and applications of Mathematics to ‘Data Mining’.
1. Textual features

I’d like to introduce the concepts of how features are extracted, and we shall consider these proxies of actual meanings. One classic representation we can use is bag-of-words (BoW). Let us define this representation, this means we list all of the distinct words in the document together with how often each one appears. This is easy to calculate from the text. Vectors One way we could try to code up the bag-of-words is to use vectors. Let each component of the vector correspond to a different word in the total lexicon of our document collection, in a fixed, standardised order. The value of the component would be the number of times the word appears, possibly including zero. We use this vector bag-of-words representation of documents for two big reasons:

  • There is a huge pre-existing technology for vectors: people have worked out, in excruciating detail, how to compare them, compose them, simplify them, etc. Why not exploit that, rather than coming up with stuff from scratch?
  • In practice, it’s proved to work pretty well.

We can store data from a corpus in the form of a matrix. Each row corresponds to a distinct case (or instance instance, unit, subject,…) – here, a document – and each column to a distinct feature. Conventionally, the number of cases is n and the number of features is p. It is no coincidence that this is the same format as the data matrix X in linear regression.

2. Measuring Similarity

Right now, we are interested in saying which documents are similar to each other because we want to do a search by content. But measuring similarity – or equivalently measuring dissimilarity or distance</b> – is fundamental to data mining. Most of what we will do will rely on having a sensible way of saying how similar to each other different objects are, or how close they are in some geometric setting. Getting the right measure of closeness will have a huge impact on our results. \paragraph{} This is where representing the data as vectors comes in so handy. We already know a nice way of saying how far apart two vectors are, the ordinary or Euclidean distance, which we can calculate with the Pythagorean formula:

\displaystyle \|\bar{x} - \bar{y}\| = \sqrt{\sum_{i=1}^{p}(x_i - y_i)^2}

where {x_i}, {y_i} are the {i^{th}} components of {\bar{x}} and {\bar{y}}. Remember that for bag-of-words vectors each distinct word – each entry in the lexicon – is a component or a fecture. We can also use our Linear Algebra skills to calculate the Euclidean norm or Euclidean distance . Of any vector this is {\|\bar{x}\| = \sqrt{\sum_{i=1}^{p}x^{2}_{i}}} so the distance between two vectors is the norm of their distance {\bar{x} - \bar{y}}. Equivalently, the norm of a vector is the distance from it to the origin, {\bar{0}} Obviously, one can just look up a topology textbook and remind oneself of other metrics such as the taxicab metric.

2.1. Normalisation

Just looking at the Euclidean distances between document vectors doesn’t work, at least if the documents are at all different in size. Instead, we need to normalise by document size, so that we can fairly compare short texts with long ones. There are (at least) two ways of doing this.
Document length normalisation Divide the word counts by the total number of words in the document. In symbols,

\displaystyle  \bar{x} \mapsto \frac{\bar{x}}{\sum_{i=1}^{p} x_{i}}\ \ \ \ \ (1)

Notice that all the entries in the normalised vector are non-negative fractions, which sum to 1. The i-th component is thus the probability that if we pick a word out of the bag at random, it’s the i-th entry in the lexicon.

Cosine ‘distance’ is actually a similarity measure, not a distance`:

\displaystyle  d_{cos} \bar{x}, \bar{y} = \frac{\sum_{i} x_{i}y_{i}}{\|\bar{x}\|\|\bar{y}\|} \ \ \ \ \ (2)

It’s the cosine of the angle between the vectors {\bar{x}} and {\bar{y}}.