# On Obamacare – and the failure of projects.

http://www.arnoldkling.com/blog/the-obamacare-suitsgeeks-divide/
The failure of Obama cares major software product shows the complexity of designing good software. This was not some simplistic website, but a combination of technically demanding backend systems and intense negotiation with insurance brokerages. It is a simplistic dichotomy to say that MBA types don’t speak the same language as software engineers – and the gap is difficult to bridge, but unless there is some sort of ‘agile’ involvement of technical people with the business unit, then the project will fail. It staggers me how some incredibly bright people have no appreciation of the risks and opportunities in software development, particularly in ‘mission-critical’ systems, and how bad techies are at communicating in the language that business people understand. This is less a software problem and more a management problem.

Clearly the Bangladeshi people deserve better enforcement of their laws. And clearly this is an incredible human tragedy. But boycotting products from Bangladesh (as some on the left argue) would be horrendous for Bangladesh. Prosperity takes time, and Bangladesh is a very poor country.

This is fundamentally a problem about the news, rural poverty (about 125k children under 5 died last year from poverty in Bangladesh) does not make the news. Whereas a horrendous garment factory disaster does. This was fundamentally not a problem about multinationals but a problem about the practice of law. This is further evidence of why corporations are not always the enemy, in this case it was corrupt governments and the interests of local owners.

It poses a question, is the best thing to help poor regions to invest in companies that will be working or expanding there? As opposed to charity say?

# On Climate Change

¨The right has a point Climate Change has taken on quasi-religious characteristics : self –denying life style changes , self-imposed thriftiness , asceticism and moral superiority¨ – Benjamin Harrison

My friend Ben, pretty much nailed it I think.

# It is for your own good – in defence of paternalism

Classical liberals like myself, often feel that freedom is a virtue.

While it is logically sound that we need good reasons for coercion, what about the cognitive-bias program which shows we are really bad at making decisions and dealing with probabilities?

The famous research of Tversky and Kahneman, highlighted that when faced with complicated probabilities – to save on computational power – we use heuristics. One famous example is the ´availability heuristic´.

So what does a classical liberal, and one who adored J.S. Mill while studying Political Philosophy, do when faced with the extensive empirical research presented by the social sciences. It is certainly true that some things are above introspection, and beyond our scopes of reason. The world – as many a physicist who has encountered the social sciences knows – is far too complex! We are frankly, not as competent as Mill would have us believe.

This reminds me of a quote by Franzen:

“It’s all circling around the same problem of personal liberties,” Walter said. “People came to this country for either money or freedom. If you don’t have money, you cling to your freedoms all the more angrily. Even if smoking kills you, even if you can’t afford to feed your kids, even if your kids are getting shot down by maniacs with assault rifles. You may be poor, but the one thing nobody can take away from you is the freedom to fuck up your life whatever way you want to.”
― Jonathan FranzenFreedom

We regularly see adults make horrendous financial decisions, the Irish Housing Bubble is testament to that, and many of us have lovers who smoke or consume too much sugar. How are we to align the notion of freedom, and allow people – in light of social science research – to make optimal choices?

I can not do better than Cass Sunstein so I shall quote him, and include the book review below.

Until now, we have lacked a serious philosophical discussion of whether and how recent behavioral findings undermine Mill’s harm principle and thus open the way toward paternalism. Sarah Conly’s illuminating book Against Autonomy provides such a discussion. Her starting point is that in light of the recent findings, we should be able to agree that Mill was quite wrong about the competence of human beings as choosers. “We are too fat, we are too much in debt, and we save too little for the future.” With that claim in mind, Conly insists that coercion should not be ruled out of bounds. She wants to go far beyond nudges. In her view, the appropriate government response to human errors depends not on high-level abstractions about the value of choice, but on pragmatic judgments about the costs and benefits of paternalistic interventions. Even when there is only harm to self, she thinks that government may and indeed must act paternalistically so long as the benefits justify the costs.

http://www.nybooks.com/articles/archives/2013/mar/07/its-your-own-good/?pagination=false

The bottom line is: freedom isn´t always what it is cracked up to be, and Mayor Bloomberg was right with his proposed Soda ban to try to try to protect New Yorkers from themselves.

# On Aaron Swartz

I unfortunately have exams to study for at the moment. But the sad news of the tragic departure of Aaron Swartz, internet activist, defender of civil liberties and coder, compelled me to write. This is horrendously sad, I had an intense admiration for Aaron Swartz, especially his recent work on elites. He was brilliant, erudite and a leading light of the internet age. And what is incredibly important is that people realize that we are all part of the internet age. The web and technology, are politically and socially very important.

The world is a poorer place without Aaron, and he certainly made the world a better place. Which is admirable and commendable for only 26 years. I’m sure many things will not be done without his presence, and he is a heroic reminder to us all. Now I’m going to get back to work on my own intellectual pursuits.

Thank you Aaron for everything. Your sad departure, is evidence that we need to take Mental Illness more seriously, but we also need to take the tech community more seriously, and the over zealous persecution by the DOJ in the US of the hacking of ‘Academic Documents’ by Aaron, was wrong.

Aaron RIP, you shall be missed.

The below image is a creative commons one of Aaron.

# Markov Chains and Monte Carlo Algorithms

1. General state space Markov chains

Most applications of Markov Chain Monte Carlo algorithms (MCMC) are concerned with continuous random variables, i.e the corresponding Markov chain has a continuous state space S. In this section we will give a brief overview of the theory underlying Markov chains with general state spaces. Although the basic principles are not much different from the discrete case, the study of general state Markov chainns involves many more technicalities and subtleties. Though this section is concerned with general state spaces we will notationally assume that the state space is ${ S = \mathbb{R}^{d}}$ We need a definition of a Markov chain, to be a stochastic process in which, conditionally on the present, the past and the future are independent. In the discrete case we formalised this idea using the conditional probability of ${{X_t = j}}$ given different collections of past events.
In a general state space it can be that all events of the type ${{X_t = j}}$ have probability 0, as it is the case for a process with a continuous state space. A process with a continuous state space spreads the probability so thinly that the probability of hitting one given state is 0 for all states. Thus we have to work with conditional probabilities of sets of states, rather than individual states.

Definition 1 (Markov chain) . Let X be a stochastic process in discrete time with general state space S. X is called a Markov chain if X satisfies the Markov property

$\displaystyle \mathbb{P}(X_{t+1} \in A|X_0 = x_0, \cdots , X_t = x_t) = \mathbb{P}(X_{t+1} \in A| X_t = x_t) \ \ \ \ \ (1)$

for all measurable sets ${A \subset S}$.

In the following we will assume that the Markov chain is homogeneous. i.e. the probabilities ${\mathbb{P}}$ ${(X_{t+1} \in A| X_t = x_t)}$ are independent of t. For the remainder of this section we shall also assume that we can express the probability from definition 1 using a transition kernel ${K: S \times S \rightarrow \mathbb{R}^{+}_{0}:}$

$\displaystyle \mathbb{P}(X_{t+1} \in A| X_t = x_t) = \int_A K(x_t,x_{t+1})dx_{t+1} \ \ \ \ \ (2)$

where the integration is with respect to a suitable dominating measure, i.e. for example with respect to the Lebesgue measure if S = ${\mathbb{R}^d}$. The transition kernel K(x,y) is thus just the conditional probability density of ${X_{t+1}}$ given ${X_t = x_t}$. We obtain the special case of the definition of a transition kernel.

Definition 2 The matrix ${ \mathbf{K} = (k_{ij})_{ij} = \mathbb{P}(X_{t+1} = j|X_{t} = i)}$ is called the transition kernel (or transition matrix) of the homogeneous Markov chain X.

We will see that together with the initial distribution, whcih we might write as a vector ${\mathbf{\lambda_{0}} = (\mathbb{P}(X_{0} = i) _{(i \in S)}}$ the transition kernel ${\mathbf{K}}$ fully specifies the distribution of a homogeneous Markov chain. However, we start by stating two basic properties of the transition kernel K:

We obtain the special case of definition 1.8 by setting K(i,j) = ${k_{ij}}$, where ${k_{ij}}$ is the (i,j)-th element of the transition matrix ${\mathbb{K}}$. For a discrete state space the dominating measure is the counting measure, so integration just corresponds to summation, i.e. equation 3 is equivalent to $\mathbb{P}(X_{t+1} \in A|X_{t} = x_t) = \sum_{x_{t+1} \in A} k_{x_{t},x_{t+1}}$ We have for measurable set ${A \subset S}$ that $\mathbb{P} (X_{t+m} \in A|X_t = x_t) = \int_A \int_S \cdots \int_S K(x_t,x_{t+1}K(x_{t+1},x_{t+2}) \cdots K(x_{t+m-1},x_{t+m})dx_{t+1} \cdots dx_{t+m-1}dx_{t+m},$ thus the m-step transition kernel is $K^{(m)}(x_0,x_m) = \int_S \cdots \int_S K(x_0,x_1)\cdots K(x_{m-1},x_{m})dx_{m-1}\cdots dx_1$ The m-step transition kernel allows for expressing the m-step transition probabilities more conveniently: $\mathbb{P} (X_{t+m} \in A|X_t = x_t) = \int_A K^{(m)}(x_t,x_{t+m})dx_{t+m}$

Let us consider an example.

Example 1 Consider the Gaussian random walk on ${\mathbb{R}}$. Consider the random walk on ${\mathbb{R}}$ defined by $X_{t+1} = X_{t} + E_{t}$ where ${E_{t}\cong N(0,1)}$, i.e. the probability density function of ${E_t}$ is ${\phi(z) = \frac{1}{\sqrt{2\pi}}exp(-\frac{z^{2}}{2})}$. This is equivalent to assuming that ${X_{t+1}|X_t = x_t \cong N(x_t, 1)}$ We also assume that ${E_t}$ is independent of ${X_0,E_1,\cdots,E_{t-1}}$. Suppose that ${X_0 \cong N(0,1)}$. In contrast to the random walk on ${\mathbb{Z}}$ the state space of the Gaussian random walk is ${\mathbb{R}}$. We have that $\mathbb{P}(X_{t+1} \in A|X_t = x_t,\cdots X_0 = x_0) = \mathbb{P}(E_t \in A - x_t| X_t =x_t, \cdots , X_0 = x_0)$ ${=\mathbb{P}(E_t \in A - x_t) = \mathbb{P}(X_{t+1} \in A| X_t = x_t),}$ where A – ${x_t = {a - x_t : a \in A}}$. Thus X is indeed a Markov chain. Furthermore we have that $\mathbb{P}(X_{t+1} \in A| X_t = x_t) = \mathbb{P}(E_t \in A - x_t) = \int_A \phi(x_{t+1} - x_{t})dx_{t+1}$ Thus the transition kernel (which is nothing other than the conditional density of ${X_{t+1}| X_t = x_t}$) is thus $K(x_t,x_{t+1}) = \phi(x_{t+1} -x_{t})$ To find the m-step transition kernel we could use equation 2. However, the resulting integral is difficult to compute. Rather we exploit the fact that $X_{t+m} = X_{t} + \boxed{E_{t} + \cdots + E_{t+m-1}},$ where the boxed formula is approximately ${ N(0,m)}$ thus we can write ${X_{t+m}|X_{t} ~ N(x_{t},m)}$ $\mathbb{P}(X_{t+m} \in A|X_t = x_t) = \mathbb{P}(X_{t+m} - X_{t}\in A - x_{t}) = \int_A \frac{1}{\sqrt{m}}\phi (\frac{x_{t+m} -x_{t}}{\sqrt{m}} )dx_{t+m}$ Comparing this with 2 we can identify $K^{(m)}(x_t, x_{t+m}) = \frac{1}{\sqrt{m}}\phi (\frac{x_{t+m} -x_{t}}{\sqrt{m}}$ as m-step transition kernel \qed

We need the powerful probabilistic notion of irreducibility.

Definition 3 (Irreducibility) Given a distribution ${\mu}$ on the states S, a Markov chain is said to be ${\mu}$-irreducible if for all sets A with ${\mu(A)> 0}$ and for all ${x \in S}$, there exists an ${m}$ ${\in \mathbb{N}_{0}}$ such that $\mathbb{P}(X_{t+m} \in A| X_t = x) = \int_A K^{(m)} (x,y) dy > 0$ If the number of steps m=1 then for all A, then the chain is said to be strongly ${\mu}$-irreducible.

Example 2 In the example above we had that ${X_{t+1} | X_t = x_t \cong N(x_{t},1).}$ As the range of the Gaussian distribution is ${\mathbb{R}}$, we have that ${\mathbb{P}(X_{t+1} \in A | X_t =x_t)}$ > 0 for all sets A of non-zero Lebegue measure. Thus the chain is strongly irreducible with the respect to any continuous distribution. \qed

We can extend the concepts of periodicity, recurrence, and transience from the discrete case to the general case. However this requires additional technical concepts like atoms and small sets one can see ‘Robert and Casella, 2004′ for a rigorous treatment of these concepts. Let us define a recurrent discrete Markov chain.

Definition 4 A discrete Markov chain is recurrent, if all states (on average) are visited inifinitely often.

For more general state spaces, we need to consider the number of visits to a set of states rather than single states. Let ${V_A = \sum^{+\infty}_{t=0}\mathbb{\oldstylenums{1}}_{{X_t \in A}}}$ be the number of visits the chain makes to states in the set ${A \subset S}$. We then define the expected number of visits in ${A \subset S}$, when we start the chain in ${x \in S}$: $\mathbb{E}(V_{A}|X_0 = x) = \mathbb{E}(\sum^{+\infty}_{t=0} \mathbb{\oldstylenums{1}}_{{X_t \in A}}|X_0 = x) = \sum_{t=0}^{+\infty} \int_A K^{(t)}(x,y)dy$ This allows us to define recurrence for general state spaces. We start with defining recurrence of sets before extending the definition of recurrence of an entire Markov chain.

Definition 5 (a) A set A ${\subset S}$ is said to be recurrent for a Markov chain X if for all ${x\;\in\; A}$ $\mathbb{E}(V_A|X_0 = x) = +\infty,$ (b) A Markov chain to be recurrent, if

• The chain is ${\mu}$-irreducible for some distribution ${\mu}$.
• Every measurable set ${A \;\subset\;S}$ with ${\mu(A) > 0}$ is recurrent.

According to the definition a set is recurrent if on average it is visited infinitely often. This is already the case if there is a non-zero probability of visiting the set infinitely often. A stronger concept of recurrence can be obtained if we require that the set is visited infinitely often with probability 1. This type of recurrence is referred to as Harris recurrence.

Definition 6 (Harris Recurrence) . (a) A set ${A \;\subset\; S}$ is said to be Harris-recurrent for a Markov chain X if for all ${x\;\in \; A}$ ${\mathbb{P} \left(V_{A} = +\infty | X_0 = x \right) = 1,}$ (b) A Markov chain is said to be Harris-recurrent, if

• The chain is ${\mu}$-irreducible for some distribution ${\mu}$.
• Every measurable set ${A \; \subset \; S}$ with ${\mu(A)}$ ${>}$ 0 is Harris-recurrent.

It is easy to see that Harris-recurrence implies recurrence. For discrete state spaces the two concepts are equivalent. Checking recurrence or Harris recurrence can be very difficult. We will state (without) proof a proposition which establishes that if a Markov chain is irreducible and has a unique invariant distribution, then the chain is also recurrent.
However, before, we can state this proposition, we need to define invariant distributions for general state spaces.

Definition 7 (Invariant Distribution). A distribution ${\mu}$ with density function ${f_{\mu}}$ is said to be the invariant distribution of a Markov chain X with transition kernel K if $f_{\mu} (y) = \int_{S} f_{\mu} (x) K(x,y) dx$ for almost all ${y \;\in\;S.}$

Proposition 8 Suppose that X is a ${\mu}$-irreducible Markov chain having ${\mu}$ as unique invariant distibution. Then X is also recurrent.

Checking the invariance condition of definition7 requires computing an integral, but this can be cumbersome, so an alternative condition is the simpler (sufficient but not necessary) condition of detailed balance.

Definition 9 (Detailed balance) . A transition kernel K is said to be in detailed balance with a distribution ${\mu}$ with denisity ${f_{\mu}}$ if for almost all x,y ${\in\;S}$

$f_{\mu}(x)K(x,y) = f_{\mu}(y)K(y,x).$ In complete analogy with theorem 1.22 one can also show in the general case that if the transition kernel of a Markov chain is in detailed balance with a distribution ${\mu}$, then the chain is time-reversible and has ${\mu}$ as its invariant distribution.

1.1. Ergodic theorems

In this section we will study the question of whether we can use observations from a Markov chain to make inferences about its invariant distribution. We will see that under some regularity conditions it is even enough to follow a single sample path of the Markov chain.
For independently identically distributed data the Law of Large Numbers is used to justify estimating the expected value of a functional using empirical averages. A similar result can be obtained for Markov chains. This result is the reason why MCMC methods work: it allows us to set up simulation algorithms to generate a Markov chain, whose sample path we can then use for estimating various quantities of interest.

Theorem 10 (Ergodic Theorem) . Let X be a ${\mu}$-irreducible, recurrent ${\mathbb{R}^{d}}$-valued Markov chain with invariant distribution ${\mu}$. Then we have for any integrable function ${g: \mathbb{R}^{d} \rightarrow \mathbb{R}}$ that with probability 1 $lim_{t \rightarrow \infty} \frac{1}{t} \sum^{t}_{i=1} g(X_{i}) \rightarrow \mathbb{E}){\mu}(g(X)) = \int_S g(x)f_{\mu}(x)dx$ for almost every starting value ${X_0 = x}$. If X is Harris-recurrent this holds for every starting value x.

Proof: For a proof see (Roberts and Rosenthal, 2004, fact 5) $\Box$

We conclude with an example that illustates that the condition of irreducibility and recurrence are necessary in theorem 10. These conditions ensure that the chain is permamently exploring the entire state space, which is a necessary condition for the convergence of ergodic averages.

Example 3 Consider a discrete chain with two states ${S = \left\lbrace 1,2 \right\rbrace}$ and transition matrix Any distribution ${\mu}$ on ${{1,2}}$ is an invariant distribution, as $\mathbf{\mu ' K = \mu ' I = \mu '}$ for all ${\mu}$. However the chain is not irreducible (or recurrent): we cannot get from state 1 to state 2 and vice versa. If the inital distribution is ${\mu = (\alpha, 1 - \alpha)'}$ with ${\alpha \in [0,1]}$ then for every ${t \in \mathbb{N}_{0}}$ we have that $\mathbb{P}(X_t = 1) = \alpha \;\;\;\;\;\;\;\;\; \mathbb{P}(X_t = 2) = 1 - \alpha$ By observing one sample path (which is either 1,1,1,… or 2,2,2,…) we can make no inference about the distribution of ${X_t}$ or the parameter ${\alpha}$. The reason for this is that the chain fails to explore the whole space space. To clarify the chain fails to switch between the states 1 and 2. In order to estimate the parameter ${\alpha}$ we would need to look at more than one sample path. \qed

2. Monte Carlo Methods

2.1. What are Monte Carlo Methods?

This collection of lectures is concerned with Monte Carlo methods, which are sometimes referred to as stochastic simulation. Examples of Monte Carlo methods include stochastic integration, where we use a simulation-based method to evaluate an integral, Monte Carlo tests, where we resort to simulation in order to computer the p-value, and Markov-Chain Monte Carlo (MCMC), where we construct a Markov chain which (hopefully) converges to the distribution of interest.
A formal definition of Monte Carlo methods was given (amongst others) by Halton (1970)\footnote{Halton, J.H. A retrospective and prospective survey of the Monte Carlo method. SIAM Review, 12, 1-63.} He defined a Monte Carlo method as “representing the solution of a problem as a parameter of a hypothetical population, and using a random sequence of numbers to construct a sample of the population, from which statistical estimates of the parameter can be obtained.”

2.2. Introductory examples

Example 4 (A raindrop experiment for computing ${\pi}$) Assume we want to compute an Monte Carlo estimate of ${\pi}$ using a simple experiment. Assume that we could produce “uniform rain” on the square ${[-1,1] \times [-1,1]}$, such that the probability of a raindrop falling in to a region ${\mathcal{R} \;\subset\;[-1,1]^{2}}$ is proportional to the area of ${\mathcal{R},}$ but independent of the position of ${\mathcal{R}}$. It is easy to see that this is the case iff the two coordinates X,Y are i.i.d. realisations of uniform distribution on the interval ${[-1,1]}$(in short ${X,Y i.i.d\sim \cup[-1,1]).}$ Now consider the probability that a raindrop falls into the unit circle. It is $\mathbb{P}(drop\; within\; the\; circle) = \frac{area\; of \;the\; unit\; circle}{area\; of\; the \;square} = \frac{\int \int _{{x^2 + y^2 \leq 1}} 1 dxdy}{\int \int _{{-1 \leq x,y \leq 1} }1 dxdy} = \frac{\pi}{2.2} = \frac{\pi}{4}$ In other words, $\pi = 4.\mathbb{P}(drop\;within\;circle),$ i.e. we found a way of expressing the desired quantity ${\pi}$ as a function of a probability. We can estimate the probability using our raindrop experiment. If we observe n raindrops, then the number of raindrops Z that fall inside the circle is a binomial random variable: $Z \sim B(n,p) \;\;\;\;\;\;\; with\;p\;=\;\mathbb{P}(drop\;within\;circle).$ Thus we can estimate p by its maximum -likelihood estimate $\hat{p} = \frac{Z}{n}$ and we can estimate ${\pi}$ by $\hat{\pi} = 4\hat{p} = 4\cdot\frac{Z}{n}.$ Assume we have observed that 77 of the 100 raindrops were inside the circle. In our case our estimate of ${\pi}$ is $\hat{\pi} = \frac{4 \cdot 77}{100} = 3.08$ which is relatively poor.
However the law of large numbers guarantees that our estimate ${\hat{\pi}}$ converges almost surely to ${\pi}$. As n increases, our estimate improves. We can assess the quality of our estimate by computing a confidence interval for ${\pi}$. As we have ${Z \sim B(100,p)}$ and ${\hat{p} = \frac{Z}{n},}$ we use the approximation that ${Z \;\sim\;N(100p,100p(1-p)). }$ Hence, ${\hat{p} \sim }$ N(p,p(1-p)/100)${,}$ and we can obtain a 95${\%}$ confidence interval for p using this normal approximation $\left[ 0.77 - 1.96\cdot \sqrt{\frac{0.77\cdot(1-0.77)}{100}}, 0.77 1.96\cdot \sqrt{\frac{0.77\cdot(1-0.77)}{100}}\right]$ = ${[0.6875,0.8525]}$, As our estimate of ${\pi}$ is four times the estimate of ${p}$, we now also have a confidence interval for ${\pi}$: $[2.750,3.410]$ Historically, the main drawback of Monte Carlo methods was that they used to be expensive to carry out. Physically random experiments (for example an experiment to examine ‘Buffon’s Needle’ were difficult to perform and so was the numerical processing of their results. This changed fundamentally with the advent of the digital computer. Amongst the first to realize this potential were John von Neuman and Stanislaw Ulam. For any Monte-Carlo simulation we need to be able to reproduce randomness by a deterministic Computer Algorithm. Clearly this is a philosophical paradox, but lots of work has been done on this, and the statistical language R has a lot of ‘random number generators’ see ${(?RNGkind)}$ in GNU R for further details.

|