Markov Chains: November 2011

Tuesday, November 15, 2011

Lecture 12

The last lecture was on Tuesday November 15. I have enjoyed giving this course and I hope you have enjoyed it too.

The notes are now finalized and probably safe for you to download as a final version. I made a change to page 48 around 9am today. If there are any more changes then I will put a remark on the course page. You can now also look at the overhead projector slides that I sometimes used in lectures, such as those today that I used to summarise four Part II courses that you may like to study next year.

Some students say that the notation is one of the most difficult things about this course. I recommend that you make for yourself a one-page crib sheet of all the notation:

(X_n)_n≥0, I, Markov(λ,P), P = (p_ij), P⁽ⁿ⁾ = (p_ij⁽ⁿ⁾), h_i^A, k_i^A, H_i, T_i, V_i, V_i(n), f_i, λ, π, γ_i^k, m_i .

Write a little explanation for yourself as to what each notation means, and how it used in our theorems about right-hand equations, recurrence/transience, left-hand equations, existence/uniqueness of invariant measure, aperiodicity/periodicity, positive/null recurrence and detailed balance. It should all seems pretty straightforward and memorable once you summarise it on one page and make some notes to place it in context.

Of course I could easily typeset a page like this for you — but I think that you'll learn more, and it will be more memorable for you personally, if you create this crib sheet yourself!

In my discussion of random walk and electrical networks in Section 12.4 I appealed to Rayleigh's Monotonicity Law: " if some resistances of a circuit are increased (decreased) the resistance between any two points of the circuit can only increase (decrease)." A proof of this "obvious" fact can be constructed by (i) proving Thomson's Principle: "Flows determined by Kirchhoff's Laws minimize energy dissipation", and then (ii) showing that Thomson's Principle implies Rayleigh's Monotonicity Law. You can read the details of this in Doyle and Snell Random walks and electric networks, pages 51–52.

In Section 12.2 I mentioned Burke's output theorem (1956) which says that the output process of a M/M/1 queue in equilibrium is a Poisson process with the same rate as the input. In writing "M/M/1" the "M"s mean Markovian (i.e. a Poisson input process of rate λ and i.i.d. exponentially distributed service times with parameter μ (where μ>λ), and the "1" means a single server. In queueing theory this very useful notation is known as Kendall's notation.)

I remarked (just for fun) that queueing is the only common word in the OED with five vowels in a row. Obscure words are ones like "miaoued" (what the cat did).

I once proved a generalization of Burke's output theorem that holds even when the queue has not reached equilibrium (see: The interchangeability of ·/M/1 queues in series, Weber, 1979). Suppose we have two single-server queues in series, which we might write as /M/1 → /M/1. The customers' service times in the first queue are i.i.d. exponentially distributed with parameter λ and in the second queue they are i.i.d. exponentially distributed with parameter μ. On finishing service in the first queue a customer immediately joins the second queue. Suppose the system starts with N customers in the first (upstream) queue and no customers in the second (downstream) queue. My theorem says that all statistics that we might measure about the departure process from the second queue are the same if λ and μ are interchanged. Thus by observing the process of departures from the second queue we cannot figure out which way around the two /M/1 servers are ordered. For example, the time at which we see the first departure leave the second queue has expected value 1/λ + 1/μ (which is symmetric in λ and μ). All other statistics are also symmetric in λ and μ. Burke's theorem is a corollary of this that can be obtained by thinking about N tending to infinity (can you see how?)

Thursday, November 10, 2011

Lecture 11

I started the lecture by mentioning the second law of thermodynamics, which says that entropy is a nondecreasing function of time.

For a distribution p=(p1, ..., pk) the entropy, H(p) = –sum_i p_i log(p_i), is a measure of disorder, or how surprising on average would be an outcome that is chosen according to this distribution. For example, the outcome of the toss of a biased coin is less surprising on average than the outcome of a toss of a fair coin, and this is expressed by the inequality

– p log p – (1–p) log(1–p) ≤ – (1/2) log(1/2) – (1/2) log(1/2).

When the log is taken base 2 then H(p) is a lower bound on the average number of binary bits that would be required to communicate an outcome that is chosen as one of k possible outcomes according to this distribution. The bound can be achieved if and only if every component of p is one of 1/2, 1/4, 1/8, ... . If that is not the case then we might consider taking m i.i.d. samples from this distribution, and then try to communicate these m results optimally as one block. There exists a coding that does this and needs only m·H(p) bits, asymptotically as as m tends to infinity.

I gave an example of a 4 state Markov chain which starts in state 1, and thus having entropy H((1,0,0,0))=0. As n increases the distribution given by the first row of P^n tends to (1/4,1/4/,1/4,1/4), and the entropy increases monotonically to 2 (using log base 2). The point of this example was to motivate the idea that reversibility is only going to make sense once our Markov chain has reached its equilibrium. Otherwise the process will look different when reversed in time because it will appear that entropy is decreasing.

In fact, I cheated a little here, because it is not always the case that the entropy of p_1^{(n)}=(p_{1j}^{(n)}, j=1, ....,k) is monotonically increasing in n. This is true if and only if the equilibrium distribution of the Markov chain is the uniform distribution (as it was in my example). Nonetheless the same point is valid. Unless we start the chain in equilibrium we will be able to detect some difference between the process run forward in time and the process run backwards in time.

There is a Part II course called Coding and Cryptography in which you can learn more about entropy as it relates to efficient communications. Not surprisingly, it also crops up in the Cosmology and Statistical Physicscourses.

Following the lecture I made a change in the notes to the proof of Theorem 11.4, and a small correction in Example 11.5.

Tuesday, November 8, 2011

Lecture 10

We only proved the first part of Theorem 10.2 (Ergodic theorem). The second part is a simple corollary of the first part. For details you can look at page 3 in Section 1.10 of Jame's Norris's notes.

The material on the random target lemma and Kemeny's constant is non-examinable, but I have presented because I think it is fun. It is surprising (don't you think?) that the expected time to reach equilibrium (in the sense of this lemma) is independent of the starting state. (As well being a co-author with Laurie Snell of the book Finite Markov Chains, John Kemeny was President of Dartmouth College, and one of the inventors of the BASIC programming language.)

Of course there are other ways to think about the time that is required for a Markov chain to reach equilibrium. One is the mixing time, τ(ε), which is defined for ε>0 as the the least time such that

max_i ∑_j | p_ij⁽ⁿ⁾ - π_j | < ε for all n ≥ τ(ε).

This is closely related to the magnitude of the second-largest eigenvalue of P, say λ₂. The smaller is |λ₂| then the smaller is the mixing time. In fact, one can prove bounds such as

|λ₂| log(1/(2 ε))/(2(1−|λ₂|)) ≤ τ(ε) ≤ log(1/(π* ε))/(1−|λ₂|)

where π* is the smallest component of π (the invariant distribution).

Other quantities that are interesting are the coupling time (the time for two independent copies of the Markov chain to couple) and the cover time (the time for the Markov chain to visit every state at least once).

I concluded the lecture with a demonstation how Engel's probabilistic abacus can be used to calculate the invariant distribution of a finite-state Markov chain in which all p_{ij} are rational numbers. The puzzle is "why does this algorithm always terminate in a finite number of steps?" There is more about this algorithm in Example Sheet 2, #15 and Section 12.5 of the notes. I first heard about this problem from Laurie Snell circa 1975. At that time he said that the questions as why it works was still open.

Thursday, November 3, 2011

Lecture 9

Today's lecture was a high point in our course: the proof by coupling that for an ergodic Markov chain P(X_n=i) tends to the equilibrium probability pi_i as n tends to infinity.

In response to a good question that a student asked during the lecture I have slightly modified in the notes the wording to the proof of Theorem 9.1, for the part (iii) implies (i).

I mentioned Vincent Doblin (1915-40) [also known as Wolfgang Doeblin] to whom is due the coupling proof of Theorem 9.8. There is a good article about his life in a 2001 article in The Telegraph: Revealed: the maths genius of the Maginot line. Some of Doblin's work was only discovered in summer 2000, having been sealed in an envelope for 60 years.

I quoted J. Michael Steele on coupling:

Coupling is one of the most powerful of the "genuinely probabilistic" techniques. Here by "genuinely probabilistic'' we mean something that works directly with random variables rather than with their analytical co-travelers (like distributions, densities, or characteristic functions.

I like the label "analytic co-travelers". The coupling proof should convince you that Probability is not merely a branch of Analysis. The above quote is taken from Mike's blog for his graduate course on Advanced Probability at Wharton (see here). Mike has a fascinating and very entertaining web site, that is full of "semi-random rants" and "favorite quotes" that are both funny and wise (such as his "advice to graduate students"). I highly recommend browsing his web site. It was by reading the blogs for his courses that I had the idea of trying something similar myself this year. So if you have been enjoying this course blog - that is partly due to him. Here is a picture of Mike and me with Thomas Bruss at the German Open Conference in Probability and Statistics 2010, in Leipzig.

The playing cards for the magic trick that I did in the lecture were "dealt" with the help of the playing card shuffler at random.org. By the way, there is a recent interview with Persi Diaconis in the October 27 (2011) issue ofNature, entitled "The Mathemagician".

The following is a little sidebar for those of you who like algebra and number theory.

In the proof of Lemma 9.5 we used the fact that if the greatest common divisor of n1,..., nk is 1 then for all sufficiently large n there exist some non-negative integers a1,..., ak such that

n = a1 n1 + ··· + ak nk. (*)

Proof. Think about the smallest positive integer d that can be expressed as d = b1 n1 + ··· + bk nk for some integers b1,..., bk (which may be negative as well as non-negative). Notice that the remainder of n1 divided by d is also of this form, since it is r =n1 − m (b1 n1 + ··· + bk nk) for m=⌊n1 /d⌋. If d does not divide n1 then r<d, and d fails to be the smallest integer that can be expressed in the form d = b1 n1 + ··· + bk nk. Thus we must conclude that d divides n1. The same must be true for every other nj, and so d=gcd(n1,...,nk)=1. So now we know that it is possible to write 1 = b1 n1 + ··· + bk nk, and so also we know we can write j = j (b1 n1 + ··· + bk nk), for all j=1,..., n1. Finally, we can leverage this fact to conclude that for some large N we can write all of N, N+1, N+2,..., N+n1 in the required form (*), and hence also we can also express in form (*) all integers N + m n1 + j, where m and j are non-negative integers, i.e. we can do this for all integers n≥ N. (This is a proof that I cooked up on the basis of my rather limited expertise in algebra. Perhaps one of you knows a quicker or more elegant way to prove this little fact? If so, please let me know.)

Tuesday, November 1, 2011

Lecture 8

Following the lecture I have made a correction to Theorem 8.4 (iii) which should be "0< γ_i^k < ∞ for all i". Earlier, I made a correction to Page 31, line 4. I have also added the words "under P_k" to the fourth line of the proof of Theorem 8.4. As I commented in lectures, the hypothesis that P is recurrent is used in this proof, since the proof relies on the fact that T_k < ∞. This comment is written at 2pm Tuesday - so re-download the notes if your copy is prior to then.

Notice that the final bits of the proofs of both Theorems 8.4 and 8.5 follow from Theorem 8.3. In Theorem 8.4, the fact of (iii) follows from Theorem 8.3 once we have proved (i) and (ii). And in Theorem 8.5 the fact that mu=0 follows from Theorem 8.3 because mu_k=0. However, in the notes the proofs of Theorems 8.4 and 8.5 are self-contained and repeat the arguments used in Theorem 8.3. (These proofs are taken directly from James Norris's book.)

In today's lecture we proved that if the Markov chain is irreducible and recurrent then an invariant measure exists and is unique (up to a constant multiple) and is essentially γ^k, where γ_i^k is the expected number of hits on i between successive hits on k. The existence of a unique positive left-hand (row) eigenvector is also guaranteed by Perron-Frebonius theory when P is irreducible (see the blog on Lecture 2). This again points up the fact that many results in the theory of Markov chains can be proved either by a probabilistic method or by a matrix-algebraic method.

If the state space is finite, then clearly sum_i γ_i^k < ∞, so there exists an invariant distribution. If the state space in infinite sum_i γ_i^k may be < ∞ or =∞, and these correspond to the cases of positive and null recurrence, which we will discuss further in Lecture 9.

We looked at a very simple model of Google PageRank. This is just one example of a type of algorithm that has become very important in today's web-based multi-agent systems. Microsoft, Yahoo, and others have proprietary algorithms for their search engines. Similarly, Amazon and E-bay run so-called recommender and reputation systems. Ranking, reputation, recommender, and trust systems are all concerned with aggregating agents' reviews of one another, and of external events, into valuable information. Markov chain models can help in designing these systems and to make forecasts of how well they should work.