Tuesday, November 13, 2012

Feedback

This year's course finished on 13 November 2011. I handed out the standard faculty feedback forms in the lecture on 8 Novembers. I obtained 63 forms back.

You can also give feedback using my equivalent on-line form. You may find it easier to tick the radio buttons and type into this form than to handwrite onto the paper form - and you have space to write interesting comments. 8 of these have been received so far.

This will be sent to my email anonymously. After reading the responses, I will forward printed copies to the Faculty Office.

Notation

Some students say that the notation is one of the most difficult things about this course. I recommend that you make for yourself a one-page crib sheet of all the notation:

\begin{align*}
&(X_n)_{n\geq 0},\ I,\ \text{Markov}(\lambda,P),\ P = (p_{ij}),\ P(n) = (p_{ij}^{(n)}),\\[12pt]
&h_i^A,\ k_i^A,\ H_i,\ T_i,\ V_i,\ V_i^{(n)},\ f_i,\ \lambda,\ \pi,\ \gamma_i^k,\ m_i,\ m_{ij} .
\end{align*}
Write a little explanation for yourself as to what each notation means, and how it used in our theorems about right-hand equations, recurrence/transience, left-hand equations, existence/uniqueness of invariant measure, aperiodicity/periodicity, positive/null recurrence and detailed balance. It should all seems pretty straightforward and memorable once you summarise it on one page and make some notes to place it in context.

Of course I could easily typeset a page like this for you — but I think that you'll learn more, and it will be more memorable for you personally, if you create this crib sheet yourself!

Lecture 12

The last lecture was on Tuesday November 13. I have enjoyed giving this course and I hope you have enjoyed it too. Thank you for your feedback.

The notes at the end of the course are slightly improved from those posted at the time of Lecture 1. I thank those of you who have spotted some typos.

You can now also look at the overhead projector slides that I sometimes used in lectures, such as those today that I used to summarise four Part II courses that you may like to study next year: Probability and Measure, Applied Probability, Optimization and Control, and Stochastic Financial Models.

In my discussion of random walk and electrical networks in Section 12.4 I appealed to Rayleigh's Monotonicity Law: " if some resistances of a circuit are increased (decreased) the resistance between any two points of the circuit can only increase (decrease)." A proof of this "obvious" fact can be constructed by (i) proving Thomson's Principle: "Flows determined by Kirchhoff's Laws minimize energy dissipation", and then (ii) showing that Thomson's Principle implies Rayleigh's Monotonicity Law. You can read the details of this in Doyle and Snell Random walks and electric networks, pages 51–52.

In Section 12.2 I mentioned Burke's output theorem (1956) which says that the output process of a $M/M/1$ queue in equilibrium is a Poisson process with the same rate as the input. In writing "$M/M/1$" the "$M$"s mean Markovian (i.e. a Poisson input process of rate $\lambda$ and i.i.d. exponentially distributed service times with parameter $\mu$ (where $\mu > \lambda$), and the "$1$" means a single server. In queueing theory this very useful notation is known as Kendall's notation. For example, a $D/G/m$ queue is one in which arrivals are separated by a deterministic time, there are general service times and $m$ servers working in parallel.)

I remarked (just for fun) that queueing is the only common word in the OED with five vowels in a row. Obscure words are ones like "miaoued" (what the cat did).

I once proved a generalization of Burke's output theorem that holds even when the queue has not reached equilibrium (see: The interchangeability of $\cdot/M/1$ queues in series, Weber, 1979). Suppose we have two single-server first-in-first-out queues in series, which we might write as $\cdot/M/1 \to /M/1$. The customers' service times in the first queue are i.i.d. exponentially distributed with parameter $\lambda$ and in the second queue they are i.i.d. exponentially distributed with parameter $\mu$. On finishing service in the first queue a customer immediately joins the second queue. Suppose the system starts with $N$ customers in the first (upstream) queue and no customers in the second (downstream) queue. My theorem says that all statistics that we might measure about the departure process from the second queue are the same if $\lambda$ and $\mu$ are interchanged. Thus by observing the process of departures from the second queue we cannot figure out which way around the two $/M/1$ servers are ordered. For example, the time at which we see the first departure leave the second queue has expected value $1/\lambda + 1/\mu$ (which is symmetric in $\lambda$ and $\mu$). Its moment generating function is $\lambda\mu(\theta-\lambda)^{-1}(\theta-\mu)^{-1}$. The expected time at which the second departure takes place is
$$
\frac{2 \lambda^2+3 \lambda\mu+2 \mu^2}{\lambda^2 \mu+\lambda \mu^2}.
$$All other statistics are also symmetric in $\lambda$ and $\mu$. It was by computing such quantities that I first formulated the conjecture that this might be true. Burke's theorem is a corollary of my result that can be obtained by thinking about $N$ tending to infinity (can you see how?)

Markov chains have been studied for about 100 years, but there are still interesting things to discover. An important on-going research area is Markov chain Monte Carlo.

Thursday, November 8, 2012

Fast retries lemma

There are many interesting consequences of reversibility. Here is one that I talked about at the end of today's lecture and which has a very practical application.

Theorem. If $P$ is the transition matrix of a reversible Markov chain on $N$ states then for all $i$,
$$
P_i(X_n=i)=p_{ii}^{(n)} = \sum_{k=1}^N a_k \lambda_k^n,
$$where all $a_k\geq 0$ and all $\lambda_k$ are real and $\lambda_k\leq 1$.

Hint. To prove this, let $\pi$ be the invariant distribution and consider $A=VPV^{-1}$, where $V$ is the diagonal matrix with $V_{ii}=\sqrt{\pi_i}$. Show that $A$ is symmetric, and think about what that implies.

Similarly, for the continuous time version of a Markov chain (a so-called Markov process)
$$
P_i(X(t)=i)=p_{ii}(t) = \sum_{k=1}^N a_k e^{-\lambda_k t}.
$$where all $a_k\geq 0$ and all $\lambda_k$ are real and $\lambda_k\geq 0$.

I have used this fact to prove what I call the fast retries lemma (Theorem 1 in the paper Optimal call routing in VoIP). Suppose that you ring up your friend and find that his phone is engaged. There is a minimum time that it takes to retry the call, say $\tau$, so you could try to reach him at times $0,\tau,2\tau,\dotso$ until you get through. But might the expected time to get through be less if you retry at $\rho,2\rho,\dotso$, for some $\rho>\tau$? You might get through on the second attempt rather than the third, and $2\rho< 3\tau$.

It turns out the answer is no. Your best policy is to attempt contact at $0,\tau,2\tau,\dotso$, provided that the state of your friend can be modelled as a state of a reversible Markov process, and that you get through to him as soon as you ring and find that he is not in the state (engaged) which he was at time $0$. For example, your friend might be performing a random walk on a graph whose vertices are "engaged on the phone", "cooking", "sleeping", "watching tv", etc.

The expected time to connect is $ET$ where
$$
ET = t + p_{00}(t)ET = \frac{t}{1-p_{00}(t)}.
$$
So the lemma is saying that, subject to the constraint $t\geq \tau$, $ET$ is minimized by $ t=\tau$.

Lecture 11

I started the lecture by mentioning the second law of thermodynamics, which says that entropy is a nondecreasing function of time.

For a distribution $p=(p_1,\dotsc, p_k)$ the entropy, $H(p) = –\sum_i p_i \log(p_i)$, is a measure of disorder, or how surprising on average would be an outcome that is chosen according to this distribution. For example, the outcome of the toss of a biased coin is less surprising on average than the outcome of a toss of a fair coin, and this is expressed by the inequality
$$
– p \log_2 p – (1–p) \log_2(1–p) \leq\ – (1/2) \log_2(1/2) – (1/2) \log_2(1/2) = 1.
$$
When the $\log$ is taken base 2 then $H(p)$ is a lower bound on the average number of binary bits that would be required to communicate an outcome that is chosen as one of $k$ possible outcomes according to this distribution. The bound can be achieved if and only if every component of $p$ is one of $1/2, 1/4, 1/8,\dotsc$. If that is not the case then we might consider taking $m$ i.i.d. samples from this distribution, and then try to communicate these m results optimally as one block. There exists a coding that does this and needs only $m\cdot H(p)$ bits, asymptotically as as $m$ tends to infinity.

I gave an example of a 4 state Markov chain which starts in state 1, and thus having entropy $H((1,0,0,0))=0$. As $n$ increases the distribution given by the first row of $P^n$ tends to $(1/4,1/4/,1/4,1/4)$, and the entropy increases monotonically to 2 (using log base 2). The point of this example was to motivate the idea that reversibility is only going to make sense once our Markov chain has reached its equilibrium. Otherwise the process will look different when reversed in time because it will appear that entropy is decreasing.

In fact, I cheated a little here, because it is not always the case that the entropy of $p_1^{(n)}=(p_{1j}^{(n)}$, $j=1,\dotsc,k$) is monotonically increasing in $n$. This is true if and only if the equilibrium distribution of the Markov chain is the uniform distribution (as it was in my example). Nonetheless the same point is valid. Unless we start the chain in equilibrium we will be able to detect some difference between the process run forward in time and the process run backwards in time.

There is a Part II course called Coding and Cryptography in which you can learn more about entropy as it relates to efficient communications. Not surprisingly, it also crops up in the Cosmology and Statistical Physics courses.

Tuesday, November 6, 2012

Lecture 10

We only proved the first part of Theorem 10.2 (Ergodic theorem). The second part is a simple corollary of the first part. For details you can look at page 3 in Section 1.10 of Jame's Norris's notes.

In this lecture I gave a demonstration how Engel's probabilistic abacus can be used to calculate the invariant distribution of a finite-state Markov chain in which all $p_{ij}$ are rational numbers. The puzzle is "why does this algorithm always terminate in a finite number of steps?" There is more about this algorithm in Example Sheet 2, #15 and Section 12.5 of the notes. I first heard about this problem from Laurie Snell circa 1975. At that time he said that the questions as why it works was still open.

The material on the random target lemma and Kemeny's constant is non-examinable, but I included it for fun. It is surprising (don't you think?) that the expected time to reach equilibrium (in the sense of this lemma) is independent of the starting state. (As well being a co-author with Laurie Snell of the book Finite Markov Chains, John Kemeny was President of Dartmouth College, and one of the inventors of the BASIC programming language.)

Of course there are other ways to think about the time that is required for a Markov chain to reach equilibrium. One is the mixing time, $\tau(\epsilon)$, which is defined for $\epsilon > 0$ as the the least time such that
\[
\max_i \sum_j | p_{ij}(n) - \pi_j | < \epsilon\quad\text{ for all }n \geq \tau(\epsilon).
\]This is closely related to the magnitude of the second-largest eigenvalue of $P$, say $\lambda_2$. The smaller is $|\lambda_2|$ then the smaller is the mixing time. In fact, one can prove bounds such as
\[
|\lambda_2| \frac{\log(1/(2 \epsilon))}{2(1−|\lambda_2|)} \leq \tau(\epsilon) \leq \frac{\log(1/(\pi^* \epsilon))}{1−|\lambda_2|}
\]where $\pi^*$ is the smallest component of $\pi$ (the invariant distribution).
Other quantities that are interesting are the coupling time (the time for two independent copies of the Markov chain to couple) and the cover time (the time for the Markov chain to visit every state at least once).

Recreational talk today

I will be giving the following talk today.

Hide and search games with multiple objects

Richard Weber

Talk at Sidney Sussex College (6.30pm, Old Library) November 6, 2012

Slides for this talk

You have mislaid $k$ objects (such as your keys, wallet, mobile phone, etc). Each is in lost in one of $n$ ($>k$) locations (such as coat pocket, desk drawer, under bed, gym locker, etc.). It will cost $c_i$ to search location $i$ and you want to minimize the expected cost in finding all your objects. “Sod’s law” predicts that the objects are distributed amongst the locations with whatever probability distribution makes your task most difficult. What is this “Sod’s law” distribution? In what order should you search the locations? I will discuss this and some similar problems.

Another cute one is this: “In how many places should a forgetful squirrel hide his nuts?”

Practical and worthwhile search problems include “rescue at sea”, “researching for a cancer cure”, and “oil exploration”.

Thursday, November 1, 2012

Lecture 9

Today's lecture was a high point in our course: the proof by coupling that for an ergodic Markov chain $P(X_n=i)\to\pi_i$ as $n\to\infty$.

A couple points raised by questions that students asked me about at the end:

If a finite state space Markov chain is aperiodic and irreducible then it is regular. I think I forgot to include the words "finite state space" when lecturing, but they are in this notes. This not true for an infinite state space chain. Consider the Makov chain on the non-negative integers with $p_{i,i+1}=p_{i,0}=1/2$. Then $p_{100,100}^{(n)}>0$ only if $n\geq 101$. So there is no $n$ for which $P^n>0$, i.e. $p_{ij}^{(n)}>0$ for all $i,j$.
In the proof of Theorem 9.8 the assumption of positive recurrence guarantees that there exists an invariant distribution $\pi$. From this we can see that $\pi\times \pi$ is the invariant distribution of the chain with state $W_n=(X_n,Y_n)$, and it existence implies that this chain is recurrent, and hence that $P(T<\infty)=1$. If we only knew that $X_n$ were recurrent (and so possibly null recurrent), we would only be able to claim the existence of an invariant measure $\lambda$, and $\lambda\times\lambda$ would be an invariant measure for $W_n$. But existence of an invariant measure does not imply recurrence (see Example 9.3).

I mentioned Vincent Doblin (1915-40) [also known as Wolfgang Doeblin] to whom is due the coupling proof of Theorem 9.8. There is a good article about his life in a 2001 article in The Telegraph: Revealed: the maths genius of the Maginot line. Some of Doblin's work was only discovered in summer 2000, having been sealed in an envelope for 60 years.

I quoted J. Michael Steele on coupling:

Coupling is one of the most powerful of the "genuinely probabilistic" techniques. Here by "genuinely probabilistic'' we mean something that works directly with random variables rather than with their analytical co-travelers (like distributions, densities, or characteristic functions.

I like the label "analytic co-travelers". The coupling proof should convince you that Probability is not merely a branch of Analysis. The above quote is taken from Mike's blog for his graduate course on Advanced Probability at Wharton (see here). Mike has a fascinating and very entertaining web site, that is full of "semi-random rants" and "favorite quotes" that are both funny and wise (such as his "advice to graduate students"). I highly recommend browsing his web site. It was by reading the blogs for his courses that I had the idea of trying something similar myself. So if you have been enjoying this course blog - that is partly due to him. Here is a picture of Mike and me with Thomas Bruss at the German Open Conference in Probability and Statistics 2010, in Leipzig.

The playing cards for the magic trick that I did in the lecture were "dealt" with the help of the playing card shuffler at random.org.

The following is a little sidebar for those of you who like algebra and number theory.

In the proof of Lemma 9.5 we used the fact that if the greatest common divisor of $n_1,\dotsc, n_k$ is $1$ then for all sufficiently large $n$ there exist some non-negative integers $a_1,\dotsc, a_k$ such that
\begin{align}
n &= a_1 n_1 + \cdots + a_k n_k.\tag{*}
\end{align}
Proof. Think about the smallest positive integer $d$ that can be expressed as $d = b_1 n_1 + \cdots + b_k n_k$ for some integers $b_1,\dotsc, b_k$ (which may be negative as well as non-negative). Notice that the remainder of $n_1$ divided by $d$ is also of this form, since it is $r =n_1 − m (b_1 n_1 + \cdots + b_k n_k$) for $m=\lfloor n_1 /d\rfloor$.

If $d$ does not divide $n_1$ then $r< d$, and $d$ fails to be the smallest integer that can be expressed in the form $d = b_1 n_1 + \cdots + b_k n_k$. So $d$ divides $n_1$. The same must be true for every other $n_j$, and so $d=\text{gcd}(n_1,\dotsc,n_k)=1$.

So now we know that it is possible to write $1 = b_1 n_1 + \cdots + b_k n_k$, and so we can also write $j = j (b_1 n_1 + \cdots + b_k n_k$), for all $j=1,\dotsc, n_1$. Finally, we can leverage this fact to conclude that for some large $N$ we can write all of $N, N+1, N+2,\dotsc, N+n_1$ in the required form (*), and hence we can also express in form (*) all integers $N + m n_1 + j$, where $m$ and $j$ are non-negative integers, i.e. we can do this for all integers $n\geq N$.