I started the lecture by mentioning the second law of thermodynamics, which says that entropy is a nondecreasing function of time.
For a distribution p=(p1, ..., pk) the entropy, H(p) = –sum_i p_i log(p_i), is a measure of disorder, or how surprising on average would be an outcome that is chosen according to this distribution. For example, the outcome of the toss of a biased coin is less surprising on average than the outcome of a toss of a fair coin, and this is expressed by the inequality
– p log p – (1–p) log(1–p) ≤ – (1/2) log(1/2) – (1/2) log(1/2).
When the log is taken base 2 then H(p) is a lower bound on the average number of binary bits that would be required to communicate an outcome that is chosen as one of k possible outcomes according to this distribution. The bound can be achieved if and only if every component of p is one of 1/2, 1/4, 1/8, ... . If that is not the case then we might consider taking m i.i.d. samples from this distribution, and then try to communicate these m results optimally as one block. There exists a coding that does this and needs only m·H(p) bits, asymptotically as as m tends to infinity.
I gave an example of a 4 state Markov chain which starts in state 1, and thus having entropy H((1,0,0,0))=0. As n increases the distribution given by the first row of P^n tends to (1/4,1/4/,1/4,1/4), and the entropy increases monotonically to 2 (using log base 2). The point of this example was to motivate the idea that reversibility is only going to make sense once our Markov chain has reached its equilibrium. Otherwise the process will look different when reversed in time because it will appear that entropy is decreasing.
In fact, I cheated a little here, because it is not always the case that the entropy of p_1^{(n)}=(p_{1j}^{(n)}, j=1, ....,k) is monotonically increasing in n. This is true if and only if the equilibrium distribution of the Markov chain is the uniform distribution (as it was in my example). Nonetheless the same point is valid. Unless we start the chain in equilibrium we will be able to detect some difference between the process run forward in time and the process run backwards in time.
There is a Part II course called Coding and Cryptography in which you can learn more about entropy as it relates to efficient communications. Not surprisingly, it also crops up in the Cosmology and Statistical Physicscourses.
Following the lecture I made a change in the notes to the proof of Theorem 11.4, and a small correction in Example 11.5.