The last two Uncertainty Wednesdays we have looked at entropy as a measure of uncertainty. Today I want to tie that back to the original communications context in which Shannon introduced it. We can use our really simple model of a world with two states and two signal values to do this. We have analyzed an example where the signal was a test. Now think of the signal as actually being an electric impulse sent down a wire by a local observer. The observer gets to see perfectly which state of the world has been realized and needs to communicate that state to us.
With two states and two signal values there is a super simple communication scheme available: if the observed state is A send a 1 (or an “H”) and if the observed sate is B send a 0 instead. So with a single message transmission our observer can give us perfect information about the state they observed (meaning once we receive the signal, there is no uncertainty left and we will know exactly which state was observed). Important aside: we are assuming here that our observer’s incentives are aligned with ours so that the observer will truthfully communicate the state they have observed (we will get to incentive problems later in this series).
Now you might think that this is all there is to is and surely we can’t do any better than this simple communication scheme. And that’s right for a one off observation. But now imagine that the state keeps changing (sometimes it is A and sometimes it is B) and we want our observer to keep signaling to us which state has occurred. In this repeated set up, is our simple communication scheme still the best? Somewhat surprisingly — if you have not worked on communication before — the answer is no. We can do better. And how much better turns out to be captured precisely by entropy!
To understand this let’s first develop some intuition by going to the extreme case. Assume P(A) = 1 (and hence P(B) = 0). In this case we don’t need a signal from our observer at all. No matter how many times a “new” state happens, we know it will always be A. Imagine moving away from that just a tad to say P(A) = 0.999 and P(B) = 0.001, meaning B has a 1 in 1000 chance of happening. Well, we would expect to see long sequences of state A happening and our observer would be sending us long sequences of 111111111 with only an occasional 0 interspersed. That means we could compress the communication through the use of a protocol. For instance, our observer could wait until state B occurs and then send us the number of all As that’s been observed until then. Of course that number would have to be encoded in binary. But that affords a lot of compression: here is 32 As uncompressed “11111111111111111111111111111111” and here is the number 32 in binary “100000”. Obviously there would be some extra transmission overhead (after all, how would we tell where a number starts and stops?) but you can see that we can do a lot better.
Now when I say that we can do better there is an important caveat here that often gets lost when this is discussed. The only way we can do better is by having the observer “buffer” some observations and then apply the compression algorithm. If we need to know each state *with certainty* immediately as it occurs, then we cannot do better than our super simple algorithm. But if we can afford to wait, well then we can do a lot better (I chose the word “buffer” on purpose here because this is related to the phenomenon at play when you start streaming audio or video on your computer).
Let’s proceed and actually devise a protocol to explore compression a bit more. We will use what is called a Huffman code in which we choose the shortest code for the most frequent sequence and longer codes for less frequent series. To make this really simple we will only compress two subsequent observations. Let’s assume P(A) = 0.9 and hence P(B) = 0.1, then for two subsequent observations, assuming independence, we have the following probabilities
P(AA) = 0.81
P(AB) = 0.09
P(BA) = 0.09
P(BB) = 0.01
and we will assign codes as follows
AA -> 1
AB -> 01
BA -> 001
BB -> 0001
You will notice how the 1 signal always completes the code. We can implement both the sender and the receiver for this encoding as simple finite state machines (this is left as an exercise for the reader). Now let’s compute our expected average signal length
0.81 * 1 + 0.09 * 2 + 0.09 * 3 + 0.01 * 4 = 1.3
which is clearly shorter than the signal length of 2 which we would get from our simple scheme. And if we divide by 2 we see that our signal length per observed state is 0.65 — meaning we use on average less than a single signal to transmit a state!
Finally let’s compute entropy and see what Shannon’s work tells us is the best we could do
- 0.9 * log 0.9 - 0.1 * log 0.1 = 0.469
By the way, we choose base 2 here for the logarithm to reflect that we have two possible signal values. And when we do that we call the measure of entropy “bits,” where a bit is a signal that has two possible values (generally assumed to be 0 and 1 and the link is to my previous series called Tech Tuesdays).
So Shannon says that the best we could possible do for two states with p = 0.9 is 0.469 bits. By combining two successive states using a simple Huffman code we were able to achieve 0.65 bits which is not bad. If we expanded the size of our blocks to three, four or more states we can do even better, slowly approaching the theoretical minimum.
Now as you let this all sink in please keep in mind that the compression is only possible here if we can wait to combine multiple state observations into a block which we then compress before transmission. This is an important assumption and may not always be true. For instance, suppose that the state of the world we are interested in is whether a particular share price is up or down and we want to trade on that. Well obviously we will not want to wait to accumulate a bunch of share price movements so we can save on communication. Instead we will invest heavily to give ourselves more communication capacity.
Albert Wenger
Over 100 subscribers