Today in Uncertainty Wednesday I want to start on the idea that just the shape of the probability distribution alone contains some measure of uncertainty. Let’s think about the simplest case again with just two states A and B and P(A) = p1 with P(B) = p2 = 1- p1. I am using indices because we will shortly expand the number of states.
If p1 = 1 there is no uncertainty at all because you are certain that the world is in state A. Same holds true for p1 = 0 except that it is now state B (because now p2 = 1). If we let p1 move continuously from 1 to 0, uncertainty first increases and then it starts to decrease again. As we will see later but should be intuitive, uncertainty is maximized for p1 = p2 = 0.5.
More generally for a probability distribution p1, p2, … pn we would like to have our measure of uncertainty such that
U(p1, p2, … pn) is continuous in p1, p2, … pn
Meaning if you make an infinitesimally small change in two of the p’s (remember, you can’t just change one of them because they all need to add up to 1), you get only an infinitesimally small change in U.
Now compare the following two situations to each other. You can either face two equally likely states A and B or three equally likely states A, B, C. It seems intuitive that we would say that there is more uncertainty when there are three equally likely states, even if we know nothing else. This requirement can be expressed as follows (with some abuse of notation):
f(n) := U(1/n, 1/n … 1/n) is monotonically increasing in n
Finally, a good measure of uncertainty will have a straightforward approach to composability, meaning if you first face one uncertainty and then a second one it should be easy to combine the uncertainty measure for each to get an overall uncertainty.
To make this more concrete imagine the following setup. There are four states of the world A, B, C and D. Now imagine that the true state of the world is revealed to you in two stages: first you find out if is in {A, B} or {C, D} and then you find out the actual state. Let’s call the fist step X and the second step Y. Then we would like our measure of uncertainty to behave as follows
U(XY) = U(X) + ∑P(Xi) * U(Y|Xi) where ∑ is over the elements of X
and where U(Y|Xi) is the measure of remaining uncertainty conditional on the outcome of the first step. What this requirement amounts to is saying that the total uncertainty is a probability weighted sum of the uncertainties of each step (the first step having probability 1).
In his groundbreaking 1942 paper “A Mathematical Theory of Computation” Claude Shannon showed that the only measure of uncertainty that fulfills all three of these requirements is
H = - K ∑ pi log pi where ∑ is over the i = 1…n of the probability distribution
which is known as the Shannon entropy or just entropy of the probability distribution.
It is important to emphasize again that this is a measure of uncertainty that operates solely at the level of the probability distribution. Nothing in its definition refers to outcomes or even further to the impacts of outcomes on different actors. See the Intro to Measuring Uncertainty from two weeks ago for an explanation of the difference.
Next week we will look at entropy for some probability distributions to get more of a feel for what this measure captures.