The link between information, probability and entropy

Probability is an assignment of numbers (between 0 and 1) to repeatable events according to certain rules.  The usual interpretation of a probability is that it is the value to which "tends" the ratio of the number of times the event happens, over the number of trials.  Probability hence makes only "real world sense" if the event is repeatable and the settings in which that event can or cannot be realized, are well-defined.

As such, throwing a dice is a well-defined repeatable thing to do, and the event "I've thrown less than 3" is hence an event for which one can reasonably define a probability.  However, "the probability that you were born" as such, cannot be assigned any probability, because of course in practice there's nothing repeatable to your birth ; and even in principle, one hasn't defined any imaginary setting of a hypothetical (even if not practical) repeatable trial from which the event "you are born" could result or not.

However, probability has implicitly also another meaning.  We normally only use the concept of probability when we don't know the outcome.  Using the above definition of probability, we could say something like "when we count from 0 upwards, the probability to have a number ending in 3 or 4 is equal to 0.2".  Indeed, when you count, you will encounter two numbers ending in 3 or 4 for every 10 numbers you will be counting, and so the ratio of "success" when you count a lot will tend to 0.2.  But people usually don't use the concept of probability in this case, because "when you're counting, of course you know what the next number is going to be".  If you arrive at 327 when counting, you're not going to say that "the probability for the next number to end in a 3 or a 4 is 0.2".  Of course not.  Because the next number will be 328 and not end in a 3 or a 4.  Although all the concepts of the mathematical theory of probability are respected, people don't use probability "when they know for sure".  When we say that the probability for a number to end in a 3 or a 4 is 0.2, we actually mean something 'random' that 'we can't know in advance'.

So instead of "the average rate of success when we repeat it often" which is the basis of probability, we actually mean something else, which is ignorance.  Of course, ignorance doesn't only depend on the event as such, but also about the "entity that ignores".

Ignorance about an event is entropy.  Entropy originated in thermodynamics and had at first sight, nothing to do with anything like information or probability, but it turns out that information entropy and thermodynamic entropy are very closely related - and at a sufficiently abstract level, in fact, the same concept.  Thermodynamic entropy can be seen as the ignorance one has about the exact microscopic state of a physical system.  Information entropy is the measure of ignorance one has about the exact outcome of an event with a certain probability.

Information, in the end, is the diminishing of entropy, the reduction of ignorance.  If in the end, I know the outcome, then there's no entropy left, and I gained as much information about the event, as I had entropy before I learned the result.

Elementary mathematics of entropy and information.

We are used to having information in bits.  A bit is the most elementary form of information, or ignorance.  It is the ignorance related to flipping a fair coin.  The outcome can be heads, or tails, and the probability of each is 0.5.  That last statement means: flipping a coin is repeatable, and if we do that a lot of times, on average, half of the outcomes will be heads and half of the outcomes will be tails.  The "flipping" is a verbal description of "doing something of which we cannot foresee the outcome".  Indeed, we could also consider "turning over a coin".  Then we would also have probabilities of 0.5 and 0.5 for heads and tails.  But we would of course know "the next outcome".  If we have tails now, then after "turning over" we would have heads.  And after turning over again, we would have tails again.    We want to state that "there's no way of knowing the outcome before we've flipped it",which is why we use the word "flipping a coin".

The ignorance of one single "flip of a fair coin" is called a bit.  The ignorance can be reduced to 0, if we learn that the outcome was "head" or "tail".  We need to learn the actual value of something that can have two values.  That's a bit.  A 0 or a 1.  A yes or a no.  A true or a false.  A high or a low.

Now, the next thing we would like to have, is that the information in 2 coin flips is twice the information in one coin flip.  That the information in 5 coin flips is 5 times the information in 1 coin flip.  However, the probability to obtain a given sequence of 5 coin flips (say, head-head-tail-head) equals 1/32.  1/2^5.  The relationship that gives us the entropy (or the information needed to eradicate ignorance) is:

H = log2 1/p

where p is the probability of an event, assuming all events having the same probability.

One extends this formula to the less intuitive case when the outcomes do not all have the same probability:

H = Σ pi log2 1/pi

The above formula gives us the entropy, that is, the ignorance, we have of the outcome of a trial, when the different outcomes have probabilities pi and we assume that these probabilities indicate "something random and un-knowable in advance".  The last part is essential, and sometimes hard to grasp or even difficult to prove.  Entropy exists when probabilities indicate our pre-event ignorance on top of their meaning of average rate of success in the long run.  The knowledge of the outcome carries of course exactly that amount of information so as to reduce our ignorance to 0, that is, to reduce the entropy to 0.  As such, the amount of entropy is exactly equal to the amount of information contained in the outcome of the trial.

If the entropy of "5 future fair coin flips" is 5 bit, then learning about the outcome of 5 coin flips contains 5 bits of information.  Indeed, in order to communicate the outcome of 5 fair coin flips, one has to transmit 5 bits.

An entropy source, or equivalently, an information source, is something that realizes an outcome of a trial F times per second.  In as much as the individual trials are statistically independent, we have that this source then generates an entropy flux of F.H.

What is more interesting, is how much information we learn about a random event X, if we fully learn about a correlated event Y.  This is the typical setting for a communication: on the sender side, there is a source of information where a message X is drawn and sent on the communication channel.  At the receiver side, the channel is a source of information where a message Y is received.

Initially, the source at the sender side has an entropy H(X) before the message is "drawn".   The ignorance about the sent message is H(X).  When the message is composed and sent out, of course, on the sender side, all of that ignorance is gone, so the message that goes into the communication channel has an information content H(X).

On the receiver side, however, there is a message Y coming out of the communication channel, with information content H(Y).  But what the receiver wants to know, is not what came out of the channel, but what the sender put into the channel.  The question is now, how much ignorance of X the information of Y helps us to diminish.  That is, how much information about X does the knowledge of Y contain ?  If the communication channel is perfect, then of course the received message is exactly the sent message, and Y = X.  In that case, there's no ignorance about X left when we know Y.  But when we have a noisy channel, then Y is not entirely X.

One can argue that the information gained about X is the information contained in X, minus the information that remains in X when we know Y.

I(X;Y) = H(X) - H(X|Y)

It can be shown that this equals:

I(X;Y) = H(X) + H(Y) - H(X,Y)

It is interesting to see that this expression is symmetrical: the information that we learn about X by knowing Y, is exactly equal to the information we learn about Y by knowing X.

When reduced to probabilities, the expression becomes:

I(X;Y) = Σi Σj p(xi , yj) log2 [ p(xi , yj) /  ( p(xi) . p(yj)  ) ]

Here, the probabilities are those of the individual possible messages x and y as a joint probability distribution.  We see for instance that when x and y are uncorrelated (that is, when the received message has nothing to do with the sent message) that the expression inside the logarithm reduces always to 1, and hence each logarithm reduces to 0.  As such, receiving Y doesn't learn us anything about X, and I(X;Y) is zero, which is logical.  On the other hand, with a perfect channel, p(xi , yj ) is different from zero only when i = j, and then it is equal to p(xi ) = p(yi ).  At that point one can see that I(X ; Y) reduces to H(X), which it should.  With a perfect channel, upon reception of a message, we there's no ignorance left of the sent message, so we learned all the information that was contained in the sent message.

Some special cases and maximum entropy distributions

With a given probability distribution comes a given entropy.  The discrete formulas can easily be extended to continuous distributions ; in that case one speaks about differential entropy and differential information.   There is a problem with the information content of a continuous distribution, when the event is realized.  Indeed, there is an infinite amount of information in the specification of a real number.  The differential entropy is defined "per unit of the variable".  Grossly, one can say that the differential entropy is the ignorance of where the real number will be, up to one unit.

The differential entropy of a Gaussian distribution with standard deviation s is equal to 1/2 log2 (2.π.e.s2)

The Gaussian distribution is the distribution which has the highest differential entropy amongst all distributions with the same standard deviation s.  In other words, one can say that for a given standard deviation, a Gaussian distributed variable is the "most unknown".  In other words, if the only thing we know about a random variable is its mean and its standard deviation, then by assuming that the random variable is distributed according to a Gaussian is leaving the most ignorance to the variable.   Any other assumption, without knowing it, would add erroneous information.  Some people claim that this is the reason why the Gaussian distribution is so ubiquitous: it is the distribution that corresponds to maximal ignorance, when mean and standard deviation are given.

The differential entropy of a uniform distribution over an interval with length a is equal to log2 a .  In the same way, this is the maximum entropy distribution if it is given that the random variable is limited to the interval [0, a].  If the only thing we know about a random variable, is that it lies within the interval [0,a], then the assumption that leaves most ignorance to the random variable, is that the distribution is uniform.

The differential entropy of an exponential distribution with mean m is given by log2 (e.m)

The exponential distribution is the distribution with the maximum entropy amongst all distributions of positive numbers with mean m.

The fact that the uniform distribution, the Gaussian distribution and the exponential distribution are maximum entropy distributions, makes that they are the natural choice if we don't know anything more about the random variable, than that it is respectively limited to an interval, has a given standard deviation, or is positive and has a given mean.