The problem

The question to be solved is: what is the information flux that an analogue channel can sustain, if we have available a given signal power, we know that the channel is contaminated with a given noise, and the channel (or the signal) has a limited bandwidth ?

The information flux can be defined as the highest possible information that can be obtained per second of an entropy source with well-chosen probability density at the sending side.  As we are talking about an analogue signal (for instance, a voltage), one needs to give a boundary on the voltage swing of the sent signal.  One also needs to define the noise level, that is, the statistical fluctuations that are added to the input signal due to the noise of the channel.  The exact ways in which these things are described will have an influence on the result. 

Historical estimations

Historically, the first accurate estimation of the information capacity of an analogue link was determined by Hartley.  Hartley considered two aspects.  The first aspect was the bandwidth of the channel.  If the channel accepts all frequencies up to B, and rejects all frequencies above B, we say that B is the bandwidth of the channel.  Nyquist's theorem then tells us that such a channel can transport 2 B totally independent voltage values per second.

The second aspect Hartley considered, was the number of perfectly distinguishable voltage levels that one could transmit over the channel, given the boundaries on the signal strength, and the noise level.  Essentially, this comes down to assuming a uniform distribution of noise over an interval d, and a maximal signal swing D.  The number of distinguishable levels is then M = D/d + 1.

If the input message has then a uniform distribution over the M different levels, clearly the entropy of a single level choice is log2 M.  As we can have 2 B such level choices per second, we have that an entropy rate of 2 B log2 M can be signalled over the channel.

The Hartley capacity of a channel is hence: C = 2 B log2 M

This simplifies to the trivial result when M = 2, that is, a channel that can only reliably transport 2 possible values.  In that case, one can send 2 B values per second, and the channel capacity is 2 B.   If the channel has, say, a bandwidth of 1 GHz, then the channel capacity is 2 Gb/s.

Shannon

If we have a channel with a sharp frequency cut-off so that the bandwidth is B, as in the Hartley case, but this time the signal power is given by S and the noise power is given by N, and all we know about the noise it that it has power N, then Shannon has shown that the channel capacity is given by:

C = B log2 (1 + S / N)

Our 1 GHz bandwidth channel, in which we can put a signal with power of 10 milli watt, and there's a noise power of 1 micro watt, will have a capacity of 13 Gb/s.  From the boundary conditions, it follows that the maximum entropy principle gives us a Gaussian noise distribution, which is used in Shannon's deduction.

In order to be able to exploit the full capacity of a channel, the source needs to have the right ensemble, and the receiver may need a certain sophistication in order to transform the received voltage samples into "data" points drawn from a certain ensemble too.

If the signal and the noise power are frequency-dependent, then the channel capacity is simply the integral over the frequency spectrum within the bandwidth of the channel;

C = ∫ log2(1 + S(f)/N(f) ) df

These formula are independent of the physical nature of the channel, whether we are dealing with optical fibres, wire links, or radio links for instance.  Whether we are "on board" or over large distances.

The channel capacity does not mean that with any given source, one can reach that information transmission rate.  The source has to be well-adapted, and the receiver has to be well-adapted.  How to adapt a source and a receiver to a specific channel so as to optimize the information transmission is an interesting engineering question.

The frequency-dependent formulation can also be used if the samples of noise or the samples of the channel are correlated in time.  Indeed, the correlation in time is nothing else but a non-flat bandwidth (the frequency spectrum being the Fourier transform of the auto-correlation function).  This resolves the most obvious objection one can formulate against the naive approach of Hartley: namely that we have to suppose that the noise, as well as the message, has independent samples.  Correlation can be handled by non-flat frequency spectra.

It is also obvious from the formula that one should put the signal in that frequency band where the noise is lowest (that is, if one has a hand on S(f) ).  The highest contributions to the channel capacity come from those parts of the frequency spectrum where the noise density is low.  Note that high noise levels where there is no signal, do not influence the channel capacity.  Adapting a source to a communication channel can consist in modulating the information such that it uses better the low-noise parts of a channel (shaping S(f) ). 

There are a few caveats to Shannon's formula, however.  The most important one is that the noise has to be "unknowable noise", that is, the noise power is the description of a true maximum-entropy distribution (Gaussian distribution in this case, as the standard deviation is given by the power), and there is no way to know the "next throwing of the dice", or we don't want to consider that knowledge. In other words, the noise has to be true entropy.  For instance, if a channel has pick-up noise, say, hum from the power mains, and we know the form of that noise, then one could in principle correct for that noise.  Thjis noise power shouldn't enter Shannon's formula if we intend to correct for it.  As such, there can be ways to render a communication channel noisy for some, and high-capacity for others: by adding "noise known to some" to it.   If one adds a "signal that looks random" to the channel, then in as much as a receiver doesn't know that signal and is not interested in it, this signal power will have to be considered as noise.   However, for a receiver that knows that signal in advance, or through another channel, this signal can be subtracted, and hence is not part of the noise power.