Information Theory part 1

29 Nov, 2024

I was introduced to Information theory through the idea of KL Divergence. Or well, the idea of minimizing it. Here's a brief attempt at explaining and detailing everything I'm currently learning.

Entropy

Entropy is a measure of uncertainty or variability of a variable's potential outcomes. It's a way of quantifying the volatility of a probability distribution. It's mathematically defined as:

$H (X) = - \sum_{x} p (x) \log_{2} p (x)$

Here:

H(X) stands for the entropy of the variable X, x represents the possible outcomes of the Random Variable X, p(x) represents the probability of that outcome,

The Reason why we use base 2 instead of using base 10 as it can directly be expressed in the binary system, which makes it easier to transmit information in a digital system.

As for why we take the negative value of the expression, the reason for this is:

We want H(X) or entropy to be positive when the event is unlikely, and closer to 0 when its more probable. In probability theory, probabilities p(x) are always between 0 and 1, and their logarithms are always non-positive. Without the negative sign, the information content would be non-positive for events, which would be counter-intuitive.

TLDR: p(x) is generally between 0-1. Taking log of this gives a negative number, hence the product would be negative. We want information to be positive!

Differential Entropy

When our probability distributions are continuous instead of discrete, we extend our definition from the discrete case to a continuous one, using an integral. For a continuous random variable

$H (X) = - \int f (x) l o g (f (x)) d x$

Here f(x) stands for the probability distribution function of X. It's also important to note the integral goes over the entire support of the distribution, generally from -∞ to +∞, or a specific range.

Differences between the two

There are two marked difference in which differential entropy is different from discrete entropy conceptually. First, differential entropy measures the 'spread' of the entire continuous probability distribution. Hence, unlike discrete entropy which only takes on positive values, differential entropy can be negative. What a negative differential entropy signals conceptually is a 'highly concentrated' probability distribution.

The tighter the distribution, the higher the values of f(x), and the more negative the entropy becomes. This is different from discrete entropy, where the probability as normalized to be less than equal to Now as to why this is possible:

The logarithm log(f(x)) can take negative values when f(x) > 1,

this is because log(f(x)) < 0 if f(x) > 1.

The intuition or reasoning for this is entirely grounded on how discrete and continuous probabilities work. In continuous distributions, the PDF 𝑓(𝑥) does not represent actual probabilities—it represents densities. For a highly concentrated PDF, 𝑓(𝑥) can be much larger than 1 in some regions.

As for why f(x) can be ≥1:

A PDF 𝑓(𝑥) must satisfy two properties:

𝑓(𝑥)≥0 for all 𝑥 (no negative densities).
The integral over the support must equal 1: $\int_{- \infty}^{\infty} f (x) d x = 1$

These rules do not restrict 𝑓(𝑥) to values ≤ 1. For instance, if a distribution is very narrow (i.e., most of its probability mass is concentrated in a small interval), the height of the PDF 𝑓(𝑥) can exceed 1.

The other important difference to understand is how differential entropy is relative to a continuous scale. To understand this sentence, first lets define what scale means. Here scale refers to the measurement unit of the continuous random variable. For continuous random variables, the probability density function (PDF) is not a direct measure of probability at a specific point—it’s a density. In contrast, discrete probabilities are unitless, and are independent of how the variable's outcome is measured.

For continuous variables, as the graph is a density graph and not of specific outcomes, it scales with the resolution of the variable. This leads to differential entropy changing to changes in its scale. Hence, changing X -> to kX completely changes the entropy of the variable.

If you transform a continuous random variable X by a scaling factor k, i.e., $𝑌 = k 𝑋$ , the PDF of 𝑌 changes as: $f_{Y} (y) = \frac{1}{| k |} f_{X} (\frac{y}{k})$ The differential entropy of Y is: $H (Y) = H (X) + l o g ∣ k ∣$

Hence, this proves how differential entropy is not invariant in changes to scale. I.E. scaling 𝑋 by a factor 𝑘 changes the differential entropy by log∣𝑘∣.