Recently, while studying AI, I naturally began to think a lot from the perspective of Information Theory. In that process, I took a deep look into the theory itself again, and since my blog content had been sparse for a while, I thought this would be a good opportunity to organize and post what I’ve studied, centering on the textbook “Elements of Information Theory”. I plan to organize it to be as easy to understand as possible, with explanations centered on intuitive examples, accompanied by formulas and proofs where absolutely necessary.
One thing to note is that while coverage of Information Theory often starts with Chapter 2: Entropy, Relative Entropy, and Mutual Information, there are already so many well-organized resources on that topic. Therefore, I will replace that with an attached file of the notes I studied and organized. (The file is right below). So, the first post related to Information Theory will start from Chapter 3: Asymptotic Equipartition Property (AEP), which is deeply related to information compression!!
Elementes of Information Theory - Chapter 2 Notes
Asymptotic Equipartition Property (AEP)
The term Asymptotic Equipartition Property (AEP) might look difficult, but if simplified, you can understand it as applying the Law of Large Numbers to entropy. And you can understand it as solving this mathematically. Let me explain starting with the formal definition.
Let’s assume we perform i.i.d sampling $N$ times from an arbitrary random variable $X_i \sim p(x)$. Then, we can define AEP as follows:
Theorem 1:
\[\begin{align*} -\frac{1}{N}\log p(X_1, X_2 , \ldots ,X_N) &= -\frac{1}{N}\sum_ {i=1}^N \log p(X_i) \\ &= - \mathbb{E}\left[ \log p(x) \right] \\ &\rightarrow H(X) \end{align*}\]Let’s examine what this means. Usually, the formula for entropy is defined as $- \sum_i p(X_i) \log p(X_i)$. However, since the current situation assumes i.i.d, there is no need to consider the $p(X)$ attached in front of the $\log$. Since the probability of each random variable being sampled is identical (because they are independent trials), we simply need to divide by $N$. Therefore, for the expectation as well, we take $\mathbb{E}$ rather than the notation $\mathbb{E}_p$ which considers $p(x)$. In other words, the final meaning of AEP is that given a sequence of random variables in an i.i.d situation, it converges to entropy.
Remark 3.1: In Machine Learning, taking the expectation is usually treated the same as simply finding the average. The reason is that we usually assume an i.i.d situation when sampling data.
Actually, there is one more part to examine mathematically here. In the formula above, $\frac{1}{N}\sum_i^N \log p(X_i)$ is the sample mean, and $\mathbb{E}\left[ \log p(x) \right]$ is the total mean (or expectation) of the true distribution $p(x)$. In other words, substituting a partial entropy value calculated through partial sampling with the total entropy $H(X)$ cannot be seen as mathematically perfectly identical. However, the reason this can hold is that it has already been proven that the sample mean expectation converges to the expectation of the true distribution. Then, let’s take a look at how this can be proven. (The term Law of Large Numbers appeared earlier in the explanation of AEP, and it is because of this proof.)
Proof 1:
(I will write the proof section in English for convenience.)
Let $X_1, X_2, \ldots, X_N$ be a random sample of $N$ i.i.d. variables from the distribution $p(x)$. Moreover, the mean of the true distribution (population mean) $p(x)$ is defined as $\mathbb{E}[X] = \mu$, and its finite variance is denoted by $\text{Var}[X] = \sigma^2$. Let us first break down the proof for the expectation of the sample mean as follows:
Let the sample mean, denoted by $\bar{X}_ N$, be defined as:
\[\begin{equation*} \bar{X}_ {N} = \frac{1}{N} \sum_ {i=1}^N X_{i} \end{equation*}\]Then, the expectation of the sample mean is given by:
\[\begin{align*} \mathbb{E}\left[\bar{X}_ N \right] = \mathbb{E}\left[\frac{1}{N} \sum_ {i=1}^N X_i \right] = \frac{1}{N}\sum_ {i=1}^N \mathbb{E}\left[ X_i \right] \end{align*}\]Here, since the expectation of each random variable $X_i$ is $\mu$, the expectation of the sample mean becomes:
\[\begin{equation*} \mathbb{E}\left[\bar{X}_ N \right] = \frac{1}{N}\sum_ {i=1}^N \mathbb{E}\left[ X_i \right] = \frac{1}{N} \times N \times \mu = \mu \end{equation*}\]This shows that the expectation of the sample mean equals the true population mean (i.e., the sample mean is an unbiased estimator of the population mean).
Next, the proof of convergence of the sample mean to the true mean is twofold—based on the Weak Law of Large Numbers (WLLN) and the Strong Law of Large Numbers (SLLN). In this case, we will focus only on WLLN, which is already sufficient for use in the context of the Asymptotic Equipartition Property (AEP).
Weak Law of Large Number (WLLN)
First, WLLN states that the sample mean converges in probability to the population mean. This means that for any small positive number $\epsilon$, the probability that the sample mean $\bar{X}_ N$ deviates from the true mean $\mu$ becomes arbitrarily small. This can be formally expressed as:
\[\begin{equation*} \lim_{n \rightarrow \infty} P(\left| \bar{X}_ N - \mu \right| \geq \epsilon ) = 0 \end{equation*}\]This can be proven using Chebyshev’s Inequality, assuming finite variance $\sigma^2$. Let us first recall Chebyshev’s Inequality, which states that for any random variable $Y$ with mean $\mathbb{E}\left[Y\right]$ and variance ${Var}\left[ Y \right]$:
\[\begin{equation*} P(\left| Y - \mathbb{E}\left[Y\right] \right| \geq \epsilon) \leq \frac{Var\left[ Y \right]}{\epsilon^2} \end{equation*}\]Now, let $Y = \bar{X}_ N$, and we can derive the variance of the sample mean $\bar{X}_ N$ using properties of variance:
\[\begin{align*} Var \left[ \bar{X}_ N \right] &= Var\left[ \frac{1}{N}\sum_ {i=1}^N X_i \right] \\ &= \frac{1}{N^2} Var \left[ \sum_ {i=1}^N X_i \right] \\ &= \frac{1}{N^2} \sum_ {i=1}^N Var\left[X_i \right] \\ &= \frac{1}{N^2} \sum_ {i=1}^N \sigma^2 \;\; (Var[X_i] = \sigma^2 \; \text{for all i}) \\ &= \frac{\sigma^2}{N} \end{align*}\]Since we already know $\mathbb{E}\left[\bar{X}_ N\right] = \mu$ and ${Var}\left[\bar{X}_ N \right] = \frac{\sigma^2}{N}$, we can apply Chebyshev’s Inequality to the sample mean $\bar{X}_ N$:
\[\begin{equation*} P(\left| \bar{X}_ N - \mu \right| \geq \epsilon) \leq \frac{\sigma^2}{N \epsilon^2} \end{equation*}\]As the number of samples $N \to \infty$, the right-hand side approaches 0:
\[\begin{equation*} \lim_{N \rightarrow \infty} P( \left| \bar{X}_ N - \mu \right| \geq \epsilon) = 0 \end{equation*}\]This tells us that the probability of the sample mean deviating from the true mean by more than $\epsilon$ becomes arbitrarily small as the sample size grows. Since probabilities are non-negative, this is equivalent to:
\[\begin{equation*} \lim_{N \rightarrow \infty} P(\left| \bar{X}_ N - \mu \right| < \epsilon) = 1 \end{equation*}\]This completes the proof of convergence in probability of the sample mean to the population mean, as stated by the WLLN. $\Box$
Typical Set
We worked hard to look at the concept called AEP earlier, but the reason this concept emerged is actually to explain the Typical Set. The Typical Set refers to the representative characteristics that a long sequence is expected to possess. Saying it like this might be a bit confusing, so let me explain with an example of a coin toss.
Example 1:
Let’s assume we toss a coin a total of 100 times. Then, since every situation in a coin toss is independent, there are $2^{100}$ possible outcomes consisting of Heads (H) and Tails (T). Here, we can explain the Typical result and Non-typical result as follows:
- Non-typical result: Cases where only Heads appear 100 times (H:100) or only Tails appear 100 times (T:100). Cases where Heads and Tails alternate exactly 50 times each (HTHT…).
- Typical result: Cases where the overall result has a ratio of Heads to Tails close to half and half, such as (H:51, T:49).
In other words, in the case of a coin toss, the Typical Set refers to the set of results where the ratio of Heads to Tails is close to 50:50.
To explain it more informally, Typical Set can be expressed as a set of probabilistic results that are acceptable within the bounds of common sense.
Now, let’s express the definition of the Typical Set formally. If we express the Typical Set mathematically, it can be written as follows:
Definition 1:
Let $(x_1, \ldots x_n) \in \mathcal{X}^n$ denotes sequences with length $n$, and $\epsilon$ is very small positive number. Then typical set ${A}^{(n)}_\epsilon$ with respect to $p(x)$ is set of sequence with the property
\[\begin{align*} 2^{-n(H(X)+\epsilon)} \leq p(x_1, x_2, &\ldots, x_n)\leq 2^{-n(H(X)-\epsilon)} \end{align*}\]By applying logs and if $(x_1, \ldots x_n) \in A_\epsilon^{(n)}$, then
\[\begin{align*} {-n(H(X)+\epsilon)} &\leq \log p(X_1, X_2, \ldots, X_n)\leq {-n(H(X)-\epsilon)} \\ {H(X)-\epsilon} &\leq -\frac{1}{n}\log p(X_1, X_2, \ldots, X_n)\leq {H(X)+\epsilon} \end{align*}\]This means that the average value of the probability of each random variable $\frac{1}{n}\sum_i \log p(X_i)$ is bounded by the entropy $H(X)$ and $\epsilon$. To express this more precisely, it is called the $\epsilon$-Typical Set.
However, if you look closely here, the form of $-\frac{1}{n}\sum_i \log p(X_i)$ in the middle shows a close relationship with the AEP explained above. Based on AEP, we can establish the properties of the typical set $A_\epsilon^{(n)}$. Then, let’s examine what properties $A_\epsilon^{(n)}$ has and the proofs for them.
Properties 1:
- When sequence length $n$ is sufficiently large, then $P(A_\epsilon^{(n)}) \geq 1 - \epsilon$
- Let $\rvert A \rvert$ denotes the number of elements in the set $A$; $\rvert A_\epsilon^{(n)} \rvert\leq 2^{n(H(X) + \epsilon)}$
- When sequence length $n$ is sufficiently large, then $\rvert A_\epsilon^{(n)} \rvert \geq (1-\epsilon)2^{n(H(X) - \epsilon)}$
Proof 2:
Proof 2.1:
From the Asymptotic Equipartition Property (AEP), we know that
\[\begin{equation*} -\frac{1}{n}\sum_ {i=1}^n \log p(X_i) \rightarrow H(X) \end{equation*}\]in probability. This implies that when $n$ is sufficiently large,
\[\begin{align*} p \underbrace{\left(\left|-\frac{1}{n}\log p(X_1, \ldots, X_n) - H(X) \right| \leq \epsilon \right)}_ {\in A_ \epsilon^{(n)}} \geq 1 - \delta, \end{align*}\]where $\delta$ is a small positive number, consistent with the convergence in probability. In the case where $\delta = \epsilon$, this satisfies the first property. $\Box$
Proof 2.2:
We can derive a lower bound on the probability mass of the typical set based on its definition:
\[\begin{align*} 1 &= \sum_ {\mathbf{x} \in \mathcal{X}^n} p(\mathbf{x}) \\ &\geq \sum_ {\mathbf{x} \in A^{(n)}_ \epsilon} p(\mathbf{x}) \\ &\geq \sum_ {\mathbf{x} \in A^{(n)}_ \epsilon} 2^{-n(H(X) +\epsilon)} \scriptsize{\text{(by the lower bound of the Typical Set definition)}}\\ &= 2^{-n(H(X) +\epsilon)} |A^{(n)}_ \epsilon|. \end{align*}\]Multiplying both sides by $2^{n(H(X) +\epsilon)}$ yields $\rvert A^{(n)}_ \epsilon\rvert \leq 2^{n(H(X) +\epsilon)},$ which satisfies the second property. $\Box$
Proof 2.3:
From the first property, we can derive an upper bound as follows:
\[\begin{align*} 1 - \epsilon &\leq p(A_\epsilon^{n}) \\ &\leq \sum_ {\mathbf{x} \in A^{(n)}_ \epsilon} 2^{-n(H(X) -\epsilon)} \scriptsize{\text{(by the upper bound of the Typical Set definition)}}\\ &= 2^{-n(H(X) -\epsilon)} |A^{(n)}_ \epsilon|. \end{align*}\]Multiplying both sides by $2^{n(H(X) - \epsilon)}$ gives $\rvert A^{(n)}_ \epsilon\rvert \geq 2^{n(H(X) - \epsilon)},$ which satisfies the third property. $\Box$
Consequences of AEP: Data Compression
We have examined the concept of AEP and the $\epsilon$-Typical Set. Then, what can we do with these properties? There are several things, but the most usefully applied field among them is data compression. In other words, compression of data to near-lossless becomes possible through AEP. Let’s examine how this is possible through an example as well.
Example 2:
Let’s use a coin toss as an example again. To give a more extreme example, let’s assume the probabilities are not equal, but Heads (H) 90% (P(H)=0.9) and Tails (T) 10% (P(T)=0.1). In this case, the possible outcomes are as follows:
- $P(HHH) = 0.9 \times 0.9 \times 0.9 \approx 0.729$
- $P(H\times2, T) = 0.9 \times 0.9 \times 0.1 \approx 0.081$ (3 cases)
- $P(H, T\times2) = 0.9 \times 0.1 \times 0.1 \approx 0.009$ (3 cases)
- $P(TTT) = 0.1 \times 0.1 \times 0.1 \approx 0.001$
Next, thinking simply, when tossing the coin 3 times, the expected number of Heads (H) can be seen as $3 \times 0.9 = 2.7$. Taking the range slightly wider, we can see it as coming out about 2 or 3 times when tossed 3 times (lower bound = 2, upper bound = 3). In other words, getting Heads (H) 2 or 3 times can be said to be typical. Now then, calculating $A^{(3)}$ based on Heads (H) gives:
$A^{(3)} = P(HHH) + P(H\times2,T)\times3 = 0.729 +0.081 \times 3 = 0.972$
That is, intuitively, when finding the typical set, the probability of that set occurring can be seen as about 97.2%. It is virtually close to 100%.
However, the current example is too small with $n=3$, so it is difficult to show by substituting into the properties for AEP and typical set defined above. Because the properties above assume when $n$ is sufficiently large. Therefore, based on the intuition above, let’s examine mathematically how it turns out when we increase it to $n=100$.
From now on, since we need to directly utilize the value of entropy in the formula, let’s calculate the entropy $H$ for the probability of Heads appearing. Calculating the binary entropy $H(p)$ (Binomial trial) for the probability of Heads gives:
\[\begin{align*} H(p) &= -p\log p - (1-p)\log (1-p) \;\; \scriptsize{\text{(Entropy for Binomial Trial)}}\\ H(0.9) &= 0.9 \log 0.9 - (1-0.9)\log 0.9 \\ &= 0.9 \log 0.9 - 0.1\log0.1 \\ &\approx 0.1368 + 0.3322\\ &\approx 0.469 \end{align*}\]In other words, the Entropy based on Heads (H) per toss is about ${H(0.9) = 0.469}$. Then now, substituting into the typical set $A_{\epsilon=0}^{(n)}$ gives:
\[\begin{equation*} |A^{(n)}_{\epsilon=0}| = 2^{nH} = 2^{100\times0.469} = 2^{46.9} \approx 1.2\times10^{14} \end{equation*}\]Let’s examine the meaning of this result from the perspective of Information Theory and bits.
First, as in the example above, let’s calculate the probability of the typical set occurring. When $n=3$, it was actually hard to set a bound, so we simply designated the bound range as 2 or 3 without setting $\epsilon$ separately. However, since now $n=100$, we can set each lower and upper bound of the typical set via $\epsilon$. In this case, for convenient calculation, let’s say the expected count is $100 \times 0.9 = 90 =np$, and let’s calculate the typical set with a bound of $\pm 5$. (Range of expected Heads: $N_H \in [85, 95]$)
\[\begin{align*} p(A^{(100)}_ \epsilon) &= \sum_ {k=85}^{95} p(N_H=k)\\ &=\sum_ {k=85}^{95}\binom{100}{k}p^k (1-p)^{100-k} \\ &\approx 0.936 \end{align*}\]This means that the sum of cases belonging to the typical set among the total $2^{100}$ possible cases accounts for about 93.6% of the total probability mass. Although not 100%, most cases can be seen as included in the typical set.
Now, imagine communicating information about what happened above to someone. The total number of ways to convey information about the coin toss is $2^{100}$ (size of total set $\mathcal{X}$; $\rvert \mathcal{X}\rvert $). If we think about transmitting this in bits, we would need $\log2^{100} = 100$ bits (excluding prefix). However, the size of the typical set calculated above is about $\log2^{46.9} = 46.9$ bits $ \approx 47$ bits. (We express this as indexing) If we transmit only the Typical Set, we can save about $53$ bits. From a communication perspective, this can be seen as a very large amount of compression. ($53\%$ compression possible)
To summarize, if we say we are communicating information about an event where 100 coin tosses were performed, based on the typical set, it means we can communicate events corresponding to about 93.6% of the total probability mass with about 47 bits, that is, compressed by 53% fewer bits than before.
One-line summary: Let's compress the frequently occurring sets, and leave the rarely occurring sets as they are!
Now let’s look at this formally from the perspective of sequence length. First, the typical set $A^{(n)}_ \epsilon$ defined earlier satisfies the following property:
\[\begin{equation*} |A^{(n)}_ \epsilon| \leq 2^{n(H + \epsilon)} \end{equation*}\]In other words, most sequences are included in this typical set, and the number is at most $2^{n(H+\epsilon)}$. As explained in the example, this means sequences within this typical set can be represented within approximately $n(H + \epsilon) + 1$ bits. (Here, +1 assumes the case where $n(H + \epsilon)$ might not be an integer) Additionally, since information about whether it is a typical set or not is needed, 1 bit increases, so finally the total length of $A^{(n)}_ \epsilon$ can be represented within $n(H + \epsilon) +2$ bits. Now there is another point to examine here; as explained above, we said we compress the typical set (indexing) and leave the non-typical set as is. We can express the total length of each as follows:
- Typical Set: $n\log \rvert A^{(n)}_ \epsilon\rvert \leq n(H + \epsilon) +2 $
- Non-typical Set: $n\log \rvert \mathcal{X}\rvert \leq n\log \rvert \mathcal{X}\rvert + 2$
Now we can calculate the expected value of the sequence length. First, let $x^n=$ sequence $(x_1, \ldots x_n)$, and denote the length for $x^n$ as $l(x^n)$. Then the expected value for the required sequence length can be expressed as follows by taking the expectation:
\[\begin{align*} \mathbb{E}\left[ l(X^n) \right] &= \sum_ {x^n} p(x^n)l(x^n)\\ &= \sum_ {x^n \in A^{(n)}_ \epsilon } p(x^n)l(x^n) + \sum_ {x^n \in A^{(n)^c}_ \epsilon} p(x^n)l(x^n)\\ &\leq \sum_ {x^n \in A^{(n)}_ \epsilon } p(x^n)\left(n(H + \epsilon) + 2\right) + \sum_ {x^n \in A^{(n)^c}_ \epsilon} p(x^n)\left(n\log |\mathcal{X}| + 2 \right)\\ &= p(A^{(n)}_ \epsilon )\left(n(H + \epsilon) + 2\right) + p(A^{(n)^c}_ \epsilon )\left(n\log |\mathcal{X}| + 2 \right)\\ \end{align*}\]Now, another part to consider here is that we want to know not only how many bits are used for the entire sequence $X^n$, but how many bits are required for each $x_i$. That is, we need the average number of bits (bit rate) $\frac{1}{n} \mathbb{E}[l(X^n)]$, not the total number of bits $\mathbb{E}[l(X^n)]$. Regarding this, we can summarize as follows:
Theorem 2
Let $X^n \sim p(x)$. Let $\epsilon > 0$. Then there exists sequences $x^n$ with sufficently large length $n$ as follows:
\[\begin{equation*} \mathbb{E}\left[\frac{1}{n}l(X^n) \right] \leq H(X) + \epsilon \end{equation*}\]This Theorem can be derived from the expectation calculated just above. First, when $n\rightarrow \infty$, $p(A^{(n)}_ \epsilon)$ and $p(A^{(n)^c}_ \epsilon)$ converge to $p(A^{(n)}_ \epsilon) \rightarrow 1$ and $p(A^{(n)^c}_ \epsilon) \rightarrow 0$ as seen in Property 1.1. Then $\mathbb{E}\left[ l(X^n) \right] \leq n(H + \epsilon) + 2$, and dividing both sides by $n$ leads to Theorem 2.
In conclusion, it means that when $n\rightarrow\infty$, the entire sequence $X^n$ can be compressed to approximately (weakly) $nH$ bits.
Conclusion
To briefly summarize the covered content:
AEP (Theorem 1):
\[\begin{equation*}
-\frac{1}{N}\sum_ {i=1}^N \log p(X_i) \rightarrow H(X)
\end{equation*}\]
Typical Set (Definition 1, Properties 1):
Definition 1:
Properties 1- When sequence length $n$ is sufficiently large, then $P(A_\epsilon^{(n)}) \geq 1 - \epsilon$
- Let $\rvert A\rvert $ denotes the number of elements in the set $A$; $\rvert A_\epsilon^{(n)}\rvert \leq 2^{n(H(X) + \epsilon)}$
- When sequence length $n$ is sufficiently large, then $\rvert A_\epsilon^{(n)}\rvert \geq (1-\epsilon)2^{n(H(X) - \epsilon)}$
Convergence of Sequence Length (Theorem 2)
\[\begin{equation*}
\mathbb{E}\left[\frac{1}{n} l(X^n) \right] \leq H(X) + \epsilon
\end{equation*}\]
In this chapter, we examined how information can be efficiently compressed and transmitted, along with intuitive explanations and how it is mathematically possible. Whenever I have time in the future, I plan to organize the key concepts from Elements of Information Theory, especially focusing on content that acts importantly from an AI research perspective.
Then I’ll see you again in the next post!