Mutual information is a metric used extensively across many fields. This is intended to be short non-rigorous introduction to it.

Intuition

First, prior to talking definitions and any nuance, we can get a general sense for what we are trying to measure. Given two sets of observations, we might wonder whether the observations are independent. One indication that they may not be independent is if they are correlated. However, correlation only captures linear dependence. A more general sense of what is sought is whether knowing the value of the observation from one random variable, provides any information about an observation for a second random variable.

This intuition tells us that for two independent variables, the joint distribution of two variables should simply be the product of the two marginal distributions. If stats has been a while, a joint distribution might look like a scatter plot of pairs of observations \({a, b}\) that come from random variables A and B. The marginal distributions are just the histograms for the individual random variables. As such, indepedence means

\[P_{AB}(a,b) = P_A(a) * P_B(b)\]

Equivalently

\[P_{AB}(a | b) = P_A(a)\]

Which reads “the probability of a given b is the probability of a”. In more technical jargon, we might say “conditioned on observing b, the probability of a doesnt change”. Thus, if the above relations don’t hold for two random variables, then they have some depedence! It’s worth noting that this definition does not restrict what the nature of the depdence might be (linear, quadratic, etc). In general, our goal is to come up with a method of generating a number (preferably between 0 and 1) which tells us how far away from completely independent two distributions are. This is precisly what Mutual Information is!

Entropy is a quantity which at a high level is meant to convey how much “information” a distribution contains. As such, it is designed to be additive between two independent distributions. This is because if A and B contain no shared information, then the total information is just the sum of the information in the individual variables. The common way to denote the entropy “function” is as \(H(A)\), and mutual information as \(I(A,B)\). Thus, we might think that some measure of the depedence between two variables feels something like:

\[I(A,B) = \frac{H(P_{AB})}{H(P_A) + H(P_B)}\]

Which might read something along the lines of “The information gain equals the entropy of the joint distribution over the sum of entropies of the marginal distributions”. In this case a value of 1 indicates A and B are indepedent. If the numerater is much smaller than the denominator (making the ratio < 1), the the variables are more depedent.

Formal Definition

The above intuition requires a bit of background to formalize. Specifically, we have to motivate a new quantitiy called the Kullback-Leibler divergence (KL divergence).

KL divergence

Again, we can motivate a term by asking a general question that is:

Given a distribution Y, and approximations X1, X2, and X3, how do we pick the best approximation?

One way to answer this question is to define a quantity which is the information loss by using the approximate distribution to encode the true distribution. Indeed, that is exactly what we will do. However, while it feels like we are motivating a distance metric between two disritubtions, it is important to squash out this intuition for this case because there is no reason this function needs to be symmetric. An intuitive reason to consider the assymetry is that one distribution may be continous and one discrete, in which case, we might feel its reasonable that the information loss is not symmetric.

Anyway, the information loss has a natural defintion following the definition of entrpy. Where as entropy can be considered the (negative) expected value of log probabilites, our new metric, \(D_{KL}\) is simply the expectation value of the difference between the log probabilities. This is written as:

\[D_{KL}(X || Y) = \sum_{i=1}^{N} p_Y(x_i) \cdot (log(p_X(x_i)) - log(p_Y(x_i)))\]

This is essentially the how much information we expect to lose through the encoding. The more common writing of this function uses the rules of logs to change the minus to division ¹.

MI Definition

Alright, now that we have our tool, we are ready to define the depedence between two distributions.

\[I(X;Y) = D_{\mathrm{KL}}( P_{(X,Y)} \| P_{X} \otimes P_{Y} )\]

This can be read as the expected information loss for encoding the joint distribution as the product of marginal distributions. In the case of discrete variables, this can be written as:

\[\operatorname{I}(X;Y) = \sum_{y \in \mathcal Y} \sum_{x \in \mathcal X} { p_{(X,Y)}(x,y) \log{ \left(\frac{p_{(X,Y)}(x,y)}{p_X(x)\,p_Y(y)} \right) }}\]

Where \(p_{(X,Y)}\) is the joint distribution of \(X\) and \(Y\), and \(p_X\) and \(p_Y\) are the marginal probability mass functions of \(X\) and \(Y\) respectively ². Personally, I think this looks something like what our intuition told us it should.

We could have probably jumped to another intuitive definition if we had motivated a conditional entropy denoted as \(H(X \vert Y)\). This provides another common reformulation of the above definitions as \(I(X;Y)=H(X)-H(X \vert Y)\) implying that:

Mutual information can be tought of as the amount of entropy removed (on average) in X by observing Y ³

Taking this a step further, mutual information can be defined as

\[\operatorname{I}(X;Y) \equiv H(X, Y) - H(X \vert Y) - H(Y \vert X)\]

This is intuitivly read as entropy of the joint distribution minus the conditional entropies! The above definitions seem easy enough to apply. However, something that one quickly realizes when using these definitions is that calculating the probability mass (or density) functions requires picking a bin size (or kernel smoothing parameter!) for the finite amount of data observed in practive. One might notice this and hope that there is a quick and natural choice, or that bin size doesn’t change the results too much, but unfortunately, neither of those things are the case!

Python Implementation

I orignally wrote these sections in matlab, but I thought python might be more inclusive for those without liscences. Following the above definition, I wrote some code to calculate the mutual information for two random variables (A and B):

n=6; % # of bins for making PMF
g1 = histcounts(A,n); % Bin the data
g2 = histcounts(B,n);
g3 = histcounts2(A,B,n);
pmf1 = repmat(g1/sum(g1),n,1); % change counts to %, and make grid for sum
pmf2 = repmat((g2/sum(g2))',1,n);
pmf3 = g3/sum(g3(:));
pmf1(pmf3==0)=[]; pmf2(pmf3==0)=[]; pmf3(pmf3==0)=[]; % remove 0 entries
MI = sum(pmf3.*log(pmf3./(pmf1.*pmf2))) % Calculate MI score

Note, I naively choose to uniform number and spacing of bins. From here we can generate some test data and see the results!

Circle

Bin Spacing Modification

It is important to note that to give our algorithm the best chance of success, I use non-uniform bin size with bins chosen from the joint distribution. We can look at the dependence on the bin size

KNN Approach

Applications

A 2018 paper which I personally think is fascinating makes the claim that:

Framing word embedding as metric recovery of a semantic space unifies existing word embedding algorithms, ties them to manifold learning, and demonstrates that existing algorithms are consistent metric recovery methods given co-occurrence counts from random walks. ⁴

The connection between metric recovery and mutual information is that a seminal paper demonstrated that human word ratings are linearly related to many distributionaly derived pointwise mutual information (PMI) ⁵. Following this motivation, the paper mentions how in recent times, it has been shown that many current distributional models are related to eachother through PMI. To quote the paper:

In terms of algorithms, Levy and Goldberg (2014b) demonstrated that the global minimum of the skip-gram method with negative sampling of Mikolov et al. (2013b) implicitly factorizes a shifted version of the PMI matrix of word-context pairs.

I mention these results here because I think that with the prevalence of interesting results using mutual information in so many fields, we should be aware of the nuance in actually calculating its value!

References

https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained ↩
https://en.wikipedia.org/wiki/Mutual_information#In_terms_of_PMFs_for_discrete_distributions ↩
https://web.stanford.edu/class/stats311/Lectures/lec-02.pdf ↩
Hashimoto, T. B., Alvarez-Melis, D., & Jaakkola, T. S. (2018). Word Embeddings as Metric Recovery in Semantic Spaces. Transactions of the Association for Computational Linguistics, 4, 273–286. doi: 10.1162/tacl_a_00098 ↩
Kenneth Ward Church andPatrick Hanks. 1990. Word association norms, mutual information, and lexicography.Comput. Linguist.,16(1):22–29 ↩