**Memo's Island**, and kindly contributed to R-bloggers)

KL divergence (Kullback-Leibler57) or KL distance is non-symmetric measure of difference between two probability distributions. It is related to mutual information and can be used to measure the association between two random variables.

Figure: Distance between two distributions. (Wikipedia) |

In this short tutorial, I show how to compute KL divergence and mutual information for two categorical variables, interpreted as discrete random variables.

${\bf Definition}$: Kullback-Leibler (KL) Distance on Discrete Distributions

Given two discrete probability distributions ${\it P}(A)$ and ${\it Q}(B)$ with discrete random variates, $A$ and $B$, having realisations $A=a_{j}$ and $B=b_{j}$, over $n$ singletons $j=1,…,n$. KL divergence or distance $D_{KL}$ in between $P$ and $Q$ is defined as follows:

$D_{KL} = D_{KL}\big({\it P}(A) || {\it Q}(B)\big)=\sum_{j=1}^{n} {\it P}(A=a_{j}) \log \Big( \cfrac{{\it P}(A=a_{j})}{{\it Q}(B=b_{j})} \Big)$

$\log$ is in base $e$.

${\bf Definition}$: Mutual Information on Discrete Random Variates

Two discrete random variables $X$ and $Y$, having realisations $X=x_{k}$ and $Y=y_{l}$, over $m$ and $n$ singletons $k=1,…,n$ and $l=1,…,m$ respectively, are given. Mutual information, $I(X;Y)$ is defined as follows,

$I(X;Y) = \sum_{k=1}^{n} \sum_{l=1}^{m} {\it R}(X=x_{k}, Y=y_{l}) \log \Big(\cfrac{ {\it R}(X=x_{k}, Y=y_{l}) }{ {\it R}(X=x_{k}) {\it R}(Y=y_{l})} \Big)$

$\log$ is in base $e$ and $R$ denotes probabilities.

${\bf Definition}$: Mutual Information on Discrete Random Variates with KL distance

Two discrete random variables $X$ and $Y$. Mutual information, $I(X;Y)$ is defined as follows,

$I(X;Y) = D_{KL} \Big( {\it R}(X, Y) || {\it R}(X) {\it R}(Y) \Big)$

${\bf Problem: Example}$: Given two discrete random variables $X$ and $Y$ defined on the following sample spaces $(a,b,c)$ and $(d,e)$ respectively. Write down the expression for the mutual information, I(X;Y), expanding summation.

${\bf Solution: Example}$:

$I(X;Y) = {\it R}(X=a, Y=d) \log \Big(\cfrac{ {\it R}(X=a, Y=d) }{ {\it R}(X=a) {\it R}(Y=d)} \Big) +$

${\it R}(X=b, Y=d) \log \Big(\cfrac{ {\it R}(X=b, Y=d) }{ {\it R}(X=b) {\it R}(Y=d)} \Big)+$

${\it R}(X=c, Y=d) \log \Big(\cfrac{ {\it R}(X=c, Y=d) }{ {\it R}(X=c) {\it R}(Y=d)} \Big)+$

${\it R}(X=a, Y=e) \log \Big(\cfrac{ {\it R}(X=a, Y=e) }{ {\it R}(X=a) {\it R}(Y=e)} \Big)+$

${\it R}(X=b, Y=e) \log \Big(\cfrac{ {\it R}(X=b, Y=e) }{ {\it R}(X=b) {\it R}(Y=e)} \Big)+$

${\it R}(X=c, Y=e) \log \Big(\cfrac{ {\it R}(X=c, Y=e) }{ {\it R}(X=c) {\it R}(Y=e)} \Big)$

${\bf Definition}$: Two discrete random variables $X$ and $Y$, having realisations $X=x_{k}$ and $Y=y_{l}$, over $m$ and $n$ singletons $k=1,…,n$ and $l=1,…,m$ respectively, are given. Two-way contingency table of realisations of $X$ and $Y$ along the same measured data points, $N$ observations, can be constructed by counting co-occurances, or cross-classifying factors. Normalizing the resulting frequency table will produce joint and marginal probabilities.

Y=y_{1} … Y=y_{l} R(X)

X=x_{1} R(X=x_{1}, Y=y_{1}) … R(X=x_{1}, Y=y_{l}) R(X=x_{1})

X=x_{2} R(X=x_{2}, Y=y_{1}) … R(X=x_{2}, Y=y_{l}) R(X=x_{2})

… … … … …

X=x_{k} R(X=x_{k}, Y=y_{1}) … R(X=x_{k}, Y=y_{1} R(X=x_{k})

R(Y) R(Y=y{1}) … R(Y=y_{l})

**Simulated example with R**

1 |
set.seed(3951824) |

**References**

(KullbackLeibler57) Kullback, S.; Leibler, R.A. (1951). “On information and sufficiency”. Annals of Mathematical Statistics 22 (1): 79–86.

**leave a comment**for the author, please follow the link and comment on their blog:

**Memo's Island**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...