# Fairness and discrimination, PhD Course, #4 Wasserstein Distances and Optimal Transport

**R-english – Freakonometrics**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For the fourth course, we will discuss Wasserstein distance and Optimal Transport. Last week, we mentioned distances, dissimilarity and divergences. But before talking about Wasserstein, we should mention Cramer distance.

#### Cramer and Wasserstein distances

The definition of Cramér distance, for \(k\geq1\), is

while Wasserstein will be (also for \(k\geq1\))

If we consider cumulative distribution functions, for the first one (Cramer), we consider some sort of “vertical” distance, while for the second one (Wasserstein), we consider some “horizontal” one,

Obviously, when \(k=1\), the two distances are identical

```
c1 = function(x) abs(pnorm(x,0,1)-pnorm(x,1,2))
w1 = function(x) abs(qnorm(x,0,1)-qnorm(x,1,2))
integrate(c1,-Inf,Inf)$value
[1] 1.166631
integrate(w1,0,1)$value
[1] 1.166636
```

But when \(k>1\), it is no longer the case.

```
c2 = function(x) (pnorm(x,0,1)-pnorm(x,1,2))^2
w2 = function(u) (qnorm(u,0,1)-qnorm(u,1,2))^2
sqrt(integrate(c2,-Inf,Inf)$value)
[1] 0.5167714
sqrt(integrate(w2,0,1)$value)
[1] 1.414214
```

For instance, we can illustrate with a simple multinomial distribution, and the distance with some Binomial one, with some parametric inference based on distance minimization \(\theta^\star=\text{argmin}\{d(p,q_{\theta})\}\)(where here a multinomial distribution with parameters \(\boldsymbol{p}=(.5,.1,.4)\), taking values respectively in \(\{0,1,10\}\), while the binomial distribution has probabilities \(\boldsymbol{q}_{\theta}=(1-\theta,\theta)\), taking values in \(\{0,10\}\))

One can prove that

while

When \(k=1\), observe that the distance is easy to compute when distributions are ordered

When \(k=2\), the two distances are not equal

In the Gaussian (and the Bernoulli) case, we can get an expression for the distance (and much more as we will see later on)

There are several representations for \(W_2\)

And finally, we can also discuss \(W_{\infty}\)

#### Wasserstein distances, and optimal transport

Wasserstein distance can also we written using some sort of expected values, when considering random variables instead of distributions, and some best-case scenario, or cheapest transportation cost,

which lead to the so call Kantorovich problem

An alternative way to look at this problem is to consider a transport map, and a push-forward measure

This is simply

Of course such mapping exist

We can then consider Monge problem

And interestingly, those two problems are (somehow) equivalent

#### Discrete case

If \(\boldsymbol{a}_{{A}}\in\mathbb{R}_+^{\color{red}{n_{{A}}}}\) and \(\boldsymbol{a}_{{B}}\in\mathbb{R}_+^{\color{blue}{n_{{B}}}}\), define\(U(\boldsymbol{a}_{{A}},\boldsymbol{a}_{{B}})=\big\lbrace M\in\mathbb{R}_+^{\color{red}{n_{{A}}}\times\color{blue}{n_{{B}}}}:M\boldsymbol{1}_{\color{blue}{n_{{B}}}}=\boldsymbol{a}_{A}\text{ and }{M}^\top\boldsymbol{1}_{\color{red}{n_{{A}}}}=\boldsymbol{a}_{B}\big\rbrace\)For convenience, let \(U_{\color{red}{n_{{A}}},\color{blue}{n_{{B}}}}\) denote \(\displaystyle{U\left(\boldsymbol{1}_{n_{{A}}},\frac{\color{red}{n_{{A}}}}{\color{blue}{n_{{B}}}}\boldsymbol{1}_{n_{{B}}}\right)}\) (so that \(U_{\color{red}{n},\color{blue}{n}}\) is the set of permutation matrices associated with \(\mathcal{S}_n\)). Let \(C_{i,j}=d(x_i,y_{j})^k\)so that \(W_k^k(\boldsymbol{x},\boldsymbol{y}) = \underset{P\in U_{\color{red}{n_{{A}}},\color{blue}{n_{{B}}}}}{\text{argmin}} \Big\lbrace \langle P,C\rangle \Big\rbrace\)where\(\langle P,C\rangle = \sum_{i=1}^{\color{red}{n_{{A}}}} \sum_{j=1}^{\color{blue}{n_{{B}}}} P_{i,j}C_{i,j}\) then consider \(P^* \in \underset{P\in U_{\color{red}{n_A},\color{blue}{n_B}}}{\text{argmin}} \Big\lbrace \langle P,C\rangle \Big\rbrace\)For the slides, if we have the same sample sizes in the two groups, we have

we can illustrate below (with costs, or distances)

And with different group sizes,

i.e., if we consider real datasets

And as usual, we can consider some penalized version. Define \(\mathcal{E}(P) = -\sum_{i=1}^{\color{red}{n_{{A}}}} \sum_{j=1}^{\color{blue}{n_{{B}}}} P_{i,j}\log P_{i,j}\)or\(\mathcal{E}'(P) = -\sum_{i=1}^{\color{red}{n_{{A}}}} \sum_{j=1}^{\color{blue}{n_{{B}}}} P_{i,j}\big[\log P_{i,j}-1\big]\) or \(\mathcal{E}'(P) = -\sum_{i=1}^{\color{red}{n_{{A}}}} \sum_{j=1}^{\color{blue}{n_{{B}}}} P_{i,j}\big[\log P_{i,j}-1\big]\) Define \(P^*_\gamma = \underset{P\in U_{\color{red}{n_{{A}}},\color{blue}{n_{{B}}}}}{\text{argmin}} \Big\lbrace \langle P,C\rangle -\gamma \mathcal{E}(P) \Big\rbrace\) The problem is strictly convex.

#### Sinkhorn relaxation

This idea is related to the following theorem

Consider a simple optimal transportation problem between 6 points to 6 other points,

```
set.seed(123)
x = (1:6)/7
y = runif(9)
x
[1] 0.14 0.29 0.43 0.57 0.71 0.86
y[1:6]
[1] 0.29 0.79 0.41 0.88 0.94 0.05
library(T4transport)
Wxy = wasserstein(x,y[1:6])
Wxy$plan
```

that we can visualize below (the first observation of \(\boldsymbol{x}\) is matched with the last one of \(\boldsymbol{y}\), the second observation of \(\boldsymbol{x}\) is matched with the first one of \(\boldsymbol{y}\), etc)

We observe that we simply match according to ranks.

But we can also use a penalized version

```
Sxy = sinkhorn(x, y[1:6], p = 2, lambda = 0.001)
Sxy$plan
```

here with a very small pernalty

or a larger one

```
Sxy = sinkhorn(x, y[1:6], p = 2, lambda = 0.05)
Sxy$plan
```

In the discrete case, optimal transport can be related to Hardy-Littlewood-Polya inequality, that is related to the idea of matching based on ranks (corresponding to a monotone mapping function)

We have then

In the bivariate dicrete case, we have

#### Optimal mapping

We have mentioned that, in the univariate setting

and clearly, \(\mathcal{T}^\star\) is increasing. In the Gaussian case, for example\(x_{{B}}=\mathcal{T}^\star(x_{{A}})= \mu_{{B}}+\sigma_{{B}}\sigma_{{A}}^{-1} (x_A-\mu_{{A}}).\)In the multivariate case, we need a more general concept of increasingness to define an “increasing” mapping \(\mathcal{T}^\star:\mathbb{R}^k\to\mathbb{R}^k\).

In the Gaussian case, for example, we have a linear mapping,\(\boldsymbol{x}_{{B}} = \mathcal{T}^\star(\boldsymbol{x}_{{A}})=\boldsymbol{\mu}_{{B}} + \boldsymbol{A}(\boldsymbol{x}_{{A}}-\boldsymbol{\mu}_{{A}})\)where \(\boldsymbol{A}\) is a symmetric positive matrix that satisfies \(\boldsymbol{A}\boldsymbol{\Sigma}_{{A}}\boldsymbol{A}=\boldsymbol{\Sigma}_{{B}},\) which has a unique solution given by \(\boldsymbol{A}=\boldsymbol{\Sigma}_{{A}}^{-1/2}\big(\boldsymbol{\Sigma}_{{A}}^{1/2}\boldsymbol{\Sigma}_{{B}}\boldsymbol{\Sigma}_{{A}}^{1/2}\big)^{1/2}\boldsymbol{\Sigma}_{{A}}^{-1/2},\) where \(\boldsymbol{M}^{1/2}\) is the square root of the square (symmetric) positive matrix \(\boldsymbol{M}\) based on the Schur decomposition (\(\boldsymbol{M}^{1/2}\) is a positive symmetric matrix). In R, for example, use the expm package,

```
M = expm::sqrtm(matrix(c(1,1.2,1.2,2),2,2))
M
[,1] [,2]
[1,] 0.8244771 0.5658953
[2,] 0.5658953 1.2960565
M %*% M
[,1] [,2]
[1,] 1.0 1.2
[2,] 1.2 2.0
```

#### Optimal mapping, on real data

To illustrate, it is possible to consider the optimal matching, between the height of \(n\) men and \(n\) women,

Another example (discussed in Optimal Transport for Counterfactual Estimation: A Method for Causal Inference – with a nice R notebook created by Ewen), consider Black and non-Black mothers in the U.S.

or the joint mapping, in dimension 2

We will spend more time on those functions (and the related concept) in a few weeks, when discussing barycenters and geodesics… More details in the slides (online) and in the forthcoming textbook,

**leave a comment**for the author, please follow the link and comment on their blog:

**R-english – Freakonometrics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.