Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In multivariate space, the Mahalanobis distance is the distance between two points. It’s frequently used to locate outliers in statistical investigations involving several variables.

This tutorial describes how to execute the Mahalanobis distance in R.

Discriminant Analysis in r » Discriminant analysis in r »

## Mahalanobis Distance in R

First, we need to create a data frame

### Step 1: Create Dataset.

We can explore student datasets with exam scores, the number of hours they spent studying, preparation numbers, and current grades.

Sample Size Calculation Formula » Sampling Methods »

```data = data.frame(score = c(81, 83, 92, 87, 96, 73, 68, 77, 78, 97, 99, 86, 84, 96, 70, 80, 83, 83, 73, 70),
hours = c(7, 8, 3, 1, 4, 3, 2, 5, 5, 5, 2, 3, 4, 8, 3, 3, 7, 3, 4, 1),
prep = c(3, 4, 0, 3, 5, 0, 1, 2, 1, 2, 3, 5, 3, 2, 2, 1, 5, 3, 2, 3),
grade = c(80, 78, 80, 80, 84, 85, 88, 94, 91, 95, 79, 82, 95, 84, 81, 93, 83, 80, 89, 79))
1    81     7    3    80
2    83     8    4    78
3    92     3    0    80
4    87     1    3    80
5    96     4    5    84
6    73     3    0    85```

### Step 2: For each observation calculate the Mahalanobis distance

We can make use of mahalanobis() function in R

Syntax mentioned as follows,

`mahalanobis(x, center, cov)`

Naive Bayes Classification in R » Prediction Model »

where:

x: indicate matrix of data

center: indicate the mean vector of the distribution

cov: indicate the covariance matrix of the distribution

Now we can calculate the distance for each observation.

```mahalanobis(data, colMeans(data), cov(data))
 3.3431887 5.7202321 7.3521513 3.1990061 4.2208239 3.4181516 3.1017453 2.8156955 1.9605904 5.6692191 5.3856421 3.5954695 3.9963068 5.9551989 2.4928251 2.4151973 4.3417003 0.9334786 1.4406139 4.6427634```

### Step 3: Calculate the p-value

Based on the step 2 result, some of the distances are much higher than others. Suppose if we want to identify any of the distances that are statistically significant then we need to calculate p-values.

Cluster Analysis in R » Unsupervised Approach »

The p-value for each distance is calculated as the Chi-Square statistic of the Mahalanobis distance with k-1 degrees of freedom, where k is the number of variables.

```data\$mahalnobis<- mahalanobis(data, colMeans(data), cov(data))
1     81     7    3    80  3.3431887
2     83     8    4    78  5.7202321
3     92     3    0    80  7.3521513
4     87     1    3    80  3.1990061
5     96     4    5    84  4.2208239
6     73     3    0    85  3.4181516
7     68     2    1    88  3.1017453
8     77     5    2    94  2.8156955
9     78     5    1    91  1.9605904
10    97     5    2    95  5.6692191
11    99     2    3    79  5.3856421
12    86     3    5    82  3.5954695
13    84     4    3    95  3.9963068
14    96     8    2    84  5.9551989
15    70     3    2    81  2.4928251
16    80     3    1    93  2.4151973
17    83     7    5    83  4.3417003
18    83     3    3    80  0.9334786
19    73     4    2    89  1.4406139
20    70     1    3    79  4.6427634```

Let’s create the p values

KNN Algorithm Machine Learning » Classification & Regression »

```data\$pvalue <- pchisq(data\$mahalnobis, df=3, lower.tail=FALSE)
data
score hours prep grade mahalnobis     pvalue
1     81     7    3    80  3.3431887 0.34167668
2     83     8    4    78  5.7202321 0.12604387
3     92     3    0    80  7.3521513 0.06148152
4     87     1    3    80  3.1990061 0.36194826
5     96     4    5    84  4.2208239 0.23858527
6     73     3    0    85  3.4181516 0.33153375
7     68     2    1    88  3.1017453 0.37620253
8     77     5    2    94  2.8156955 0.42092267
9     78     5    1    91  1.9605904 0.58062647
10    97     5    2    95  5.6692191 0.12886057
11    99     2    3    79  5.3856421 0.14564075
12    86     3    5    82  3.5954695 0.30858950
13    84     4    3    95  3.9963068 0.26186321
14    96     8    2    84  5.9551989 0.11381036
15    70     3    2    81  2.4928251 0.47658914
16    80     3    1    93  2.4151973 0.49081192
17    83     7    5    83  4.3417003 0.22685238
18    83     3    3    80  0.9334786 0.81734205
19    73     4    2    89  1.4406139 0.69604281
20    70     1    3    79  4.6427634 0.19990417```

In general, a p-value that is less than 0.001 is considered to be an outlier. In this case, all the p values are greater than 0.001.

Principal component analysis (PCA) in R »

The post How to Calculate Mahalanobis Distance in R appeared first on finnstats.