How to apply the Mann-Whitney U Test in R

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In statistics, the Mann–Whitney U test (also called Wilcoxon rank-sum test) is a nonparametric test of the null hypothesis that it is equally likely that a randomly selected value from one population will be less than or greater than a randomly selected value from a second population. This test can be used to investigate whether two independent samples were selected from populations having the same distribution.

Some investigators interpret this test as comparing the medians between the two populations. Recall that the parametric test compares the means ( \(H_0: \mu_1=\mu_2\) ) between independent groups.

In contrast, the null and two-sided research hypotheses for the nonparametric test are stated as follows:

\(H_0\): The two populations are equal versus

\(H_1\): The two populations are not equal

This test is often performed as a two-sided test and, thus, the research hypothesis indicates that the populations are not equal as opposed to specifying directionality. A one-sided research hypothesis is used if interest lies in detecting a positive or negative shift in one population as compared to the other. The procedure for the test involves pooling the observations from the two samples into one combined sample, keeping track of which sample each observation comes from, and then ranking lowest to highest from 1 to \(n_1+n_2\), respectively.

Mann-Whitney U Test in Breast Cancer Dataset

We will work with the Breast Cancer Wisconsin dataset, where we will apply the Mann-Whitney Test in every independent variable by comparing those who were diagnosed with a malignant tumor vs those with a benign.

We provide a report where we will represent the Mean, the Standard Deviation, the Median, the Difference In Medians as well as the P-value of the Mann-Whitney U test of each variable.

library(tidyverse)


# the column names of the dataset
names <- c('id_number', 'diagnosis', 'radius_mean', 
           'texture_mean', 'perimeter_mean', 'area_mean', 
           'smoothness_mean', 'compactness_mean', 
           'concavity_mean','concave_points_mean', 
           'symmetry_mean', 'fractal_dimension_mean',
           'radius_se', 'texture_se', 'perimeter_se', 
           'area_se', 'smoothness_se', 'compactness_se', 
           'concavity_se', 'concave_points_se', 
           'symmetry_se', 'fractal_dimension_se', 
           'radius_worst', 'texture_worst', 
           'perimeter_worst', 'area_worst', 
           'smoothness_worst', 'compactness_worst', 
           'concavity_worst', 'concave_points_worst', 
           'symmetry_worst', 'fractal_dimension_worst')

# get the data from the URL and assign the column names
df<-read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"), col.names=names)

# remove the ID number
df<-df%>%select(-id_number)


# get the means of all the variables
means<-df%>%group_by(diagnosis)%>%summarise_all(list(mean), na.rm=TRUE )%>%gather("Variable", "Mean", -diagnosis)%>%spread(diagnosis, Mean)%>%rename("Mean_M"='M', "Mean_B"="B")

# get the standard deviation of all the variables
sds<-df%>%group_by(diagnosis)%>%summarise_all(list(sd), na.rm=TRUE )%>%gather("Variable", "SD", -diagnosis)%>%spread(diagnosis, SD)%>%rename("SD_M"='M', "SD_B"="B")

# get the median of all the variables
medians<-df%>%group_by(diagnosis)%>%summarise_all(list(median), na.rm=TRUE )%>%gather("Variable", "Median", -diagnosis)%>%spread(diagnosis, Median)%>%rename("Median_M"='M', "Median_B"="B")


# join the tables 
summary_report<-means%>%inner_join(sds, "Variable")%>%inner_join(medians, "Variable")%>%mutate(DiffInMedians=Median_M-Median_B)



# now apply the Mann-Whitney U test for all variables

variables<-colnames(df)[2:dim(df)[2]]



pvals<-{}
vars<-{}

for (i in variables) {
  
  xxx<-df%>%select(c("diagnosis", i))
  
  
  
  x1<-xxx%>%filter(diagnosis=="M")%>%dplyr::select(c(2))%>%na.omit()%>%pull()
  x2<-xxx%>%filter(diagnosis=="B")%>%dplyr::select(c(2))%>%na.omit()%>%pull()
  wc<-wilcox.test(x1,x2)
  
  
  pvals<-c(pvals,round(wc$p.value,4) )
  vars<-c(vars,i)
  
}

wc_df<-data.frame(Variable=vars,pvalues=pvals)

wc_df$Variable<-as.character(wc_df$Variable)
summary_report<-summary_report%>%inner_join(wc_df, by="Variable")
 

If we run the R script above, we get the following output. As we can see almost all of the variables appear to be statistically significant (p-values<0.05) between the two groups (Malignant and Benign). The only non-statistically significant variables appear to be the fractal_dimension_mean, the smoothness_se and the texture_se.


Variable Mean_B Mean_M SD_B SD_M Median_B Median_M DiffInMedians pvalues
area_mean 462.7902 978.2692 134.2871 368.8097 458.4 930.9 472.5 0
area_se 21.13515 72.28981 8.843472 61.24716 19.63 58.38 38.75 0
area_worst 558.8994 1419.458 163.6014 597.967 547.4 1302 754.6 0
compactness_mean 0.0800846 0.1445602 0.03375 0.0533352 0.07529 0.1319 0.05661 0
compactness_se 0.0214383 0.0322017 0.0163515 0.0183944 0.01631 0.02855 0.01224 0
compactness_worst 0.1826725 0.373446 0.09218 0.1695886 0.1698 0.3559 0.1861 0
concave_points_mean 0.0257174 0.0877099 0.0159088 0.0342122 0.02344 0.08624 0.0628 0
concave_points_se 0.0098577 0.0150566 0.0057086 0.0055302 0.009061 0.0142 0.005139 0
concave_points_worst 0.0744443 0.1818432 0.0357974 0.0460601 0.07431 0.182 0.10769 0
concavity_mean 0.0460576 0.1601144 0.0434422 0.0745776 0.03709 0.1508 0.11371 0
concavity_se 0.0259967 0.0417676 0.0329182 0.0216391 0.0184 0.0371 0.0187 0
concavity_worst 0.1662377 0.4493672 0.1403677 0.1810384 0.1412 0.4029 0.2617 0
fractal_dimension_mean 0.0628674 0.0626041 0.0067473 0.0075099 0.06154 0.06149 -0.00005 0.4788
fractal_dimension_se 0.0036361 0.0040523 0.0029382 0.002041 0.002808 0.003739 0.000931 0
fractal_dimension_worst 0.0794421 0.0914002 0.0138041 0.021521 0.07712 0.08758 0.01046 0
perimeter_mean 78.07541 115.3301 11.80744 21.90059 78.18 114.2 36.02 0
perimeter_se 2.000321 4.303716 0.7711692 2.557696 1.851 3.654 1.803 0
perimeter_worst 87.00594 141.1655 13.52709 29.37531 86.92 137.9 50.98 0
radius_mean 12.14652 17.46033 1.780512 3.211384 12.2 17.3 5.1 0
radius_se 0.2840824 0.6067796 0.1125696 0.3442221 0.2575 0.5449 0.2874 0
radius_worst 13.3798 21.11469 1.981368 4.283704 13.35 20.58 7.23 0
smoothness_mean 0.0924777 0.102825 0.0134461 0.0125927 0.09076 0.102 0.01124 0
smoothness_se 0.0071959 0.0067819 0.0030606 0.0028972 0.00653 0.006208 -0.000322 0.2134
smoothness_worst 0.1249595 0.144763 0.0200135 0.021889 0.1254 0.1434 0.018 0
symmetry_mean 0.174186 0.1926768 0.0248068 0.0274958 0.1714 0.1896 0.0182 0
symmetry_se 0.0205838 0.0204271 0.0069985 0.0100671 0.01909 0.01768 -0.00141 0.0225
symmetry_worst 0.2702459 0.3228204 0.0417448 0.0742636 0.2687 0.3103 0.0416 0
texture_mean 17.91476 21.6581 3.995125 3.708042 17.39 21.46 4.07 0
texture_se 1.22038 1.212363 0.5891797 0.4838656 1.108 1.127 0.019 0.6195
texture_worst 23.51507 29.37502 5.493955 5.384249 22.82 29.02 6.2 0

Discussion

Generally, is a good idea to apply the Mann-Whitney U test during the Exploratory Data Analysis part since we can get an idea of which variables may be significant for the final machine learning model and the most important thing is that since it is a non-parametric test, we do not need to make any assumption about the distribution of the variables. For example, instead of the Student’s t-test we can apply the Mann-Whitney U test without worrying about the assumptions of the normal distribution.

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)