Identifying gaps in your data

[This article was first published on R – The Past by Numbers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the first things you want to do when you explore a new dataset is to identify possible gaps. Sample size and the number of variables are relevant but…how many observations do you have for each variable? This distinction is even more relevant for archaeologists because (if we are being honest…) most of our data has huge gaps.

Just to make the post clear:
– The Sample is the set of entities you collected.
Variables are measures and properties of this sample
Observations are the values of the variables for each item in your sample

The identification of variables with a decent number of observations is crucial for several processes. Let’s say that you have a bunch of archaeological sites and you want to create a map where the size of each dot (i.e. site) is proportional to the area of the site. This would be a bad idea if 90% of your sample does not have an assigned area because these points will be ignored.

This is even more relevant if you want to do some modelling (e.g. a linear regression). Lots of statistical models ignore variables that have observations with unassigned values so you have to be very careful about it. Let’s see how can we explore this issue.

Example: Arrowheads in UK

Loading the dataset

We downloaded a dataset of arrowheads collected by the Portable Antiquities Scheme:

arrowheads <- read.csv("https://git.io/v9JJd", na.string="")

As you can see I specified that empty strings of text should be read as NA (i.e. Not Available). If you don’t do that then it will be read as an empty string, which is different than not having a value.

If we take a look at the newly created arrowheads data frame we will see a bunch of interesting metrics:

str(arrowheads)

You should get something like:

'data.frame':   1079 obs. of  13 variables:
 $ id               : int  522443 179174 233283 199204 106547 45059 508485 646649 401936 133638 ...
 $ classification   : Factor w/ 81 levels "Arrowhead","Barb and Tanged",..: NA NA 10 NA NA NA NA NA NA NA ...
 $ subClassification: Factor w/ 19 levels "barbed and tanged",..: NA NA NA NA NA NA NA NA NA NA ...
 $ length           : num  28 55.2 45.3 100.3 39 ...
 $ width            : num  18 11 25 7.6 11 ...
 $ thickness        : num  3 2 5.11 6.7 NA 3.54 3.47 6.5 1 NA ...
 $ diameter         : num  2.2 4.5 4.58 5.1 6 6.82 6.96 7.5 8 8 ...
 $ weight           : num  NA 6.1 2.78 16.69 4.76 ...
 $ quantity         : int  1 1 1 1 1 1 1 1 1 1 ...
 $ broadperiod      : Factor w/ 12 levels "BRONZE AGE","EARLY MEDIEVAL",..: 1 4 1 4 1 4 4 4 4 4 ...
 $ fromdate         : int  -2150 NA -2150 1250 -2150 1066 1066 1200 1066 1200 ...
 $ todate           : int  -1500 NA -1500 1499 -800 1350 1400 1300 1500 1499 ...
 $ district         : Factor w/ 188 levels "Arun","Ashford",..: 45 181 74 NA 26 93 177 137 186 53 ...

See all these NA values? These are the gaps in our data. We can suspect that diameter or “subClassification* will not be popular, but having over 1000 arrowheads it is difficult to know what variables should we be used in the analysis.

Visualizing gaps

How can you identify these gaps? My preferred method is to visualize them using the Amelia package (yes, awesome name for an R package on missing data…). Its use is straightforward:

install.package("Amelia")
library("Amelia")
missmap(arrowheads)

Missingness map

The structure is R-classic: rows are sample units while columns are variables. Red cells are the ones that have some values while the other ones are empty.

Interpretation

The map of missing values allows us to make informed decisions on how to proceed with the analysis. In this case:
– We should not use diameter for analysis because it is not present in most of the sample.
– We have a almost complete information on broad spatial and temporal coordinates (broadperiod and district)
classification and subClassification are quite useless here
– Measures that can be used are: weight, thickness, width and length

Impact

You can easily visualize the impact by creating a visualization with diameter and another one without this value:

ggplot(arrowheads, aes(x=width, y=diameter, col=broadperiod)) + geom_point() + theme_bw() + facet_wrap(~broadperiod)

Scatterplot width vs diameter

Not looking good…R even tells you that you lost 1051 points in your dataset…also, most of the periods. Compare it with:

ggplot(arrowheads, aes(x=width, y=length, col=broadperiod)) + geom_point() + theme_bw() + facet_wrap(~broadperiod)

Scatterplot width vs length

Only 84 rows contained missing values, much better!

To leave a comment for the author, please follow the link and comment on their blog: R – The Past by Numbers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)