Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Alright. Time to take a crack at another data set, and see if i can hone some of those data cleaning skills that i’ve just learned on Coursera.

The data sets I’m using are from the World Bank’s report. I’m using data compiled for 1995, 2000, and 2005. The data were all combined together in one Excel workbook, so I saved the three annual data separately in different files and then loaded them in R under the variables data95, data20, and data05 for 1995, 2000, and 2005; respectively.

Given that the data is not exactly arranged in a tidy manner, some cleaning had to be done. I experimented with combining data frames in lists so that I can loop together all the data frames within the list. I’ve been told that the plyr package does provide that functionality but I’m still not used to it; and besides I’m still going through all the online tutorials.

Below is the code I used to combine and the data sets and clean them by removing the variables/columns that are not needed.

library(plyr)
library(dplyr)
library(stringr)

#adding the year column to each data frame
data95$year = "1995" data20$year = "2000"
data05$year = "2005" #combine all datasets in a list dflist = list(data95, data20, data05) for(i in 1:length(dflist)){ #to remove all rows where the second column show now data for(j in nrow(dflist[[i]]):1){ if(nchar(dflist[[i]][j,2]) == 0){dflist[[i]] = dflist[[i]][-j,]} } #remove all columns where all values are NA dflist[[i]] = naAll.omit(dflist[[i]]) #remove all rows where there is an NA value dflist[[i]] = na.omit(dflist[[i]]) #remove all commas from all columns for(k in ncol(dflist[[i]]):1){ dflist[[i]][,k] = str_replace_all(dflist[[i]][,k], ",", "") } } #convert from a list to a dataframe final = ldply(dflist)  The psych R package has the helpful describeBy() function which runs summary statistics for all unique values of a particular variable, in a given dataframe. And although you can subset the output to extract any particular group’s summary stats…I didn’t find it intuitive enough for manipulation. That said, I thought making a custom function would be more helpful…at least for me. Below is the function that would return a data frame of all the groups for which the summary stats are to be run. The parameters are the dataframe and the name of the column, and the resulting data frame was assigned to the variable sumFinal.  deStats = function(data, column){ #index number of the column name gIndex = which(colnames(data) == column) #assign unique column names to a variable groups = unique(data[,gIndex]) #create an empty list, to be filled later clist = list() #a loop to create list of data frames containing.. #..summary data frames for(i in 1:length(groups)){ x = which(colnames(data) == column) y = groups[i] z = subset(data, data[,x] == y) a = describe(z) #add columns that include the group name and.. #..measurements a$group = y
a$measurements = rownames(a) #add data frames to empty list clist[[i]] = a } #convert list to dataframe b = ldply(clist) limit1 = ncol(b)-1 limit2 = ncol(a)-2 cPost = c(limit1:ncol(b)) #re-position location of group and measurement columns b = b[,c(cPost,1:limit2)] #remove row names rownames(b) = NULL return(b) } sumFinal = deStats(final, "Region")  Most of the NA values that have been generated are a result of running summary stats on the character variables. You can find the data output at the end of this post. I thought it might also be a good time for me to try out the explData() function I mentioned in my first post, and see how it fares with this tidy data. It keeps giving me an error, as i seem to have not placed the character class filters in the function’s code. So I’d have to run the function on only the numeric columns like so… explData(final[,5:21]) …which gives me the following output: To elaborate on the plot once more, the blue boxes represent a positive relationship and the strength of that relationship is signified by the color’s intensity. The crossed out boxes simply mean that the relationship is not statistically significant, at an alpha level of 0.05. I was also interested in checking if I could derive the percentage changes for each variable from 1995 to 2005. The only problem I had with this is that not all countries have data from 1995 to 2005. With that in mind, the loops would definitely be creating quite a number of NA values during the analysis. I also noticed that I’m using a whole bunch of loops in my data cleaning and summarization; something I’m hoping i will eventually stop doing once I get the hang of the plyr and dplyr packages. Anyway, here’s the code for calculating the percentages, working on the the final tidy data set that was created earlier. #list of unique countries uniCount = unique(final$Economy)

#create empty data frame to fill in later using loop
perDF = data.frame()

for(i in 1:length(uniCount)){

x = final[final$Economy == uniCount[i] & final$year == "1995",]
y = final[final$Economy == uniCount[i] & final$year == "2005",]

for(j in 1:ncol(y)){
if(class(x[1,j]) != "character"){

perDF[i,j] = (y[1,j] - x[1,j])/x[1,j]} else {

if(j <= 4){perDF[i,j] = x[1,j]}else{
perDF[i,j] = NA
}
}
}
}

for(i in nrow(perDF):1){
if(sum(is.na(perDF[i,])) == ncol(perDF)){perDF = perDF[-i,]}
}

colnames(perDF) = make.names(colnames(final))
perDF = naAll.omit(perDF)

Using the ever so wonderful ggplot2 package (and also with the help of this), i’ve plotted the percentage changes in the country’s Pasture Land wealth, and color coding by the country’s Region. Below is the code and the output:

library(ggplot2)

plot = ggplot(perDF, aes(x = reorder(Economy,Pasture.Land), y = Pasture.Land, fill = Region)) +
geom_bar(stat = "identity")

plot + coord_flip() + labs(x = "Country", y = "% Change") +
theme(axis.text.y=element_text(face = "bold", color = "black"),
axis.text.x=element_text(face = "bold", color = "black")) +
ggtitle("Percentage Change in Pasture Land Wealth - 1995 to 2005") +
theme(plot.title = element_text(face = "bold", size=rel(1.5)))


Interesting note on the above plot: African countries situated in the sub-Saharan region seem to have experienced the most reduction in wealth derived from pasture land; and the opposite seems to be holding true for countries situated in Latin America and the Caribbean.

There’s actually still a whole lot more room for exploration in this dataset, but this is all i’m going to be posting because the code for exploring the rest is pretty much the same.

If you’re interested in the data sets i’ve used and the outputs that resulted from the summarization, i’ve uploaded the files below:

Raw datasets:

Outputs:

Tagged: correlation, data, functions, programming, r, rstats, tidy data, world bank