Freshwater access in rural regions, using d3Network to explore similarities

January 1, 2014
By

(This article was first published on dataism » R, and kindly contributed to R-bloggers)

This post describes the construction of a similarity matrix and its use in creating grouped network graphs to examine freshwater access in rural regions of 194 countries around the world. The data comes from the WHO/UNICEF Joint Monitoring Programme (JMP) for Water Supply and Sanitation, downloaded from The World Bank December 26, 2013.

Dataset construction

To run the code, you’ll need Christopher Gandrud’s d3Network package.

setwd("C:/_Rproject/ForceDirected")
require('d3Network',lib.loc="c:/r/packages/")

The following code snippet reads a .csv file containing two columns, Country (after removing any accents and diacritical marks) and Access_Rural, from the table linked above, strips trailing blanks off columns, and creates a data frame called water.

water <- read.csv(file="3.5_Freshwater_useForCooccurrence_clean.csv", strip.white=TRUE, head=TRUE,sep=",", na.strings=c("."), colClasses=c('character','numeric'))

The meta data, available with the table linked earlier, contains the table name, income group, currency, region, and other fields for each country. The following commands load the data into a data frame and subset the data frame to three columns of interest. I wanted High Income counties in one group, regardless of OECD membership status, so the group names are cleaned before converting Income.Group into a factor variable and merging to create the final source data frame, 'water'.

meta <- read.csv(file="FreshwaterMeta.csv",strip.white=TRUE, head=TRUE,sep=",", na.strings=c(" "))

meta <- subset(meta,Income.Group != "",select=c("Table.Name","Income.Group","Region"))

meta[2] <- lapply(meta[2], as.character)

meta$inc <-ifelse(substr(meta$Income.Group,1,1) =='H',"High income",meta$Income.Group)

meta$ecogrp <- as.integer(factor(meta$inc, levels=c("Low income","Lower middle income","Upper middle income","High income")))

water <-merge(water,meta, by.x = "Country", by.y = "Table.Name", all.x = TRUE)

Given the size of my drawing area, between 800 and 1000 pixels, I divided the data frame by region, to restrict the number of countries to a range of 50-70. The following command creates the data frame combining two regions, Europe & Central Asia and East Asia & Pacific, and restricts the resulting data frame to records with non-missing Access_Rural values. Other regional data frames were created in the same manner.

waterECA <- subset(water,Region=="Europe & Central Asia" & !is.na(Access_Rural))

Matrix Construction

To create the similarity matrix, I began with a square matrix of zeros with a row for each country.

m <- matrix(rep(0), nrow=nrow(water), ncol=nrow(waterECA))

The waterNLA data frame can now be used to populate m with a set of non-negative values, bound between 0 and 100, that reflect the level of agreement between each pair of countries. in the matrix, m, each element, (i,j), will represent the absolute difference in percentages between country i and country j.

for(i in seq_along(waterECA$Country)){

for(j in seq_along(waterECA$Country)){

m[i, j] <- abs(waterECA$Access_Rural[i]-waterECA$Access_Rural[j]) } } rownames(m) <- waterECA$Country

colnames(m) <- waterECA$Country

Only the elements above or below m’s diagonal are needed to create the set of edges for the graph. These next steps set m’s upper triangle elements to NULL, coerce m into a table of distinct country pairs and their corresponding similarity estimate, and subset the resulting data frame, links, to non-missing values.

m[upper.tri(m, diag=TRUE)] <- NA

links <- as.data.frame(as.table(m))

colnames(links)<-c("source","target","value")

links <- subset(dm, !is.na(value))

Before passing links to d3Network, these next steps assign ordinal values to the source and target countries. Since by default, the levels of “source” and “target” in this case are the unique, alphabetically sorted country names from the same file (waterECA), I used R's internal ordering of these factors to set the "values", using the as.integer() function to assign both.

links$sourceN <-as.integer(links$source) -1 # initialize to zero

links$targetN <-as.integer(links$target) -1 # initialize to zero

links <- subset(links,sourceN != targetN)

links <- subset(links,select=c("sourceN","targetN","value"))

The nodes data frame was created from unique values of the waterECA data frame.

nodes <-as.data.frame(unique(waterECA[,c("Country","ecogrp")]))

Graphing

d3Network’s d3ForceNetwork function will send the contents of the HTML file that displays the graph to the console unless the output is redirected. Since I have to modify the code slightly to render the graph in WordPress and make some other adjustments (described later), I called the sink function first to divert the output to a text file in my working directory.


sink("d3force-waterECA.txt")

d3ForceNetwork(Links = links, Nodes = nodes,
Source = "sourceN", Target = "targetN",
Value = "value", NodeID = "country",
Group = "ecogrp", width = 800, height = 800,
opacity = 0.9)

The output of d3ForceNetwork can be easily customized. For example, by default, the link distance is fixed and the values in the set of edges determines the stroke width. Because each node in this data is connected to every other node, the modification, the resulting graph looks like this.

NetworkBall

Opening the text output file and varying the force layout's linkDistance and charge attributes helped make the graph more readable.

Original output:

var force = d3.layout.force()

.nodes(d3.values(nodes))

.links(links)

.size([width, height])

.linkDistance(50)

.charge(-120)

.on("tick", tick)

.start();

Sample modification:

.linkDistance(function(d) { return (d.value +1)*9; })
.charge(-1*Math.pow(nodes.length, 2))

An example of the graph produced using this method appears here: Europe and Central Asia. Since I chose not to show the connecting lines in the final graph, I set their opacity value to 0. I also replaced the d3ForceNetwork default nodes information, which can only contain the node name and group level for now, with a JSON-formatted list containing a third variable, the percentage of freshwater access, and added this field to the node's text element.

To leave a comment for the author, please follow the link and comment on their blog: dataism » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)