**CYBAEA Data and Analysis**, and kindly contributed to R-bloggers)

We are interested in Social Network Analysis using the statistical analysis and computing platform R. The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.

In our previous post on SNA we gave up on using the `statnet`

package because it was not able to handle our data volumes. In this entry we have better success with the `igraph`

package.

The task we are considering is still how to load the network data into the R package’s internal representation. We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries. In the former the terminology is Call Detail Record (CDR) and an extract may look a little like the following:

src, dest, start, duration,type,...+447000000005,+447000000006,1238510028, 52,call,... +447000000006,+447000000009,1238510627, 154,call,... +447000000009,+447000000007,1238511103, 48,call,... +447000000006,+447000000005,1238511145, 49,call,... +447000000006,+447000000005,1238511678, 12,call,... +447000000001,+447000000006,1238511735, 147,call,... +447000000007,+447000000009,1238511806, 26,call,... +447000000000,+447000000008,1238511825, 19,call,... +447000000009,+447000000008,1238511900, 28,call,... ...

Here a record indicates that the customer identified as `src` called (`type`=call) the customer `dest` at the given time `start` and the call lasted `duration` seconds. In general, there will be (many) more attributes describing the transaction which are represented by the `...`. In a Financial Services example, the records may be money transfers between accounts.

## Loading the data in the `igraph`

package

We are able to load the previous test data with 51 million records easily:

> library("igraph") > m <- matrix(scan(bzfile("cdr.51M.csv.bz2", open="r"), + what=integer(0), skip=1, sep=','), + ncol=4, byrow=TRUE) Read 205266564 items > ### Vertices are numbered from zero in the igraph library > m[,1] <- m[,1]-1; m[,2] <- m[,2]-1 > g <- graph.edgelist(m[,c(2,1)]) > E(g)$start <- as.POSIXct(m[,3], origin="1970-01-01", tz="UTC") > E(g)$duration <- m[,4] > ns <- neighborhood.size(g, 1)

Time to up the ante! We have a file with simulated call data records containing over 700 million entries where we suspect the algorithm used is under-estimating nodes with small connections. Let’s check on the first ½ billion records (which seems to more-or-less fit in our available memory on this workstation):

> library("igraph") ### Note that R can only handle 2^31-1 elements in a vector (on any ### platform, including 64-bit), so we need to read this file as a ### list. > s <- scan("cdr.1e6x1e1.csv", what=rep(list(integer(0)),4), skip=1, sep=',', multi.line=FALSE) Read 700466826 records > m <- as.vector(rbind(s[[2]], s[[1]])) > print(length(m)) [1] 1400933652 > length(m) <- 1e9 > g <- graph(m, directed=TRUE) > ns <- neighborhood.size(g, 1) > summary(ns) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 35.00 40.00 42.92 47.00 101.00 > hist(ns, xlab="Neighborhood size", main="Distribution of neighborhood size", sub="From cdr.1e6x1e1.1e9")

As we suspected, the Monte Carlo algorithm does not provide enough customers with low calling circle sizes. Fortunately it is very easy to add these separately: the hard part is modelling the larger calling circles. A mix of these two algorithms provide a reasonably good fit to actual customer behaviour. (The cut-off at 100 is a parameter to our Monte Carlo simulation program which indeed was 100 for this run.)

## Problems

However, it is not all perfect. When we attempt to add the edge parameters in the obvious way it fails:

> length(s[[3]]) <- 0.5e9 > length(s[[4]]) <- 0.5e9 > E(g)$start <- s[[3]] Error: cannot allocate vector of size 3.7 Gb Execution halted > E(g)$duration <- s[[4]]

So we are just at the limit. Probably 100 million records is OK in this environment. But the core igraph library is accessible from C so better performance can probably be achieved this way and certainly pointers are 8 byte structures on this machine so we should not have the silly limits that R imposes on us.

**leave a comment**for the author, please follow the link and comment on his blog:

**CYBAEA Data and Analysis**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...