We are interested in Social Network Analysis using the statistical analysis and computing platform R. As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work. We use here the statnet group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages.
The first task which we consider in this post is to load our data into a
network object, which is how all the
statnet packages represent a network. Typically for R, the documentation is voluminous but not always as helpful as one could want.
We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries. In the former the terminology is Call Detail Record (CDR) and an extract may look a little like the following:
src, dest, start, duration,type,... +447000000005,+447000000006,1238510028, 52,call,... +447000000006,+447000000009,1238510627, 154,call,... +447000000009,+447000000007,1238511103, 48,call,... +447000000006,+447000000005,1238511145, 49,call,... +447000000006,+447000000005,1238511678, 12,call,... +447000000001,+447000000006,1238511735, 147,call,... +447000000007,+447000000009,1238511806, 26,call,... +447000000000,+447000000008,1238511825, 19,call,... +447000000009,+447000000008,1238511900, 28,call,... ...
Here a record indicates that the customer identified as src called (type=call) the customer dest at the given time start and the call lasted duration seconds. In general, there will be (many) more attributes describing the transaction which are represented by the …. In a Financial Services example, the records may be money transfers between accounts.
Implementation in the
In the naive implementation of this data as a network, we would have the sources and destinations (broadly speaking: people) as vertices and the calls as edges. That broadly seems to make sense: people are connected by the calls they make, and that is the social relationship we wish to model.
In the terminology of the
network class, that means that our network will be directed (calls and money transfers have a direction from one person to another) and will need to allow multiple edges between the same endpoints (because any one person can, and indeed usually will, make several calls to the same other person).
We could consider dropping the multiple attribute of the network and instead represent the fact that A has called B with a single edge and perhaps have the number of calls and their total duration as an edge attribute. We will investigate this another time, but it is surely a less faithful representation of the data that we have (and we would need to drop information like the time of call).
Mapping customer identifiers to network vertex numbers
One thing they seem to forget to tell you in the documentation is that when you import your data your vertex identifiers (which in our case is customer or account numbers) must be changed to number the vertices and that this numbering must be sequential and start from 1. Being used to an environment where the vertex identifiers are arbitrary (and arrays usually start from 0), this one had me puzzled for a while. The error message that tells you your vertex numbering is not what the package expected is spectacularly unhelpful:
> n <- network(m, matrix.type="edgelist", directed=TRUE, multiple=TRUE) Error in add.edges(g, as.list(x[, 1]), as.list(x[, 2]), edge.check = edge.check) : (edge check) Illegal vertex reference in addEdges_R. Exiting.
For the discussion that follows, we will assume that you have changed your identifies externally to R.
Loading the data
The good news is that our data is essentially in a format that the
network package calls edge list and which it can import directly.
I say “essentially” because for some strange reason the package expects the destination to come before the source which seems ass-backwards to me. But assume we have our data in a file
cdr.csv like this (we only have calls here):
src, dest, start, duration 5, 6,1238510028, 52 6, 9,1238510627, 154 9, 7,1238511103, 48 6, 5,1238511145, 49 ...
Then we can load the data into R easily:
> library("network") > m <- matrix(scan(file="cdr.csv", what=integer(0), skip=1, sep=','), ncol=4, byrow=TRUE) Read 1896 items > # Swap columns for ass-backward network package > m[,c(1,2)] <- m[,c(2,1)] > # Create network > net <- network(m, matrix.type="edgelist", directed=TRUE, multiple=TRUE) > summary(net) Network attributes: vertices = 10 directed = TRUE hyper = FALSE loops = FALSE multiple = TRUE bipartite = FALSE total edges = 474 missing edges = 0 non-missing edges = 474 density = 5.266667 Vertex attributes: vertex.names: character valued attribute 10 valid vertex names No edge attributes Network adjacency matrix: Error in as.matrix.network.adjacency(x = x, attrname = attrname, ...) : Multigraphs not currently supported in as.matrix.network.adjacency. Exiting. In addition: Warning message: In network.density(x) : Network is multiplex - no general way to define density. Returning value for a non-multiplex network (hope that's what you wanted).
OK, that’s a lot of warnings, but it basically worked. We have figured out how to load our network data into the network package in R.
We can’t do an exhaustive performance review now, but let us at least make sure we can load medium-sized networks. We change our CDR simulator to emit the desitnation before the source just like
network likes it and let it run.
The first file has 2,645,288 (simulated) CDR lines from 100k customers and it loads OK on our small development workstation even with the default settings:
> library("network") > n <- network(matrix(scan(file="cdr.1e5x1e0.csv", what=integer(0), skip=1, sep=','), ncol=4, byrow=TRUE), matrix.type="edgelist", directed=TRUE, multiple=TRUE) Read 10581152 items > proc.time() user system elapsed 138.304 1.597 140.878 > save(n, file="n.RData", ascii=FALSE, compress=FALSE)
The size of the saved network object is 373MB (only 27MB compressed).
We can potentially save some time and memory by not explicitly not performing the edge check (again: the documentation frustrates us and is silent on what the defaults are for the
network call we used above) so we try this for our next file with 51,316,641 lines of CDR data (again for 100k customers) which also saves us some column swapping:
> library("network") > m <- matrix(scan(file="cdr.51M.csv", what=integer(0), skip=1, sep=','), ncol=4, byrow=TRUE) Read 205266564 items > num_vert <- max(m[,1], m[,2]) > num_vert  100000 > n <- network.initialize(n=num_vert, directed=TRUE, multiple=TRUE) > add.edges(n, tail=m[,2], head=m[,1], edge.check=FALSE) > proc.time() (several hours: I’ll let you know when it is done) > rm(m) > save(n, file="n.RData", ascii=FALSE, compress=TRUE)
Our attempted optimization did not seem to matter and this network is too big for the machine and the
network package. Building the network was painful as I was working on the workstation at the same time. The machine has 16GB installed RAM, but it was clearly running out and swapping extensively.
51 million might be a reasonable size data set for some Financial Services applications but it is clearly a trivial number of records for Telecommunications. I’ll need to do some more digging around.
Does anybody have any SNA benchmarks? I like the KXEN implementation for its simplicity and speed so I might get a copy and try it out. Any R performance experts who could make suggestions in the comments? How big are your networks?