igraph and structured text exploration

June 29, 2012
By

(This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers)

I am in the slow process of developing a package to bridge structured text formats (i.e. classroom transcripts)  with the tons of great R packages that visualize and analyze quantitative data (If you care to play with a rough build of this package (qdap) see: https://github.com/trinker/qdap). One of the packages qdap will bridge to is igraph.

A while back I came across a blog post on igraph and word statistics (LINK).  It inspired me to learn a little bit about graphing and the igraph package and provided a nice intro to learn.  As I play with this terrific package I feel it is my duty to share my experiences with others who are just starting out with igraph as well.   The following post is a script and the plots created with a word frequency matrix (similar to a term document matrix from the tm package) and igraph:

Build a word frequency matrix and covert to an adjacency matrix

set.seed(10)
X <- matrix(rpois(100, 1), 10, 10)
colnames(X) <- paste0("Guy_", 1:10)
rownames(X) <- c('The', 'quick', 'brown', 'fox', 'jumps',
    'over', 'a', 'bot', 'named', 'Dason')
X #word frequency matrix
Y <- X >= 1
Y <- apply(Y, 2, as, "numeric") #boolean matrix
rownames(Y) <- rownames(X)
Z <- t(Y) %*% Y  #adjacency matrix

Build a graph from the above matrix

 g <- graph.adjacency(Z, weighted=TRUE, mode ='undirected')
# remove loops
library(igraph)
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)

#Plot a Graph
set.seed(3952)
layout1 <- layout.auto(g)
#for more on layout see:
browseURL("http://finzi.psych.upenn.edu/R/library/igraph/html/layout.html")
opar <- par()$mar; par(mar=rep(0, 4)) #Give the graph lots of room
plot(g, layout=layout1)


Alter widths of edges based on dissimilarity of people’s dialogue

 #adjust the widths of the edges and add distance measure labels
#use 1 - binary (?dist) a proportion distance of two vectors
#1 is perfect and 0 is no overlap (using 1 - binary)

edge.weight <- 7  #a maximizing thickness constant
z1 <- edge.weight*(1-dist(t(X), method="binary"))
E(g)$width <- c(z1)[c(z1) != 0] #remove 0s: these won't have an edge
z2 <- round(1-dist(t(X), method="binary"), 2)
E(g)$label <- c(z2)[c(z2) != 0]
plot(g, layout=layout1) #check it out! 


Scale the label cex based on word counts

 SUMS <- diag(Z) #frequency (same as colSums(X))
label.size <- .5 #a maximizing label size constant
V(g)$label.cex <- (log(SUMS)/max(log(SUMS))) + label.size
plot(g, layout=layout1) #check it out!
 


Add vertex coloring based on factoring

 #add factor information via vertex color
set.seed(15)
V(g)$gender <- rbinom(10, 1, .4)
V(g)$color <- ifelse(V(g)$gender==0, "pink", "lightblue")

plot(g, layout=layout1) #check it out!
plot(g, layout=layout1, edge.curved = TRUE) #curve it up

par(mar=opar) #reset margins 



Try it interactively with tkplot

#interactive version
tkplot(g)  #an interactive version of the graph
tkplot(g, edge.curved =TRUE) 

This is just scratching the surface of igraph’s capabilities. Click here for a link to more igraph documentation.

This post was me toying with different ideas and concepts. If you see a way to improve the code/thinking please leave a comment.

For a .txt version of this demonstration click here


To leave a comment for the author, please follow the link and comment on his blog: TRinker's R Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,

Comments are closed.