Site icon R-bloggers

igraph and structured text exploration

[This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am in the slow process of developing a package to bridge structured text formats (i.e. classroom transcripts)  with the tons of great R packages that visualize and analyze quantitative data (If you care to play with a rough build of this package (qdap) see: https://github.com/trinker/qdap). One of the packages qdap will bridge to is igraph.

A while back I came across a blog post on igraph and word statistics (LINK).  It inspired me to learn a little bit about graphing and the igraph package and provided a nice intro to learn.  As I play with this terrific package I feel it is my duty to share my experiences with others who are just starting out with igraph as well.   The following post is a script and the plots created with a word frequency matrix (similar to a term document matrix from the tm package) and igraph:

Build a word frequency matrix and covert to an adjacency matrix

set.seed(10)
X <- matrix(rpois(100, 1), 10, 10)
colnames(X) <- paste0("Guy_", 1:10)
rownames(X) <- c('The', 'quick', 'brown', 'fox', 'jumps',
    'over', 'a', 'bot', 'named', 'Dason')
X #word frequency matrix
Y <- X >= 1
Y <- apply(Y, 2, as, "numeric") #boolean matrix
rownames(Y) <- rownames(X)
Z <- t(Y) %*% Y  #adjacency matrix

Build a graph from the above matrix

 g <- graph.adjacency(Z, weighted=TRUE, mode ='undirected')
# remove loops
library(igraph)
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)

#Plot a Graph
set.seed(3952)
layout1 <- layout.auto(g)
#for more on layout see:
browseURL("http://finzi.psych.upenn.edu/R/library/igraph/html/layout.html")
opar <- par()$mar; par(mar=rep(0, 4)) #Give the graph lots of room
plot(g, layout=layout1)


Alter widths of edges based on dissimilarity of people’s dialogue

 #adjust the widths of the edges and add distance measure labels
#use 1 - binary (?dist) a proportion distance of two vectors
#1 is perfect and 0 is no overlap (using 1 - binary)

edge.weight <- 7  #a maximizing thickness constant
z1 <- edge.weight*(1-dist(t(X), method="binary"))
E(g)$width <- c(z1)[c(z1) != 0] #remove 0s: these won't have an edge
z2 <- round(1-dist(t(X), method="binary"), 2)
E(g)$label <- c(z2)[c(z2) != 0]
plot(g, layout=layout1) #check it out! 


Scale the label cex based on word counts

 SUMS <- diag(Z) #frequency (same as colSums(X))
label.size <- .5 #a maximizing label size constant
V(g)$label.cex <- (log(SUMS)/max(log(SUMS))) + label.size
plot(g, layout=layout1) #check it out!
 


Add vertex coloring based on factoring

 #add factor information via vertex color
set.seed(15)
V(g)$gender <- rbinom(10, 1, .4)
V(g)$color <- ifelse(V(g)$gender==0, "pink", "lightblue")

plot(g, layout=layout1) #check it out!
plot(g, layout=layout1, edge.curved = TRUE) #curve it up

par(mar=opar) #reset margins 



Try it interactively with tkplot

#interactive version
tkplot(g)  #an interactive version of the graph
tkplot(g, edge.curved =TRUE) 

This is just scratching the surface of igraph’s capabilities. Click here for a link to more igraph documentation.

This post was me toying with different ideas and concepts. If you see a way to improve the code/thinking please leave a comment.

For a .txt version of this demonstration click here


To leave a comment for the author, please follow the link and comment on their blog: TRinker's R Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.