# Experimenting With iGraph – and a Hint Towards Ways of Measuring Engagement?

January 27, 2012
By

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

For fear of being left way behind as Martin Hawksey starts to get to grips with R, (see for example how he’s using R to automate the annotation of Google Spreadsheets with calculations that don’t come readily or efficiently to hand in Google Spreadsheets itself), I thought I better try to get to grips with R’s igraph library…

So here’s a script that gives me some hints as to how to start migrating chunks of my clunky Python script into R, as well as some ideas about how to start reporting on the structure of hashtag communities in a graphical as well as stats analytical way.

require(igraph)

#load in a graph from a graphml file
summary(g)
#Vertices are obtained via V(g). The summary() tells us what attributes are available.
#So for example, inspect the label attribute
V(g)$label #in and out degree counts for each (labelled) node/vertex df=data.frame(name=V(g)$label,indegree=degree(g,mode='in'),outdegree=degree(g,mode='out'))

#inspect the top 10 nodes sorted by indegree
#the plyr arrange function makes sorting dataframes a doddle...
require(plyr)
df2

#get ready to do some plots
require(ggplot2)

#It might be interesting to look at the in-degree and out-degree distributions
#out-degree, because we see how promiscuous folk are in their following behaviour
#h/t to @mhawksey for pointing out the mode argument to me.. doh!
ddout=degree.distribution(g,mode='out')
#degree.distribution() "a numeric vector of the same length as the maximum degree plus one. The ﬁrst element is the relative frequency zero degree vertices, the second vertices with degree one,etc."
#We can use the vector vals as the y-value, but x is unspecified/implied by the row number
#So we need to generate the x vals explicitly...?
ggplot()+geom_point(aes(c(1:length(ddout)),ddout))
#If we want to ignore the outdegree==0 value, we can skip the first item in the list
ggplot()+geom_point(aes(c(2:length(ddout)),ddout[-1]))

#in-degree
ddin=degree.distribution(g,mode='in')
ggplot()+geom_point(aes(c(1:length(ddin)),ddin))
ggplot()+geom_point(aes(c(2:length(ddin)),ddin[-1]))

#We can also plot indegree and outdegree together
#Use colour to distinguish the points, and make the upper layer smaller in case we overplot
ggplot() + geom_point(aes(c(2:length(ddin)),ddin[-1]),colour='red') + geom_point(aes(c(2:length(ddout)),ddout[-1]),colour='blue',size=1)

Note that I really should have labelled the axes – x-axis is “in (or out) degree”, y-axis is “proportion of nodes with corresponding in (or out) degree”.

Out-degree:

Out-degree (except out-degree==0):

In-degree:

In-degree (except in-degree==0):

One thing I notice about the in-degree is that there is a very high number of very low in-degree nodes, which tail off quickly, and then another head at in-degree 25 which then tails off. This is an artefact of the way the graph file was pre-processed – I generated a friends network of hashtag users, then filter the network to only include nodes that had indegree of at least 25 and/or outdegree of at least 25. The nodes with in-degree between 1 and 25 are nodes corresponding to hashtaggers that are friended by other hashtaggers.

In- (blue) and out- (red) degree:

Reflecting on the in-degree graph, we have a way of identifying those folk who used the hashtag and are connected to other hashtaggers:

arrange(subset(df,subset=(outdegree>0 & indegree>0)),desc(indegree))

The dataset I’m using refers is based on folk using the #bbcqt hashtag. Here are the hashtaggers most linked to by other hashtaggers:

> head(arrange(subset(df,subset=(outdegree>0)),desc(indegree)))
name indegree outdegree
1 bbcquestiontime      190       102
2       DIMBLEBOT       76        61
4 politicalhackuk       27       236
5          10anta       25        73
6 Parlez_me_nTory       24        63

So now I’m wondering… does this hint at a way of measuring some sort of engagement with the Twitter account set up to promote the programme and, presumably, the hashtag???

If we consider @bbcquestiontime, the high indegree tells us that the @bbcquestiontime account is being followed by a significant number of the hashtag users (we could find out what proportion by dividing through by the number of folk with out-degree>1 minus 1 (minus 1 because @bbcquestiontime is one of those hashtaggers). That @bbcquestiontime has outdegree > 0 tells us it was sampled as a user of the hashtag (the graph was originally generated with directed edges from folk who used the tag to their friends.) The high (ish?!) out-degree tells us that this account is linking to a reasonable number of folk popularly followed by users of the #bbcqt hashtag or who used the hashtag; so #bbcquestiontime is listening to folk that the #bbcqt taggers listen to, which is probably a good thing. (I guess what we could do here is compare the outdegree of the @bbcquestiontime account with its total friend count (ie, with the total number of accounts it follows. Because if the account was following 1000 people or so, and only 10% of them were being followed by #bbcqt hashtaggers, we might wonder whether they’re interested in different things?) Once again, we could also normalise the out-degree number with respect to one less number of accounts with indegree >0 (again, we subtract one to account for the self reference) to get the proportion of folk being followed by hashtaggers that are being followed by @bbcquestiontime. This gives us some idea of the extent to which @bbcquestiontime is listening to folk that the #bbcqt hashtaggers are listening to.

Let’s try that latter normalisation to get a feel for what the proportions are…

#Count the number of rows where folk have indegree, or outdegree, as required, > 0
df$inReach=df$indegree/(nrow(subset(df,df$outdegree>0))-1) df$outReach=df$outdegree/(nrow(subset(df,df$indegree>0))-1)
#First let's see who reaches furthest out into the interest community
name indegree outdegree  outReach     inReach
1        Damientg        5       341 0.4782609 0.013054830
2      danmknight        9       265 0.3716690 0.023498695
3         martysm        1       261 0.3660589 0.002610966
4       MrJacHart       18       257 0.3604488 0.046997389
5           VMcAV        5       237 0.3323983 0.013054830
6 politicalhackuk       27       236 0.3309958 0.070496084

#now let's see who is touched by most of the community
name indegree outdegree   outReach    inReach
1 bbcquestiontime      190       102 0.14305750 0.49608355
2       DIMBLEBOT       76        61 0.08555400 0.19843342
3   markinreading       34       121 0.16970547 0.08877285
4 politicalhackuk       27       236 0.33099579 0.07049608
5          10anta       25        73 0.10238429 0.06527415
6 Parlez_me_nTory       24        63 0.08835905 0.06266319

So, from that, we see that @Damientg is following a large number of the folk popularly followed by users of the #bbcqt hashtag or who used the hashtag. I don’t think this is interesting. However, the fact that @bbcquestiontime is followed by about half the folk who used the #bbcqt tag (in the sample I grabbed) is maybe useful as a measure of how engaged the hashtaggers may be with the programme Twitter account?

The latter report also brings to mind another question – how many of the hashtaggers does any particular account follow – that is, how connected is any particular account to folk who used the hashtag (which is the set of folk with outdegree>0)? This is important I think – distinguishing between hashtaggers who link to each other as part of a conversation, and other accounts they follow en masse but who aren’t engaging in conversation via the hashtag?

Hmmm…something to ponder over the weekend I think;-)

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...