**Peter's stats stuff - R**, and kindly contributed to R-bloggers)

Branko Milanovic asked on Twitter:

Idea for a paper: “Homogamy” on Twitter. Do people with more followers follow people with more followers?

I don’t have time to write a paper but I was sufficiently interested to want to blog about it. The consensus in the replies was “of course they do”, with the claim that Twitter is well known for being assortative. Oh, here’s some terminology:

*homogamy*– in-breeding, or (in sociology) marriage between individuals who are, in some culturally important way, similar to eachother*assortative*– a preference for a network’s nodes to attach to others that are similar in some way.

The idea of assortativity in a network is basically that popular people (or whatever a node is) hang out with eachother and form lots of connections.

First, let’s cut to the chase. Turns out that in fact, looking at the trimmed mean number of followers of the people that Twitter users follow, the relationship with the number of followers of the original person is not straightforward:

In fact, you could say that generally, the more followers you have, the *less* followers the people you follow have – until a threshold of around 100,000 followers, which makes you one of the giants of Twitter (for a reference point that might mean something to readers of this blog, Hadley Wickham has around 58,000 followers, so more than 100,000 is really a lot). This finding holds with two different methods of sampling people from Twitter, and is confirmed by more sophisticated modelling that looks simultaneously at the number of people one is following and the number of followers they have.

One plausible explanation is that “number of followers” is a proxy for length of engagement on Twitter. When one starts out on Twitter, you are presented with a bunch of suggested popular accounts to follow (eg sports teams, entertainment celebrities, famous media outlets), typically with very high numbers of followers themselves. So on day one, the average number of followers of the people *you* follow is very high. Over time, the regular user of Twitter follows people (eg your friends) with less followers and the average followers of those people declines. Only for the super-famous, everyone you know or care about is famous and so the homogamy/assortativity thesis kicks in but *iff* you’re a bit selective about who you follow, and don’t just follow back all your non-entity fans.

The issue (I think) that stops Twitter being highly assortative is the one-way nature of following someone. Katy Perry’s 109 million followers don’t need to ask her permission to follow her; this makes the forming of edges between nodes in the Twitter network fundamentally different to marriage.

So what about the references mentioned in the Twitter exchange after Milanovic’s first tweet on this? What I see in the literature (admittedly after a very cursory glance – I’m pressed for time as I write this) is a bit different. There’s a couple of pieces suggesting *happiness* is assortative – people who write happy tweets associate with others who do the same, and the same applies to people who write unhappy tweets. Also, the action of replying and retweeting is assortative. But contrary to what some said in the Twitter exchange, I don’t see articles showing that the follow/followed network is assortative.

Let’s go into how I did this.

## A sampling problem

The first challenge is to get a sample of Twitter users. This is harder than it might seem at first, if the aim is (as it should be) to be representative of the population at large. First challenge is defining that population. Do we mean every Twitter account, every human Twitter account, every account that is used for actively tweeting (why would you restrict it to this, as even reading tweets should surely count?).

There’s no conveniently published population of Twitter users. I’m aware of three broad ways one might go about getting a sample of users:

- You could do the equivalent of “random digit dialling”, making up numeric Twitter identification numbers and checking them in the Twitter API for existence. This method is in fact what you find if you google “how do I get a sample of Twitter users” but Twitter have made it effectively impossible by the (hidden) way they assign IDs. I observed ID numbers up to 10^17 and as low as 10^6, and sampling random numbers between those extremes hoping to hit one of the 300 million or so actual users sounds like a recipe for getting nowhere.
- You can pick a node and use a snowball sampling method; that is, follow an edge (either a person the first node follows, or a node who follows them) to another node, record what you need to about that person, then follow another edge to a third node, and so on until you have enough people. This is what I did with the sample labelled “snowball sampling along network”.
- You can sample a bunch of actual tweets, and treat their authors as your sample. This is what I did with the sample labelled “sample of today’s tweeters”.

Method 1 is I think infeasible. Method 2 will oversample users with lots of followers and who follow lots of people – basically, the more networked you are, the more likely you are to be sampled. On the plus side for Method 2, quiet users who lurk but aren’t tweeting these days will have a chance of selection. Method 3 on the other hand will give a very particular slice of users; but it has the advantage of less obvious dependence between the nodes we pick (all they have in common is tweeting in the last couple of minutes of when we harvest them). I was unsure enough about sampling strategy to try them both.

Here’s code to do the sampling, using Jeff Gentry’s `twitteR`

R package. Note that the code that follows isn’t very robust, and took days to run because of Twitter’s rules on maximum downloads via the public API. There’s lots that can go wrong with grabbing data from the Twitter API, hence the extensive and undisciplined use of `try()`

in the code below in an effort to keep the data harvesting going in the face of various quirks .

First, setup. This requires four different pass phrases which associated with your Twitter account (for obvious reasons, blanked out in the below, which stops this script being fully reproducible as-is).

As I’m interested in the “average” number of followers of the people a sampled user follows, for each user I sample I’m going to need to find out everyone they follow and estimate how many followers *they* have. This is the thing that takes time. It also exposes a problem; Twitter won’t let you look at more than a certain number of users at once. That number turns out to be 75,000, as I find out from this experiment with Todd Carey:

The problem here is that as Carey follows so many people (1.28 million) it’s infeasible to get the details of them all. In fact, it’s infeasible even just to get all their 1.28 million screen names and then sample from those (I’m very happy to estimate “average number of followers of people X follows” from a sample). I’ll have to deal with getting 75,000 users he follows, then sampling from those 75,000. The big problem here is that I think those are the most recent 75,000 people he’s followed. All else being equal, these are likely to be newer Twitter users than the overall population of people he follows, and hence likely to bias downwards my estimate of the average number of followers *they* have (as newer users will almost certainly have fewer followers on average).

[As an aside, one might wonder what is the point of following 1.28 million people on Twitter; it’s presumably part of a strategy, automated or not, of attracting followers by implicitly agreeing to follow them back. It’s not fully automatic – I established this by following him to see what happened, and nothing did.]

I don’t see what can be done about this, other than note the people who follow more than 75,000 people as potentially suspect in subsequent analysis. There’s not that many of them in my sample anyway.

Anywhere, here’s the code that does the snowball sampling. I start with myself. I take three types of average number of followers of the people X follows: mean (which is highly vulnerable to an arbitrarily large number getting in the sample), 20% trimmed mean and 50% trimmed mean (ie median); but I’m satisfied that 20% trimmed mean is robust and a good measure.

This gets me a sample that looks like the below. Note that this method is prone to sampling individuals more than once, particularly highly networked users (ie lots of followers and/or lots of people following). I’ll deal with that later by taking the average of the estimates of them.

The snapshot sampling method is a bit simpler. All I need is 1,000 random tweets, which I get by searching for tweets containing the letter “e” (as `twitteR`

doesn’t facilitate a completely open search as far as I can see). This isn’t great – I think it eliminates people using some character sets – but is good enough for a blog.

## Results

The resulting numbers are all very skewed distributions. Visually, they look good when you take the logarithm of the original number plus 1 (needed to avoid turning people with 0 friends or 0 followers becoming `-Inf`

). I can justify “+ 1” by saying that, in a way, everyone follows themself and is followed by themselves.

Here’s the distributions and relationships of the key variables, when transformed this way:

We see a strong positive relationship between number of your followers and number of people you are following – until we get to people with many followers (10,000 or more), when the relationship breaks down (visible only in the snowball sample, as the simpler “today’s tweeters” sampling method harvests few such people). Partly this comes from reciprocal follow-back arrangements, partly it is a general indicator of longevity.

We also see a strongly negative relationship between number of people one is following and the average number of their followers. This makes sense and fits in with the notion that if you want to be following people with lots of followers, you have to be quite selective in who you follow, and its best to follow big names in sport, entertainment and perhaps politics. Consider Trump supporter @TheRoyalPosts as an extreme example (not actually included in the final sample). With 41,700 followers of her own, she follows only 120 prominent political accounts, with an astonishing average number of followers themselves of 7 million.

At this point let’s have another look at the graphic I started the post with:

From these last couple of charts, I’m actually pretty happy with both of my sampling methods.

Here’s the code that combines the two samples and produces the graphics above.

## Modelling

Finally, I wanted to see if the apparent u-shaped relationship between number of followers and the average number of followers of people one follows was robust to a model that simultaneously modelled the strongly negative relationship between the number of people one follows and that same response variable. It turns out that this is the case. Here are the partial effects of the two variables in a generalized additive model without an interaction term:

And here is how the relationship looks when there is an interaction term. First, as a three-dimensional perspective plot:

… then, more usefully as a heatmap.

The interaction is significant so we’d keep it in even though it complicates the interpretation.

I think the heatmap is the best representation of the data. It shows clearly that the average number of followers of the people one follows is:

- high for people with few followers
- high for people with many followers who
*don’t*follow many themselves - low for people with many followers who
*do*follow lots of people themselves

Additionally, there are are very few or no users who follow hundreds of thousands of accounts but have a low number of their own followers (hence the white space in the top left corner).

Here’s the code for the modelling:

**leave a comment**for the author, please follow the link and comment on their blog:

**Peter's stats stuff - R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...