Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I first became interested in networks when reading Matthew O’Jackson’s 2010 paper describing their application to economics. During the 2014 ebola outbreak, there was a lot of concern over the disease spreading to the U.S.. I was caught up with work/classes at the time, but decided to use airline flight data to at least explore the question.

The source for the data can be found in a previous post on spatial data visualization.

I assumed that the disease had a single origin (Liberia) and wanted to explore the question of how the disease could travel to the U.S. through a network of airline flights.

A visualization of the network can be seen below. Each node is a country and each edge represents an existing airline route from one country to another. Flights that take off and land in the same country are omitted to avoid clutter.

Communities and Homophily

I used a spinglass algorithm to detect “communities” of countries, i.e. sets of countries with many flights between themselves, but few flights between themselves and countries not in the set. Roughly speaking, the algorithm tended to group countries in the same continent together. However, this is not always the case. For example, France was was placed in the same community as several African countries, due to the close relationships with its former colonies. Roughly speaking, this network seems to exhibit homophily – countries on the same continent tend to be connected more with each other than with countries off their continent.

Liberia, the US, and Degree Distribution

The labels are unclear in the plot, but Liberia and the U.S. are in two separate communities, which may lead us to believe that ebola spreading from the former to the latter would be unlikely. In fact, the degrees (number of countries a given country is connected to) of the countries differ greatly, which would also support this intuition. The US is connected to 186 other countries, whereas Liberia is connected to only 12. The full degree distribution is shown below. It roughly follows a power law, which, according to wikipedia, is what we should expect. Note that the approximation is asymptotic, which could be why this finite sample is off. According to the degree distribution, half of all countries are connected to 27 other countries. Liberia falls far below the median and the US falls far above the median.

Small Worlds

Let’s zoom in and look at Liberia’s second-degree connections:

Even though they’re in two different communities, Liberia and the U.S. are only two degrees of separation away. This is generally the case for all countries. If, for each node, we calculated the shortest path between it and every other node, the average shortest distance would be about 2 (specifically 2.3) hops. This is called the small world phenomenon. On average, every country is 2 hops away from every other country. Many networks exhibit this phenomenon largely due to “hubs” – countries (more generally, nodes) with lots of connections to other countries. For example, one can imagine that Charles de Gaulle airport in France is a hub that links countries in the US, Eastern Europe, Asia, and Africa. The existence of these hubs makes it possible to get from one country to another with very few transfers.

Contagion

The close-up network above shows that if ebola were to spread to the U.S., it might be through Nigeria, Ghana, Morocco, and Belgium. If we knew the proportion of flights that went from Liberia to each of these countries and from each of these countries to the US, we could estimate the probability of Ebola spreading along each route. I don’t have time to do this, although it’s certainly possible given the data.

Of course, this is a huge simplification for many reasons. Even though Sierra Leon, for example, doesn’t have a direct connection to the US, it could be connected to other countries that are connected to the US. This route could have a very high proportion of flights landing in the U.S..

There are also several epidemiological parameters that could change how quickly the disease spreads. For example, the length of time from infection to detectable symptoms is important. If the infected don’t show symptoms until a week after infection, then they cannot be screened and contained as easily. They could infect many others before showing symptoms.

The deadliness of the disease is also important. If patients die within hours of being infected, then the disease can’t spread as far. To take the extreme, consider a patient dies within a second of being infected. Then there’s very little time for him to infect others.

Finally, we assumed a single origin. If the disease is already present in several countries, then we would need to adjust the analysis.

routes=read.table('.../routes.dat',sep=',')

library(igraph)

# for each flight, get the country of the airport the plane took off from and landed at.
ports=ports[,c('V4','V5')]
names(ports)=c('country','airport')

routes=routes[,c('V3','V5')]
names(routes)=c('from','to')

m=merge(routes,ports,all.x=TRUE,by.x=c('from'),by.y=c('airport'))
names(m)[3]=c('from_c')
m=merge(m,ports,all.x=TRUE,by.x=c('to'),by.y=c('airport'))
names(m)[4]=c('to_c')

m$count=1 # create a unique country to country from/to route ID m$id=paste(m$from_c,m$to_c,sep=',')

# see which routes are flown most frequently
a=aggregate(m$count,list(m$id),sum)
names(a)=c('id','flights')
a$fr=substr(a$id,1,regexpr(',',a$id)-1) a$to=substr(a$id,regexpr(',',a$id)+1,100)
a=a[,2:4]

a$perc=(a$flights/sum(a$flights))*100 # create directed network graph a=a[!(a[,2]==a[,3]),] mat=as.matrix(a[,2:3]) g=graph.data.frame(mat, directed = T) edges=get.edgelist(g) deg=degree(g,directed=TRUE) vv=V(g) # use spinglass algo to detect community set.seed(9) sgc = spinglass.community(g) V(g)$membership=sgc$membership table(V(g)$membership)

V(g)[membership==1]$color = 'pink' V(g)[membership==2]$color = 'darkblue'
V(g)[membership==3]$color = 'darkred' V(g)[membership==4]$color = 'purple'
V(g)[membership==5]$color = 'darkgreen' plot(g, main='Airline Routes Connecting Countries', vertex.size=5, edge.arrow.size=.1, edge.arrow.width=.1, vertex.label=ifelse(V(g)$name %in% c('Liberia','United States'),V(g)$name,''), vertex.label.color='black') legend('bottomright',fill=c('darkgreen','darkblue', 'darkred', 'pink', 'purple'), c('Africa', 'Europe', 'Asia/Middle East', 'Kiribati, Marshall Islands, Nauru', 'Americas'), bty='n') # plot degree distribution dplot=degree.distribution(g,cumulative = TRUE) plot(dplot,type='l',xlab='Degree',ylab='Frequency',main='Degree Distribution of Airline Network',lty=1) lines((1:length(dplot))^(-.7),type='l',lty=2) legend('topright',lty=c(1,2),c('Degree Distribution','Power Law with x^(-.7)'),bty='n') # explore membership...regional patterns exist cc=cbind(V(g)$name,V(g)$membership) tt=cc[cc[,2]==5,] # explort connection from Liberia to United States m=mat[mat[,1]=='Liberia',] t=mat[mat[,1] %in% m[,2],] tt=t[t[,2]=='United States',] # assess probabilities lib=a[a$fr=='Liberia',]
lib$prob=lib$flights/sum(lib$flights) # most probable route from liberia is Ghana vec=c(tt[,1],'Liberia') names(vec)=NULL g2=graph.data.frame(mat[(mat[,1] %in% vec & mat[,2] == 'United States') | (mat[,1]=='Liberia'),], directed = T) V(g2)$color=c('darkblue','darkgreen','darkgreen','darkgreen','darkgreen','purple','darkgreen','darkgreen')

plot(g2,
main='Airline Connections from Liberia to the United States',
vertex.size=5,
edge.arrow.size=1,
edge.arrow.width=.5,
vertex.label.color='black')
legend('bottomright',fill=c('darkgreen','darkblue','purple'),
c('Africa', 'Europe', 'Americas'),
bty='n')

aa=a[a$fr %in% tt[,1],] sum=aggregate(aa$flights,list(aa$fr),sum) bb=a[a$fr %in% tt[,1] & a$to=='United States',] fin=data.frame(bb$fr,sum$x,bb$flights,bb$flights/sum$x)

s=shortest.paths(g)
mean(s)