Exploring Global Internet Performance Data Using R

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Lourdes O. Montenegro

Lourdes O. Montenegro is a PhD candidate at the Lee Kuan Yew School of Public Policy, National University of Singapore. Her research interests cover the intersection of applied data science, technology, economics and public policy.

Many of us now find it hard to live without a good quality internet connection. As a result, there is growing interest in characterizing and comparing internet performance metrics. For example, when planning to switch internet service providers or considering a move to a new city or country, internet users may want to research in advance what to expect in terms of download speed or latency. Cloud companies may want to provision adequately for different markets with varying levels of internet quality. And governments may want to benchmark their communications infrastructure and invest accordingly. Whatever the purpose, a consortium of research, industry and public interest organizations called Measurement Lab has made available the largest open and verifiable internet performance dataset in the planet. With the help of a combination of packages, R users can easily query, explore and visualize this large dataset at no cost.

In the example that follows, we use the bigrquery package to query and download results from the Network Diagnostic Tool (NDT) used by the U.S. FCC. The bigrquery package provides an interface to Google BigQuery which hosts NDT results along with several other Measurement Lab (M-Lab) datasets. However, R users will need to first set-up a BigQuery account and join the M-Lab mailing list to authenticate. Detailed instructions are provided on the M-Lab website. Once done, SQL-like queries can be run from within R. The results are saved as a dataframe on which further analysis can be performed. Aside from the convenience of working within the R environment, the bigrquery package has another advantage: the only limitation to the size of the query results that can be saved for further exploration is the amount of available RAM. In contrast, the BigQuery web interface only allows export to .csv format of query results which are at or below 16,000 rows.

The following R script gives us the average download speed (in Mbps) per country in 2015.[1] The SQL-like query can be modified to return other internet performance metrics that may be of interest to the R user such as upload speed, round-trip time (latency) and packet re-transmission rates.

# Querying average download speed per country in 2015
downquery_template <- "SELECT
connection_spec.client_geolocation.country_code AS country,  
AVG(8 * web100_log_entry.snap.HCThruOctetsAcked/ (web100_log_entry.snap.SndLimTimeRwin +
web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd)) AS downloadThroughput,
COUNT(DISTINCT test_id) AS tests,
AND IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.local_ip)
AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.HCThruOctetsAcked)
AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeRwin)
AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeCwnd)
AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeSnd)
AND project = 0
AND IS_EXPLICITLY_DEFINED(connection_spec.data_direction)
AND connection_spec.data_direction = 1
AND web100_log_entry.snap.HCThruOctetsAcked >= 8192
AND (web100_log_entry.snap.SndLimTimeRwin +
web100_log_entry.snap.SndLimTimeCwnd +
web100_log_entry.snap.SndLimTimeSnd) >= 9000000
AND (web100_log_entry.snap.SndLimTimeRwin +
web100_log_entry.snap.SndLimTimeCwnd +  
web100_log_entry.snap.SndLimTimeSnd) < 3600000000
AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.CongSignals)
AND web100_log_entry.snap.CongSignals > 0
AND (web100_log_entry.snap.State == 1 OR (web100_log_entry.snap.State >= 5 AND web100_log_entry.snap.State <= 11))
AND web100_log_entry.log_time >= PARSE_UTC_USEC('2015-01-01 00:00:00') / POW(10, 6) 
AND web100_log_entry.log_time < PARSE_UTC_USEC('2016-01-01 00:00:00') / POW(10, 6)
GROUP BY country
ORDER BY country ASC;"
downresult <- query_exec(downquery_template, project="measurement-lab", max_pages=Inf)

Once we have the query results in a dataframe, we can proceed to visualize and map average download speeds for each country. To do this, we can use the rworldmap package which offers a relatively simple way to map country level and gridded user datasets. Mapping is done mainly through two functions: (1) joinCountryData2Map joins the query results with shapefiles of country boundaries; (2) mapCountryData plots the chloropleth map. Note that the join is best effected using either two or three-letter ISO country codes, although the rworldmap package also allows join columns filled with country names.

In order to make the chloropleth map prettier and more comprehensible, we can augment with a combination of the classInt package to calculate natural breaks in the range of download speed results and the RColorBrewer package for a wider selection of color schemes. In the succeeding R script, we specify the Jenks method to cluster download speed results in such a way that minimizes deviation from the class mean within each class but maximizes deviation across class means. Compared to other methods for clustering download speed results, the Jenks method draws a sharper picture of countries clocking greater than 25 Mbps on average.

downloadmap <- joinCountryData2Map(downresult
                              , joinCode='ISO2'
                              , nameJoinColumn='country'
                              , verbose='TRUE')
#getting class intervals using a 'jenks' classification in classInt package
classInt <- classInt::classIntervals( downloadmap[["downloadThroughput"]], n=5, style="jenks")
catMethod = classInt[["brks"]]
#getting a colour scheme from the RColorBrewer package
colourPalette <- RColorBrewer::brewer.pal(5,'RdPu')
mapParams <- mapCountryData(downloadmap, nameColumnToPlot="downloadThroughput", 
                            mapTitle="Download Speed (mbps)", catMethod=catMethod, 
do.call( addMapLegend, c(mapParams, legendWidth=0.5, legendLabels="all", 
                         legendIntervals="data", legendMar = 2))


Looking at the map, we see that the UK, Japan, Romania, Sweden, Taiwan, The Netherlands, Denmark and Singapore (if we squint!) are the best places to be for internet speed addicts. Until further investigation, we can safely discount the suspiciously high results for North Korea since the number of observations are too low. In contrast, average download speeds in South Korea might be grossly underestimated when measured from foreign servers, as may be the case with NDT results, since most Koreans access locally hosted content. There are, of course, a number of caveats worth mentioning before drawing any conclusions regarding the causes of varying internet performance between countries. Confounding factors such as distance from the client to the test server, the client's operating system, and the proportion of fixed broadband to wireless connections will need to be controlled for. Despite these caveats, this tentative exploration already reveals interesting patterns in global internet performance that is worth a closer look.

[1]    Thanks to Stephen McInerney and Chris Ritzo for code advice.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)