Speeding up the Bradley Terry Model in R

[This article was first published on R – mildlyscientific, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


I am currently developing my first R package which confronted me a lot with the question: “How can I speed up my code?”.

I did some “research” and read a lot of general articles about speeding up code, but also a few posts specifically about speeding up R code. While I mostly got the main points, I always found the example use cases slightly contrived. So I decided to write a little something about one of my use cases which includes many points that I think are important when trying to speed up your code. The method we want to implement in this post is the so called Bradley Terry Model. If you do not care about the theoretical part, you can jump directly to the implementation section.

The Bradley Terry Model

The Bradley Terry model is a fairly popular probabilistic model for paired comparisons in different contexts. In sports, for example, paired comparisons of course mean matches between teams or individual players. Given a pair of teams/players i and j, we can estimate a probability that “i beats j”, or i>j, as

$$P(i>j)=\frac{p_i}{p_i+p_j}.$$

The values p_i and p_j represent a skill rating of teams or players that were somehow (we come to that later) estimated before. If we know the skills,

$$l(p)=\sum\limits_{i=1}^n\sum\limits_{i=1}^n \left(w_{ij}\log(p_i)-w_{ij}\log(p_i+p_j)\right),$$

where w_{ij} is the the number of times i has beaten j. The log-likelihood can be maximized iteratively with

$$p_i^{(k+1)}=W_i\left(\sum\limits_{j \neq i} \frac{N_{ij}}{p_i^{(k)}+p_j^{(k)}}\right)^{-1},$$

where W_i is the number of wins by i and N_{ij}=w_{ij}+w_{ji} is the number of games between i and j. After each iteration, the skill values are scaled to sum to one. The procedure is stopped as soon as
$$\vert p^{(k+1)}-p^{(k)}\vert \leq \epsilon$$
for some \epsilon\geq 0.

The alert mathematician might notice that there is a slight problem with the iterative procedure. If there is a team/player i that does not win any game, the procedure fails. A safety net is to give every team a teeny tiny victory against every other team. In math terms, we make w irreducible such that the iterative procedure converges.

Once the process converges, we have an estimate of player/team skills that can be used to form a ranking.

A first implementation

The iterative process for p_i^{(k+1)} is easily implemented in R.

BTm.1=function(matches,max.iter=100,tol=10^-8){
  #matches should have two columns, where the first column should always be the winner
  g=graph_from_edgelist(as.matrix(matches),directed=T)

  E(g)$weight=1
  g=igraph::simplify(g,edge.attr.comb = "sum")
  A=get.adjacency(g,attr = "weight",sparse=F)
  n=dim(A)[1]
  # sparse correction
  if(!is.connected(g,mode = "strong")){
    eps=1/n
    A=A+matrix(eps,n,n)-diag(eps,n)
  }
  #no games between players
  N=A+t(A)
  #no of wins for each player
  W=rowSums(A)

  #initialisation
  w_0=rep(1/n,n)
  w_old=rep(n,n)
  iter=0

  while(iter<=max.iter & sum((w_old-w_0)^2)>tol){
    iter=iter+1
    w_old=w_0
    for(i in 1:n){
      tmp=0
      for(j in 1:n){
        if(i!=j){
          tmp=tmp+N[i,j]/(w_0[i]+w_0[j])
        }
      }
      w_0[i]=W[i]*tmp^(-1)
    }
    w_0=w_0/sum(w_0)
  }
  return(data.frame(player=V(g)$name,skill=w_0))
}

The construction of the game matrix N might seem a bit awkward but I am a network analyst, so constructing the game graph feels natural and was the quickest I could come up with. After constructing the matrix we check if it is irreducible. If not we add a small eps to every entry. The heart of the function is the while loop where the iterative process is carried out.

To benchmark the function, I will use some tennis data from a previous post gathered from here and here. For the beginning we use all ATP matches played in 2016.

atp=read_csv("atp.csv")
data1<-atp %>% 
  dplyr::filter(date>as.Date("2016-01-01")) %>% 
  select(winner,loser)
glimpse(data1)

## Observations: 2,626
## Variables: 2
## $ winner <chr> "Grigor Dimitrov", "Denis Kudla", "Tobias Kamke", "Hyeo...
## $ loser  <chr> "Gilles Simon", "John Patrick Smith", "Benjamin Mitchel...

How fast is our implementation?

system.time(BTm.1(data1))

##    user  system elapsed 
##  11.452   0.000  11.371

11 seconds seems a bit slow for only 2626 observations. Let’s see what we can do about that.

Maybe there is a Package for it?

Actually this is a question you should ask yourself before implementing a potentially complex function in R. Chances are somewhere, someone has already done that and probably better than you (no offense!). For our problem, there actually exists the BradleyTerry2 package.

library(BradleyTerry2)
library(microbenchmark)

#prepare input data for the Package function
data2<-data.frame(player1=factor(data1$winner,levels=unique(c(data1$winner,data1$loser))),
                  player2=factor(data1$loser,levels=unique(c(data1$winner,data1$loser))),
                  outcome=rep(1,nrow(data1)))
res<-microbenchmark(BTm.1=BTm.1(data1),
                    package=BTm(data2$outcome,data2$player1,data2$player2),
                    times=10)

print(res)

## Unit: seconds
##     expr       min       lq      mean    median       uq       max neval
##    BTm.1 10.844524 10.93058 11.274145 11.181254 11.40294 12.486450    10
##  package  4.604094  4.66283  4.823965  4.753485  4.99618  5.274492    10

res %>% 
  group_by(expr) %>% 
  dplyr::summarise(mean.t=mean(time)/10^9) %>% 
  ggplot(aes(x=expr,y=mean.t))+
  geom_col()+
  labs(x="Expression",y="Mean Time in seconds")

plot of chunk unnamed-chunk-5
The code of the package is almost 3x faster, which was to be expected.

Using functions from existing packages is usually a good idea. Why reinventing the wheel?
But sometimes you might reconsider this, especially when you want to include a function into your own R package.
I personally do not like my package to depend on to many others since I don’t want to force users to install a bulk of packages, that they might not need otherwise. Also, other packages might be using different data structures forcing you to use them as well and you will find yourself transforming data back and forth, slowing down your code again.

If you decide to use your own implementation, you can use existing packages as benchmark for speed and also for accuracy.

res.1=BTm.1(data1)
res.2=BTm(data2$outcome,data2$player1,data2$player2)

#top five ATP players of 2016 according to BTm.1
res.1$player[order(res.1$skill,decreasing=T)[1:5]]

## [1] "Andy Murray"    "Novak Djokovic" "Milos Raonic"   "Roger Federer" 
## [5] "Kei Nishikori"

#top five ATP players of 2016 according to BTm
names(sort(res.2$coefficients,decreasing=T))[1:5]

## [1] "..Andy Murray"    "..Novak Djokovic" "..Milos Raonic"  
## [4] "..Roger Federer"  "..Kei Nishikori"

While this is certainly not a very robust test to check if the two functions produce the same outcome, it shows that they at least agree upon the 5 best players in 2016. I am not big tennis connoisseur, but I think this Top 5 looks very reasonable. If you plan on using a self-implemented function more frequently, you should definitely make sure that it is correct.

Can we speed up our implementation?

Now that we decided to use our own implementation, it is time to think about speeding up the code.
A very easy way to do so code is using a byte code compiler. Explaining the details of this is out of scope for this blog post. A post about this topic can be found here.

BTm.1.jit=compiler::cmpfun(BTm.1)
res<-microbenchmark(BTm.1=BTm.1(data1),
                    BTm.1.jit=BTm.1.jit(data1),
                    package=BTm(data2$outcome,data2$player1,data2$player2),
                    times=10)

print(res)

## Unit: seconds
##       expr       min        lq      mean    median        uq       max
##      BTm.1 10.809827 10.908028 10.925320 10.928710 10.958387 10.981341
##  BTm.1.jit  4.468739  4.489647  4.553959  4.538044  4.605068  4.697536
##    package  4.512514  4.591104  4.607991  4.615787  4.634948  4.725338
##  neval
##     10
##     10
##     10

res %>% 
  group_by(expr) %>% 
  dplyr::summarise(mean.t=median(time)/10^9) %>% 
  ggplot(aes(x=expr,y=mean.t))+
  geom_col()+
  labs(x="Expression",y="Median Time in seconds")

plot of chunk unnamed-chunk-8
With just one line of code we could speed up our function so much, that we now match the runtime of the dedicated package!

Can we speed things up even more?

Although using for loops is easy and readable, it mostly has a tremendous effect on runtime in R.
This stackoverflow thread discusses some loops in R in some detail.

The key to fast R code is vectorization. The downside of vectorization is that you have to invest some time into it and it requires some knowledge in matrix algebra. Carefully investigating our code we might find that the line tmp=tmp+N[i,j]/(w_0[i]+w_0[j]) looks suspiciously like something we could rewrite in terms of matrices. The denominator can be expressed as a matrix W_0 where W_0[i,j]=w_0[i]+w_0[j]. Calculating this matrix can either be done as before in a for loop, yet a more efficient way is using the outer function.
Here is a small example.

#for loops
f.for=function(a){
n=length(a)
A=matrix(0,n,n)
for(i in 1:n){
  for(j in 1:n){
    A[i,j]=a[i]+a[j]
  }
}
return(A)
}

#using outer
f.outer=function(a) {return(outer(a,a,"+"))}

#are the results identical?
n=25
a=runif(n)
identical(f.for(a),f.outer(a))

## [1] TRUE

#runtime
n=2500
a=runif(n)
res<-microbenchmark::microbenchmark(f.for=f.for(a),f.outer=f.outer(a),times=10)

print(res)

## Unit: milliseconds
##     expr        min         lq      mean   median        uq       max
##    f.for 8314.93612 8440.52418 8480.5129 8462.018 8571.8711 8616.8158
##  f.outer   32.84932   33.16897  128.5399  167.679  172.0675  300.8795
##  neval
##     10
##     10

res %>% 
  group_by(expr) %>% 
  dplyr::summarise(mean.t=median(time)/10^9) %>% 
  ggplot(aes(x=expr,y=mean.t))+
  geom_col()+
  labs(x="Expression",y="Median Time in seconds")

plot of chunk unnamed-chunk-9
The difference between the for loops and the outer is enormous. Not only does it cut the required lines of code to a bare minimum, but it is also 60x faster in this case!

Using the outer function in our implementation allows us to reduce each iteration to a simply rowSums call.

BTm.2=function(matches,max.iter=100,tol=10^-8){
  #matches should have two columns, where the first column should always be the winner
  sparse.corrected=F
  g=graph_from_edgelist(as.matrix(matches),directed=T)

  E(g)$weight=1
  g=igraph::simplify(g,edge.attr.comb = "sum")
  A=get.adjacency(g,attr = "weight",sparse=F)
  n=dim(A)[1]

  # sparse correction
  if(!is.connected(g,mode = "strong")){
    eps=1/n
    A=A+matrix(eps,n,n)-diag(eps,n)
    sparse.corrected=T
  }
  #no games between players
  N=A+t(A)
  #no of wins for each player
  W=rowSums(A)

  #initialisation
  w_0=rep(1/n,n)
  w_old=rep(n,n)
  iter=0
  while(iter<=max.iter & sqrt(sum((w_old-w_0)^2))>tol){
    iter=iter+1
    w_old=w_0
    w_0=W*rowSums(N/outer(w_0,w_0,"+"))^(-1)
    w_0=w_0/sum(w_0)
  }

  return(data.frame(player=V(g)$name,skill=w_0))

}

Notice how the vectorized code reduced the size of the while loop significantly. Admittedly, it also made it a bit harder to understand. I have learned (the hard way) that when you vectorize your code, always keep the non-vectorized code as a reference.

How much runtime do we gain?

res<-microbenchmark(BTm.1.jit=BTm.1.jit(data1),
                    BTm.2=BTm.2(data1),
                    package=BTm(data2$outcome,data2$player1,data2$player2),
                    times=10)

print(res)

## Unit: milliseconds
##       expr       min        lq      mean    median        uq       max
##  BTm.1.jit 4454.3048 4473.1272 4515.7012 4485.3681 4595.2181 4615.3122
##      BTm.2  431.4755  445.0542  480.6702  458.8493  477.7227  595.9364
##    package 4305.5614 4517.1449 4594.6055 4609.5872 4632.7367 4863.8251
##  neval
##     10
##     10
##     10

res %>% 
  group_by(expr) %>% 
  dplyr::summarise(mean.t=median(time)/10^9) %>% 
  ggplot(aes(x=expr,y=mean.t))+
  geom_col()+
  labs(x="Expression",y="Median Time in seconds")

plot of chunk unnamed-chunk-11
The speed up of the vectorized code is extraordinary. It is even several magnitudes faster than the code from the BradleyTerry2 package. Most likely due to the fact that the BTm function is more generic and also produces a variety of diagnostics that we completely ignored. The code for BTM.2 is highly specific but exactly what we wanted.

Disclaimer: It might well be that the Btm function could be speed up when using proper options, but I did not spent to much time looking into the details of that function.

Scaling it up

A speed up is always nice, especially when it is as large as in our case. But a speed up for a single run is not all we need. It is also important how the runtime scales with the size of the input.
Below I increased the time window for the considered matches used in BTm.2 gradually from 2016 back to 2000, gradually including more matches.

years=2016:2000
res.scale=data.frame(runtime=rep(0,length(years)),
                     observations=rep(0,length(years)))
k=0
for(y in years){
  data3<-atp %>% 
  dplyr::filter(lubridate::year(date)>=y) %>% 
  select(winner,loser)
  k=k+1
  tmp<-microbenchmark(BTm.2=BTm.2(data3),times=5)
  res.scale$runtime[k]=median(tmp$time)/10^9 
  res.scale$observations[k]=nrow(data3)
}

ggplot(res.scale,aes(x=observations,y=runtime))+
  geom_point()+
  geom_line()+
  labs(x="Observations",y="Median time in seconds")

plot of chunk unnamed-chunk-12
Not sure what is going on in the middle (I should have maybe done more than five runs) but apart from that I think the code scales quite reasonably.

Wrap up

The key to speeding up R code without a doubt is to vectorize your code. This will hardly be any news for more experienced R users. Although it was also nothing new for me, things get a bit more real when you try to develop a package. It might be no problem if you wait for results a couple of seconds (or minutes, if you are patient), but users might have bigger inputs then you have, so scalability becomes an issue too.

I have always been a huge fan of the outer function since it appeals to my inner mathematician. I will dedicate the next post to it in more detail, since you can do impressive things with it.

I did, on purpose, not tackle the option of implementing functions in C++ which will also be left for the next post. But be sure that even vectorized R code can be magnitudes slower than a naive C/C++ implementation.

The code of this post can be found on Github.

By the way, for those who are interested in the ranking of tennis players from 2000-2016. Here is what we get.

Rank Player Skill
1 Novak Djokovic 0.0154895
2 Rafael Nadal 0.0131991
3 Roger Federer 0.0128205
4 Andy Murray 0.0097758
5 Andre Agassi 0.0061090
6 Andy Roddick 0.0060375
7 Juan Martin Del Potro 0.0059108
8 Jo Wilfried Tsonga 0.0051155
9 Lleyton Hewitt 0.0047013
10 Milos Raonic 0.0044824

To leave a comment for the author, please follow the link and comment on their blog: R – mildlyscientific.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)