Using R to study tennis players

February 13, 2017
By

(This article was first published on R and Finance, and kindly contributed to R-bloggers)

Looking at winning records of ATP top players and Brazilian tennists –

In the previous post about tennis, we studied how changes in ball’s composition in hard and grass courts affected the game back in 2000. In this post, we will analyse a different dataset from the same repository, and look at the players winning records in ATP matches. We will again separate the result by the type of court.

The data

I’m again using the great repository of tennis data of Jeff
Sackmann
. In this case, however, I’m
using the ATP repository
that contains ATP match data since 1968 until 2017. Again, I thank Jeff
Sackmann for making this dataset publicly available.

First, let’s download and unzip the dataset. This is a large file with
35MB and it takes some time.

zip.file <- 'TennisData_ATP.zip'

# download file
if (!file.exists(zip.file)){
  download.file('https://github.com/JeffSackmann/tennis_atp/archive/master.zip', destfile = zip.file)
  
}

# unzip it
dir.out <- 'tennis_atp-master/'
if (!dir.exists(dir.out)) unzip(zip.file)

Lets have a look at the contents of the zip file.

my.f <- list.files(dir.out, pattern = '*.csv', full.names = T)
print(my.f)

##   [1] "tennis_atp-master/atp_matches_1968.csv"           
##   [2] "tennis_atp-master/atp_matches_1969.csv"           
##   [3] "tennis_atp-master/atp_matches_1970.csv"           
##   [4] "tennis_atp-master/atp_matches_1971.csv"           
##   [5] "tennis_atp-master/atp_matches_1972.csv"           
##   [6] "tennis_atp-master/atp_matches_1973.csv"           
##   [7] "tennis_atp-master/atp_matches_1974.csv"           
##   [8] "tennis_atp-master/atp_matches_1975.csv"           
##   [9] "tennis_atp-master/atp_matches_1976.csv"           
##  [10] "tennis_atp-master/atp_matches_1977.csv"           
##  [11] "tennis_atp-master/atp_matches_1978.csv"           
##  [12] "tennis_atp-master/atp_matches_1979.csv"           
##  [13] "tennis_atp-master/atp_matches_1980.csv"           
##  [14] "tennis_atp-master/atp_matches_1981.csv"           
##  [15] "tennis_atp-master/atp_matches_1982.csv"           
##  [16] "tennis_atp-master/atp_matches_1983.csv"           
##  [17] "tennis_atp-master/atp_matches_1984.csv"           
##  [18] "tennis_atp-master/atp_matches_1985.csv"           
##  [19] "tennis_atp-master/atp_matches_1986.csv"           
##  [20] "tennis_atp-master/atp_matches_1987.csv"           
##  [21] "tennis_atp-master/atp_matches_1988.csv"           
##  [22] "tennis_atp-master/atp_matches_1989.csv"           
##  [23] "tennis_atp-master/atp_matches_1990.csv"           
##  [24] "tennis_atp-master/atp_matches_1991.csv"           
##  [25] "tennis_atp-master/atp_matches_1992.csv"           
##  [26] "tennis_atp-master/atp_matches_1993.csv"           
##  [27] "tennis_atp-master/atp_matches_1994.csv"           
##  [28] "tennis_atp-master/atp_matches_1995.csv"           
##  [29] "tennis_atp-master/atp_matches_1996.csv"           
##  [30] "tennis_atp-master/atp_matches_1997.csv"           
##  [31] "tennis_atp-master/atp_matches_1998.csv"           
##  [32] "tennis_atp-master/atp_matches_1999.csv"           
##  [33] "tennis_atp-master/atp_matches_2000.csv"           
##  [34] "tennis_atp-master/atp_matches_2001.csv"           
##  [35] "tennis_atp-master/atp_matches_2002.csv"           
##  [36] "tennis_atp-master/atp_matches_2003.csv"           
##  [37] "tennis_atp-master/atp_matches_2004.csv"           
##  [38] "tennis_atp-master/atp_matches_2005.csv"           
##  [39] "tennis_atp-master/atp_matches_2006.csv"           
##  [40] "tennis_atp-master/atp_matches_2007.csv"           
##  [41] "tennis_atp-master/atp_matches_2008.csv"           
##  [42] "tennis_atp-master/atp_matches_2009.csv"           
##  [43] "tennis_atp-master/atp_matches_2010.csv"           
##  [44] "tennis_atp-master/atp_matches_2011.csv"           
##  [45] "tennis_atp-master/atp_matches_2012.csv"           
##  [46] "tennis_atp-master/atp_matches_2013.csv"           
##  [47] "tennis_atp-master/atp_matches_2014.csv"           
##  [48] "tennis_atp-master/atp_matches_2015.csv"           
##  [49] "tennis_atp-master/atp_matches_2016.csv"           
##  [50] "tennis_atp-master/atp_matches_2017.csv"           
##  [51] "tennis_atp-master/atp_matches_futures_1991.csv"   
##  [52] "tennis_atp-master/atp_matches_futures_1992.csv"   
##  [53] "tennis_atp-master/atp_matches_futures_1993.csv"   
##  [54] "tennis_atp-master/atp_matches_futures_1994.csv"   
##  [55] "tennis_atp-master/atp_matches_futures_1995.csv"   
##  [56] "tennis_atp-master/atp_matches_futures_1996.csv"   
##  [57] "tennis_atp-master/atp_matches_futures_1997.csv"   
##  [58] "tennis_atp-master/atp_matches_futures_1998.csv"   
##  [59] "tennis_atp-master/atp_matches_futures_1999.csv"   
##  [60] "tennis_atp-master/atp_matches_futures_2000.csv"   
##  [61] "tennis_atp-master/atp_matches_futures_2001.csv"   
##  [62] "tennis_atp-master/atp_matches_futures_2002.csv"   
##  [63] "tennis_atp-master/atp_matches_futures_2003.csv"   
##  [64] "tennis_atp-master/atp_matches_futures_2004.csv"   
##  [65] "tennis_atp-master/atp_matches_futures_2005.csv"   
##  [66] "tennis_atp-master/atp_matches_futures_2006.csv"   
##  [67] "tennis_atp-master/atp_matches_futures_2007.csv"   
##  [68] "tennis_atp-master/atp_matches_futures_2008.csv"   
##  [69] "tennis_atp-master/atp_matches_futures_2009.csv"   
##  [70] "tennis_atp-master/atp_matches_futures_2010.csv"   
##  [71] "tennis_atp-master/atp_matches_futures_2011.csv"   
##  [72] "tennis_atp-master/atp_matches_futures_2012.csv"   
##  [73] "tennis_atp-master/atp_matches_futures_2013.csv"   
##  [74] "tennis_atp-master/atp_matches_futures_2014.csv"   
##  [75] "tennis_atp-master/atp_matches_futures_2015.csv"   
##  [76] "tennis_atp-master/atp_matches_futures_2016.csv"   
##  [77] "tennis_atp-master/atp_matches_futures_2017.csv"   
##  [78] "tennis_atp-master/atp_matches_qual_chall_1991.csv"
##  [79] "tennis_atp-master/atp_matches_qual_chall_1992.csv"
##  [80] "tennis_atp-master/atp_matches_qual_chall_1993.csv"
##  [81] "tennis_atp-master/atp_matches_qual_chall_1994.csv"
##  [82] "tennis_atp-master/atp_matches_qual_chall_1995.csv"
##  [83] "tennis_atp-master/atp_matches_qual_chall_1996.csv"
##  [84] "tennis_atp-master/atp_matches_qual_chall_1997.csv"
##  [85] "tennis_atp-master/atp_matches_qual_chall_1998.csv"
##  [86] "tennis_atp-master/atp_matches_qual_chall_1999.csv"
##  [87] "tennis_atp-master/atp_matches_qual_chall_2000.csv"
##  [88] "tennis_atp-master/atp_matches_qual_chall_2001.csv"
##  [89] "tennis_atp-master/atp_matches_qual_chall_2002.csv"
##  [90] "tennis_atp-master/atp_matches_qual_chall_2003.csv"
##  [91] "tennis_atp-master/atp_matches_qual_chall_2004.csv"
##  [92] "tennis_atp-master/atp_matches_qual_chall_2005.csv"
##  [93] "tennis_atp-master/atp_matches_qual_chall_2006.csv"
##  [94] "tennis_atp-master/atp_matches_qual_chall_2007.csv"
##  [95] "tennis_atp-master/atp_matches_qual_chall_2008.csv"
##  [96] "tennis_atp-master/atp_matches_qual_chall_2009.csv"
##  [97] "tennis_atp-master/atp_matches_qual_chall_2010.csv"
##  [98] "tennis_atp-master/atp_matches_qual_chall_2011.csv"
##  [99] "tennis_atp-master/atp_matches_qual_chall_2012.csv"
## [100] "tennis_atp-master/atp_matches_qual_chall_2013.csv"
## [101] "tennis_atp-master/atp_matches_qual_chall_2014.csv"
## [102] "tennis_atp-master/atp_matches_qual_chall_2015.csv"
## [103] "tennis_atp-master/atp_matches_qual_chall_2016.csv"
## [104] "tennis_atp-master/atp_matches_qual_chall_2017.csv"
## [105] "tennis_atp-master/atp_players.csv"                
## [106] "tennis_atp-master/atp_rankings_00s.csv"           
## [107] "tennis_atp-master/atp_rankings_10s.csv"           
## [108] "tennis_atp-master/atp_rankings_70s.csv"           
## [109] "tennis_atp-master/atp_rankings_80s.csv"           
## [110] "tennis_atp-master/atp_rankings_90s.csv"           
## [111] "tennis_atp-master/atp_rankings_current.csv"

Again, same as with the other post, we see a lot of files. The names are
quite suggesting and it is clear that not all files contains matches
data. Let’s restrict the analysis just for the files with the string
atp_matches. This includes main matches, qualifications and futures
games.

library(stringr)

# restrict just for main atp matches
my.f <- my.f[str_detect(my.f, 'matches')]
my.f

##   [1] "tennis_atp-master/atp_matches_1968.csv"           
##   [2] "tennis_atp-master/atp_matches_1969.csv"           
##   [3] "tennis_atp-master/atp_matches_1970.csv"           
##   [4] "tennis_atp-master/atp_matches_1971.csv"           
##   [5] "tennis_atp-master/atp_matches_1972.csv"           
##   [6] "tennis_atp-master/atp_matches_1973.csv"           
##   [7] "tennis_atp-master/atp_matches_1974.csv"           
##   [8] "tennis_atp-master/atp_matches_1975.csv"           
##   [9] "tennis_atp-master/atp_matches_1976.csv"           
##  [10] "tennis_atp-master/atp_matches_1977.csv"           
##  [11] "tennis_atp-master/atp_matches_1978.csv"           
##  [12] "tennis_atp-master/atp_matches_1979.csv"           
##  [13] "tennis_atp-master/atp_matches_1980.csv"           
##  [14] "tennis_atp-master/atp_matches_1981.csv"           
##  [15] "tennis_atp-master/atp_matches_1982.csv"           
##  [16] "tennis_atp-master/atp_matches_1983.csv"           
##  [17] "tennis_atp-master/atp_matches_1984.csv"           
##  [18] "tennis_atp-master/atp_matches_1985.csv"           
##  [19] "tennis_atp-master/atp_matches_1986.csv"           
##  [20] "tennis_atp-master/atp_matches_1987.csv"           
##  [21] "tennis_atp-master/atp_matches_1988.csv"           
##  [22] "tennis_atp-master/atp_matches_1989.csv"           
##  [23] "tennis_atp-master/atp_matches_1990.csv"           
##  [24] "tennis_atp-master/atp_matches_1991.csv"           
##  [25] "tennis_atp-master/atp_matches_1992.csv"           
##  [26] "tennis_atp-master/atp_matches_1993.csv"           
##  [27] "tennis_atp-master/atp_matches_1994.csv"           
##  [28] "tennis_atp-master/atp_matches_1995.csv"           
##  [29] "tennis_atp-master/atp_matches_1996.csv"           
##  [30] "tennis_atp-master/atp_matches_1997.csv"           
##  [31] "tennis_atp-master/atp_matches_1998.csv"           
##  [32] "tennis_atp-master/atp_matches_1999.csv"           
##  [33] "tennis_atp-master/atp_matches_2000.csv"           
##  [34] "tennis_atp-master/atp_matches_2001.csv"           
##  [35] "tennis_atp-master/atp_matches_2002.csv"           
##  [36] "tennis_atp-master/atp_matches_2003.csv"           
##  [37] "tennis_atp-master/atp_matches_2004.csv"           
##  [38] "tennis_atp-master/atp_matches_2005.csv"           
##  [39] "tennis_atp-master/atp_matches_2006.csv"           
##  [40] "tennis_atp-master/atp_matches_2007.csv"           
##  [41] "tennis_atp-master/atp_matches_2008.csv"           
##  [42] "tennis_atp-master/atp_matches_2009.csv"           
##  [43] "tennis_atp-master/atp_matches_2010.csv"           
##  [44] "tennis_atp-master/atp_matches_2011.csv"           
##  [45] "tennis_atp-master/atp_matches_2012.csv"           
##  [46] "tennis_atp-master/atp_matches_2013.csv"           
##  [47] "tennis_atp-master/atp_matches_2014.csv"           
##  [48] "tennis_atp-master/atp_matches_2015.csv"           
##  [49] "tennis_atp-master/atp_matches_2016.csv"           
##  [50] "tennis_atp-master/atp_matches_2017.csv"           
##  [51] "tennis_atp-master/atp_matches_futures_1991.csv"   
##  [52] "tennis_atp-master/atp_matches_futures_1992.csv"   
##  [53] "tennis_atp-master/atp_matches_futures_1993.csv"   
##  [54] "tennis_atp-master/atp_matches_futures_1994.csv"   
##  [55] "tennis_atp-master/atp_matches_futures_1995.csv"   
##  [56] "tennis_atp-master/atp_matches_futures_1996.csv"   
##  [57] "tennis_atp-master/atp_matches_futures_1997.csv"   
##  [58] "tennis_atp-master/atp_matches_futures_1998.csv"   
##  [59] "tennis_atp-master/atp_matches_futures_1999.csv"   
##  [60] "tennis_atp-master/atp_matches_futures_2000.csv"   
##  [61] "tennis_atp-master/atp_matches_futures_2001.csv"   
##  [62] "tennis_atp-master/atp_matches_futures_2002.csv"   
##  [63] "tennis_atp-master/atp_matches_futures_2003.csv"   
##  [64] "tennis_atp-master/atp_matches_futures_2004.csv"   
##  [65] "tennis_atp-master/atp_matches_futures_2005.csv"   
##  [66] "tennis_atp-master/atp_matches_futures_2006.csv"   
##  [67] "tennis_atp-master/atp_matches_futures_2007.csv"   
##  [68] "tennis_atp-master/atp_matches_futures_2008.csv"   
##  [69] "tennis_atp-master/atp_matches_futures_2009.csv"   
##  [70] "tennis_atp-master/atp_matches_futures_2010.csv"   
##  [71] "tennis_atp-master/atp_matches_futures_2011.csv"   
##  [72] "tennis_atp-master/atp_matches_futures_2012.csv"   
##  [73] "tennis_atp-master/atp_matches_futures_2013.csv"   
##  [74] "tennis_atp-master/atp_matches_futures_2014.csv"   
##  [75] "tennis_atp-master/atp_matches_futures_2015.csv"   
##  [76] "tennis_atp-master/atp_matches_futures_2016.csv"   
##  [77] "tennis_atp-master/atp_matches_futures_2017.csv"   
##  [78] "tennis_atp-master/atp_matches_qual_chall_1991.csv"
##  [79] "tennis_atp-master/atp_matches_qual_chall_1992.csv"
##  [80] "tennis_atp-master/atp_matches_qual_chall_1993.csv"
##  [81] "tennis_atp-master/atp_matches_qual_chall_1994.csv"
##  [82] "tennis_atp-master/atp_matches_qual_chall_1995.csv"
##  [83] "tennis_atp-master/atp_matches_qual_chall_1996.csv"
##  [84] "tennis_atp-master/atp_matches_qual_chall_1997.csv"
##  [85] "tennis_atp-master/atp_matches_qual_chall_1998.csv"
##  [86] "tennis_atp-master/atp_matches_qual_chall_1999.csv"
##  [87] "tennis_atp-master/atp_matches_qual_chall_2000.csv"
##  [88] "tennis_atp-master/atp_matches_qual_chall_2001.csv"
##  [89] "tennis_atp-master/atp_matches_qual_chall_2002.csv"
##  [90] "tennis_atp-master/atp_matches_qual_chall_2003.csv"
##  [91] "tennis_atp-master/atp_matches_qual_chall_2004.csv"
##  [92] "tennis_atp-master/atp_matches_qual_chall_2005.csv"
##  [93] "tennis_atp-master/atp_matches_qual_chall_2006.csv"
##  [94] "tennis_atp-master/atp_matches_qual_chall_2007.csv"
##  [95] "tennis_atp-master/atp_matches_qual_chall_2008.csv"
##  [96] "tennis_atp-master/atp_matches_qual_chall_2009.csv"
##  [97] "tennis_atp-master/atp_matches_qual_chall_2010.csv"
##  [98] "tennis_atp-master/atp_matches_qual_chall_2011.csv"
##  [99] "tennis_atp-master/atp_matches_qual_chall_2012.csv"
## [100] "tennis_atp-master/atp_matches_qual_chall_2013.csv"
## [101] "tennis_atp-master/atp_matches_qual_chall_2014.csv"
## [102] "tennis_atp-master/atp_matches_qual_chall_2015.csv"
## [103] "tennis_atp-master/atp_matches_qual_chall_2016.csv"
## [104] "tennis_atp-master/atp_matches_qual_chall_2017.csv"

As you can see, I used string matching function str_detect from
stringr to find and keep just the csv files with the string matches
in its name. Now, let’s load all files into a single dataframe.

library(readr)
library(dplyr)

# set cols (missing some)
my.cols <- cols(
  .default = col_integer(),
  tourney_id = col_character(),
  tourney_name = col_character(),
  surface = col_character(),
  tourney_level = col_character(),
  winner_entry = col_character(),
  winner_name = col_character(),
  winner_hand = col_character(),
  winner_ioc = col_character(),
  winner_age = col_double(),
  loser_entry = col_character(),
  loser_name = col_character(),
  loser_hand = col_character(),
  loser_ioc = col_character(),
  loser_age = col_double(),
  score = col_character(),
  round = col_character()
)

# load all files with lapply and do.call (some cols don't match in all files)
df.matches <- do.call(bind_rows,lapply(my.f, read_csv, col_types = my.cols))

# create year column
df.matches$Date <- as.Date(as.character(df.matches$tourney_date),'%Y%m%d')
df.matches$Year <- format(df.matches$Date,'%Y')

Let’s see what the data offers.

str(df.matches)

## Classes 'tbl_df', 'tbl' and 'data.frame':    667965 obs. of  51 variables:
##  $ tourney_id        : chr  "1968-580" "1968-580" "1968-580" "1968-580" ...
##  $ tourney_name      : chr  "Australian Chps." "Australian Chps." "Australian Chps." "Australian Chps." ...
##  $ surface           : chr  "Grass" "Grass" "Grass" "Grass" ...
##  $ draw_size         : int  64 64 64 64 64 64 64 64 64 64 ...
##  $ tourney_level     : chr  "G" "G" "G" "G" ...
##  $ tourney_date      : int  19680119 19680119 19680119 19680119 19680119 19680119 19680119 19680119 19680119 19680119 ...
##  $ match_num         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ winner_id         : int  110023 109803 100257 100105 109966 107759 100101 100025 108519 109799 ...
##  $ winner_seed       : int  NA NA NA 5 NA NA 12 3 NA NA ...
##  $ winner_entry      : chr  NA NA NA NA ...
##  $ winner_name       : chr  "Richard Coulthard" "John Brown" "Ross Case" "Allan Stone" ...
##  $ winner_hand       : chr  "R" "R" "R" "R" ...
##  $ winner_ht         : int  NA NA NA NA NA NA NA 173 NA NA ...
##  $ winner_ioc        : chr  "AUS" "AUS" "AUS" "AUS" ...
##  $ winner_age        : num  NA 27.5 16.2 22.3 29.9 ...
##  $ winner_rank       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ winner_rank_points: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ loser_id          : int  107760 106964 110024 110025 110026 110027 110028 108430 110029 110030 ...
##  $ loser_seed        : int  NA NA 15 NA NA NA NA NA NA NA ...
##  $ loser_entry       : chr  NA NA NA NA ...
##  $ loser_name        : chr  "Max Senior" "Ernie Mccabe" "Gondo Widjojo" "Robert Layton" ...
##  $ loser_hand        : chr  "R" "R" "R" "R" ...
##  $ loser_ht          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ loser_ioc         : chr  "AUS" "AUS" "INA" "AUS" ...
##  $ loser_age         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ loser_rank        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ loser_rank_points : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ score             : chr  "12-10 7-5 4-6 7-5" "6-3 6-2 6-4" "6-4 3-6 6-3 7-5" "6-4 6-2 6-1" ...
##  $ best_of           : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ round             : chr  "R64" "R64" "R64" "R64" ...
##  $ minutes           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ w_ace             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ w_df              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ w_svpt            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ w_1stIn           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ w_1stWon          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ w_2ndWon          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ w_SvGms           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ w_bpSaved         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ w_bpFaced         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ l_ace             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ l_df              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ l_svpt            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ l_1stIn           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ l_1stWon          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ l_2ndWon          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ l_SvGms           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ l_bpSaved         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ l_bpFaced         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Date              : Date, format: "1968-01-19" "1968-01-19" ...
##  $ Year              : chr  "1968" "1968" "1968" "1968" ...

Again, lots of information. Looks like the games are separated by rows,
where the columns have a lot of information about the matches. This
dataset is not as detailed as the last one with point by point data, but
still quite impressive.

Looking at top players winning percentages

For our first analysis, lets look at the winning percentages of the top
10 players from the ATP rankings. For that, I will use package rvest
to scrape the information about the top ATP players from the ATP
website
.

library(rvest)

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(stringr)

n.players <- 10

my.url <- 'http://www.atpworldtour.com/en/rankings/singles'

atp.players <- read_html(my.url) %>%
  html_nodes(".player-cell") %>%
  html_text() %>%
  str_replace_all('\\n','') %>%
  str_trim()

name.player <- atp.players[1:n.players]
print(name.player)

##  [1] "Andy Murray"    "Novak Djokovic" "Stan Wawrinka"  "Milos Raonic"  
##  [5] "Kei Nishikori"  "Rafael Nadal"   "Marin Cilic"    "Dominic Thiem" 
##  [9] "Roger Federer"  "Gael Monfils"

As you can see, we have to usual suspects. Andy Murray is now leading in
the first position, which was held previously by Djokovic for a long
period of time. I really like to see Dominic Thiem in the top ten as he
is the youngest with 23 year, a solid game, and lots of potential.
Federer is the oldest of the list, with 35 years old, but the most
talented. It is amazing how he can still stay competitive given the age
difference.

Back to the R analysis, a problem with this approach of scraping the
names from the ATP website is that they don’t necessarily match the
names in the ATP files. As an example, the swiss player Stan Wawrinka
is named Stanislas Wawrinka in the csv files, while in the ATP website
is Stan Wawrinka. This means that directly trying to match the names
in name.player with the names in df.matches may not work. This is
classic case of data from difference sources that don’t share the same
unique identifiers.

I usually solve this problem using a lookup table built manually in a
csv file, where I store the matching identifiers in the different
datasets. If the number of unique cases is small, it is quite doable.
For this analysis, however, we have 12325 players, which is definitely
too much. Manually creating a lookup table would be very demanding.

The solution I often in this case is to make the computer look for the
best imperfect match among the different identifiers (names). This is
also known as calculating string distances – the number of require
edits so that a string matches other, and can be accomplished in many
different ways. In this example I’ll use function amatch from package
stringdist, with the basic options.

library(stringdist)

# get unique names from losers and winners
unique.players <- unique(c(df.matches$winner_name, 
                           df.matches$loser_name ))

# find the index of the closest string for each name in name.player
idx <- amatch(name.player, unique.players, maxDist = 5)

Lets look at the result:

print(data.frame(name.from.web = name.player,
                 name.from.atp = unique.players[idx] ))

##     name.from.web      name.from.atp
## 1     Andy Murray        Andy Murray
## 2  Novak Djokovic     Novak Djokovic
## 3   Stan Wawrinka Stanislas Wawrinka
## 4    Milos Raonic       Milos Raonic
## 5   Kei Nishikori      Kei Nishikori
## 6    Rafael Nadal       Rafael Nadal
## 7     Marin Cilic        Marin Cilic
## 8   Dominic Thiem      Dominic Thiem
## 9   Roger Federer      Roger Federer
## 10   Gael Monfils       Gael Monfils

Beautiful! For our case of just 10 players, it worked perfectly. The
name of Stan Wawrinka_ was matched correctly. Be aware, however, that
this procedure of string matching is not guaranteed to work 100% of the
cases. You could also set NA values using the options in amatch but it
is still not guaranteed to work well. You should always manually check
the result.

Going forward, let’s copy the players name from one to the other and use
a loop and dplyr to calculate yearly winning percentages of each
player, also separating by the type of surface. Using a loop here might
look odd, but this was the simplest solution I’ve found as you cannot
use dplyr directly. The dataframe df.matches is unique in the sense
that it has two informations in each row, you have a win for someone and
a loss to another player (see columns winner_name and loser_name).
If I use dplyr directly, I would miss the games where two players in
the list played each other. There is probably a way to adjust the
dataframe, but I just find it easier to loop over the players.

# copy players name
name.player <- unique.players[idx]

# create table with players
my.tab <- tibble()
for (i.player in name.player) {
  
  temp <- filter(df.matches, (winner_name==i.player)|(loser_name==i.player))
  temp$Name <- i.player
  temp.tab <- temp %>% 
    group_by(Name, Year,surface) %>%
    summarise(`Percentage of wins` = sum(winner_name == Name)/n(),
              `Number of matches` = n()) %>%
    filter(`Number of matches` > 0 ) %>%
    filter(surface %in% c('Hard','Clay'))
  
  my.tab <- bind_rows(my.tab,
                      temp.tab)
    
}

# use atp ranking in names e.g. Andy Murray (1)
my.tab$Name <- paste0(my.tab$Name, 
                      ' (', 
                      match(my.tab$Name, name.player),
                      ')' )

my.tab <- my.tab[complete.cases(my.tab), ]

Lets plot it!

library(ggplot2)

p <- ggplot(my.tab, aes(x = as.numeric(Year), y = `Percentage of wins`, color=surface))
p <- p + geom_point() + geom_smooth()
p <- p + facet_grid(surface~Name)
p <- p + ylim(c(0.25,1)) + xlim(c(2005,2016))
print(p)

We can see a few patterns. First, the winning consistency of top ten
player is higher in hard courts. For example, look at Murray and
Djokovic. It is easier for them to keep the same proportion of wins in
hard courts than it is in clay. My best guess is that hard courts are
more predictable and the ball is faster and bounces lower, making it
easier for top players to hit winners and other point-finishing strokes
such as drop shots. Be aware that this could also be due to the
different number of matches in each court. We could test the difference
of volatility statistically but I rather keep the analysis simple and
visual.

Also, not surprisingly, Nadal, the king of
clay
, presents the highest
winning percentage in clay, before 2012. His game strategy uses strength
and heavy top spin balls that are difficult to counter. See
here for a
more detailed analysis on why he is so dominant on clay. The bad side of
his strategy and type of play is that it is based heavily on his
muscular strength. As his age increases, it is difficult to maintain the
same level. This explains the increased proportion of defeats after
2012.

The effect of age is also perceived for Roger Federer. But, given that
his style of play is not based just on strength, I think that he can
still beat a lot of players. Just look at the his recent title win at
the 2017 Australian open.

Now, look how Dojokovic is consistent in clay and hard courts. He is
definitely the player with the best stats. I predict that, if he works
for it, he will have no problem in taking the #1 spot once again from
Murray.

Looking at Brazilian tennis players

In this section we will analyse the stats of Brazilian tennis players in
ATP rankings. The following code can be easily executed for any country
as the data is scraped from the ATP website. Just change the object
my.country for your case.

my.country <- 'BRA'
n.players <- 5
my.url <- paste0('http://www.atpworldtour.com/en/rankings/singles?rankDate=2017-02-06&rankRange=0-500&countryCode=',my.country)

atp.players <- read_html(my.url) %>%
  html_nodes(".player-cell") %>%
  html_text() %>%
  str_replace_all('\\n','') %>%
  str_trim()

name.player <- atp.players[1:n.players]
name.player

## [1] "Thiago Monteiro"     "Rogerio Dutra Silva" "Thomaz Bellucci"    
## [4] "Joao Souza"          "Andre Ghem"

The top five Brazilian players in ATP are Thiago Monteiro, Rogerio Dutra
Silva, Thomas Belluci, João Souza and Andre Ghem, in that order. We
should be proud of all of them. I can only imagine the dedication and
effort necessary to be a professional tennis player. You can see more
details about each
here.

At this point, it makes sense to create a function to analyse the
dataframe, as we will use it again later.

plot.winning.perc <- function(name.player, 
                              df.matches, 
                              min.games.per.year = 0,
                              min.year = 2005,
                              max.year = 2016){
  require(stringdist)
  require(stringr)
  require(dplyr)
  require(ggplot2)
  
  # create year column
  df.matches$Date <- as.Date(as.character(df.matches$tourney_date),'%Y%m%d')
  df.matches$Year <- format(df.matches$Date,'%Y')
  
  # get unique names from losers and winners
  unique.players <- unique(c(df.matches$winner_name, 
                             df.matches$loser_name ))
  
  # find the index of the closest string for each name in name.player
  idx <- amatch(name.player, unique.players, maxDist = 5)
  #print(data.frame(name.from.web = name.player,
   #                name.from.atp = unique.players[idx] ))
  
  my.tab <- tibble()
  for (i.player in name.player) {
    
    temp <- filter(df.matches, (winner_name==i.player)|(loser_name==i.player))
    temp$Name <- i.player
    temp.tab <- temp %>% 
      group_by(Name, Year,surface) %>%
      summarise(`Percentage of wins` = sum(winner_name == Name)/n(),
                `Number of matches` = n()) %>%
      filter(`Number of matches` > min.games.per.year ) %>%
      filter(surface %in% c('Hard','Clay'))
    
    my.tab <- bind_rows(my.tab,
                        temp.tab)
    
  }
  
  p <- ggplot(my.tab, aes(x = as.numeric(Year), y = `Percentage of wins`, color=surface))
  p <- p + geom_point() + geom_smooth()
  p <- p + facet_grid(surface~Name)
  p <- p + ylim(c(0.25,1)) + xlim(c(min.year,max.year))
  
  return(p)
  
}

Now, we use the previous function for the new names of players.

# plot winnign percentages
p <- plot.winning.perc(name.player, df.matches)
print(p)

From this list, the youngest Brazilian player, and with most potential
in my opinion, is Thiago Monteiro. He is 22 and has lots of time to
improve his game. In 2016 he beated Tsonga in a great match here in
Brazil
.

The records of Guga (Gustavo Kurten)

For last, lets look at stats of the the best and most cherished singles
player from Brazil, Gustavo Kuerten –
Guga
. Guga is
internationally famous and reached #1 in the ATP ranking in 2000. A
very charismatic player, with a fantastic backhand. It is not far
fetched to say that Guga brought Tennis to Brazil, making it a more
popular sport.

Guga had a hip injury back in 2001. He tried surgery, but it didn’t
worked out and he wasn’t able to maintain the higher level of tennis
that is necessary in the professional circuit. He retired in february of
2008. His story is very inspiring, growing up in the state of Santa
Catarina, in a time where tennis was not a popular sport in Brazil and
funding opportunities were scarce. His biography (in
portuguese)
is an
excellent read.

Let’s have a look in the data.

name.player <- 'Gustavo Kuerten'

p <- plot.winning.perc(name.player, df.matches, min.year = 1990)
print(p)

As expected, a winning journey until 2001, when losses start to
accumulate.

Conlusion

The data repository of Jeff Sackmann is great. I really only scratched
the surface with some simple graphical analysis. I think there is a lot
of space for modeling as well, that is, creating a model that can
predict the winning records of a player. I’m thinking about an
econometric model using historical statistics of players such as past
winning records, forehand/backhand winners, and so on. But I’m sure you
can use other approaches as well.

From the computational side, it would be interesting to create a CRAN
package to analyse this dataset. Just like in this post, all data could
be fetched from the ATP website and the repository of Jeff. One could
check the winning records of players with just a simple function call. I
just checked and there is nothing like it in CRAN. A shiny app would
also be interesting. I’ll keep it in mind for the future.

I hope you liked this post. Any comments or suggestions, please share in
the comments section.

To leave a comment for the author, please follow the link and comment on their blog: R and Finance.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)