Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Several months ago, I used R to analyze professional soccer players based on their attributes from the video game, FIFA14. Now that FIFA15 is upon us, let’s take a similar look.

FIFA 15 is a video game by EA Sports that mimics the experience of managing and playing for a soccer team. The game uses the likenesses and attributes of real players and this is part of the appeal. Although I rarely play video games, I am an avid soccer player and got curious about what could be learned by taking a closer look at the game-assigned player attributes.

www.futhead.com is a good source of FIFA 14 data. I scraped the html from the two hundred-plus pages of player attributes and then munged them into a useful table. Players have an overall rating and they have six specific stats (pace, shooting, passing, dribbling, defending, and physicality (replacing last year’s “heading”). Each player has an assigned position; I collapsed the positions into a “type” category (Defense, Midfield, Forward). The modern game effectively has four lines of players but the position names still carry the naming conventions of the days of the three line formations, such as 4-4-2.

# Player Positions and Position Types

Below is a chart summarizing player rating by position. The charted is sorted in ascending median rating. There is a great deal of spread, but generally the center midfielder and fullbacks are a bit lower than the wingers and wingbacks.
The collapsed view below corresponds with the above chart:  a slight bias as the position becomes more offensive-minded, but not dramatically different.

# Modeling Player Ratings

I built a linear model for each position “type” and found R-squared values ranging 89%-96%. Each model used all six attributes as predictors with overall rating as the dependent variable. I speculate that player age/experience may account for the unexplained variance. Below is a look at the performance of each position type’s model. Both images visually support the models’ validity.

# Position Type Models

Each position type’s model naturally has a different mix of attribute weights. Below are charts showing these weights.
Forwards need to be good at shooting and this is expressed in the above graph. Interestingly, passing is actually negatively correlated with a forward’s rating. I can think of several great forwards I have played with that fit this category!
Midfield ratings are more balanced than that of defense but dribbling and passing are the two most important skills for this position type.

# Mismatches

Each player’s position is assigned in the database. This leads to the possibility of having a player being theoretically higher rated in a different position. I found some evidence of this. Below is a table of the top three mismatches by position.
 Best Rating Forward Leonel VangioniDenis EpsteinMohammed Qasim NeymarTheo WalcottTaison Best Rating Midfield MarceloDani AlvesDavid Alaba Ricardo AlvarezSebastian GiovincoJérémy Ménez Best Rating Defense Philipp LahmSami KhediraSergio Busquets Darius CharlesAndrew WeidemanArkadiusz Gajewski Defense Midfield Forward Assigned Assigned Assigned

After comparing with last year’s version of this mismatch table, it looks like Neymar and Lahm are hard to pigeonhole:  last year, the model thought they should both be midfielders; this year, it puts them back to forward and defense, respectively. Theo Walcott, once healthy, will want to show these statistics to Arsene Wenger in a bid to move from winger to forward.
As someone who has watched countless matches, I venture that the positions should be thought of in terms of where the player is expected to defend not necessarily where he is expected to attack; it is common for wingers to cut inside and act like forwards once the opponent’s defenders are occupied by the true forwards. Likewise, the rise of the offensive-minded wing backs can cause trouble for defenses that have to cope with a late runner joining the attack.

# Model Outliers

The model does a good job of predicting a player’s overall rating, but there are a few exceptions.
 At Assigned Position At Best Position Better than Predicted Stefan ReinartzRaoul Cedric LoéJosé Cañas Borja FernándezJesús NavasMarco Rojas Worse than Predicted Murat AkinMusharraf Al RuwailiGeir Ludvig Fevang Francesco TottiIan HarteRuslan Adzhindzhal
The players in the top row must have magic not captured in the regular six attributes; one might call this the X Factor. Unbelievably, the top left box did not change! There must be something about these players more than their underlying attributes suggest. Game developer friends, perhaps? They are all defensive midfielders, but not sure what other commonalities they have.

# Clustering

There is some evidence that the player attributes lead to a few common clusters. Below is a chart showing the weighted sum of squares for a given cluster count. This is a bit of visual confirmation that there are three or four general styles of player; past that the WSS does not change as much.

# Player Tree

Finally, I clustered the top field players (overall rating at least 84) hierarchically. What developed was an insightful way to visualize how different players are stylistically related to each other.

Football / Soccer’s very own family tree. The most interesting leaves are the players positionally mixed in with other positions.  Philipp Lahm resurfaces as the midfielder who should be a defender, or vice-versa.  Maybe Germany should consider moving Mats Hummels to midfield and restoring Lahm to defense. Sounds crazy, but this same analysis pointed to moving Lahm to midfield even though many thought him to be the best right back in the world and, sure enough, Pep moved him to midfield. I’m sure Pep had been thinking about that move far earlier.

Santi Cazorla and Vincent Kompany look to be the furthest apart. That sounds spot on!

Below is the code for screen-scraping. Be nice to others’ sites when scraping.

 ### Prepare ####
pkg <- c("cluster","fpc","digest","ggplot2","foreign","ggdendro","reshape2")
inst <- pkg %in% installed.packages()
if(length(pkg[!inst]) > 0) install.packages(pkg[!inst])
lapply(pkg,library,character.only=TRUE)
rm(inst,pkg)
set.seed(4444)

### Control panel for screen-scraping ####
sleep.time <- 0.01
pagecount <- 237
pc.ignore <- 0
names.page <- 48
names.lastpage <- 38
name.gaplines <- 71
namLine1 <- 1723
posLine1 <- 1723+5
RATLine1 <- 1723+10
PACLine1 <- 1723+15
SHOLine1 <- 1723+20
PASLine1 <- 1723+25
DRILine1 <- 1723+30
DEFLine1 <- 1723+34
PHYLine1 <- 1723+37

### Create custom urls to scrape ####
pageSeq <- seq(from=1,to=pagecount,by=1)
urls.df <- data.frame(pageSeq)
for(i in 1:length(urls.df$pageSeq)){ urls.df$url[i] <- paste0("http://www.futhead.com/15/players/?page=",
urls.df$pageSeq[i], "&sort_direction=desc") } ### Scrape html from custom urls #### pages <- as.list("na") for(j in 1:length(urls.df$pageSeq)){
pages[[j]] <- urls.df$pageSeq[j] } for(j in 1:length(urls.df$pageSeq)){
download.file(urls.df$url[j],destfile=paste0(urls.df$pageSeq[j],".txt"))
Sys.sleep(sleep.time)
}

### Identify which lines store player statistics ####
namSeq <- seq(from=namLine1,by=name.gaplines,length.out=names.page)
posSeq <- seq(from=posLine1,by=name.gaplines,length.out=names.page)
RATSeq <- seq(from=RATLine1,by=name.gaplines,length.out=names.page)
PACSeq <- seq(from=PACLine1,by=name.gaplines,length.out=names.page)
SHOSeq <- seq(from=SHOLine1,by=name.gaplines,length.out=names.page)
PASSeq <- seq(from=PASLine1,by=name.gaplines,length.out=names.page)
DRISeq <- seq(from=DRILine1,by=name.gaplines,length.out=names.page)
DEFSeq <- seq(from=DEFLine1,by=name.gaplines,length.out=names.page)
PHYSeq <- seq(from=PHYLine1,by=name.gaplines,length.out=names.page)

### Create empty dataframe for storing player stats
attribs <- data.frame(matrix(nrow=names.page*(pagecount-1)+names.lastpage,ncol=9))
colnames(attribs) <- c("Name","Position","RAT","PAC","SHO","PAS","DRI","DEF","PHY")

### Store lines from full pages containing player stats to dataframe ####
for(m in 1:(pagecount-1-pc.ignore)){
page <- readLines(paste0(urls.df$pageSeq[m],".txt")) for(k in 1:names.page){ n <- (m-1)*names.page+k attribs$Name[n] <- page[namSeq[k]]
attribs$Position[n] <- page[posSeq[k]] attribs$RAT[n] <- page[RATSeq[k]]
attribs$PAC[n] <- page[PACSeq[k]] attribs$SHO[n] <- page[SHOSeq[k]]
attribs$PAS[n] <- page[PASSeq[k]] attribs$DRI[n] <- page[DRISeq[k]]
attribs$DEF[n] <- page[DEFSeq[k]] attribs$PHY[n] <- page[PHYSeq[k]]
}
}

### Store lines from partial last page containing player stats to dataframe ####
pagelast <- readLines(paste0(urls.df$pageSeq[pagecount],".txt")) for(p in 1:names.lastpage){ q <- (pagecount-1)*names.page+p attribs$Name[q] <- pagelast[namSeq[p]]
attribs$Position[q] <- pagelast[posSeq[p]] attribs$RAT[q] <- pagelast[RATSeq[p]]
attribs$PAC[q] <- pagelast[PACSeq[p]] attribs$SHO[q] <- pagelast[SHOSeq[p]]
attribs$PAS[q] <- pagelast[PASSeq[p]] attribs$DRI[q] <- pagelast[DRISeq[p]]
attribs$DEF[q] <- pagelast[DEFSeq[p]] attribs$PHY[q] <- pagelast[PHYSeq[p]]
}

### Remove html wrapped around player stats in each line ####
attribs$Name <- gsub("^.*<span class="name">","",attribs$Name)
attribs$Name <- gsub("</span>.*$","",attribs$Name) attribs$Name <- gsub("^\s+|\s+$","",attribs$Name)
attribs$Position <- gsub("^ *","",attribs$Position)
attribs$Position <- gsub("^\s+|\s+$","",attribs$Position) attribs$RAT <- gsub("^.*<span>","",attribs$RAT) attribs$RAT <- gsub("</span>.*$","",attribs$RAT)
attribs$RAT <- gsub("^\s+|\s+$","",attribs$RAT) attribs$PAC <- gsub("^.*<span class="attribute">","",attribs$PAC) attribs$PAC <- gsub("</span>.*$","",attribs$PAC)
attribs$PAC <- gsub("^\s+|\s+$","",attribs$PAC) attribs$SHO <- gsub("^.*<span class="attribute">","",attribs$SHO) attribs$SHO <- gsub("</span>.*$","",attribs$SHO)
attribs$SHO <- gsub("^\s+|\s+$","",attribs$SHO) attribs$PAS <- gsub("^.*<span class="attribute">","",attribs$PAS) attribs$PAS <- gsub("</span>.*$","",attribs$PAS)
attribs$PAS <- gsub("^\s+|\s+$","",attribs$PAS) attribs$DRI <- gsub("^.*<span class="attribute">","",attribs$DRI) attribs$DRI <- gsub("</span>.*$","",attribs$DRI)
attribs$DRI <- gsub("^\s+|\s+$","",attribs$DRI) attribs$DEF <- gsub("^.*<span class="attribute">","",attribs$DEF) attribs$DEF <- gsub("</span>.*$","",attribs$DEF)
attribs$DEF <- gsub("^\s+|\s+$","",attribs$DEF) attribs$PHY <- gsub("^.*<span class="attribute">","",attribs$PHY) attribs$PHY <- gsub("</span>.*$","",attribs$PHY)
attribs$PHY <- gsub("^\s+|\s+$","",attribs$PHY) ### Remove statistics from duplicated players #### attribs <- attribs[!(attribs$Name=="Cristiano Ronaldo"&attribs$RAT=="93"),] attribs <- attribs[!duplicated(attribs$Name),]
rownames(attribs) <- NULL

### Clean up foreign characters in names ####
Encoding(attribs$Name) <- "UTF-8" attribs$Name <- iconv(attribs$Name,"UTF-8","UTF-8",sub='') ### Create general position type #### attribs$Type[attribs$Position %in% c("CF","LF","RF","ST")] <- "Forward" attribs$Type[attribs$Position %in% c("LM","RM","CDM","CM","CAM","LW","RW")] <- "Midfield" attribs$Type[attribs$Position %in% c("LB","RB","CB","LWB","RWB")] <- "Defense" attribs$Type[attribs$Position %in% c("GK")] <- "Keeper" ### Change each stat to the appropriate data type #### attribs$Name <- as.character(attribs$Name) attribs$Position <- as.factor(attribs$Position) attribs$RAT <- as.integer(attribs$RAT) attribs$PAC <- as.integer(attribs$PAC) attribs$SHO <- as.integer(attribs$SHO) attribs$PAS <- as.integer(attribs$PAS) attribs$DRI <- as.integer(attribs$DRI) attribs$DEF <- as.integer(attribs$DEF) attribs$PHY <- as.integer(attribs$PHY) attribs$Type <- ordered(attribs\$Type,levels=c("Forward","Midfield","Defense","Keeper"))