Dimensionality Reduction Methods Using FIFA 18 Player Data

November 18, 2017
By

(This article was first published on schochastics, and kindly contributed to R-bloggers)

In this post, I will introduce three different methods for dimensionality reduction of large datasets.

#used packages
library(tidyverse)  # for data wrangling
library(stringr)    # for string manipulations
library(ggbiplot)   # pca biplot with ggplot
library(Rtsne)      # implements the t-SNE algorithm
library(kohonen)    # implements self organizing maps
library(hrbrthemes) # nice themes for ggplot
library(GGally)     # to produce scatterplot matrices

Data

The data we use comes from Kaggle
and contains around 18,000 players of the game FIFA 18
with 75 features per player.

glimpse(fifa_tbl)
## Observations: 17,981
## Variables: 75
## $ X1                     0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12...
## $ Name                   "Cristiano Ronaldo", "L. Messi", "Neymar...
## $ Age                    32, 30, 25, 30, 31, 28, 26, 26, 27, 29, ...
## $ Photo                  "https://cdn.sofifa.org/48/18/players/20...
## $ Nationality            "Portugal", "Argentina", "Brazil", "Urug...
## $ Flag                   "https://cdn.sofifa.org/flags/38.png", "...
## $ Overall                94, 93, 92, 92, 92, 91, 90, 90, 90, 90, ...
## $ Potential              94, 93, 94, 92, 92, 91, 92, 91, 90, 90, ...
## $ Club                   "Real Madrid CF", "FC Barcelona", "Paris...
## $ `Club Logo`            "https://cdn.sofifa.org/24/18/teams/243....
## $ Value                  "€95.5M", "€105M", "€123M", "€97M", "€61...
## $ Wage                   "€565K", "€565K", "€280K", "€510K", "€23...
## $ Special                2228, 2154, 2100, 2291, 1493, 2143, 1458...
## $ Acceleration           89, 92, 94, 88, 58, 79, 57, 93, 60, 78, ...
## $ Aggression             63, 48, 56, 78, 29, 80, 38, 54, 60, 50, ...
## $ Agility                89, 90, 96, 86, 52, 78, 60, 93, 71, 75, ...
## $ Balance                63, 95, 82, 60, 35, 80, 43, 91, 69, 69, ...
## $ `Ball control`         93, 95, 95, 91, 48, 89, 42, 92, 89, 85, ...
## $ Composure              95, 96, 92, 83, 70, 87, 64, 87, 85, 86, ...
## $ Crossing               85, 77, 75, 77, 15, 62, 17, 80, 85, 68, ...
## $ Curve                  81, 89, 81, 86, 14, 77, 21, 82, 85, 74, ...
## $ Dribbling              91, 97, 96, 86, 30, 85, 18, 93, 79, 84, ...
## $ Finishing              94, 95, 89, 94, 13, 91, 13, 83, 76, 91, ...
## $ `Free kick accuracy`   76, 90, 84, 84, 11, 84, 19, 79, 84, 62, ...
## $ `GK diving`            7, 6, 9, 27, 91, 15, 90, 11, 10, 5, 11, ...
## $ `GK handling`          11, 11, 9, 25, 90, 6, 85, 12, 11, 12, 8,...
## $ `GK kicking`           15, 15, 15, 31, 95, 12, 87, 6, 13, 7, 9,...
## $ `GK positioning`       14, 14, 15, 33, 91, 8, 86, 8, 7, 5, 7, 1...
## $ `GK reflexes`          11, 8, 11, 37, 89, 10, 90, 8, 10, 10, 11...
## $ `Heading accuracy`     88, 71, 62, 77, 25, 85, 21, 57, 54, 86, ...
## $ Interceptions          29, 22, 36, 41, 30, 39, 30, 41, 85, 20, ...
## $ Jumping                95, 68, 61, 69, 78, 84, 67, 59, 32, 79, ...
## $ `Long passing`         77, 87, 75, 64, 59, 65, 51, 81, 93, 59, ...
## $ `Long shots`           92, 88, 77, 86, 16, 83, 12, 82, 90, 82, ...
## $ Marking                22, 13, 21, 30, 10, 25, 13, 25, 63, 12, ...
## $ Penalties              85, 74, 81, 85, 47, 81, 40, 86, 73, 70, ...
## $ Positioning            95, 93, 90, 92, 12, 91, 12, 85, 79, 92, ...
## $ Reactions              96, 95, 88, 93, 85, 91, 88, 85, 86, 88, ...
## $ `Short passing`        83, 88, 81, 83, 55, 83, 50, 86, 90, 75, ...
## $ `Shot power`           94, 85, 80, 87, 25, 88, 31, 79, 87, 88, ...
## $ `Sliding tackle`       23, 26, 33, 38, 11, 19, 13, 22, 69, 18, ...
## $ `Sprint speed`         91, 87, 90, 77, 61, 83, 58, 87, 52, 80, ...
## $ Stamina                92, 73, 78, 89, 44, 79, 40, 79, 77, 72, ...
## $ `Standing tackle`      31, 28, 24, 45, 10, 42, 21, 27, 82, 22, ...
## $ Strength               80, 59, 53, 80, 83, 84, 64, 65, 74, 85, ...
## $ Vision                 85, 90, 80, 84, 70, 78, 68, 86, 88, 70, ...
## $ Volleys                88, 85, 83, 88, 11, 87, 13, 79, 82, 88, ...
## $ CAM                    89, 92, 88, 87, NA, 84, NA, 88, 83, 81, ...
## $ CB                     53, 45, 46, 58, NA, 57, NA, 47, 72, 46, ...
## $ CDM                    62, 59, 59, 65, NA, 62, NA, 61, 82, 52, ...
## $ CF                     91, 92, 88, 88, NA, 87, NA, 87, 81, 84, ...
## $ CM                     82, 84, 79, 80, NA, 78, NA, 81, 87, 71, ...
## $ ID                     20801, 158023, 190871, 176580, 167495, 1...
## $ LAM                    89, 92, 88, 87, NA, 84, NA, 88, 83, 81, ...
## $ LB                     61, 57, 59, 64, NA, 58, NA, 59, 76, 51, ...
## $ LCB                    53, 45, 46, 58, NA, 57, NA, 47, 72, 46, ...
## $ LCM                    82, 84, 79, 80, NA, 78, NA, 81, 87, 71, ...
## $ LDM                    62, 59, 59, 65, NA, 62, NA, 61, 82, 52, ...
## $ LF                     91, 92, 88, 88, NA, 87, NA, 87, 81, 84, ...
## $ LM                     89, 90, 87, 85, NA, 82, NA, 87, 81, 79, ...
## $ LS                     92, 88, 84, 88, NA, 88, NA, 82, 77, 87, ...
## $ LW                     91, 91, 89, 87, NA, 84, NA, 88, 80, 82, ...
## $ LWB                    66, 62, 64, 68, NA, 61, NA, 64, 78, 55, ...
## $ `Preferred Positions`  "ST LW", "RW", "LW", "ST", "GK", "ST", "...
## $ RAM                    89, 92, 88, 87, NA, 84, NA, 88, 83, 81, ...
## $ RB                     61, 57, 59, 64, NA, 58, NA, 59, 76, 51, ...
## $ RCB                    53, 45, 46, 58, NA, 57, NA, 47, 72, 46, ...
## $ RCM                    82, 84, 79, 80, NA, 78, NA, 81, 87, 71, ...
## $ RDM                    62, 59, 59, 65, NA, 62, NA, 61, 82, 52, ...
## $ RF                     91, 92, 88, 88, NA, 87, NA, 87, 81, 84, ...
## $ RM                     89, 90, 87, 85, NA, 82, NA, 87, 81, 79, ...
## $ RS                     92, 88, 84, 88, NA, 88, NA, 82, 77, 87, ...
## $ RW                     91, 91, 89, 87, NA, 84, NA, 88, 80, 82, ...
## $ RWB                    66, 62, 64, 68, NA, 61, NA, 64, 78, 55, ...
## $ ST                     92, 88, 84, 88, NA, 88, NA, 82, 77, 87, ...

In this post, we are only interested in the attributes and the preferred position
of the players.

fifa_tbl <- fifa_tbl %>% 
  select(Acceleration:Volleys,`Preferred Positions`)

head(fifa_tbl$`Preferred Positions`)
## [1] "ST LW" "RW"    "LW"    "ST"    "GK"    "ST"

Notice that the Preferred Positions column may contain several positions. We
will simply split those entries and use the first given one. Additionally,
we create a column indicating if the position is in defense, midfield or offense.
Goalkeepers are treated separately.

fifa_tbl <- fifa_tbl %>% 
  mutate(position = word(`Preferred Positions`,1)) %>% 
  mutate(position = factor(position,
                           levels = c("GK","CB","RB","LB","RWB","LWB","CDM",
                                      "CM","RM","LM","CAM",
                                      "CF","RW","LW","ST")))

defense  <- c("CB","RB","LB","RWB","LWB")
midfield <- c("CDM","CM","RM","LM","CAM")
offense  <- c("CF","RW","LW","ST")
    
fifa_tbl <- fifa_tbl %>% 
  mutate(position2 = ifelse(position %in% defense,"D",
                     ifelse(position %in% midfield,"M",
                     ifelse(position %in% offense,"O","GK")))) %>% 
  mutate(position2 = factor(position2,levels = c("GK","D","M","O"))) %>% 
  select(-`Preferred Positions`)

Why reducing the dimension?

There are many good reasons for reducing the dimension of data sets with a large amount of
features. For example, to get rid of correlated variables or speeding up computations.
In this post, we focus on the application in data exploration. Any data analytic
task (should) start with “getting a feeling for the data”.

Among the first things to do is to look at scatterplots of pairs of features.
Below you see the scatterplots for six features using the ggpairs() function
of the GGally package.

##ggpair doesn't understand whitespace in column names
names(fifa_tbl) <- str_replace_all(names(fifa_tbl)," ","_")

fifa_tbl %>% 
  select(Acceleration:Volleys,position2) %>% 
  ggpairs(columns = c(1:5,12),aes(col=position2))

names(fifa_tbl) <- str_replace_all(names(fifa_tbl),"_"," ")

These few scatterplots already reveal that goalkeepers have rather different
skills than other players. This certainly does not come by surprise. There
also seem to be some patterns for defenders/midfielders and strikers, but to get a
better picture, we should be looking at all scatterplots. However, our complete
data set has 34 features, meaning that we would be forced to explore
\(34\cdot33/2=561\) scatterplots. Ain’t nobody got time for that!

That’s where dimension reduction techniques come into play. The ultimate goal is to
to reduce our high dimensional data to 2 dimensions without (much) loss of information.
In this way, we can explore the entire data set with only one scatterplot.

Principal Component Analysis

If you know a bit of statistics and you are asked for dimension reduction methods,
I bet your immediate answer is “PCA!”. It is one of the oldest and certainly most
commonly used method to reduce high dimensional data sets. There are many excellent
introductory posts for PCA using R
(1,
2,
3,
4),
so I will not spend much time on the fundamentals.

Very briefly, principal components are linear combination of the original variables
which capture the variance in the data set. The first principal component captures
the highest amount of variability. The larger this variability is,
the more information is contained in it. In geometric terms, it describes a line
which is closest to the data and thus minimizes the sum of squared distance between
all data points and the line. The second principal component capture
the remaining variability in a similar way. The more of the total
variability is captured by these two components, the more information does the
scatterplot of these vectors contain.

We use the prcomp() function from the stats package for our PCA.

fifa_pca <- fifa_tbl %>% 
  select(Acceleration:Volleys) %>%
  prcomp(center=TRUE,scale.=TRUE)

Note that usually scaling your variables is very important to not overemphasize
features with large values. In our case, it would actually not be necessary since
all attributes lie in the interval [0,100].

Besides the components, the function returns the standard deviation captured by
each component. We can thus compute the variances and check how much information
is captured by each component.

tibble(sd = fifa_pca$sdev, 
       pc = 1:length(sd)) %>% 
  mutate(cumvar = cumsum((sd^2)/sum(sd^2))) %>% 
  ggplot(aes(pc,cumvar))+geom_line()+geom_point()+
  labs(x="Principal Component",y="Cummulative Proportion of Variance Explained")+
  theme_ipsum_rc()

Roughly 70% of the variance is explained by the first two components. To explain
90% of the variance, we have to go up to the 8th component.

To visualize the result of the PCA, we use the ggbiplot() function from the
ggbiplot package. As the name suggests, it creates a biplot in ggplot style.

ggbiplot(fifa_pca, obs.scale = 1, var.scale = 1, alpha = 0.01,
         groups = fifa_tbl$position2, varname.size = 4, varname.adjust = 2,
         ellipse = TRUE, circle = FALSE) +
  scale_color_discrete(name = '') +
  scale_x_continuous(limits = c(-20,20))+
  scale_y_continuous(limits = c(-10,10))+
  theme_ipsum_rc()+
  theme(legend.direction = 'horizontal', legend.position = 'bottom')

Since we have a lot of features, this unfortunately results in some overplotting.
However, general patterns are still visible. The first component clearly distinguishes
goalkeepers from the rest of the players, which we already expected from examining a
few scatterplots.

The second component reveals that the remaining positions are also separated
fairly well. While defenders and offensive players are well separated, the midfielders
lie somewhere in between. Just as on the pitch!

t-SNE

t-SNE stands for t-distributed stochastic neighbor embedding and
was introduced in 2008. A comprehensive introduction to the method can be found in
this
or this post.

Non-technically, the algorithm is in fact quite simple. t-SNE is a non-linear
dimensionality reduction algorithm that seeks to finds patterns in the data by
identifying clusters based on similarity of data points. Note that this does not
make it a clustering algorithm! It still is “only” a dimensionality reduction algorithm.
Nontheless, the results can be quite impressive and in many cases are superior to
a PCA. Check this post on
recognizing hand drawn digits or this
post if you are into Pokemon.

The t-SNE algorithm is implemented in the package Rtsne. To run it, several
hyper parameters have to be set. The two most important ones are perplexity
and max_iter. While the latter should be self-explanatory, the second is not.
Perplexity roughly indicates how to balance local and global aspects of the data.
The parameter is an estimate for the number of close neighbors for each point.
The original authors state, that “The performance of SNE is fairly robust to
changes in the perplexity, and typical values are between 5 and 50.”
But other sources say that the output
can be heavily influenced by the choice of the perplexity parameter. We here
choose a perplexity of 50 and 1000 iterations.

set.seed(12)

fifa_tsne <- fifa_tbl %>%
  select(Acceleration:Volleys) %>%
Rtsne(perplexity = 50, max_iter = 1000, check_duplicates = FALSE)
tibble(x = fifa_tsne$Y[,1],
       y = fifa_tsne$Y[,2],
       position = fifa_tbl$position2) %>%
  ggplot(aes(x,y)) + geom_point(aes(col = position), alpha = 0.25) +
  theme_ipsum_rc()+
  theme(legend.position = "bottom")+
  labs(x="",y="")

The plot again shows that goalkeepers are easily distinguished from other players.
Also, the other positions are some what clustered together.

Overall we can observe that, for our data set, the t-SNE algorithm does not significantly
extend the results we have already obtained from the PCA.

Pros and Cons

As said before, there are some amazing examples for the effectiveness of the t-SNE
algorithm. However, several things have to be kept in mind to not glorify the algorithm
to much.

The algorithm is rather complex. While a PCA can be done in under a second even for very large
data sets, the t-SNE algorithm will take considerably longer. For the FIFA 18 player data,
the algorithm took several minutes to run. PCA is deterministic, meaning
that each run on the same data gives the same results. t-SNE is not. Sometimes,
different runs with the same hyper parameters may produce different results.
The algorithm is so good, that it may even find patterns in random noise, which, of
course, is not really desirable. It is therefore advisable to run the algorithm
several times with different sets of hyper-parameter before deciding if a pattern
exists in the data.

A big drawback is the interpretation of the result. While we may end up with well
separated groups of players, we do not now what features are the most decisive.

Self-Organizing Maps

A self-organizing map (SOM)
is an artificial neural network that is trained using unsupervised learning.
Since neural networks are super popular right now, it is only natural to look at
one that it is used for dimensionality reduction.

A SOM is made up of multiple “nodes”, where each node vector has the following properties.

  • A fixed position on the SOM grid.
  • A weight vector of the same dimension as the input space.
  • Associated data points. Each data point is mapped to a node on the map grid.

The key feature of SOMs is that the topological features of the input data are
preserved on the map. In our case, players with similar attributes are placed
close together on the grid. Some more in depth posts to SOMs can be found here
and here.

We use the som() function from the kohonen package to produce a SOM for the
FIFA 18 player data. We choose a 20×20 hexagonal grid and 300 iteration steps.

fifa_som <- fifa_tbl %>% 
  select(Acceleration:Volleys) %>%
  scale() %>%
  som(grid = somgrid(20, 20, "hexagonal"), rlen = 300)

The package comes with a variety of visualization options for the resulting SOM.
First, we can check if the training process was “successful”.

plot(fifa_som, type="changes")

Ideally, we reach a plateau at one point. If the curve is still decreasing, it might
be advisable to increase the rlen parameter.

We can also visualize, how many players are contained in each grid node.

plot(fifa_som, type="count", shape = "straight")

If you notice that the counts are very imbalanced, you might want to consider increasing
your grid size.

Finally we plot the distribution of players over the grid nodes and the weight
vectors associated with each grid node.

par(mfrow=c(1,2))
plot(fifa_som, type="mapping", pch=20,
     col = c("#F8766D","#7CAE00","#00B0B5","#C77CFF")[as.integer(fifa_tbl$position2)],
     shape = "straight")
plot(fifa_som, type="codes",shape="straight")

Again, we notice a clear cut between goalkeepers and other players. Also the other
positions are fairly well separated.

Pros and Cons

I must admit that I don’t know too much about SOMs yet. But so far they seem pretty
neat. I don’t realy see any major drawback, except that they may also require a bit
more computation time than a simple PCA, but they seem much faster than t-SNE.
The results are also easier to interpret than the ones from t-SNE, since we get
a weight vector for each grid node.

Summary

Exploring your data is a very important first step for any kind of data analytic task.
Dimensionally reduction methods can help by producing 2 dimensional representations,
which, in best case, display meaningful patterns in the higher dimensional data.

For the FIFA 18 player data, we learned that it is possible to distinguish positions
of players by player attributes. This knowledge will be used in a later post to
predict player positions.

To leave a comment for the author, please follow the link and comment on their blog: schochastics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)