SOMs and ggplot

January 23, 2018
By

(This article was first published on schochastics, and kindly contributed to R-bloggers)

#used packages
library(tidyverse)  # for data wrangling
library(stringr)    # for string manipulations
library(kohonen)    # implements self organizing maps
library(ggforce)    # for additional ggplot features

I introduced self-organizing maps (SOM) in a previous post and since then
I am using the kohonen package on a daily basis. However, I prefer the ggplot style plotting,
so I reimplemented the SOM plots of the package with the ggplot2 package. But don’t get
me wrong, the kohonen package does an amazing job in visualizing SOMs. I really just
prefer ggplot. Also, I needed some customized features. For illustration purposes, I will
use the FIFA 18 player data again.
For the data preparation steps, refer to that post.

We will, however, only use a random sample of the 75,000 players, for computational convenience.

r_sample <- sample(1:nrow(fifa_tbl),2000)
fifa_tbl <- fifa_tbl[r_sample,]
glimpse(fifa_tbl)
## Observations: 2,000
## Variables: 36
## $ Acceleration          52, 53, 30, 76, 46, 70, 47, 29, 45, 84, 7...
## $ Aggression            49, 79, 27, 62, 24, 75, 80, 76, 62, 63, 4...
## $ Agility               47, 58, 25, 77, 45, 63, 58, 51, 52, 70, 8...
## $ Balance               65, 50, 45, 80, 56, 51, 55, 63, 37, 70, 8...
## $ `Ball control`        31, 57, 16, 77, 18, 62, 64, 51, 57, 67, 6...
## $ Composure             44, 46, 32, 82, 23, 58, 68, 61, 48, 59, 6...
## $ Crossing              28, 45, 11, 76, 13, 46, 57, 60, 31, 74, 5...
## $ Curve                 26, 32, 13, 75, 19, 46, 57, 68, 30, 51, 5...
## $ Dribbling             25, 45, 9, 77, 17, 58, 57, 39, 54, 77, 66...
## $ Finishing             20, 33, 9, 67, 20, 32, 47, 46, 68, 68, 57...
## $ `Free kick accuracy`  23, 33, 12, 67, 13, 55, 64, 47, 24, 38, 4...
## $ `GK diving`           12, 9, 53, 14, 66, 13, 15, 12, 5, 13, 12,...
## $ `GK handling`         8, 10, 51, 8, 60, 5, 7, 11, 10, 12, 14, 1...
## $ `GK kicking`          12, 11, 54, 9, 61, 12, 9, 9, 6, 15, 10, 1...
## $ `GK positioning`      14, 16, 50, 13, 65, 12, 8, 6, 10, 5, 8, 1...
## $ `GK reflexes`         11, 6, 54, 15, 71, 14, 6, 7, 13, 7, 6, 1,...
## $ `Heading accuracy`    49, 68, 11, 48, 21, 52, 66, 73, 74, 53, 6...
## $ Interceptions         50, 67, 10, 60, 20, 69, 65, 74, 22, 67, 1...
## $ Jumping               73, 70, 57, 64, 62, 63, 72, 67, 70, 53, 6...
## $ `Long passing`        25, 53, 18, 68, 19, 66, 63, 64, 31, 52, 5...
## $ `Long shots`          19, 43, 9, 68, 19, 47, 69, 57, 57, 57, 62...
## $ Marking               47, 66, 9, 58, 15, 57, 60, 70, 21, 58, 38...
## $ Penalties             40, 48, 15, 70, 25, 48, 54, 47, 55, 57, 6...
## $ Positioning           26, 42, 5, 74, 14, 60, 59, 35, 59, 71, 58...
## $ Reactions             42, 66, 49, 77, 53, 58, 68, 66, 51, 64, 5...
## $ `Short passing`       31, 63, 18, 77, 29, 67, 67, 66, 45, 68, 6...
## $ `Shot power`          41, 50, 22, 76, 23, 49, 70, 65, 61, 63, 6...
## $ `Sliding tackle`      56, 68, 14, 61, 19, 64, 59, 64, 17, 56, 2...
## $ `Sprint speed`        54, 54, 26, 76, 48, 65, 43, 30, 64, 91, 7...
## $ Stamina               64, 66, 21, 75, 33, 77, 61, 64, 61, 84, 6...
## $ `Standing tackle`     61, 67, 12, 59, 17, 68, 63, 70, 20, 60, 3...
## $ Strength              55, 75, 52, 72, 70, 72, 73, 75, 83, 67, 5...
## $ Vision                34, 36, 34, 75, 13, 64, 58, 41, 43, 71, 5...
## $ Volleys               25, 30, 7, 76, 18, 48, 37, 32, 56, 46, 58...
## $ position              CB, CB, GK, RB, GK, CM, CB, CB, ST, RB, ...
## $ position2             D, D, GK, D, GK, M, D, D, O, D, O, D, GK...

SOM data structure

We start by computing the SOM for the random sample.

fifa_som <- fifa_tbl %>%
  select(Acceleration:Volleys) %>%
  scale() %>%
  som(grid = somgrid(15, 15, "hexagonal","gaussian"), rlen = 800)

To use this output with ggplot now, we have to dig into the output fifa_som.

glimpse(fifa_som)
## List of 13
##  $ data            :List of 1
##   ..$ : num [1:2000, 1:34] -0.905 -0.835 -2.457 0.788 -1.329 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. ..- attr(*, "scaled:center")= Named num [1:34] 64.8 56.7 63.5 63.8 59.1 ...
##   .. .. ..- attr(*, "names")= chr [1:34] "Acceleration" "Aggression" "Agility" "Balance" ...
##   .. ..- attr(*, "scaled:scale")= Named num [1:34] 14.2 17.2 14.3 13.9 16.1 ...
##   .. .. ..- attr(*, "names")= chr [1:34] "Acceleration" "Aggression" "Agility" "Balance" ...
##  $ unit.classif    : num [1:2000] 137 47 196 75 183 68 17 31 224 7 ...
##  $ distances       : num [1:2000] 3.11 3.31 1.55 1.82 3.19 ...
##  $ grid            :List of 6
##   ..$ pts              : num [1:225, 1:2] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ xdim             : num 15
##   ..$ ydim             : num 15
##   ..$ topo             : chr "hexagonal"
##   ..$ neighbourhood.fct: Factor w/ 2 levels "bubble","gaussian": 2
##   ..$ toroidal         : logi FALSE
##   ..- attr(*, "class")= chr "somgrid"
##  $ codes           :List of 1
##   ..$ : num [1:225, 1:34] -0.6751 -0.7856 -0.1953 -0.0218 0.3022 ...
##   .. ..- attr(*, "dimnames")=List of 2
##  $ changes         : num [1:800, 1] 0.018 0.0181 0.0179 0.018 0.0181 ...
##  $ alpha           : num [1:2] 0.05 0.01
##  $ radius          : Named num [1:2] 8.89 0
##   ..- attr(*, "names")= chr [1:2] "66.66667%" ""
##  $ user.weights    : num 1
##  $ distance.weights: num 1
##  $ whatmap         : int 1
##  $ maxNA.fraction  : int 0
##  $ dist.fcts       : chr "sumofsquares"
##  - attr(*, "class")= chr "kohonen"

This might look a bit overwhelming on first glance so let’s take it step by step.
As a basic layer, we need the information of grid nodes. These are stored fifa_som[[4]].

som_grid <- fifa_som[[4]]$pts %>%
  as_tibble %>% 
  mutate(id=row_number())
som_grid
## # A tibble: 225 x 3
##        x     y    id
##      
##  1  1.50 0.866     1
##  2  2.50 0.866     2
##  3  3.50 0.866     3
##  4  4.50 0.866     4
##  5  5.50 0.866     5
##  6  6.50 0.866     6
##  7  7.50 0.866     7
##  8  8.50 0.866     8
##  9  9.50 0.866     9
## 10 10.5  0.866    10
## # ... with 215 more rows

The information in which node a player ended up is stored in fifa_som[[2]]. Its
distance to the weight vector in fifa_som[[3]].

som_pts <- tibble(id = fifa_som[[2]],
                  dist = fifa_som[[3]],
                  type = fifa_tbl$position2)

To get the coordinates for the players, we join som_pts and som_grid.

som_pts <- som_pts %>% left_join(som_grid,by="id")

Mapping plot

The code below produces the mapping plot with the kohonen package. If you forgot
what that was, take a peak here.

#GK=red,D=green,M=blue,O=purple
plot(fifa_som, type="mapping", pch=20,
     col = c("#F8766D","#7CAE00","#00B0B5","#C77CFF")[as.integer(fifa_tbl$position2)],
     shape = "round")

Lets start by ggplotting the basic “circle” layer. Unfortunately, there is no easy way to draw circles with the basic
ggplot2. Luckily there is ggforce, a package with really nice extras for ggplot, including
also geom_circle.

p <- som_grid %>% 
  ggplot(aes(x0=x,y0=y))+
  geom_circle(aes(r=0.5))+
  theme(panel.background = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        axis.text = element_blank(),
        axis.title = element_blank(),
        legend.position = "bottom")
p

There we go! Now we can continue putting the players in the right node.

p+geom_jitter(data=som_pts,aes(x,y,col=type),alpha=0.5)+
  scale_color_manual(values=c("#F8766D","#7CAE00","#00B0B5","#C77CFF"),name="Position")

Note that I used geom_jitter because otherwise all players would have been mapped
to the center of the grid point.

Except for the legend, the plot looks fairly like the standard plot. But what if you want to
assess the fraction of each type per grid node? Basically, turning the grid nodes into pie chart.
(Something, something pie charts are awful). For this, we need to count the occurrences
of positions in each grid node. Then, we can use a geom_arc_bar layer to plot the fractions.

pts <- som_pts %>% 
        group_by(id,x,y) %>% 
        count(type) %>% 
        ungroup() %>%
        dplyr::filter(!is.na(type))
    
    
p + 
  geom_arc_bar(data=pts,
               aes(x0 = x, y0 = y, r0 = 0, r = 0.5, amount = n, 
                   fill = type),
               stat = 'pie')+
  scale_fill_manual(values = c("#F8766D","#7CAE00","#00B0B5","#C77CFF"), 
                    name = "Position")

I think you can see more easily how homogeneous the grid nodes are with this plot.

Other SOM ggplots

The kohonen package contains many more visualizations, which can be transferred to
ggplot much in the same way as the mapping plot. Let’s look at “dist.neighbours”,
that is how close neighbouring grid points are. For this, we need to first calculate
the average distances.

ndist <- unit.distances(fifa_som$grid)
cddist <- as.matrix(object.distances(fifa_som, type = "codes"))
cddist[abs(ndist - 1) > .001] <- NA
neigh.dists <- colMeans(cddist, na.rm = TRUE)

som_grid <- som_grid %>% mutate(dist=neigh.dists)

This very much the same code as used in the package.

p <- som_grid %>% 
     ggplot(aes(x0=x,y0=y))+ggforce::geom_circle(aes(r=0.5,fill=dist))+
     scale_fill_gradient(low="white",high="gray25",name="Distance")+
     theme(panel.background = element_blank(),
           axis.ticks = element_blank(),
           panel.grid = element_blank(),
           axis.text = element_blank(),
           axis.title = element_blank(),
           legend.position = "bottom")
p

Below is the standard plot.

plot(fifa_som, type="dist.neighbours",shape = "round",palette.name=gray.colors)

Again, quite the same, except for the inverted color palette.

Thanks to ggplots layer system, we can combine this plot with the mapping plot.

p+geom_jitter(data = som_pts,aes(x,y,col=type),alpha = 1)+
  scale_color_manual(values = c("#F8766D","#7CAE00","#00B0B5","#C77CFF"),name="Position")

As I said, there are some more plot features in the kohonen, but I think you get the gist now
on how to turn the base R plots into ggplot and can built in your on features

To leave a comment for the author, please follow the link and comment on their blog: schochastics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)