Clustering NHL Goalies

[This article was first published on Dan Garmat's Blog -- R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This has been a great Stanley Cup playoffs for Washington Capitals fans such as myself. With so much breath holding, I’ve paid more attention this year than recent years. As a former goalie myself, my curiosity grew towards: Who are these goalies so much better than me they beat me to being in the NHL? Who’s a hero, and who’s maybe not a keeper?

What better way to understand how they shake out than clustering their regular season statistics? This is an opportunity to work with tibbleColumns by Hoyt Emerson, a new package that adds some intriguing functionality to dplyr, and dendextend by Tal Galili, which adds options to hierarchical clustering diagrams. Best data found came from Rob Vollman at http://www.hockeyabstract.com/testimonials.

Bottom line up front: unsupervised learning here taught more about my data set, and less about the world it represents. Where it did teach about the world of NHL goalies, it showed this guy, is standing out:

Frederick Andersen

By David from Washington, DC – _25A9839, CC BY 2.0, Link

1. Loading data

How does hockeyabstract’s data look? Let’s load some packages we’ll be using in this analysis and take an initial glimpse.

<span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">readxl</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dbscan</span><span class="p">)</span><span class="w">
</span><span class="c1">#devtools::install_github("nhemerson/tibbleColumns") # requires new-ish version of R</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tibbleColumns</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">GGally</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">broom</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">naniar</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggfortify</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">RColorBrewer</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">scales</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dendextend</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">viridis</span><span class="p">)</span><span class="w">

</span><span class="n">goalies</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read_excel</span><span class="p">(</span><span class="s1">'NHL Goalies 2017-18.xls'</span><span class="p">,</span><span class="w"> </span><span class="n">sheet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Goalies'</span><span class="p">)</span><span class="w">
</span><span class="n">glimpse</span><span class="p">(</span><span class="n">goalies</span><span class="p">)</span><span class="w">

</span><span class="c1">#Observations: 95</span><span class="w">
</span><span class="c1">#Variables: 132</span><span class="w">
</span><span class="c1">#$ X__1         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, ...</span><span class="w">
</span><span class="c1">#$ DOB          <dttm> 1998-09-20, 1982-09-18, 1990-03-02, 1988-01-05, 1988-02-11, 1989-09-16, 1988-09-20...</span><span class="w">
</span><span class="c1">#$ `Birth City` <chr> "Lantzville", "Banská Bystrica", "Barrie", "Norrtälje", "Surrey", "Lloydminster", "...</span><span class="w">
</span><span class="c1">#$ `S/P`        <chr> "BC", NA, "ON", NA, "BC", "SK", NA, "VA", "ON", "MI", "MA", NA, "MN", "QC", "NB", "...</span><span class="w">
</span><span class="c1">#$ Cntry        <chr> "CAN", "SVK", "CAN", "SWE", "CAN", "CAN", "RUS", "USA", "CAN", "USA", "USA", "SWE",...</span><span class="w">
</span><span class="c1">#$ Nat          <chr> "CAN", "SVK", "CAN", "SWE", "CAN", "CAN", "RUS", "USA", "CAN", "USA", "USA", "SWE",...</span><span class="w">
</span><span class="c1">#$ Ht           <dbl> 73, 73, 75, 76, 74, 74, 74, 78, 78, 75, 74, 78, 73, 74, 73, 76, 75, 74, 76, 73, 73,...</span><span class="w">
</span><span class="c1">#$ Wt           <dbl> 189, 196, 202, 187, 215, 211, 182, 232, 220, 173, 195, 229, 182, 180, 200, 200, 195...</span><span class="w">
</span><span class="c1">#$ Sh           <chr> "L", "L", "R", "L", "L", "L", "L", "L", "L", "L", "L", "L", "R", "L", "L", "L", "L"...</span><span class="w">
</span>

At 95 x 132 this has fewer players than variables! Can see the first column is just row number, so remove it. Then take a look look at the distribution of games.

1.1. Distribution of Games Played (GP)

<span class="n">goalies</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">goalies</span><span class="p">[</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="m">-1</span><span class="p">]</span><span class="w">

</span><span class="n">ggplot</span><span class="p">(</span><span class="n">goalies</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">GP</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_histogram</span><span class="p">(</span><span class="n">bins</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">
</span>

goalies_03

Fairly uniformly distributed with a bit fewer players as the number of games played (GP) goes up.

Defining a starter as a goalie who plays 35+ regular season games in an 82 game regular season, we can see 39 such starters, more than the number of NHL teams, 31. There are some teams with 2 starters. Are they duplicates or shared starters?

<span class="n">goalies</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">GP</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="m">35</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">tbl_out</span><span class="p">(</span><span class="s1">'starters'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">   </span><span class="c1"># tbl_out saves a data frame and allows a pipe to continue</span><span class="w">
  </span><span class="n">count</span><span class="p">()</span><span class="w">
</span><span class="c1">## A tibble: 1 x 1</span><span class="w">
</span><span class="c1">#      n</span><span class="w">
</span><span class="c1">#  <int></span><span class="w">
</span><span class="c1">#1    39</span><span class="w">
  
</span><span class="n">starters</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">`Team(s)`</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">count</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="c1">## A tibble: 32 x 2</span><span class="w">
</span><span class="c1">## Groups:   Team(s) [32]</span><span class="w">
</span><span class="c1">#   `Team(s)`     n</span><span class="w">
</span><span class="c1">#   <chr>     <int></span><span class="w">
</span><span class="c1"># 1 BUF           2</span><span class="w">
</span><span class="c1"># 2 CAR           2</span><span class="w">
</span><span class="c1"># 3 COL           2</span><span class="w">
</span><span class="c1"># 4 DAL           2</span><span class="w">
</span><span class="c1"># 5 FLA           2</span><span class="w">
</span><span class="c1"># 6 NJD           2</span><span class="w">
</span><span class="c1"># 7 WSH           2</span><span class="w">
</span><span class="c1"># 8 ANA           1</span><span class="w">
</span><span class="c1"># 9 ARI           1</span><span class="w">
</span><span class="c1">#10 BOS           1</span><span class="w">
</span><span class="c1">## ... with 22 more rows</span><span class="w">
  
</span><span class="n">starters</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">`Team(s)`</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">count</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">starters</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Team(s)'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">`Team(s)`</span><span class="p">,</span><span class="w"> </span><span class="n">`First Name`</span><span class="p">,</span><span class="w"> </span><span class="n">`Last Name`</span><span class="p">)</span><span class="w">
</span><span class="c1">## A tibble: 14 x 3</span><span class="w">
</span><span class="c1">#   `Team(s)` `First Name` `Last Name`</span><span class="w">
</span><span class="c1">#   <chr>     <chr>        <chr>      </span><span class="w">
</span><span class="c1"># 1 BUF       Robin        Lehner     </span><span class="w">
</span><span class="c1"># 2 BUF       Chad         Johnson    </span><span class="w">
</span><span class="c1"># 3 CAR       Scott        Darling    </span><span class="w">
</span><span class="c1"># 4 CAR       Cam          Ward       </span><span class="w">
</span><span class="c1"># 5 COL       Jonathan     Bernier    </span><span class="w">
</span><span class="c1"># 6 COL       Semyon       Varlamov   </span><span class="w">
</span><span class="c1"># 7 DAL       Kari         Lehtonen   </span><span class="w">
</span><span class="c1"># 8 DAL       Ben          Bishop     </span><span class="w">
</span><span class="c1"># 9 FLA       Roberto      Luongo     </span><span class="w">
</span><span class="c1">#10 FLA       James        Reimer     </span><span class="w">
</span><span class="c1">#11 NJD       Keith        Kinkaid    </span><span class="w">
</span><span class="c1">#12 NJD       Cory         Schneider  </span><span class="w">
</span><span class="c1">#13 WSH       Braden       Holtby     </span><span class="w">
</span><span class="c1">#14 WSH       Philipp      Grubauer   </span><span class="w">
</span><span class="c1">## don't see any dupes</span><span class="w">

</span><span class="c1"># make sure no dupes in general</span><span class="w">
</span><span class="n">goalies</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">`Last Name`</span><span class="p">,</span><span class="w"> </span><span class="n">`First Name`</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">count</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w"> 
</span><span class="c1">## A tibble: 0 x 3</span><span class="w">
</span><span class="c1"># Groups:   Last Name, First Name [0]</span><span class="w">
</span><span class="c1"># ... with 3 variables: `Last Name` <chr>, `First Name` <chr>, n <int></span><span class="w">

</span><span class="n">goalies</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">`Last Name`</span><span class="p">,</span><span class="w"> </span><span class="n">`First Name`</span><span class="p">,</span><span class="w"> </span><span class="n">GP</span><span class="p">,</span><span class="w"> </span><span class="n">`Team(s)`</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">str_detect</span><span class="p">(</span><span class="n">`Team(s)`</span><span class="p">,</span><span class="w"> </span><span class="s1">','</span><span class="p">))</span><span class="w">
</span><span class="c1">## A tibble: 6 x 4</span><span class="w">
</span><span class="c1">#  `Last Name` `First Name`    GP `Team(s)`    </span><span class="w">
</span><span class="c1">#  <chr>       <chr>        <dbl> <chr>        </span><span class="w">
</span><span class="c1">#1 Lack        Eddie            8 CGY, NJD     </span><span class="w">
</span><span class="c1">#2 Kuemper     Darcy           29 LAK, ARI     </span><span class="w">
</span><span class="c1">#3 Montoya     Al              13 MTL, EDM     </span><span class="w">
</span><span class="c1">#4 Domingue    Louis           19 ARI, TBL     </span><span class="w">
</span><span class="c1">#5 Mrazek      Petr            39 DET, PHI     </span><span class="w">
</span><span class="c1">#6 Niemi       Antti           24 PIT, FLA, MTL</span><span class="w">
</span>

So no duplicates. Team(s) can hold more than one team in the field with a comma, as opposed to only showing the last team the goalie played for in 2018. So look like true shared starters.

1.2. Distribution of Heights

Heights of NHL goalies are ridiculous these days! This picture by falsegrit shows Ben Bishop 6’7’’ who currently plays for the Dallas Stars being interviewed by retired NHL goalie Darren Pang 5’5’’.
bishop_pang

To be fair, Bishop is the tallest NHL netminder ever while Pang is the second shortest. But would Pang be the shortest by a lot today?

<span class="n">goalies</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">`Height in Feet`</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Ht</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">12</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">`Height in Feet`</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">GP</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_jitter</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.5</span><span class="p">),</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_alpha</span><span class="p">(</span><span class="n">guide</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_smooth</span><span class="p">()</span><span class="w">
</span>

goalies04

Sorry Darren, less than 6 feet need not apply, it seems.

Who are these new very short netminders on the left getting game time?

<span class="n">goalies</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">Ht</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">6.0</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">12</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">`First Name`</span><span class="p">,</span><span class="w"> </span><span class="n">`Last Name`</span><span class="p">,</span><span class="w"> </span><span class="n">Ht</span><span class="p">,</span><span class="w"> </span><span class="n">GP</span><span class="p">,</span><span class="w"> </span><span class="n">`Team(s)`</span><span class="p">)</span><span class="w">
</span><span class="c1">## A tibble: 3 x 5</span><span class="w">
</span><span class="c1">#  `First Name` `Last Name`    Ht    GP `Team(s)`</span><span class="w">
</span><span class="c1">#  <chr>        <chr>       <dbl> <dbl> <chr>    </span><span class="w">
</span><span class="c1">#1 Juuse        Saros          71    26 NSH      </span><span class="w">
</span><span class="c1">#2 Jaroslav     Halak          71    54 NYI      </span><span class="w">
</span><span class="c1">#3 Anton        Khudobin       71    31 BOS      </span><span class="w">

</span><span class="n">paste0</span><span class="p">(</span><span class="m">71</span><span class="w"> </span><span class="o">%/%</span><span class="w"> </span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="s1">'\''</span><span class="p">,</span><span class="w"> </span><span class="m">71</span><span class="w"> </span><span class="o">%%</span><span class="w"> </span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="s1">'\'\', what a bunch of short people!'</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] "5'11'', what a bunch of short people!"</span><span class="w">
</span>

5 foot 11 inches. Well, as a bit shorter, now I finally know the primary reason I’m not in the NHL!

2. Initial Clustering

Let’s start with something simple.

2.1. k = 2

Since we’ve looked at Games Played and Height, let’s add another key statistic, Save Percentage and k-means it with k = 2.

<span class="n">clusters_HGS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">goalies</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">Ht</span><span class="p">,</span><span class="w"> </span><span class="n">GP</span><span class="p">,</span><span class="w"> </span><span class="n">`SV%`</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">kmeans</span><span class="p">(</span><span class="n">centers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)</span><span class="w">

</span><span class="c1"># we have some NAs</span><span class="w">
</span><span class="n">goalies</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">Ht</span><span class="p">,</span><span class="w"> </span><span class="n">GP</span><span class="p">,</span><span class="w"> </span><span class="n">`SV%`</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="nf">is.na</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="nf">sum</span><span class="p">()</span><span class="w">
</span><span class="c1">#[1] 1</span><span class="w">

</span><span class="c1"># Just 1, who is it?</span><span class="w">
</span><span class="n">goalies</span><span class="p">[</span><span class="nf">is.na</span><span class="p">(</span><span class="n">goalies</span><span class="o">$</span><span class="n">Ht</span><span class="p">),</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">`Last Name`</span><span class="p">,</span><span class="w"> </span><span class="n">`First Name`</span><span class="p">,</span><span class="w"> </span><span class="n">GP</span><span class="p">,</span><span class="w"> </span><span class="n">Ht</span><span class="p">,</span><span class="w"> </span><span class="n">`SV%`</span><span class="p">)</span><span class="w">
</span><span class="c1">#  `Last Name` `First Name`    GP    Ht `SV%`</span><span class="w">
</span><span class="c1">#  <chr>       <chr>        <dbl> <dbl> <dbl></span><span class="w">
</span><span class="c1">#1 Foster      Scott            1    NA     1</span><span class="w">
</span>

R’s kmeans() returns an error because of an NA. Who is this NA? Scott Foster. Chicago accountant, Scott Foster, may be the most famous NHL goalie after Lester Patrick to play one game. Classy that hockeyabstract added him. Looks like his height is 6’0’’ so we’ll add it.

<span class="n">goalies</span><span class="p">[</span><span class="nf">is.na</span><span class="p">(</span><span class="n">goalies</span><span class="o">$</span><span class="n">Ht</span><span class="p">),</span><span class="w"> </span><span class="s1">'Ht'</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">6</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">12</span><span class="w">

</span><span class="c1"># try again</span><span class="w">
</span><span class="n">clusters_HGS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">goalies</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">Ht</span><span class="p">,</span><span class="w"> </span><span class="n">GP</span><span class="p">,</span><span class="w"> </span><span class="n">`SV%`</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">kmeans</span><span class="p">(</span><span class="n">centers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nstart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w">

</span><span class="n">goalies</span><span class="o">$</span><span class="n">cluster_2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">clusters_HGS</span><span class="o">$</span><span class="n">cluster</span><span class="p">)</span><span class="w">

</span><span class="n">goalies</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">Ht</span><span class="p">,</span><span class="w"> </span><span class="n">GP</span><span class="p">,</span><span class="w"> </span><span class="n">`SV%`</span><span class="p">,</span><span class="w"> </span><span class="n">cluster_2</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggpairs</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cluster_2</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.4</span><span class="p">))</span><span class="w">
</span>

goalies05

Looking along the diagonal, this clustering splits almost entirely along Games Played. This suggests distinction between backup and starter may be the strongest distinction in these three fields of data. Out of curiosity, what value of GP does that split suggest as a good cutoff for a starter?

<span class="n">goalies</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">cluster_2</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">summarise</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">GP</span><span class="p">),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">GP</span><span class="p">))</span><span class="w">
</span><span class="c1">## A tibble: 2 x 3</span><span class="w">
</span><span class="c1">#  cluster_2 `min(GP)` `max(GP)`</span><span class="w">
</span><span class="c1">#  <fct>         <dbl>     <dbl></span><span class="w">
</span><span class="c1">#1 1                 1        32</span><span class="w">
</span><span class="c1">#2 2                35        67</span><span class="w">
</span>

Between 32 GP and 35 GP a goalie becomes a starter – would be a fair rule of thumb for 2017-2018’s NHL regular season.

2.2. Better k than k = 2?

Do these three fields present more than 2 clusters? Using a scree plot to see how much remaining variance additional clusters fail to capture, we see k of 2 and debatably 3 are “elbows” meaning good numbers for these data.

<span class="n">wssplot</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">nc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1234</span><span class="p">,</span><span class="w"> </span><span class="n">nstart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">){</span><span class="w">
  </span><span class="n">wss</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">apply</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">var</span><span class="p">))</span><span class="w">
  </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="n">nc</span><span class="p">){</span><span class="w">
    </span><span class="n">set.seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span><span class="w">
    </span><span class="n">wss</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">kmeans</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">centers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">nstart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nstart</span><span class="p">)</span><span class="o">$</span><span class="n">withinss</span><span class="p">)}</span><span class="w">
  </span><span class="n">qplot</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nc</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">wss</span><span class="p">,</span><span class="w">  </span><span class="n">xlab</span><span class="o">=</span><span class="s2">"Number of Clusters"</span><span class="p">,</span><span class="w">
        </span><span class="n">ylab</span><span class="o">=</span><span class="s2">"Within groups sum of squares"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.title.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rel</span><span class="p">(</span><span class="m">.8</span><span class="p">),</span><span class="w"> </span><span class="n">angle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">90</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.title.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rel</span><span class="p">(</span><span class="m">.8</span><span class="p">),</span><span class="w"> </span><span class="n">angle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">00</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rel</span><span class="p">(</span><span class="m">.8</span><span class="p">)))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rel</span><span class="p">(</span><span class="m">.8</span><span class="p">)))</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">goalies</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">Ht</span><span class="p">,</span><span class="w"> </span><span class="n">GP</span><span class="p">,</span><span class="w"> </span><span class="n">`SV%`</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">wssplot</span><span class="p">(</span><span class="n">nstart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w">
</span>

goalies06

Importantly, though, these data aren’t scaled. Centering and scaling better removes effects of units. For example is a change from 20 to 30 degrees Fahrenheit equal to a change from 20 to 30 degrees Celsius? No! In one case you can still play hockey but in the other it’s too hot. So let’s scale these and retry a scree plot.

<span class="c1"># so what if we scale it?</span><span class="w">
</span><span class="n">goalies</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">Ht</span><span class="p">,</span><span class="w"> </span><span class="n">GP</span><span class="p">,</span><span class="w"> </span><span class="n">`SV%`</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">scale</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">tibble_out</span><span class="p">(</span><span class="s1">'scaled_3_vars'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">wssplot</span><span class="p">(</span><span class="n">nstart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w">
</span>

goalies07

Now it looks like maybe a best bet for an elbow in explaining variance is about k = 4 clusters.

So rerunning the pairs plot with 4 clusters on the scaled variables:

<span class="n">set.seed</span><span class="p">(</span><span class="m">1001</span><span class="p">)</span><span class="w"> </span><span class="c1"># so cluster assignments stay the same</span><span class="w">
</span><span class="n">scaled_3_vars</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">do</span><span class="p">(</span><span class="n">clusters_HGS_scaled</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">kmeans</span><span class="p">(</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">centers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">nstart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">tbl_module</span><span class="p">((</span><span class="n">.</span><span class="o">$</span><span class="n">clusters_HGS_scaled</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="o">$</span><span class="n">centers</span><span class="p">),</span><span class="w"> </span><span class="s1">'scaled_centers'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">do</span><span class="p">(</span><span class="n">augment</span><span class="p">(</span><span class="n">.</span><span class="o">$</span><span class="n">clusters_HGS_scaled</span><span class="p">[[</span><span class="m">1</span><span class="p">]],</span><span class="w"> </span><span class="n">scaled_3_vars</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="s1">'cluster_scaled'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">`.cluster`</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">bind_cols</span><span class="p">(</span><span class="n">goalies</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">tibble_out</span><span class="p">(</span><span class="s1">'goalies'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">Ht</span><span class="p">,</span><span class="w"> </span><span class="n">GP</span><span class="p">,</span><span class="w"> </span><span class="n">`SV%`</span><span class="p">,</span><span class="w"> </span><span class="n">cluster_scaled</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggpairs</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cluster_scaled</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.4</span><span class="p">))</span><span class="w">
</span>

goalies08

Ht vs GP shows some good separation. Save Percentage (SV%) doesn’t do much except to separate out one “really bad” showing in green, a goalie who had 50% SV%. I’m not saying, as a sub- 5’11’’ individual I could do better, but who is that?

<span class="n">goalies</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">cluster_scaled</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">`First Name`</span><span class="p">,</span><span class="w"> </span><span class="n">`Last Name`</span><span class="p">,</span><span class="w"> </span><span class="n">GP</span><span class="p">,</span><span class="w"> </span><span class="n">Ht</span><span class="p">,</span><span class="w"> </span><span class="n">`SV%`</span><span class="p">,</span><span class="w"> </span><span class="n">cluster_scaled</span><span class="p">,</span><span class="w"> </span><span class="n">SA</span><span class="p">,</span><span class="w"> </span><span class="n">MIN</span><span class="p">)</span><span class="w">
</span><span class="c1">## A tibble: 1 x 8</span><span class="w">
</span><span class="c1">#  `First Name` `Last Name`    GP    Ht `SV%` cluster_scaled    SA   MIN</span><span class="w">
</span><span class="c1">#  <chr>        <chr>       <dbl> <dbl> <dbl> <fct>          <dbl> <dbl></span><span class="w">
</span><span class="c1">#1 Dylan        Ferguson        1    73   0.5 3                  2   554</span><span class="w">
</span>

Dylan Ferguson apparently had 2 shots against (SA) in 554 minutes. Either amazing defense, or units are actually in seconds not minutes. Wikipedia verifies he played a little over 9 minutes, or 554 / 60. Will let hockeyabstract know! Let’s fix it, anyway.

<span class="n">goalies</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">goalies</span><span class="p">,</span><span class="w"> </span><span class="n">MIN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">MIN</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">60</span><span class="p">)</span><span class="w">
</span>

2.3. Prototypical Members

Who are the prototypical members of the cluster? That is, who is closest to the centroid? Algorithmically: for each center, for each player, we need to calculate total Euclidean distance to each center then return the player with the lowest distance to each center.

<span class="n">vars</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">scaled_3_vars</span><span class="p">)</span><span class="w">
</span><span class="n">distances</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">map</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">rowSums</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">scaled_3_vars</span><span class="p">[</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="n">vars</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> 
      </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">scaled_centers</span><span class="p">[</span><span class="n">.x</span><span class="p">,</span><span class="w"> </span><span class="n">vars</span><span class="p">]))</span><span class="w">
       </span><span class="p">[</span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">each</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">scaled_3_vars</span><span class="p">)),]</span><span class="w"> </span><span class="p">)))</span><span class="w">
</span><span class="c1"># this gives you 4 lists, 1 per cluster</span><span class="w">
</span><span class="c1"># for each cluster of distances, which player is the min?</span><span class="w">
</span><span class="n">prototype_player_nums</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">map</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">which.min</span><span class="p">(</span><span class="n">distances</span><span class="p">[[</span><span class="n">.x</span><span class="p">]]))</span><span class="w">

</span><span class="c1"># for each prototype, who is it?</span><span class="w">
</span><span class="n">prototypes</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">map_df</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">goalies</span><span class="p">[</span><span class="n">prototype_player_nums</span><span class="p">[[</span><span class="n">.x</span><span class="p">]],</span><span class="w"> 
  </span><span class="nf">c</span><span class="p">(</span><span class="s1">'cluster_scaled'</span><span class="p">,</span><span class="w"> </span><span class="s1">'First Name'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Last Name'</span><span class="p">,</span><span class="w"> </span><span class="n">vars</span><span class="p">,</span><span class="w"> </span><span class="s1">'Team(s)'</span><span class="p">)])</span><span class="w">


</span><span class="c1"># and now plot them on the ggpairs to understand</span><span class="w">
</span><span class="n">pm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">goalies</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">Ht</span><span class="p">,</span><span class="w"> </span><span class="n">GP</span><span class="p">,</span><span class="w"> </span><span class="n">`SV%`</span><span class="p">,</span><span class="w"> </span><span class="n">cluster_scaled</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggpairs</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cluster_scaled</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.4</span><span class="p">),</span><span class="w"> 
          </span><span class="n">columns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">vars</span><span class="p">)</span><span class="w">

</span><span class="c1"># which plots to add points?</span><span class="w">
</span><span class="n">sps</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">pm</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">pm</span><span class="p">[</span><span class="m">3</span><span class="p">,</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">pm</span><span class="p">[</span><span class="m">3</span><span class="p">,</span><span class="m">2</span><span class="p">])</span><span class="w">
</span><span class="n">sps2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">map</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">sps</span><span class="p">),</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">sps</span><span class="p">[[</span><span class="n">.x</span><span class="p">]]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">prototypes</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> 
  </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_text</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">prototypes</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> 
  </span><span class="n">paste0</span><span class="p">(</span><span class="n">`First Name`</span><span class="p">,</span><span class="w"> </span><span class="s2">" "</span><span class="p">,</span><span class="w"> </span><span class="n">`Last Name`</span><span class="p">)),</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">vjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.5</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> 
  </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.5</span><span class="p">))</span><span class="w">
</span><span class="n">pm</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sps2</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="n">pm</span><span class="p">[</span><span class="m">3</span><span class="p">,</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sps2</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="n">pm</span><span class="p">[</span><span class="m">3</span><span class="p">,</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sps2</span><span class="p">[[</span><span class="m">3</span><span class="p">]]</span><span class="w">
</span><span class="n">pm</span><span class="w">
</span>

goalies09

What do these four clusters teach us about these data? Looking at the scatterplot with the best separation, row 2, column 1, Ht vs. GP, we can see the red group in the top-right corner of the plot, represented by Tukka Rask, is taller than average and mostly starters. Backups, in the lower half of the plot can be “short” (purple) or tall (green). Then there’s the 50% save percentage group, which we already know about. Sorry, Ferguson.

3. Clustering on All Data

OK let’s add in all the data we can quickly get somewhere with.

3.1. How Many Clusters?

Ignoring categorical data for now, how many columns are numeric?

<span class="n">goalies</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">map_lgl</span><span class="p">(</span><span class="n">is.numeric</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">mean</span><span class="p">()</span><span class="w">
</span><span class="c1">#[1] 0.8120301</span><span class="w">
</span>

81% of columns are numeric. Let’s save just those.

goalies[, map_lgl(goalies, is.numeric)] %>% 
  tibble_out('goalies_stats') %>% 
  glimpse()
#Observations: 95
Variables: 108
$ Ht          <dbl> 73, 73, 75, 76, 74, 74, 74, 78, 78, 75, 74, 78, 73, 74, 73, 76, 75, 7...
#$ Wt          <dbl> 189, 196, 202, 187, 215, 211, 182, 232, 220, 173, 195, 229, 182, 180,...
#$ `Dft Yr`    <dbl> 2017, 2001, 2008, NA, NA, 2008, NA, 2007, NA, 1999, NA, 2009, NA, 200...
#...
#$ MIN__1      <dbl> 9, 20144, 5414, 7996, 3092, 20678, 22814, 6557, 1092, 42926, 7073, 57...
#$ QS__2       <dbl> 0, 117, 41, 62, 33, 211, 229, 54, 6, 338, 56, 38, 12, 319, 38, 134, 4...
#$ RBS__1      <dbl> 0, 41, 12, 22, 5, 45, 46, 13, 8, 72, 22, 16, 5, 78, 13, 27, 13, 7, 26...
#$ GPS__1      <dbl> -0.1, 50.7, 14.3, 20.5, 10.8, 68.2, 77.7, 17.9, 1.1, 143.4, 17.4, 15....

These last 13 columns are all career stats. They all have a double underscore, __ because they have duplicate names of other fields. They make veteran Henrik Lundqvist look the best if included in this year’s numbers. Honestly, didn’t even see this until trying to figure out why he looked the best in these numbers but not in any 2017-2018 stats posted at nhl.com.

We could leave them in for some questions, but since ours are limited to 2017-2018 regular season performance, results are more interpretable if we take these 13 columns out.

<span class="n">goalies_stats</span><span class="p">[</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">((</span><span class="m">108-12</span><span class="p">)</span><span class="o">:</span><span class="m">108</span><span class="p">)]</span><span class="w">
</span><span class="c1">## A tibble: 95 x 13</span><span class="w">
</span><span class="c1">#   GP__1 GS__1  W__1  L__1 OTL__1 GA__2 SA__2 SO__1 PIM__1 MIN__1 QS__2 RBS__1 GPS__1</span><span class="w">
</span><span class="c1">#   <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl></span><span class="w">
</span><span class="c1"># 1     1     0     0     0      0     1     2     0      0      9     0      0   -0.1</span><span class="w">
</span><span class="c1"># 2   365   246   158   131     40   903  9418    18     20  20144   117     41   50.7</span><span class="w">
</span><span class="c1"># 3   102    87    43    39     11   239  2644     3      0   5414    41     12   14.3</span><span class="w">
</span><span class="c1"># 4   144   126    56    55     18   349  3832     9      2   7996    62     22   20.5</span><span class="w">
</span><span class="c1"># 5    56    49    27    15      6   119  1549     4      2   3092    33      5   10.8</span><span class="w">
</span><span class="c1"># 6   361   353   225    89     35   831 10306    32     19  20678   211     45   68.2</span><span class="w">
</span><span class="c1"># 7   395   385   218   129     36   929 11607    24     18  22814   229     46   77.7</span><span class="w">
</span><span class="c1"># 8   118   104    52    38     16   292  3251     4      2   6557    54     13   17.9</span><span class="w">
</span><span class="c1"># 9    21    21     5     9      4    68   565     2      0   1092     6      8    1.1</span><span class="w">
</span><span class="c1">#10   737   593   370   268     80  1863 21999    43     48  42926   338     72  143. </span><span class="w">
</span><span class="c1">## ... with 85 more rows</span><span class="w">

</span><span class="n">goalies_stats</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">goalies_stats</span><span class="p">[</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="nf">c</span><span class="p">((</span><span class="m">108-12</span><span class="p">)</span><span class="o">:</span><span class="m">108</span><span class="p">)]</span><span class="w">
</span>

This leaves 95 numeric columns. Can we answer how many clusters there are in these 95 variables?

<span class="n">goalies_stats</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">scale</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">tibble_out</span><span class="p">(</span><span class="s1">'scaled_all_vars'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">wssplot</span><span class="p">(</span><span class="n">nstart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w">
</span><span class="c1"># Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) </span><span class="w">
</span>

We have more NAs. Let’s take a look at missingness.

<span class="n">vis_miss</span><span class="p">(</span><span class="n">goalies_stats</span><span class="p">)</span><span class="w">
</span>

It looks like only a handful of variables have issues. Which are they?
goalies10

<span class="n">sort</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">goalies_stats</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">x</span><span class="p">)))))</span><span class="w">
</span><span class="c1">#     SO__1    PIM__1    MIN__1     QS__2    RBS__1    GPS__1        Wt     StMin      StSV </span><span class="w">
</span><span class="c1">#        0         0         0         0         0         0         1         5         5 </span><span class="w">
</span><span class="c1">#     StGA     QS__1       RBS      Pull    Dft Yr        Rd      Ovrl      Ginj      CHIP </span><span class="w">
</span><span class="c1">#        5         5         5         5        21        21        21        44        44 </span><span class="w">
</span><span class="o">></span><span class="w"> 
</span>

Looks like CHIP (Cap Hit of Injured Player) and Ginj (Games Injured), as well as three Draft variables. Are those players who haven’t had an injury or be...

To leave a comment for the author, please follow the link and comment on their blog: Dan Garmat's Blog -- R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)