My KISS Attempt to rstatsgoes10k Contest

[This article was first published on Jkunst - R category, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last year eoda.de
launched a contest to predict when the R packages will be 10k. So this
is a really good opportunity to use (finally) the forecastHybrid package developed by
Peter Ellis and David Shaub.

This will be a really KISS-naive-simply-raw try to get a
reasonable prediction. No transformations, no CV. etc. But you can do better!
The writer.

Let’s load the packages!

<span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">rvest</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">janitor</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">lubridate</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">highcharter</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">forecast</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">forecastHybrid</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">zoo</span><span class="p">)</span><span class="w">
</span><span class="n">options</span><span class="p">(</span><span class="n">highcharter.theme</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hc_theme_smpl</span><span class="p">())</span><span class="w">
</span>

The data will be extracted from the list of packages by date from CRAN.
Then we’ll make some wrangling to get the cumulative sum of the packages
by day.

<span class="n">packages</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"https://cran.r-project.org/web/packages/available_packages_by_date.html"</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">read_html</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_table</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">.</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">tbl_df</span><span class="p">()</span><span class="w">

</span><span class="nf">names</span><span class="p">(</span><span class="n">packages</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tolower</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">packages</span><span class="p">))</span><span class="w">

</span><span class="n">packages</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">packages</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ymd</span><span class="p">(</span><span class="n">date</span><span class="p">))</span><span class="w">

</span><span class="n">glimpse</span><span class="p">(</span><span class="n">packages</span><span class="p">)</span><span class="w">
</span>
## Observations: 9,858
## Variables: 3
## $ date    <date> 2017-01-07, 2017-01-07, 2017-01-07, 2017-01-07, 2017-...
## $ package <chr> "AER", "c212", "caseMatch", "clustRcompaR", "dat", "gd...
## $ title   <chr> "Applied Econometrics with R", "Methods for Detecting ...
<span class="nf">c</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">packages</span><span class="o">$</span><span class="n">date</span><span class="p">),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">packages</span><span class="o">$</span><span class="n">date</span><span class="p">))</span><span class="w">
</span>
## [1] "2005-10-29" "2017-01-07"
<span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">packages</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">summarise</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">data_frame</span><span class="p">(</span><span class="n">date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">packages</span><span class="o">$</span><span class="n">date</span><span class="p">),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">packages</span><span class="o">$</span><span class="n">date</span><span class="p">),</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)),</span><span class="w">
            </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"date"</span><span class="p">)</span><span class="w">

</span><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">mutate</span><span class="p">(</span><span class="w">
    </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">n</span><span class="p">),</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">),</span><span class="w">
    </span><span class="n">cumsum</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">cumsum</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w">

</span><span class="n">tail</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w">
</span>
date n cumsum
2017-01-02 14 9710
2017-01-03 29 9739
2017-01-04 28 9767
2017-01-05 37 9804
2017-01-06 46 9850
2017-01-07 8 9858
<span class="n">hchart</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="s2">"line"</span><span class="p">,</span><span class="w"> </span><span class="n">hcaes</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="n">cumsum</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">hc_title</span><span class="p">(</span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Just in CRAN, what if we sum GH, BioC? How many would be?"</span><span class="p">)</span><span class="w">
</span>

open

A little weird the effect in the 2014. Let’s drop some past
information and create some auxiliar variables.

<span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="n">year</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="m">2013</span><span class="p">)</span><span class="w">

</span><span class="n">date_first</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">first</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">date</span><span class="p">)</span><span class="w">
</span><span class="n">date_last</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">last</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">date</span><span class="p">)</span><span class="w">
</span>

To use the package we need first a time series object:

<span class="n">z</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">zooreg</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">cumsum</span><span class="p">,</span><span class="w"> </span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">date_first</span><span class="p">,</span><span class="w"> </span><span class="n">frequency</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">tail</span><span class="p">(</span><span class="n">z</span><span class="p">)</span><span class="w">
</span>
## 2017-01-02 2017-01-03 2017-01-04 2017-01-05 2017-01-06 2017-01-07 
##       9710       9739       9767       9804       9850       9858

Now we can use the forecastHybrid::hybridModel function. In this case
I removed the tbats model due the long time to fit, the long time to
make CV and the long long time to make the predictions (in my previous tests).
So, in the spirit to be parsimonious and KISS we will remove this model
from the fit.

<span class="c1"># hm <- hybridModel(z, models = "aefns", weights = "cv.errors", errorMethod = "MASE")
# saveRDS(hm, "data/rstatsgoes10k/hm.rds")
</span><span class="n">hm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s2">"data/rstatsgoes10k/hm.rds"</span><span class="p">)</span><span class="w">
</span><span class="n">hm</span><span class="w">
</span>
## Hybrid forecast model comprised of the following models: auto.arima, ets, thetam, nnetar
## ############
## auto.arima with weight 0.368 
## ############
## ets with weight 0.37 
## ############
## thetam with weight 0.2 
## ############
## nnetar with weight 0.061

It is really simple to get the forecasts. After the calculate them we will
create a data_frame to filter and see what day R will have 10k
packages according this methodology.

<span class="n">H</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">20</span><span class="w">
</span><span class="n">fc</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">forecast</span><span class="p">(</span><span class="n">hm</span><span class="p">,</span><span class="w"> </span><span class="n">h</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">H</span><span class="p">)</span><span class="w">

</span><span class="n">data_fc</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fc</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">as_data_frame</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">date_last</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">days</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">H</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">clean_names</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">tbl_df</span><span class="p">()</span><span class="w">
</span>

So let’t see the point forecast and the optimistic prediction
which is the upper limit from the 95% interval.

<span class="n">data_preds</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bind_rows</span><span class="p">(</span><span class="w">
  </span><span class="n">data_fc</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">filter</span><span class="p">(</span><span class="n">point_forecast</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="m">10000</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">mutate</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Prediction"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">head</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">rename</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">point_forecast</span><span class="p">),</span><span class="w">
  </span><span class="n">data_fc</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">filter</span><span class="p">(</span><span class="n">hi_95</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="m">10000</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">mutate</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Optimitstic prediction"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">head</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">rename</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hi_95</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">select</span><span class="p">(</span><span class="n">data_preds</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w">
</span>
name date y
Prediction 2017-01-16 10008
Optimitstic prediction 2017-01-11 10008

So soon!! (warning: according to this).

Now, let’s visualize the result.

<span class="n">highchart</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">hc_title</span><span class="p">(</span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"rstatsgoes10k"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">hc_subtitle</span><span class="p">(</span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Predictions via <code>forecastHybrid</code> package"</span><span class="p">,</span><span class="w"> </span><span class="n">useHTML</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">hc_xAxis</span><span class="p">(</span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"datetime"</span><span class="p">,</span><span class="w">
           </span><span class="n">crosshair</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">zIndex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">dashStyle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"dot"</span><span class="p">,</span><span class="w">
                            </span><span class="n">snap</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gray"</span><span class="w">
           </span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">hc_add_series</span><span class="p">(</span><span class="n">filter</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">ymd</span><span class="p">(</span><span class="m">20161001</span><span class="p">)),</span><span class="w"> </span><span class="s2">"line"</span><span class="p">,</span><span class="w"> </span><span class="n">hcaes</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="n">cumsum</span><span class="p">),</span><span class="w">
                </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Packages"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">hc_add_series</span><span class="p">(</span><span class="n">data_fc</span><span class="p">,</span><span class="w"> </span><span class="s2">"line"</span><span class="p">,</span><span class="w"> </span><span class="n">hcaes</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="n">point_forecast</span><span class="p">),</span><span class="w">
                </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Prediction"</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"#75aadb"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">hc_add_series</span><span class="p">(</span><span class="n">data_fc</span><span class="p">,</span><span class="w"> </span><span class="s2">"arearange"</span><span class="p">,</span><span class="w"> </span><span class="n">hcaes</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="n">low</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lo_95</span><span class="p">,</span><span class="w"> </span><span class="n">high</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hi_95</span><span class="p">),</span><span class="w">
                </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Prediction Interval (95%)"</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"#75aadb"</span><span class="p">,</span><span class="w"> </span><span class="n">fillOpacity</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.3</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">hc_add_series</span><span class="p">(</span><span class="n">data_preds</span><span class="p">,</span><span class="w"> </span><span class="s2">"scatter"</span><span class="p">,</span><span class="w"> </span><span class="n">hcaes</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">name</span><span class="p">),</span><span class="w">
                </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Predicctions to 10K"</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">,</span><span class="w">
                </span><span class="n">tooltip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">pointFormat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">),</span><span class="w"> </span><span class="n">zIndex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-4</span><span class="p">,</span><span class="w">
                </span><span class="n">marker</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">symbol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"circle"</span><span class="p">,</span><span class="w"> </span><span class="n">lineWidth</span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">radius</span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w">
                              </span><span class="n">fillColor</span><span class="o">=</span><span class="w"> </span><span class="s2">"transparent"</span><span class="p">,</span><span class="w"> </span><span class="n">lineColor</span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">),</span><span class="w">
                </span><span class="n">dataLabels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">enabled</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"{point.name}<br>{point.x:%Y-%m-%d}"</span><span class="p">,</span><span class="w">
                                  </span><span class="n">style</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">fontWeight</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"normal"</span><span class="p">)))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">hc_tooltip</span><span class="p">(</span><span class="n">shared</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">valueDecimals</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span>

open

Simple, right? Maybe the that will not be the day where R hit 10k packages but its
doesn’t matter. The really important fact here is all this is product of many many
developers joined by the community, and some Rheroes like HW, JO, JB, GC, JC, BR, MA,
DR, JS, KR and many others who have astonished us package by package or show our work
in the social media . Thanks to everybody I can using this versatile and powerful language
happily day by day.

To leave a comment for the author, please follow the link and comment on their blog: Jkunst - R category.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)