# My KISS Attempt to rstatsgoes10k Contest

**Jkunst - R category**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last year eoda.de
launched a contest to predict when the R packages will be 10k. So this
is a really good opportunity to use (finally) the `forecastHybrid`

package developed by
Peter Ellis and David Shaub.

This will be a really KISS-naive-simply-raw try to get a reasonable prediction. No transformations, no CV. etc. But you can do better! – The writer.

Let’s load the packages!

```
library(dplyr)
library(rvest)
library(janitor)
library(lubridate)
library(highcharter)
library(forecast)
library(forecastHybrid)
library(zoo)
options(highcharter.theme = hc_theme_smpl())
```

The data will be extracted from the list of packages by date from CRAN. Then we’ll make some wrangling to get the cumulative sum of the packages by day.

```
packages <- "https://cran.r-project.org/web/packages/available_packages_by_date.html" %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
tbl_df()
names(packages) <- tolower(names(packages))
packages <- mutate(packages, date = ymd(date))
glimpse(packages)
```

```
## Observations: 9,858
## Variables: 3
## $ date <date> 2017-01-07, 2017-01-07, 2017-01-07, 2017-01-07, 2017-...
## $ package <chr> "AER", "c212", "caseMatch", "clustRcompaR", "dat", "gd...
## $ title <chr> "Applied Econometrics with R", "Methods for Detecting ...
```

```
c(min(packages$date), max(packages$date))
```

```
## [1] "2005-10-29" "2017-01-07"
```

```
data <- packages %>%
group_by(date) %>%
summarise(n = n()) %>%
left_join(data_frame(date = seq(min(packages$date), max(packages$date), by = 1)),
., by = "date")
data <- data %>%
mutate(
n = ifelse(is.na(n), 0, n),
cumsum = cumsum(n)
)
tail(data)
```

date | n | cumsum |
---|---|---|

2017-01-02 | 14 | 9710 |

2017-01-03 | 29 | 9739 |

2017-01-04 | 28 | 9767 |

2017-01-05 | 37 | 9804 |

2017-01-06 | 46 | 9850 |

2017-01-07 | 8 | 9858 |

```
hchart(data, "line", hcaes(date, cumsum)) %>%
hc_title(text = "Just in CRAN, what if we sum GH, BioC? How many would be?")
```

A little weird the effect in the 2014. Let’s drop some past information and create some auxiliar variables.

```
data <- filter(data, year(date) >= 2013)
date_first <- first(data$date)
date_last <- last(data$date)
```

To use the package we need first a time series object:

```
z <- zooreg(data$cumsum, start = date_first, frequency = 1)
tail(z)
```

```
## 2017-01-02 2017-01-03 2017-01-04 2017-01-05 2017-01-06 2017-01-07
## 9710 9739 9767 9804 9850 9858
```

Now we can use the `forecastHybrid::hybridModel`

function. In this case
I removed the `tbats`

model due the long time to fit, the long time to
make CV and the long long time to make the predictions (in my previous tests).
So, in the spirit to be parsimonious and KISS we will remove this model
from the fit.

```
# hm <- hybridModel(z, models = "aefns", weights = "cv.errors", errorMethod = "MASE")
# saveRDS(hm, "data/rstatsgoes10k/hm.rds")
hm <- readRDS("data/rstatsgoes10k/hm.rds")
hm
```

```
## Hybrid forecast model comprised of the following models: auto.arima, ets, thetam, nnetar
## ############
## auto.arima with weight 0.368
## ############
## ets with weight 0.37
## ############
## thetam with weight 0.2
## ############
## nnetar with weight 0.061
```

It is really simple to get the forecasts. After the calculate them we will
create a `data_frame`

to filter and see what day R will have 10k
packages according this methodology.

```
H <- 20
fc <- forecast(hm, h = H)
data_fc <- fc %>%
as_data_frame() %>%
mutate(date = date_last + days(1:H)) %>%
clean_names() %>%
tbl_df()
```

So let’t see the point forecast and the *optimistic* prediction
which is the upper limit from the 95% interval.

```
data_preds <- bind_rows(
data_fc %>%
filter(point_forecast >= 10000) %>%
mutate(name = "Prediction") %>%
head(1) %>%
rename(y = point_forecast),
data_fc %>%
filter(hi_95 >= 10000) %>%
mutate(name = "Optimitstic prediction") %>%
head(1) %>%
rename(y = hi_95)
)
select(data_preds, name, date, y)
```

name | date | y |
---|---|---|

Prediction | 2017-01-16 | 10008 |

Optimitstic prediction | 2017-01-11 | 10008 |

So soon!! (warning: according to this).

Now, let’s visualize the result.

```
highchart() %>%
hc_title(text = "rstatsgoes10k") %>%
hc_subtitle(text = "Predictions via <code>forecastHybrid</code> package", useHTML = TRUE) %>%
hc_xAxis(type = "datetime",
crosshair = list(zIndex = 5, dashStyle = "dot",
snap = FALSE, color = "gray"
)) %>%
hc_add_series(filter(data, date >= ymd(20161001)), "line", hcaes(date, cumsum),
name = "Packages") %>%
hc_add_series(data_fc, "line", hcaes(date, point_forecast),
name = "Prediction", color = "#75aadb") %>%
hc_add_series(data_fc, "arearange", hcaes(date, low = lo_95, high = hi_95),
name = "Prediction Interval (95%)", color = "#75aadb", fillOpacity = 0.3) %>%
hc_add_series(data_preds, "scatter", hcaes(date, y, name = name),
name = "Predicctions to 10K", color = "blue",
tooltip = list(pointFormat = ""), zIndex = -4,
marker = list(symbol = "circle", lineWidth= 1, radius= 4,
fillColor= "transparent", lineColor= NULL),
dataLabels = list(enabled = TRUE, format = "{point.name}<br>{point.x:%Y-%m-%d}",
style = list(fontWeight = "normal"))) %>%
hc_tooltip(shared = TRUE, valueDecimals = 0)
```

Simple, right? Maybe the that will not be the day where R hit 10k packages but its
doesn’t matter. The **really important fact here** is all this is product of many many
developers joined by the community, and some Rheroes like HW, JO, JB, GC, JC, BR, MA,
DR, JS, KR and many others who have astonished us package by package or show our work
in the social media . Thanks to everybody I can using this versatile and powerful language
happily day by day.

**leave a comment**for the author, please follow the link and comment on their blog:

**Jkunst - R category**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.