**Jkunst - R category**, and kindly contributed to R-bloggers)

Last year eoda.de

launched a contest to predict when the R packages will be 10k. So this

is a really good opportunity to use (finally) the `forecastHybrid`

package developed by

Peter Ellis and David Shaub.

This will be a really KISS-naive-simply-raw try to get a

reasonable prediction. No transformations, no CV. etc. But you can do better!

– The writer.

Let’s load the packages!

```
library(dplyr)
library(rvest)
library(janitor)
library(lubridate)
library(highcharter)
library(forecast)
library(forecastHybrid)
library(zoo)
options(highcharter.theme = hc_theme_smpl())
```

The data will be extracted from the list of packages by date from CRAN.

Then we’ll make some wrangling to get the cumulative sum of the packages

by day.

```
packages <- "https://cran.r-project.org/web/packages/available_packages_by_date.html" %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
tbl_df()
names(packages) <- tolower(names(packages))
packages <- mutate(packages, date = ymd(date))
glimpse(packages)
```

```
## Observations: 9,858
## Variables: 3
## $ date
``` 2017-01-07, 2017-01-07, 2017-01-07, 2017-01-07, 2017-...
## $ package "AER", "c212", "caseMatch", "clustRcompaR", "dat", "gd...
## $ title "Applied Econometrics with R", "Methods for Detecting ...

```
c(min(packages$date), max(packages$date))
```

```
## [1] "2005-10-29" "2017-01-07"
```

```
data <- packages %>%
group_by(date) %>%
summarise(n = n()) %>%
left_join(data_frame(date = seq(min(packages$date), max(packages$date), by = 1)),
., by = "date")
data <- data %>%
mutate(
n = ifelse(is.na(n), 0, n),
cumsum = cumsum(n)
)
tail(data)
```

date | n | cumsum |
---|---|---|

2017-01-02 | 14 | 9710 |

2017-01-03 | 29 | 9739 |

2017-01-04 | 28 | 9767 |

2017-01-05 | 37 | 9804 |

2017-01-06 | 46 | 9850 |

2017-01-07 | 8 | 9858 |

```
hchart(data, "line", hcaes(date, cumsum)) %>%
hc_title(text = "Just in CRAN, what if we sum GH, BioC? How many would be?")
```

A little weird the effect in the 2014. Let’s drop some past

information and create some auxiliar variables.

```
data <- filter(data, year(date) >= 2013)
date_first <- first(data$date)
date_last <- last(data$date)
```

To use the package we need first a time series object:

```
z <- zooreg(data$cumsum, start = date_first, frequency = 1)
tail(z)
```

```
## 2017-01-02 2017-01-03 2017-01-04 2017-01-05 2017-01-06 2017-01-07
## 9710 9739 9767 9804 9850 9858
```

Now we can use the `forecastHybrid::hybridModel`

function. In this case

I removed the `tbats`

model due the long time to fit, the long time to

make CV and the long long time to make the predictions (in my previous tests).

So, in the spirit to be parsimonious and KISS we will remove this model

from the fit.

```
# hm <- hybridModel(z, models = "aefns", weights = "cv.errors", errorMethod = "MASE")
# saveRDS(hm, "data/rstatsgoes10k/hm.rds")
hm <- readRDS("data/rstatsgoes10k/hm.rds")
hm
```

```
## Hybrid forecast model comprised of the following models: auto.arima, ets, thetam, nnetar
## ############
## auto.arima with weight 0.368
## ############
## ets with weight 0.37
## ############
## thetam with weight 0.2
## ############
## nnetar with weight 0.061
```

It is really simple to get the forecasts. After the calculate them we will

create a `data_frame`

to filter and see what day R will have 10k

packages according this methodology.

```
H <- 20
fc <- forecast(hm, h = H)
data_fc <- fc %>%
as_data_frame() %>%
mutate(date = date_last + days(1:H)) %>%
clean_names() %>%
tbl_df()
```

So let’t see the point forecast and the *optimistic* prediction

which is the upper limit from the 95% interval.

```
data_preds <- bind_rows(
data_fc %>%
filter(point_forecast >= 10000) %>%
mutate(name = "Prediction") %>%
head(1) %>%
rename(y = point_forecast),
data_fc %>%
filter(hi_95 >= 10000) %>%
mutate(name = "Optimitstic prediction") %>%
head(1) %>%
rename(y = hi_95)
)
select(data_preds, name, date, y)
```

name | date | y |
---|---|---|

Prediction | 2017-01-16 | 10008 |

Optimitstic prediction | 2017-01-11 | 10008 |

So soon!! (warning: according to this).

Now, let’s visualize the result.

```
highchart() %>%
hc_title(text = "rstatsgoes10k") %>%
hc_subtitle(text = "Predictions via
````forecastHybrid`

package", useHTML = TRUE) %>%
hc_xAxis(type = "datetime",
crosshair = list(zIndex = 5, dashStyle = "dot",
snap = FALSE, color = "gray"
)) %>%
hc_add_series(filter(data, date >= ymd(20161001)), "line", hcaes(date, cumsum),
name = "Packages") %>%
hc_add_series(data_fc, "line", hcaes(date, point_forecast),
name = "Prediction", color = "#75aadb") %>%
hc_add_series(data_fc, "arearange", hcaes(date, low = lo_95, high = hi_95),
name = "Prediction Interval (95%)", color = "#75aadb", fillOpacity = 0.3) %>%
hc_add_series(data_preds, "scatter", hcaes(date, y, name = name),
name = "Predicctions to 10K", color = "blue",
tooltip = list(pointFormat = ""), zIndex = -4,
marker = list(symbol = "circle", lineWidth= 1, radius= 4,
fillColor= "transparent", lineColor= NULL),
dataLabels = list(enabled = TRUE, format = "{point.name}

{point.x:%Y-%m-%d}",
style = list(fontWeight = "normal"))) %>%
hc_tooltip(shared = TRUE, valueDecimals = 0)

Simple, right? Maybe the that will not be the day where R hit 10k packages but its

doesn’t matter. The **really important fact here** is all this is product of many many

developers joined by the community, and some Rheroes like HW, JO, JB, GC, JC, BR, MA,

DR, JS, KR and many others who have astonished us package by package or show our work

in the social media . Thanks to everybody I can using this versatile and powerful language

happily day by day.

**leave a comment**for the author, please follow the link and comment on their blog:

**Jkunst - R category**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...