My KISS Attempt to rstatsgoes10k Contest

January 6, 2017
By

(This article was first published on Jkunst - R category, and kindly contributed to R-bloggers)

Last year eoda.de
launched a contest to predict when the R packages will be 10k. So this
is a really good opportunity to use (finally) the forecastHybrid package developed by
Peter Ellis and David Shaub.

This will be a really KISS-naive-simply-raw try to get a
reasonable prediction. No transformations, no CV. etc. But you can do better!
The writer.

Let’s load the packages!

library(dplyr)
library(rvest)
library(janitor)
library(lubridate)
library(highcharter)
library(forecast)
library(forecastHybrid)
library(zoo)
options(highcharter.theme = hc_theme_smpl())

The data will be extracted from the list of packages by date from CRAN.
Then we’ll make some wrangling to get the cumulative sum of the packages
by day.

packages <- "https://cran.r-project.org/web/packages/available_packages_by_date.html" %>% 
  read_html() %>% 
  html_table() %>%
  .[[1]] %>% 
  tbl_df()

names(packages) <- tolower(names(packages))

packages <- mutate(packages, date = ymd(date))

glimpse(packages)
## Observations: 9,858
## Variables: 3
## $ date     2017-01-07, 2017-01-07, 2017-01-07, 2017-01-07, 2017-...
## $ package  "AER", "c212", "caseMatch", "clustRcompaR", "dat", "gd...
## $ title    "Applied Econometrics with R", "Methods for Detecting ...
c(min(packages$date), max(packages$date))
## [1] "2005-10-29" "2017-01-07"
data <- packages %>% 
  group_by(date) %>% 
  summarise(n = n()) %>% 
  left_join(data_frame(date = seq(min(packages$date), max(packages$date), by = 1)),
            ., by = "date")

data <- data %>% 
  mutate(
    n = ifelse(is.na(n), 0, n),
    cumsum = cumsum(n)
  )

tail(data)
date n cumsum
2017-01-02 14 9710
2017-01-03 29 9739
2017-01-04 28 9767
2017-01-05 37 9804
2017-01-06 46 9850
2017-01-07 8 9858
hchart(data, "line", hcaes(date, cumsum)) %>% 
  hc_title(text = "Just in CRAN, what if we sum GH, BioC? How many would be?")

open

A little weird the effect in the 2014. Let’s drop some past
information and create some auxiliar variables.

data <- filter(data, year(date) >= 2013)

date_first <- first(data$date)
date_last <- last(data$date)

To use the package we need first a time series object:

z <- zooreg(data$cumsum, start = date_first, frequency = 1)
tail(z)
## 2017-01-02 2017-01-03 2017-01-04 2017-01-05 2017-01-06 2017-01-07 
##       9710       9739       9767       9804       9850       9858

Now we can use the forecastHybrid::hybridModel function. In this case
I removed the tbats model due the long time to fit, the long time to
make CV and the long long time to make the predictions (in my previous tests).
So, in the spirit to be parsimonious and KISS we will remove this model
from the fit.

# hm <- hybridModel(z, models = "aefns", weights = "cv.errors", errorMethod = "MASE")
# saveRDS(hm, "data/rstatsgoes10k/hm.rds")
hm <- readRDS("data/rstatsgoes10k/hm.rds")
hm
## Hybrid forecast model comprised of the following models: auto.arima, ets, thetam, nnetar
## ############
## auto.arima with weight 0.368 
## ############
## ets with weight 0.37 
## ############
## thetam with weight 0.2 
## ############
## nnetar with weight 0.061

It is really simple to get the forecasts. After the calculate them we will
create a data_frame to filter and see what day R will have 10k
packages according this methodology.

H <- 20
fc <- forecast(hm, h = H)

data_fc <- fc %>% 
  as_data_frame() %>% 
  mutate(date = date_last + days(1:H)) %>% 
  clean_names() %>% 
  tbl_df()

So let’t see the point forecast and the optimistic prediction
which is the upper limit from the 95% interval.

data_preds <- bind_rows(
  data_fc %>%
    filter(point_forecast >= 10000) %>%
    mutate(name = "Prediction") %>%
    head(1) %>% 
    rename(y = point_forecast),
  data_fc %>%
    filter(hi_95 >= 10000) %>%
    mutate(name = "Optimitstic prediction") %>%
    head(1) %>% 
    rename(y = hi_95)
)

select(data_preds, name, date, y)
name date y
Prediction 2017-01-16 10008
Optimitstic prediction 2017-01-11 10008

So soon!! (warning: according to this).

Now, let’s visualize the result.

highchart() %>% 
  hc_title(text = "rstatsgoes10k") %>%
  hc_subtitle(text = "Predictions via forecastHybrid package", useHTML = TRUE) %>% 
  hc_xAxis(type = "datetime",
           crosshair = list(zIndex = 5, dashStyle = "dot",
                            snap = FALSE, color = "gray"
           )) %>% 
  hc_add_series(filter(data, date >= ymd(20161001)), "line", hcaes(date, cumsum),
                name = "Packages") %>% 
  hc_add_series(data_fc, "line", hcaes(date, point_forecast),
                name = "Prediction", color = "#75aadb") %>% 
  hc_add_series(data_fc, "arearange", hcaes(date, low = lo_95, high = hi_95),
                name = "Prediction Interval (95%)", color = "#75aadb", fillOpacity = 0.3) %>% 
  hc_add_series(data_preds, "scatter", hcaes(date, y, name = name),
                name = "Predicctions to 10K", color = "blue",
                tooltip = list(pointFormat = ""), zIndex = -4,
                marker = list(symbol = "circle", lineWidth= 1, radius= 4,
                              fillColor= "transparent", lineColor= NULL),
                dataLabels = list(enabled = TRUE, format = "{point.name}
{point.x:%Y-%m-%d}"
, style = list(fontWeight = "normal"))) %>% hc_tooltip(shared = TRUE, valueDecimals = 0)

open

Simple, right? Maybe the that will not be the day where R hit 10k packages but its
doesn’t matter. The really important fact here is all this is product of many many
developers joined by the community, and some Rheroes like HW, JO, JB, GC, JC, BR, MA,
DR, JS, KR and many others who have astonished us package by package or show our work
in the social media . Thanks to everybody I can using this versatile and powerful language
happily day by day.

To leave a comment for the author, please follow the link and comment on their blog: Jkunst - R category.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)