A new open source data set for anomaly detection

March 31, 2015

(This article was first published on Hyndsight » R, and kindly contributed to R-bloggers)

Yahoo Labs has just released an interesting new data set useful for research on detecting anomalies (or outliers) in time series data. There are many contexts in which anomaly detection is important. For Yahoo, the main use case is in detecting unusual traffic on Yahoo servers.

The data set comprises real traffic to Yahoo services, along with some synthetic data. There are 367 time series in the data set, each of which contains between 741 and 1680 observations recorded at regular intervals. Each series is accompanied by an indicator series with a 1 if the observation was an anomaly, and 0 otherwise. The anomalies in the real data were determined by human judgement, while those in the synthetic data were generated algorithmically. For the synthetic data, some information about the components used to construct the data is also provided.

Although the Yahoo announcement claims that the data are publicly available, in fact they are only available to people with an edu address. Further, you have to apply to use them, and it takes about 24 hours before approval is granted. I have suggested that they remove these restrictions, and make the data available without restriction to anyone who wants to use them.

Research on anomaly detection in time series seems to be growing in popularity. Twitter has also released their own Anomaly Detection R package. Their approach has some similarities with my own tsoutliers function in the forecast package. The tso function in the tsoutliers package is another approach to the same problem.

Hopefully having a large public data set available will lead to improvements in time series outlier detection methods, at least for detecting outliers in internet traffic data.

To leave a comment for the author, please follow the link and comment on their blog: Hyndsight » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)