A new open source data set for anomaly detection

Posted on March 31, 2015 by Rob J Hyndman in R bloggers | 0 Comments

[This article was first published on Hyndsight » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Yahoo Labs has just released an interesting new data set useful for research on detecting anomalies (or outliers) in time series data. There are many contexts in which anomaly detection is important. For Yahoo, the main use case is in detecting unusual traffic on Yahoo servers.

The data set comprises real traffic to Yahoo services, along with some synthetic data. There are 367 time series in the data set, each of which contains between 741 and 1680 observations recorded at regular intervals. Each series is accompanied by an indicator series with a 1 if the observation was an anomaly, and 0 otherwise. The anomalies in the real data were determined by human judgement, while those in the synthetic data were generated algorithmically. For the synthetic data, some information about the components used to construct the data is also provided.

Although the Yahoo announcement claims that the data are publicly available, in fact they are only available to people with an edu address. Further, you have to apply to use them, and it takes about 24 hours before approval is granted. I have suggested that they remove these restrictions, and make the data available without restriction to anyone who wants to use them.

Research on anomaly detection in time series seems to be growing in popularity. Twitter has also released their own Anomaly Detection R package. Their approach has some similarities with my own tsoutliers function in the forecast package. The tso function in the tsoutliers package is another approach to the same problem.

Hopefully having a large public data set available will lead to improvements in time series outlier detection methods, at least for detecting outliers in internet traffic data.

To leave a comment for the author, please follow the link and comment on their blog: Hyndsight » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

A new open source data set for anomaly detection

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)