Preparing Big Data for Analysis in R

July 15, 2014
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Yaniv Mor, Co-founder & CEO of Xplenty

How do you get Big Data ready for R? Gigabytes or terabytes of raw data may need to be combined, cleaned, and aggregated before they can be analyzed. Processing such large amounts of data used to require installing Hadoop on a cluster of servers, not to mention coding MapReduce jobs in Pig or Java. Those days are over.

This post is going to show how raw data can be prepared for analysis in R without any code or server installations. Instead, we’ll use Xplenty’s data integration-as-a-service to design a data flow, create a cluster, and run the job all via a friendly user interface.

For this demo we’ll use 1.5 GB of raw web logs (uncompressed) from the servers that hosted the ”Star Wars Kid” video. A remix of the video was also hosted there as well as the usual affair of HTMLs, images, and more. Here’s an example log line:

208.63.63.94 - - [11/Apr/2003:12:36:39 -0700] "GET /archive/2003/04/03/typo_pop.shtml HTTP/1.1" 200 28361 "http://www.kottke.org/" Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.0.3705)"

Log line format:

  1. Source IP/domain
  2. User Identifier (blank)
  3. UserID (blank)
  4. Date - in the format of dd/MMM/yyyy:HH:mm:ss Z
  5. HTTP request - type, URL, HTTP version
  6. HTTP code
  7. Bytes transferred
  8. Referrer
  9. User agent

Let’s say we would only like to analyze requests to the original “Star Wars Kid” video by source IP, date and referrer. Imagine what it would be like to setup the servers and write the code - the hours spent writing and debugging a relatively simple dataflow. Feel the stress building? Let it go. Here’s how such a dataflow looks like in Xplenty:

X1

Let’s take a closer look how it works:

Source - loads the data from Amazon S3 and splits it into fields. The data is publicly available on S3 at xplenty.public/weblogs/star_wars_kid.log.gz. If you’d like to take a look at the data, download it via the web, or create an AWS account and use a tool such as S3Browser to access the above path.

Select - only keeps the ip, date, url, and referrer fields while leaving the rest of the data out. Note that the date also contains the time, and that the request also contains the request type and HTTP version. They are both cleaned in the select component using a regular expression.

Filter - matches Star_Wars_Kid.wmv in the URL field and removes any other log lines.

Destination - stores the results back into Amazon S3. 

No setup or installation is needed. Just a few clicks enables you to create a new cluster. Then, one more screen to get the job running.

The results - about 120 MB (uncompressed) log lines of video file requests with IPs, URLs, and referrers that are now ready for analysis. Job running time - about 3 minutes. The full results are available in the xplenty.dumpster bucket at starwarskid/videos.gz. Here are a few sample lines:

66.142.89.235   09/May/2003     /random/video/Star_Wars_Kid.wmv     http://www.waxy.org/
63.195.36.218   09/May/2003     /random/video/Star_Wars_Kid.wmv -
66.27.235.199   09/May/2003     /random/video/Star_Wars_Kid.wmv     http://www.kuro5hin.org/story/2003/5/2/16116/46048
24.81.67.79     09/May/2003     /random/video/Star_Wars_Kid.wmv     http://www.waxy.org/archive/2003/04/29/star_war.shtml
12.149.141.14   09/May/2003     /random/video/Star_Wars_Kid.wmv     http://www.waxy.org/

Now, we can finally analyze the data in R. Here’s sample code which generates a traffic graph by date for Star_Wars_Kid.wmv:

df <- read.table('star-wars-kid.tsv', fill = TRUE)
colnames(df) <- c('ip', 'date', 'url', 'referrer')
df$date <- as.Date(df$date,"%d/%b/%Y")
reqs <- as.data.frame(table(df$date))
ggplot(data=reqs, aes(x=as.Date(Var1), y=Freq)) + geom_line() + xlab('Date') + ylab('Requests') +  theme(title=element_text('Traffic to Star Wars Kid Video'), legend.position='none')

  X10

Additional components could easily be added to the dataflow for joining several sources, sorting data, extracting strings with regular expressions, and more. The same dataflow could be used to process even 1.5 TB of data, or a directory that contains many big files. Would you like to prepare your data for analysis in R? Get a free Xplenty account and start crunching your data

 

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.