by Yaniv Mor, Co-founder & CEO of Xplenty
How do you get Big Data ready for R? Gigabytes or terabytes of raw data may need to be combined, cleaned, and aggregated before they can be analyzed. Processing such large amounts of data used to require installing Hadoop on a cluster of servers, not to mention coding MapReduce jobs in Pig or Java. Those days are over.
This post is going to show how raw data can be prepared for analysis in R without any code or server installations. Instead, we’ll use Xplenty’s data integration-as-a-service to design a data flow, create a cluster, and run the job all via a friendly user interface.
For this demo we’ll use 1.5 GB of raw web logs (uncompressed) from the servers that hosted the ”Star Wars Kid” video. A remix of the video was also hosted there as well as the usual affair of HTMLs, images, and more. Here’s an example log line:
22.214.171.124 - - [11/Apr/2003:12:36:39 -0700] "GET /archive/2003/04/03/typo_pop.shtml HTTP/1.1" 200 28361 "http://www.kottke.org/" Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.0.3705)"
Log line format:
- Source IP/domain
- User Identifier (blank)
- UserID (blank)
- Date - in the format of dd/MMM/yyyy:HH:mm:ss Z
- HTTP request - type, URL, HTTP version
- HTTP code
- Bytes transferred
- User agent
Let’s say we would only like to analyze requests to the original “Star Wars Kid” video by source IP, date and referrer. Imagine what it would be like to setup the servers and write the code - the hours spent writing and debugging a relatively simple dataflow. Feel the stress building? Let it go. Here’s how such a dataflow looks like in Xplenty:
Let’s take a closer look how it works:
Source - loads the data from Amazon S3 and splits it into fields. The data is publicly available on S3 at xplenty.public/weblogs/star_wars_kid.log.gz. If you’d like to take a look at the data, download it via the web, or create an AWS account and use a tool such as S3Browser to access the above path.
Select - only keeps the ip, date, url, and referrer fields while leaving the rest of the data out. Note that the date also contains the time, and that the request also contains the request type and HTTP version. They are both cleaned in the select component using a regular expression.
Filter - matches Star_Wars_Kid.wmv in the URL field and removes any other log lines.
Destination - stores the results back into Amazon S3.
The results - about 120 MB (uncompressed) log lines of video file requests with IPs, URLs, and referrers that are now ready for analysis. Job running time - about 3 minutes. The full results are available in the xplenty.dumpster bucket at starwarskid/videos.gz. Here are a few sample lines:
126.96.36.199 09/May/2003 /random/video/Star_Wars_Kid.wmv http://www.waxy.org/ 188.8.131.52 09/May/2003 /random/video/Star_Wars_Kid.wmv - 184.108.40.206 09/May/2003 /random/video/Star_Wars_Kid.wmv http://www.kuro5hin.org/story/2003/5/2/16116/46048 220.127.116.11 09/May/2003 /random/video/Star_Wars_Kid.wmv http://www.waxy.org/archive/2003/04/29/star_war.shtml 18.104.22.168 09/May/2003 /random/video/Star_Wars_Kid.wmv http://www.waxy.org/
Now, we can finally analyze the data in R. Here’s sample code which generates a traffic graph by date for Star_Wars_Kid.wmv:
df <- read.table('star-wars-kid.tsv', fill = TRUE) colnames(df) <- c('ip', 'date', 'url', 'referrer') df$date <- as.Date(df$date,"%d/%b/%Y") reqs <- as.data.frame(table(df$date)) ggplot(data=reqs, aes(x=as.Date(Var1), y=Freq)) + geom_line() + xlab('Date') + ylab('Requests') + theme(title=element_text('Traffic to Star Wars Kid Video'), legend.position='none')
Additional components could easily be added to the dataflow for joining several sources, sorting data, extracting strings with regular expressions, and more. The same dataflow could be used to process even 1.5 TB of data, or a directory that contains many big files. Would you like to prepare your data for analysis in R? Get a free Xplenty account and start crunching your data