Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 – Part II

April 25, 2012
By

(This article was first published on All Things R, and kindly contributed to R-bloggers)


In my last blog, Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps, I analyzed historical airlines performance data set using R and SAP HANA and put the aggregated analysis on Google Maps.  Undoubtedly, Map is a pretty exciting canvas to view and analyze big data sets. One could draw shapes (circles, polygons) on the map under a marker pin, providing pin-point information and display aggregated information in the info-window when a marker is clicked.  So I enjoyed doing all of that, but I was craving for some old fashion bubble charts and other types of charts to provide comparative information on big data sets.  Ultimately, all big data sets get aggregated into smaller analytical sets for viewing, sharing and reporting.  An old fashioned chart is the best way to tell a visual story!

On bubble charts, one could display four dimensional data for comparative analysis. In this blog analysis, I used the same data-set which had 200M data points and went deeper looking at finer slices of information.  I leveraged D3, R and SAP HANA for this blog post.  Here I am publishing some of this work:  

In this first graphics, the performance of top airlines is compared for 2008.  As expected, Southwest, the largest airlines (when using total number of flights as a proxy), performed well for its size (1.2M flights, 64 destinations but average delay was ~10 mins.)  Some of the other airlines like American and Continental were the worst performers along with Skywest.  Note, I didn't remove outliers from this analysis.  Click here to interact with this example (view source to get D3 code).


In the second analysis, I replaced airlines dimension with airports dimension but kept all the other dimensions the same.  To my disbelief, Newark airport is the worst performing airport when it comes to departure delays.  Chicago O'Hare, SFO and JFK follow.  Atlanta airport is the largest airport but it has the best performance. What are they doing differently at ATL?  Click here to interact with this example (view source to get D3 code).


It was hell of a fun playing with D3, R and HANA, good intellectual stimulation if nothing else!  Happy Analyzing and remember possibilities are endless!

As always, my R modules are fairly simple and straightforward:
###########################################################################################  
#ETL - Read the AIRPORT Information, get major aiport informatoin extracted and upload this 
#transfromed dataset into HANA
###########################################################################################
major.airports <- data.table(read.csv("MajorAirports.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE))
setkey(major.airports, iata)

all.airports <- data.table(read.csv("AllAirports.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE)) 
setkey(all.airports, iata)

airports.2008.hp <- data.table(read.csv("2008.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE)) 
setkey(airports.2008.hp, Origin, UniqueCarrier)

#Merge two datasets
airports.2008.hp <- major.airports[airports.2008.hp,]


###########################################################################################  
# Get airport statisitics for all airports
###########################################################################################
airports.2008.hp.summary <- airports.2008.hp[major.airports,     
    list(AvgDepDelay=round(mean(DepDelay, na.rm=TRUE), digits=2),
    TotalMiles=prettyNum(sum(Distance, na.rm=TRUE), big.mark=","),
    TotalFlights=length(Month),
    TotalDestinations=length(unique(Dest)),
    URL=paste("http://www.fly", Origin, ".com",sep="")), 
                    by=list(Origin)][order(-TotalFlights)]
setkey(airports.2008.hp.summary, Origin)
#merge two data tables
airports.2008.hp.summary <- major.airports[airports.2008.hp.summary, 
                                                     list(Airport=airport, 
                                                          AvgDepDelay, TotalMiles, TotalFlights, TotalDestinations, 
                                                          Address=paste(airport, city, state, sep=", "), 
                                                          Lat=lat, Lng=long, URL)][order(-TotalFlights)]

airports.2008.hp.summary.json <- getRowWiseJson(airports.2008.hp.summary)
writeLines(airports.2008.hp.summary.json, "airports.2008.hp.summary.json")                 
write.csv(airports.2008.hp.summary, "airports.2008.hp.summary.csv", row.names=FALSE)

To leave a comment for the author, please follow the link and comment on his blog: All Things R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.