A simple Big Data analysis using the RevoScaleR package in Revolution R

May 24, 2011
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

This post from Stephen Weller is part of a series from members of the Revolution Analytics Engineering team. Learn more about the RevoScaleR package, available free to academics as part of Revolution R Enterprise — ed.

The RevoScaleR package, installed with Revolution R Enterprise, offers parallel external memory algorithms that help R break through memory and performance limitations.

RevoScaleR contains:

  • The .xdf data file format, designed for fast processing of blocks of data, and
  • A growing number of external memory implementations of the statistical algorithms most commonly used with large data sets

Here is a sample RevoScaleR analysis that uses a subset of the airline on-time data reported each month to the U.S. Department of Transportation (DOT) and Bureau of Transportation Statistics (BTS) by the 16 U.S. air carriers.  This data contains three columns: two numeric variables, ArrDelay and CRSDepTime, and a categorical variable, DayOfWeek. It is located in the SampleData folder of the RevoScaleR package, so you can easily run this example in your Revolution R Enterprise session.

  1. Import the sample airline data from a comma-delimited text file to an .xdf file.  When we import the data, we convert the string variable to a (categorical) factor variable using stringsAsFactors:

          inFile <- file.path(rxGetOption("sampleDataDir"), "AirlineDemoSmall.csv")

          rxTextToXdf(inFile = inFile, outFile = "airline.xdf",  stringsAsFactors = T, rowsPerRead = 200000)
 
There are a total of 600,000 rows in the data file. Specifying the argument rowsPerRead allows us to read and write the data in 3 blocks of 200,000 rows each.\
     2.  View basic data information. The rxGetInfoXdf function allows you to quickly view some basic information about the data set and variables.
 
      rxGetInfoXdf("airline.xdf", getVarInfo = TRUE, numRows = 20)
 
Setting the 'numRows' argument allows you to retrieve and display the first portion of the data.
 
     3.   Explore the data. Use the rxHistogram function to show the distribution of flight delay by the day of week
 
            rxHistogram( ~ ArrDelay|DayOfWeek, data = "airline.xdf")
 
Next, we compute summary statistics for the arrival delay variable
 
            rxSummary( ~ ArrDelay, data = "airline.xdf")
 
     4.   Estimating a Linear Model. Next, we fit a linear model in RevoScaleR using the 'rxLinMod()' function, passing as input the newly created XDF datafile. The purpose for fitting the model is to compute group means of arrival delay for each scheduled departure hour for both weekdays and weekends. We use this information subsequently to create a 'lattice-style' conditioned lineplot of the data.

We use the RevoScaleR 'F()' function here, which tells the rxLinMod() function to treat a variable as a 'factor' variable. We also use the ability to create new variables "on-the-fly" by using the transforms argument to create the variable "Weekend":

test.linmod.fit <- rxLinMod(ArrDelay ~  F(Weekend) : F(CRSDepTime),
  transforms=list(Weekend = (DayOfWeek == "Saturday") | (DayOfWeek == "Sunday")),
  cube = TRUE, data = "airline.xdf")

The 'test.linmod.fit$countDF' component, contains the group means and cell counts.  Since the independent variables in our regression were all categorical, the group means are the same as the coefficients.  We can do a quick check by taking the sum of the differences:

         linModDF <- test.linmod.fit$countDF
         sum(linModDF$ArrDelay – coef(test.linmod.fit))

The output from our linear model estimation includes standard errors of the coefficient estimates.  We can use these to create confidence bounds around the estimated coefficients.  Let's add them as additional variables in our data frame:

     linModDF$coef.std.error <- as.vector(test.linmod.fit$coef.std.error)
     linModDF$lowerConfBound <- linModDF$ArrDelay – 2*linModDF$coef.std.error
     linModDF$upperConfBound <- linModDF$ArrDelay + 2*linModDF$coef.std.error

We'll make two more changes before exploring the data graphically: create an integer variable from the factor variable created by the F() function, and give labels to the "weekend" factor variable.
 
     linModDF$DepartureHour <- as.integer(levels(linModDF$F.CRSDepTime.))[linModDF$F.CRSDepTime.]
     levels(linModDF$F.Weekend.) = c("Weekday", "Weekend")

     5.   Plot the results. We can use rxLinePlot to create a conditioned plot, with weekdays shown in one panel and weekends the other. Here is the call to produce the lineplot:
     rxLinePlot( lowerConfBound + upperConfBound + ArrDelay ~ DepartureHour | F.Weekend.,
          data = linModDF, lineColor = c("Blue1", "Blue2", "Red"),
          title = "Arrival Delay by Departure Hour: Weekdays and Weekends")

The line plot is informative, as it clearly shows that our estimates of arrival delays in the early hours of the morning are not very precise because of the small number of observations. 

ArrivalDelayByHourWW

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)