# Importing a log file with rxImport()

May 1, 2014
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

Tuesday's post on a new Kaggle contest mentioned that Revolution Analytics offers a free trial for using Revolution R Enterprise in the Amazon cloud. One reason this might be of interest to contestants is the rxImport() function which reads delimited text data, fixed format text data, and with an appropriate ODBC driver, data stored in a database. (rxImport() also directly reads SAS and SPSS files, but I'm guessing that this feature is not lilely to be of interest to contestants). As it turns out, rxImport()is also useful for dealing for semistructure text data such as log files. For example, here are the first three lines of internet log file complements of gVim.

190.12.51.140 - - [24/Feb/2013:01:44:32 -0600] "GET /bin/macosx/leopard/contrib/2.12/FGN_1.5.tgz HTTP/1.0" 200 510166 "-" "R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0)"
190.12.51.140 - - [24/Feb/2013:01:44:39 -0600] "GET /bin/macosx/leopard/contrib/2.12/fgui_1.0-2.tgz HTTP/1.0" 200 404275 "-" "R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0)"
190.12.51.140 - - [24/Feb/2013:01:44:45 -0600] "GET /bin/macosx/leopard/contrib/2.12/fields_6.6.tgz HTTP/1.0" 200 2852202 "-" "R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0)"

It is not quite space delimited, but it appears that the spaces may be useful. Indeed, reading the first five lines of the file, one line at a time, using just default parameters for rxImport()

# Point to file
#-------------------------------------------
# Read 5 rows to see how rxImport handles things
rxImport(inData=file,outFile="test",
numRows=5,
overwrite=TRUE)

rxGetInfo(data="test",getVarInfo=TRUE,numRows=2)

produces a binary XDF file in the following form:

Number of observations: 5
Number of variables: 10
Number of blocks: 5
Compression type: zlib
Variable information:
Var 1: V1, Type: character
Var 2: V2, Type: integer, Low/High: (NA, NA)
Var 3: V3, Type: integer, Low/High: (NA, NA)
Var 4: V4, Type: character
Var 5: V5, Type: character
Var 6: V6, Type: character
Var 7: V7, Type: integer, Low/High: (200, 404)
Var 8: V8, Type: integer, Low/High: (1051, 2852202)
Var 9: V9, Type: character
Var 10: V10, Type: character
Data (2 rows starting with row 1):
V1 V2 V3                    V4     V5
1 190.12.51.140 NA NA [24/Feb/2013:01:44:32 -0600]
2 190.12.51.140 NA NA [24/Feb/2013:01:44:39 -0600]
V6  V7     V8 V9
1    GET /bin/macosx/leopard/contrib/2.12/FGN_1.5.tgz HTTP/1.0 200 510166  -
2 GET /bin/macosx/leopard/contrib/2.12/fgui_1.0-2.tgz HTTP/1.0 200 404275  -
V10
1 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0)
2 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0)

Not perfect - but clearly useful! Moreover, an added bonus is that rxImport() assigns variable names to the columns: "V1", "V2", etc. which can be used in the import process. The following code, imports the file and does a bit of cleaning along the way, removing some columns and renaming others.

# Import data
colX <- list("V1" = list(type="character",newName = "IP"),
"V7" = list(type="character", newName = "Status"),
"V8" = list(type="integer", newName = "NoClue"),
"V10" = list(type="character", newName = "R_version"))

rxImport(inData=file,outFile="logData",
colInfo=colX,
varsToDrop=c("V2","V3","V9"),
transformVars = c("V4","V5"),
transforms=list(Date = substr(V4,2,12),
UTC = substr(V4,14,21),
Offset = as.numeric(substr(V5,1,5))),
overwrite=TRUE)

rxGetInfo(data="logData",getVarInfo=TRUE,numRows=2)

Notice that the "transforms" parameter is doing some elementary text processing on each chunk of data that is being read. The output from this step looks like:

IP                    V4     V5
1 190.12.51.140 [24/Feb/2013:01:44:32 -0600]
2 190.12.51.140 [24/Feb/2013:01:44:39 -0600]
V6 Status NoClue
1    GET /bin/macosx/leopard/contrib/2.12/FGN_1.5.tgz HTTP/1.0    200 510166
2 GET /bin/macosx/leopard/contrib/2.12/fgui_1.0-2.tgz HTTP/1.0    200 404275
R_version        Date      UTC
1 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0) 24/Feb/2013 01:44:32
2 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0) 24/Feb/2013 01:44:39
Offset
1   -600
2   -600

Now we are getting somewhere. We can use the rxDataStep() function to remove the columns V4 and V5, which are no longer needed, and further process the data. The following code uses a transform function in the data step to break apart the V6 column into some meaningful fields.

rxDataStep(inData="logData",outFile="logData_2",
varsToDrop=c("V4","V5"),
transformVars = c("V6"),
transformFunc = function(data) {
temp <- unlist(strsplit(data$V6, ' ')); temp.1 <- seq(from = 1, to = length(temp), by = 3); temp.2 <- seq(from = 2, to = length(temp), by = 3); temp.3 <- seq(from = 3, to = length(temp), by = 3); data$Command <- temp[temp.1];
data$File <- temp[temp.2]; data$Protocol <- temp[temp.3];
data },
overwrite=TRUE)

IP Status NoClue
1 190.12.51.140    200 510166
2 190.12.51.140    200 404275
R_version        Date      UTC
1 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0) 24/Feb/2013 01:44:32
2 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0) 24/Feb/2013 01:44:39
Offset Command                                            File Protocol
1   -600     GET    /bin/macosx/leopard/contrib/2.12/FGN_1.5.tgz HTTP/1.0
2   -600     GET /bin/macosx/leopard/contrib/2.12/fgui_1.0-2.tgz HTTP/1.0

The Transform function, transformFunc(), in the last block of code may look a bit mysterious. The key to understanding what it does is to realize that rxDataStep()reads a big file a chunk at a time. Each chunk holds the data in a list, and processing must take this structure into account. If the structure of the list is not clear, it is easy enough to print things out and take a look. The following code reads in 5 lines of the file 4 lines to a chunk and prints out the contents of the chunk.

# Look at what is going on in the chunks
rxImport(inData=file,outFile="test",
transformFunc = function(data) {
print(data);
# Internal variables can tell you aboutthe chunk
print(paste("chunk starts with row",.rxStartRow,"of file"));
print(paste("chunk number = ",.rxChunkNum));
print(paste("number of rows read = ",.rxNumRows));
data },
numRows=5, # only read 5 rows from the file
overwrite=TRUE) # overwrite the file if it exists

The code also points out some internal variables that may be useful in writing transform functions to process each chunk.

.rxStartRow contains the row of the file that begins the chunk.
.rxChunkNum contains the number of the chunk
.rxNumRows contains the number of rows in a chunk

to have a look at chunk printed out by the last block of code. In a future post, I'll look into squeezing some information out of this file.