Log File Analysis with R

February 21, 2012
By

(This article was first published on R-Chart, and kindly contributed to R-bloggers)


 

R often comes up in discussions of heavy duty scientific and statistical analysis (and so it should).  However, it is also incredibly handy for a variety of more routine developer activities.   And so I give you… log file analysis with R!  

I was just involved in the launch of gradesquare.com (go ahead – click on the link and check it out.  We will still be here later!).  With the flurry of recent activity, I needed a way to visualize and communicate site activity to the rest of the team.  It only takes a few lines of R to read in a log file (of a reasonable size), format the data, and generate some usable charts.  Like most good ideas - it is not new.  Most log files follow a similar format (such as common log formatso there may be some minor variations to the following exercise.

The only library that I used for this example was ggplot2 for charts.  
library(ggplot2)

Read the Log File
A sample of the log file (miserably wrapped - my apologies):


66.12.71.25 - - [21/Feb/2012 23:44:11] "GET /course/1894/detail HTTP/1.1" 200 7017 5.0829
66.12.71.21 - - [21/Feb/2012 23:44:39] "GET /search_by_author?search_learn_exp=Khan+Academy&page=193 HTTP/1.1" 200 8019 0.3288
66.12.71.25 - - [21/Feb/2012 23:45:21] "GET /course/19/detail HTTP/1.1" 200 6851 0.1213
18.4.5.14 - - [21/Feb/2012 23:45:59] "GET /search_by_subject?search_learn_exp=algebra-i-worked-examples HTTP/1.1" 200 7939 0.0370



If you can't make that out - just know that it is a relatively typical log file that includes the IP address of the client request, the date and time, the HTTP method and URL path, the HTTP request status code, a count of bytes returned and the time required for the request to process.



The log file can be read into a data frame as follows.

df = read.table('webapp.log')

There are a lot of different options available - and you might want to take advantage of these to minimize the amount of additional cleanup required after loading the file.  For details:

help(read.table)






Clean Up and Format 
I chose to clean up manually after the fact.  To start, we name the columns in the data frame.


colnames(df)=c('host','ident','authuser','date','time','request','status','bytes','duration')


The date and time were split up when read in above.  I am not concerned with the time at this point but do want the date to be cast to a date type.

df$date=as.Date(df$date,"[%d/%b/%Y")


To see the column names and first few rows of our data frame...
head(df)

There are a number of different ways of getting a quick handle on the data - you could do a summary for instance.  One item that you might want to have is a the number of requests for HTTP status.

table(df$status)
 

But the item of immediate interest is simply the number of requests.  The following will provide the number of requests by date.
reqs=as.data.frame(table(df$date))

R is really great for these quick summarizations, and if you memorize a few functions you will be able to address most needs easily.  At a certain point, I can better visualize data problems using SQL, and so use the sqldf library.  For now - on to some charts using ggplot2.

Make Some Charts


One "gotcha" that I hit fairly often with R and ggplot2 is the need to cast variables in a way that allows them to be treated as either continuous or discrete.  In the following casting the Var1 field as a Date allows it to be treated as continuous and geom_line() renders a line as intended.

ggplot(data=reqs, aes(x=as.Date(Var1), y=Freq)) + geom_line() + xlab('Date') + ylab('Requests') + opts(title='Traffic to Site')




On the other hand, the format function is used in this example to cause the (http) status value to be treated as discrete.

ggplot(data=df, aes(x=format(status))) + geom_bar() + xlab('Status') + ylab('Count') + opts(title='Status')


By the way, the images were exported as pngs for the blog by assigning the chart to a variable p and printing like so:



png("imagename.png")
print(p)
dev.off()

So there you have it - functional, useful R that addresses a practical every day need of web developers.  It is also a great, practical task that can introduce you to R with a simple relevant exercise that provides immediate value.

The next time Google Analytics falls short, pull out R and give it a try!



To leave a comment for the author, please follow the link and comment on his blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.