Log File Analysis with R

[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


R often comes up in discussions of heavy duty scientific and statistical analysis (and so it should).  However, it is also incredibly handy for a variety of more routine developer activities.   And so I give you… log file analysis with R!  

I was just involved in the launch of gradesquare.com (go ahead – click on the link and check it out.  We will still be here later!).  With the flurry of recent activity, I needed a way to visualize and communicate site activity to the rest of the team.  It only takes a few lines of R to read in a log file (of a reasonable size), format the data, and generate some usable charts.  Like most good ideas – it is not new.  Most log files follow a similar format (such as common log formatso there may be some minor variations to the following exercise.

The only library that I used for this example was ggplot2 for charts.  

Read the Log File
A sample of the log file (miserably wrapped – my apologies): – – [21/Feb/2012 23:44:11] “GET /course/1894/detail HTTP/1.1” 200 7017 5.0829 – – [21/Feb/2012 23:44:39] “GET /search_by_author?search_learn_exp=Khan+Academy&page=193 HTTP/1.1” 200 8019 0.3288 – – [21/Feb/2012 23:45:21] “GET /course/19/detail HTTP/1.1” 200 6851 0.1213 – – [21/Feb/2012 23:45:59] “GET /search_by_subject?search_learn_exp=algebra-i-worked-examples HTTP/1.1” 200 7939 0.0370

If you can’t make that out – just know that it is a relatively typical log file that includes the IP address of the client request, the date and time, the HTTP method and URL path, the HTTP request status code, a count of bytes returned and the time required for the request to process.

The log file can be read into a data frame as follows.

df = read.table(‘webapp.log’)

There are a lot of different options available – and you might want to take advantage of these to minimize the amount of additional cleanup required after loading the file.  For details:


Clean Up and Format 
I chose to clean up manually after the fact.  To start, we name the columns in the data frame.


The date and time were split up when read in above.  I am not concerned with the time at this point but do want the date to be cast to a date type.


To see the column names and first few rows of our data frame…

There are a number of different ways of getting a quick handle on the data – you could do a summary for instance.  One item that you might want to have is a the number of requests for HTTP status.


But the item of immediate interest is simply the number of requests.  The following will provide the number of requests by date.

R is really great for these quick summarizations, and if you memorize a few functions you will be able to address most needs easily.  At a certain point, I can better visualize data problems using SQL, and so use the sqldf library.  For now – on to some charts using ggplot2.

Make Some Charts

One “gotcha” that I hit fairly often with R and ggplot2 is the need to cast variables in a way that allows them to be treated as either continuous or discrete.  In the following casting the Var1 field as a Date allows it to be treated as continuous and geom_line() renders a line as intended.

ggplot(data=reqs, aes(x=as.Date(Var1), y=Freq)) + geom_line() + xlab(‘Date’) + ylab(‘Requests’) + opts(title=’Traffic to Site’)

On the other hand, the format function is used in this example to cause the (http) status value to be treated as discrete.

ggplot(data=df, aes(x=format(status))) + geom_bar() + xlab(‘Status’) + ylab(‘Count’) + opts(title=’Status’)

By the way, the images were exported as pngs for the blog by assigning the chart to a variable p and printing like so:


So there you have it – functional, useful R that addresses a practical every day need of web developers.  It is also a great, practical task that can introduce you to R with a simple relevant exercise that provides immediate value.

The next time Google Analytics falls short, pull out R and give it a try!

To leave a comment for the author, please follow the link and comment on their blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)