R often comes up in discussions of heavy duty scientific and statistical analysis (and so it should). However, it is also incredibly handy for a variety of more routine developer activities. And so I give you… log file analysis with R!
I was just involved in the launch of gradesquare.com (go ahead – click on the link and check it out. We will still be here later!). With the flurry of recent activity, I needed a way to visualize and communicate site activity to the rest of the team. It only takes a few lines of R to read in a log file (of a reasonable size), format the data, and generate some usable charts. Like most good ideas – it is not new. Most log files follow a similar format (such as common log format) so there may be some minor variations to the following exercise.
The only library that I used for this example was ggplot2 for charts.
Read the Log File
A sample of the log file (miserably wrapped – my apologies):
188.8.131.52 – – [21/Feb/2012 23:44:11] “GET /course/1894/detail HTTP/1.1″ 200 7017 5.0829
184.108.40.206 – – [21/Feb/2012 23:44:39] “GET /search_by_author?search_learn_exp=Khan+Academy&page=193 HTTP/1.1″ 200 8019 0.3288
220.127.116.11 – – [21/Feb/2012 23:45:21] “GET /course/19/detail HTTP/1.1″ 200 6851 0.1213
18.104.22.168 – – [21/Feb/2012 23:45:59] “GET /search_by_subject?search_learn_exp=algebra-i-worked-examples HTTP/1.1″ 200 7939 0.0370
If you can’t make that out – just know that it is a relatively typical log file that includes the IP address of the client request, the date and time, the HTTP method and URL path, the HTTP request status code, a count of bytes returned and the time required for the request to process.
The log file can be read into a data frame as follows.
df = read.table(‘webapp.log’)
There are a lot of different options available – and you might want to take advantage of these to minimize the amount of additional cleanup required after loading the file. For details:
Clean Up and Format
I chose to clean up manually after the fact. To start, we name the columns in the data frame.
The date and time were split up when read in above. I am not concerned with the time at this point but do want the date to be cast to a date type.
To see the column names and first few rows of our data frame…
There are a number of different ways of getting a quick handle on the data – you could do a summary for instance. One item that you might want to have is a the number of requests for HTTP status.
But the item of immediate interest is simply the number of requests. The following will provide the number of requests by date.
R is really great for these quick summarizations, and if you memorize a few functions you will be able to address most needs easily. At a certain point, I can better visualize data problems using SQL, and so use the sqldf library. For now – on to some charts using ggplot2.
Make Some Charts
One “gotcha” that I hit fairly often with R and ggplot2 is the need to cast variables in a way that allows them to be treated as either continuous or discrete. In the following casting the Var1 field as a Date allows it to be treated as continuous and geom_line() renders a line as intended.
ggplot(data=reqs, aes(x=as.Date(Var1), y=Freq)) + geom_line() + xlab(‘Date’) + ylab(‘Requests’) + opts(title=’Traffic to Site’)
On the other hand, the format function is used in this example to cause the (http) status value to be treated as discrete.
ggplot(data=df, aes(x=format(status))) + geom_bar() + xlab(‘Status’) + ylab(‘Count’) + opts(title=’Status’)
By the way, the images were exported as pngs for the blog by assigning the chart to a variable p and printing like so: