Analyzing OWASP Mod Security Audit Log with R

October 1, 2018
By

(This article was first published on Programming Utility codes, and kindly contributed to R-bloggers)

Web servers are frequently scanned or attacked by remote HTTP clients. Attackers send HTTP requests via algorithms or by direct “human” intervention. In order to protect web servers against these threats, automated protection systems block potentially malicious HTTP requests. Every time a request is blocked, its description is logged in a audit file. Often, audit files are difficult-to-read/large text files.  For that reason, it is difficult to have a clear overview of the threats blocked by security mechanisms.  
In this tutorial I describe an algorithm that I wrote in order to analyse the HTTP requests blocked by mod security for Apache HTTPD.  In my environment Mod Security is interfaced with OWASP Core Rule Set version 3 but the algorithm should work for other rules. 

1. Parse the Audit File

Mod Security audit log file is organised in sections.  Each section starts on a new line and can be composed of multiple lines.  Sections are separated by a line specifying the transaction IDs and section code. A description of each section can be found here: https://www.feistyduck.com

 --38c0471f-A--  
[04/Apr/2018:15:35:55 +0000] WsTw20sPICSQr9pUrF5lpQAAAAs x.x.x.x 46351 x.x.x.x 443
--38c0471f-B--
GET /sitemap.xml HTTP/1.1
Host: 155.198.29.141
Accept-Encoding: identity

For performance issue, we decided to read the audit file line by line in R. This means that our algorithm acts pretty much as a “human” reader would. A set of functions “understand” the content of the current line given the information retrieved in the previous lines. For example, if the previous line was 

 --38c0471f-A-- 

Then, the current line must indicates the time of the request, its ID and  the source and destination IP addresses and ports.  (this is because -A– is an Audit Log Header)

 [04/Apr/2018:15:35:55 +0000] WsTw20sPICSQr9pUrF5lpQAAAAs x.x.x.x 46351 x.x.x.x 443  
Using regex we created a set of function that validate the type of line and extract the relevant content.

1.a Check and extract the content of the Section header and the Audit Log Header (A)

In order to check the presence of a specific text pattern or extract text within a text pattern we use regular expressions. We imported the package “stringr”, a great library that makes our life easier. The first function below evaluates if the current line is announcing a new section. The second function extracts the transaction ID.  The third function extracts the section type. The function “auditLogGetTransactionDate” retrieves the date of the transaction as an R Date object. The last function retrieves the source and destination IPs and Ports as two String (i.e. [“ipSource portSource”, “ipDest portDest”].

 # Load Regular Expression Library  
require('stringr')

# Returns true if line is an Audit Log Header
auditLogIsTransaction <- function(line){
return(str_detect(line, "--[:xdigit:]{8}-[ABCDEFGHIJKZ]--"))
}

# Returns The transaction ID from the Audit Log Header
auditLogGetTransactionId <- function(line){
return(str_extract(line, "[:xdigit:]{8}"))
}

# Returns The section type from the Audit Log Header
auditLogGetTransactionSection <- function(line){
return(gsub("-","",str_extract(line, "-([ABCDEFGHIJKZ])-")))
}

# Returns the Transaction date
auditLogGetTransactionDate <- function(line){
extract <- str_extract(line, "[0123][0-9]/(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)/20(17|18|19|20)")
extractDate <- as.Date(extract,"%d/%b/%Y")
return( extractDate )
}
# Returns the IP addresses as an array. (i.e. [IP Source, IP Destination])
auditLogGetIPAddressesAndPorts <- function(line){
return(str_extract_all(line, "[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\s[0-9]{2,5}"))
}

In our example, we can now recognize and extract the piece of information highlighted in yellow.

 --38c0471f-A--  
[
04/Apr/2018:15:35:55 +0000] WsTw20sPICSQr9pUrF5lpQAAAAs 66.x.x.x 46351 x.x.x.x 443
--
38c0471f-B--
GET /sitemap.xml HTTP/1.1
Host: 155.198.29.141
Accept-Encoding: identity

We now need to extract the bit of information explaining why the HTTP request was deemed potentially malicious by Mod Security.

1.b Check and extract OWASP Rule information in Section H

The section H of the Audit log gives the reason why an HTTP request was blocked. It gives both the rule’s ID and the rule’s description.  
 --38c0471f-H--  
Message:...
[id "920350"] [rev "2"] [msg "Host header is a numeric IP address"] ...
Apache-Error: [file "apache2_util.c"] ... [client %s] ModSecurity: %s%s [uri "%s"]%s
Stopwatch: 1522856155328261 2511 (- - -)
Stopwatch2: 1522856155328261 2511; combined=1679,..., sr=122, sw=0, l=0, gc=0
Response-Body-Transformed: Dechunked
Producer: ModSecurity for Apache/2.9.1 (http://www.modsecurity.org/); OWASP_CRS/3.0.2.
Server: Apache
Engine-Mode: "ENABLED"

If the last section header read by the algorithm was of type “H” then the current line could potentially contains the rule’s Id and the rule’s message. Testing for the presence and extracting the rule’s Id is much easier than the rule’s message.

The two following functions test if the line string contains a rule id and extracts the rule id.

 # Returns true if contains a rule id  
auditLogHasErrorCodeMessage <- function(line){
require('stringr')
return(str_detect(line, '\\[id "\\d{3,10}"\\]'))
}
# Returns the id of the rule
auditLogGetErrorCodeMessage <- function(line){
require('stringr')
return(str_extract(str_extract(line, '\\[id "[0-9]{3,10}"\\]'),"[0-9]{3,10}"))
}

1.c Read *and make sense of* the current Line 

Now we put all the functions described above together in order to make sense of what happens in the current line. A Data Frame called “data” acts as the memory of the reader.  At the first line data would be null and a new data frame needs  to be created. Line by line the reader gains new information that is stored in “data”. The information stored in “data” helps the reader to understand the next line.

   
processAuditLogLine <- function(fileLine, data){
# Create structured storage if absent
if(is.null(data)){
data <- data.frame(txId = NA, txSection = NA, txDate = NA, txSourceIp = NA, txSourceIpLocation = NA, txSourcePort = NA, txDestIp = NA, txDestPort = NA, txRule = NA )
}
# Check if line is transaction
if(auditLogIsTransaction(fileLine)){
# Retrieve Transaction Id
txId <- auditLogGetTransactionId(fileLine)
# Retrieve transaction Type
txSection <- auditLogGetTransactionSection(fileLine)
if(txSection == 'A'){
data <- rbind(data, data.frame(txId = txId, txSection = txSection, txDate = NA, txSourceIp = NA, txSourceIpLocation = NA, txSourcePort = NA, txDestIp = NA, txDestPort = NA, txRule = NA))
}else{
# retrive current data row
txRow <- dim(data)[1]
data[txRow,'txSection'] <- txSection
}

}else{

# retrive current data row
txRow <- dim(data)[1]

# Check line not empty
if(fileLine !=''){

# Retrieve current Section Type
currentSection <- data[txRow,'txSection']
# Check if section is for a new HTTP Request (A)
if(currentSection == 'A'){
# Retrieve Date
data[txRow,'txDate'] <- auditLogGetTransactionDate(fileLine)
# Retrieve IP addresses and ports
ipAddressesArr <- unlist(auditLogGetIPAddressesAndPorts(fileLine))
source <- unlist(strsplit(ipAddressesArr[1]," "))
dest <- unlist(strsplit(ipAddressesArr[2]," "))
# Store IP addesses and ports in data
data[txRow,'txSourceIp'] <- source[1]
data[txRow,'txSourcePort'] <- source[2]
data[txRow,'txDestIp'] <- dest[1]
data[txRow,'txDestPort'] <- dest[2]
}
# Check if section is giving information about the Rule (H)
if(currentSection == 'H'){
#
if(startsWith(fileLine, 'Message: ')){
data[txRow,'txRule'] <- auditLogGetErrorCodeMessage(fileLine)
}
}
}
}
return(data)
}

1.d Read the file 

So far, we have seen how we can recognize the information and extract it in R. We also saw how, in one function, we can read and make sense of a line given the previous knowledge gained in the document.  The function below shows how the file is actually read.

At every line of the file we execute the method called “method” and pass the data in parameter.  Note that data is first declared as NULL.

 processLogFile <- function(filepath, method) {  
con = file(filepath, "r")
data <- NULL
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
# execute method on line
data <- method(line, data)
}
close(con)

return(data)
}

Now to get a data frame containing all the transactions of the audit log we can call:

 # File Location  
pathAudit <- 'C:/Temp/Logs/modsec_audit_20181001.log'
# Parse/Read Audit Log
data <- processLogFile(pathAudit, processAuditLogLine)
# Display data
head(data, 9)
The 9 first rows are displayed below. Of course source IP addresses have been changed to protect the anonymity of our attackers and ours (A bit of GDPR compliance in this world of savage would not hurt!)

txId txSection txDate txSourceIp txSourceIpLocation txSourcePort txDestIp txDestPort txRule
1 bde4ba7f Z 17801 184.x.x.x NA 41434 x.x.x.x 443 920350
2 b6c7467b Z 17801 198.x.x.x NA 22434 x.x.x.x 443 920440
3 ad702d79 Z 17802 78.x.x.x NA 15438 x.x.x.x 443 920280
4 5efba84c Z 17802 139.x.x.x NA 40714 x.x.x.x 443 920350
5 d358e435 Z 17802 220.x.x.x NA 2580 x.x.x.x 443 930130
6 6f227755 Z 17802 220.x.x.x NA 54458 x.x.x.x 443 920350
7 c207da7e Z 17802 220.x.x.x NA 50019 x.x.x.x 443 920280
8 30e96b28 Z 17802 198.x.x.x NA 54382 x.x.x.x 443 920280
9 bdf1496c Z 17802 60.x.x.x NA 58764 x.x.x.x 443 920350
In this table we can see that all sections have been read up to section Z in the column “txSection”. Thats because section Z announces the end of a transaction/HTTP request.  The dates in column txDate are actually stored in number of days since the “1970-01-01”. 

We already have a lot of information but we would also like to retrieve the country where each Source IP address is registered. 

2.a Add additional Insight from API. (i.e. IP Address Geolocation). 

We would like to extend the data set obtained previously by adding the IP addresses geolocations.  In order to do so, we have to fetch the countries name from an online IP geolocation API.  

In this tutorial we use the API proposed by  http://api.ipstack.com/ which worked really well. You will need to replace the API_Key by your own in order to use the API. The free account provides 10,000 API requests by month. 

 retrieveIPsLocations <- function(threatsData){  
# Library
require('jsonlite')

# Find unique Ips
unIps <- unique(threatsData$txSourceIp)
countriesForIpsDataFrame <- data.frame(txSourceIp = unIps, txSourceCountry = rep(NA, length(unIps)))
# Using https://ipstack.com
API_Key <- "########################"
countriesForIpsDataFrame$txSourceCountry <- unlist(apply(countriesForIpsDataFrame, 1, FUN = function(x){
# API URL for SPECIFIC IP
URL <- paste("http://api.ipstack.com/",x[1],"?access_key=",API_Key, sep = "")
jsonObj <- fromJSON(URL)
if(is.null(jsonObj) || is.null(jsonObj$country_name))
return(NA)
else
return(jsonObj$country_name)
}))

threatsData <- merge(threatsData, countriesForIpsDataFrame, by='txSourceIp')
return(threatsData)
}

We can think of other API uses here. For example we could check which IP addresses are on  blacklist. Many online APIs provide this info for small fees.

3 Analyzing the Data

By running the code above we should have one row per HTTP request. In each row the following information are stored:  the country, date and security rules. 
It is time to unleash our data analysis skills! 

3.a Analyzing the Geography of the attacks

Using the package “rworldmap”, a simple world visualization library we can easily plot the amount of request by country.

The function below does just that.

 mapThreats <- function(threathDataset){  
require('rworldmap')
#Aggregate By countries
threathDataset <- threats %>% group_by(txSourceCountry) %>% summarise(ThreatsCount=n()) %>% arrange(desc(ThreatsCount))
spMapDF <- joinCountryData2Map(threathDataset, joinCode = "NAME", nameJoinColumn = "txSourceCountry")
mapCountryData( spMapDF, nameColumnToPlot="ThreatsCount" )
}

# Visualise threats count by country
mapThreats(threats)

For our audit log we got, *unsurprisingly* the following graph:

The rule 920350  prevents any remote clients to access the web server with an IP address only. While it is a really good rule to be protected against Malicious web vulnerability scanners, it is prone to false positive. Imagine for example that google is trying to index your website by using its IP address.

In order to have a clearer picture we filter the request that were not blocked by the rule 920350.  We use “dplyr” package for filtering but any other package (or none) could be used instead. 

 # Filter for IP Rule   
threats <- threats %>% filter(txRule!=920350)
# Visualise threats count by country
mapThreats(threats)

We can see that most request have been filtered out. We passed from 2,253 request in the country with the highest number of blocked request to just 102.  The result however are consistent. Most of the request originate from the USA, China, Sweden, Poland, Romania, France, Germany and the Netherlands.

3.b Analyzing the Type of attack

From section 3.a it is obvious that by far the most frequently triggered security rule is 920350. We will now visualize the other rules in a bar plot.   
 barPlotModSecRules <- function(threatsDataset){  
require('ggplot2')
rules <- as.data.frame(threats %>% group_by(txRule) %>% summarise(ThreatsCount=n()) %>% arrange(desc(ThreatsCount)))
rules$txRule <- as.factor(rules$txRule)
ggplot(data=rules[1:10,], aes(x=txRule, y=ThreatsCount)) +
geom_bar(stat="identity", width=0.5)
}
# Visualize rules in a Bar plot
barPlotModSecRules(threats)



The description and count of the rules in the bar plot are given below:
Rule
Message
Warnings
920350
host header is a numeric IP address (could be search engines)
3723
930130
Restricted File Access Attempt
98
920280
Request Missing a Host Header
95
980130
Anomaly score correlation rule
42
941100
XSS Attack Detected via libinjection
34
920210
Multiple/Conflicting Connection Header Data Found.
28
920440
URL file extension is restricted by policy
24
913100
Found User-Agent associated with security scanner
18
920170
GET or HEAD Request with Body Content.
6
920100
Invalid HTTP Request Line
4
930110
Path Traversal Attack (/../)
3
933160
PHP Injection Attack: High-Risk PHP Function Call Found
3
920340
Request Containing Content, but Missing Content-Type header
1
932150
Remote Command Execution: Direct Unix Command Execution
1
It seems that attacker mainly attempts to access restricted files. In total over the period of the log, 95 requests were made without a proper host header, this can be the work of Scanner Hacker or misconfigured HTTP clients (false positive)   

3.c Analyzing the chronology of attacks

Using our data.frame, we can visualize the chronology of attacks and search for seasonal patterns. In addition we can search for trends or stationarity.

 # Count blocked HTTP requests for each day in our dataset  
dataAtHand <- threats %>% group_by(txDate) %>% summarise(count=n()) %>% arrange(txDate)
# Create dataset with all days between the first and last day of the log
allDates <-data.frame(txDate= seq(as.Date(unlist(head(dataAtHand,1)[1])), as.Date(unlist(tail(dataAtHand,1)[1])), by="days"))
# Merge data frame containing all the dates with the data frame containing the count.
datesData <- merge(allDates, dataAtHand , by="txDate" ,all.x=TRUE) %>% arrange(txDate)
# Set count equal to zero when no request for one day
datesData[is.na(datesData$count),'count']<- 0

# Plot time series
par(mar=c(8,4.1,4.1,2.1))
plot(ts(datesData$count), axes=F, ylab="Warnings",xlab="")
title(xlab="Time", line=6, cex.lab=1.2)
labelsIndexes <-seq(1,length(datesData$txDate), by=30)
axis(1, at = labelsIndexes, labels = datesData$txDate[labelsIndexes], las=3)
axis(2)

The code above returns the following time series plot:

No clear seasonality and trend could be found be inspecting the graph visually.  Further test statistics could be used to assess whether either seasonality or trend effect are present.

4. Conclusion

It is difficult to have an overview of the threat by looking at the log files. With the code above we can not only have an overview of the attacks but also analyse our attackers and their behaviors. We can monitor the time of their attack, the type of their attack and their geographic location.   
The resulting data set can be used to test statistical hypothesis and improve our defense by understanding our attackers.

This post is intended to be used on https://www.r-bloggers.com


To leave a comment for the author, please follow the link and comment on their blog: Programming Utility codes.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)