For a project hosted at the ETH Risk Center, a fellow PhD student and I are currently trying to extract and vectorize a large number of cartographic features from a collection of scanned maps. Though the project is still in an early phase, we were already able to do a lot of interesting stuff, and I was particularly surprised about how far we got using R (which is not known for its image processing capabilities) as our primary programming tool.
One of the steps for which R turned out to be a particularly suited option is image classification, i.e., separating the pixels representing the feature of interest (e.g. roads) from everything else. In this first post, I’ll try to demonstrate how easily one can perform some basic image classification in R. To this end, I’ll use the example of extracting pixels representing oil pipelines from a map of oil infrastructure in Iraq (which is not directly part of the project mentioned above, but still relevant to my research).
The problem we’re dealing with here is essentially the following: On the basis of a raster image (which a scanned map eventually is), categorize each pixel into two or more classes. This is a well-known problem in remote sensing, where typically multispectral satellite imagery is processed and pixels are assigned to land-cover classes (different types of vegetation, built-up areas, etc.). Luckily, for our application here, we’re facing a very simple variant of this: our base material (the scanned map) is already highly standardized (satellite imagery usually doesn’t come with a legend…), and the information we use for classification is only 3 dimensional (the three color bands making up the RGB image).
Image classification procedures usually fall into one of two categories. In supervised classification, we first fit a model to manually pre-classified training data (i.e., the model is initially provided the “right” answers, thus “supervised”), which we then use to classify all other pixels in the image. In unsupervised classification, data is clustered into categories with little or no prior knowledge about how the latter should look like. For our pipeline problem, we are clearly going for supervised classification, since the categories of interest are perfectly well defined from the outset (pipeline-pixels versus the rest), and, as we will see below, creating training data to fit the model is trivial.
Support Vector Machines
The remote sensing literature suggests a wide array of well-established methods for performing supervised classification, but in this post I’ll demonstrate one of the more recent alternatives. Specifically, I will use support vector machines (SVMs) for classification. SVMs are linear classifiers with very desirable properties, and are apparently one of the hottest topics in machine learning. Admittedly, I only have basic knowledge of the method, and I’m still not fully comfortable with the underlying math, but the intuition behind SVMs is essentially to place a separating hyperplane into the feature-space such there is a maximal margin between the hyperplane and the data points of either class. Additionally, there’s some kernel magic going on that allows for non-linear classification, but I’m not competent enough to discuss the details. If you’re interested in the topic, I recommend the respective chapter in Hastie et al.’s machine learning bible (see http://stanford.io/hmBvL), which features a thorough introduction to SVMs, and is particularly readable if you have a background in econometrics.
Implementation in R
So let’s try and implement this in R. The image to be classified is the following map of Iraqi oil infrastructure in 2003, which I obtained from the Perry-Castaneda library map collection (http://bit.ly/xlvSN), and was apparently first published in the 2003 CIA country profile of Iraq:
Quite obviously, for this image in particular, simply vectorizing the pipelines by hand would be much faster than relying on an automated approach. For maps with higher resolution and more detailed features, however, automation pays off quickly.
Loading the scanned map (which is in JPEG format) and getting a look at it is easily achieved using the Raster package:
# Load and plot the scanned map require(raster) iraq.img <- brick("iraq_oil_2003.jpg") plotRGB(iraq.img)
Further, apart from the original image, we need some pre-classified samples of pipeline pixels and background pixels for the initial fitting of the SVM. In order to obtain these, I simply opened the original map in GIMP and created two JPEG images, one containing some randomly selected pipeline sections, and one containing some patches of background. These, too, can be loaded into R using the Raster package:
# Load images with the (pre-classified) training data training.ppl.img <- brick("iraq_pipelines_train.jpg") training.noppl.img <- brick("iraq_nopipelines_train.jpg")
And this is what the training data looks like:
In a next step, we transform these pixel images into properly formatted data frames, such that they can be used as training data. For this purpose, I get the 3-dimensional pixel values with the getValues function in the Raster package, pack them in a data frame, and remove all rows representing white background pixels. Furthermore, we assign a factor variable “pipeline”:
# Put training data into data frame training.ppl.df <- data.frame(getValues(training.ppl.img)) names(training.ppl.df) <- c("r", "g", "b") # Remove white background pixels training.ppl.df <- training.ppl.df[(training.ppl.df$r < 254 & training.ppl.df$g < 254 & training.ppl.df$b < 254),] # Create new variable indicating pipeline pixels training.ppl.df$pipeline <- "P" # Do the same for the non-pipeline image training.noppl.df <- data.frame(getValues(training.noppl.img)) # etc...
At this point, it is useful to have a look at the just created data frames with the training data. It turns out that while we have only 120 classified pipeline pixels, we have over 20’000 background pixels. This isn’t a problem per-se, but for the sake of computational performance, I will only perform model fitting using a random sample of 5’000 background pixels (which does not seem to affect the quality of the end result):
# Combine data frames and subset only 5000 random values from the non-pipeline training data training.df <- rbind(training.ppl.df, training.noppl.df[sample(nrow(training.noppl.df), 5000),]) # Turn classification variable into factor training.df$pipeline <- as.factor(training.df$pipeline)
Now, before we fit an SVM to all the training data, we might want to have some indication about how well image classification is going to work with this setup. For this purpose, I divide the training data into a training subset and a testing subset, fit an SVM to the training subset, and calculate the model’s predictive accuracy using out-of-sample predictions on the testing subset. To fit the SVM, I rely on the e1071 package, which is said to feature the fastest SVM implementation for R. I use the default settings (which use radial-basis kernels to allow for non-linear classification), and perform a grid-search for the gamma and cost parameters in order to pick the best model by minimizing classification error via 10-fold cross-validation. As mentioned before, we classify purely on the basis of pixel color, hence using the RGB bands as predictor variables:
# Divide training data into a train-subset and a test-subset train <- sample(nrow(training.df), round((nrow(training.df) - 1) / 2, 0)) test <- c(1:nrow(training.df))[!(c(1:nrow(training.df)) %in% train)] trainset.df <- training.df[train,] testset.df <- training.df[test,] # Fit best SVM using tuning require(e1071) svm.fit <- best.svm(pipeline~., data = trainset.df, gamma = 10^(-6:-1), cost = 10^(-1:1)) # Fit predictions and print error matrix svm.pred <- predict(svm.fit, testset.df[,1:3]) svm.tab <- table(pred = svm.pred, true = testset.df[,4]) print(svm.tab)
The resulting error matrix (which may vary due to the random sampling above) shows that the SVM does not perform perfectly, but ok: There are no false positives, but a fair amount of false negatives, indicating that the model is rather conservative in coding pipelines.
So let’s get to the real thing: First, we fit an SVM to the entire training data set, and then we use the fitted SVM for predicting the class of each pixel in the original map. In order to visualize the result, we create a copy of the raster object containing the original map, and assign zeros and ones (instead of the original RGB bands) to the pixels on the basis of the pipeline/background predictions obtained by the SVM:
# Fit tuned SVM to entire training set svm.fit <- best.svm(pipeline~., data = training.df, gamma = 10^(-6:-1), cost = 10^(-1:1)) # Prepare Iraq map for predictions iraq.df <- data.frame(getValues(iraq.img)) names(iraq.df) <- c("r", "g", "b") # Assign predicted values to target map iraq.pred <- predict(svm.fit, iraq.df) iraq.class <- ifelse(iraq.pred == "P", 1, 0) classified.img <- iraq.img[] values(classified.img) <- iraq.class
Finally, I plot the classified image using the image function (from the graphics package), and slightly change the color settings in order to optimize visibility of the pipeline pixels. I use image instead of the plot function from the Raster package for esthetic reasons (see http://bit.ly/Y23odn):
The result surely isn’t perfect, but given the difficulty of the problem (classifying green pipelines in a map where essentially everything is green), it’s more than acceptable. Though the SVM appears to systematically underestimate the number of pipeline pixels, as already indicated by the non-trivial number of false negatives in the error matrix generated above, the pipelines are clearly visible, and further image processing should very well be possible.