Featurizing images: the shallow end of deep learning

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Bob Horton and Vanja Paunic, Microsoft AI and Research Data Group

Training deep learning models from scratch requires large data sets and significant computational reources. Using pre-trained deep neural network models to extract relevant features from images allows us to build classifiers using standard machine learning approaches that work well for relatively small data sets. In this context, a deep learning solution can be thought of as incorporating layers that compute features, followed by layers that map these features to outcomes; here we’ll just map the features to outcomes ourselves.

We explore an example of using a pre-trained deep learning image classifier to generate features for use with traditional machine learning approaches to address a problem the original model was never trained on (see the blog post “Image featurization with a pre-trained deep neural network model” for other examples). This approach allows us to quickly and easily create a custom classifier for a specific specialized task, using only a relatively small training set. We use the image featurization abilities of Microsoft R Server 9.1 (MRS) to create a classifier for different types of knots in lumber. These images were made publicly available from the laboratory of Prof., Dr. Olli Silven, University of Oulu, Finland, in 1995. Please note that we are using this problem as an academic example of an image classification task with clear industrial implications, but we are not really trying to raise the bar in this well-established field.

We characterize the performance of the machine learning model and describe how it might fit into the framework of a lumber grading system. Knowing the strengths and weaknesses of the classifier, we discuss how it could be used to triage additional image data for labeling by human experts, so that the system can be iteratively improved.

The pre-trained deep learning models used here are optional components that can be installed alongside Microsoft R Server 9.1; directions are here.

Lumber grading

In the sawmill industry lumber grading is an important step of the manufacturing process. Improved grading accuracy and better control of quality variation in production leads directly to improved profits. Grading has traditionally been done by visual inspection, in which a (human) grader marks each piece of lumber as it leaves the mill, according to a factors like size, category, and position of knots, cracks, species of tree, etc. Visual inspection is often an error prone and laborious task. Certain defect classes may be difficult to distinguish, even for a human expert. To that end, a number of automated lumber grading systems have been developed which aim to improve the accuracy and the efficiency of lumber grading [2-7].

Loading and visualizing data

Let us start by downloading the data.

DATA_DIR <- file.path(getwd(), 'data')

  knots_url <- 'http://www.ee.oulu.fi/research/imag/knots/KNOTS/knots.zip'
  names_url <- 'http://www.ee.oulu.fi/research/imag/knots/KNOTS/names.txt'
  download.file(knots_url, destfile = file.path(DATA_DIR, 'knots.zip'))
  download.file(names_url, destfile = file.path(DATA_DIR, 'names.txt'))
  unzip(file.path(DATA_DIR, 'knots.zip'), exdir = file.path(DATA_DIR, 'knot_images'))


Let’s now load the data from the downloaded files and look at some of those files.

knot_info <- read.delim(file.path(DATA_DIR, "names.txt"), sep=" ", header=FALSE, stringsAsFactors=FALSE)[1:2]

names(knot_info) <- c("file", "knot_class")

knot_info$path <- file.path(DATA_DIR, "knot_images", knot_info$file)

## [1] "file"       "knot_class" "path"

We’ll be trying to predict knot_class, and here are the counts of the categories:

## decayed_knot     dry_knot    edge_knot encased_knot    horn_knot 
##           14           69           65           29           35 
##    leaf_knot   sound_knot 
##           47          179

Four of these labels relate to how the knot is integrated into the structure of the surrounding wood; these are the descriptions from the README file:

  • sound: “A knot grown firmly into the surrounding wood material and does not contain any bark or signs of decay. The color may be very close to the color of sound wood.”

  • dry: “A firm or partially firm knot, and has not taken part to the vital processes of growing wood, and does not contain any bark or signs of decay. The color is usually darker than the color of sound wood, and a thin dark ring or a partial ring surrounds the knot.”

  • encased: “A knot surrounded totally or partially by a bark ring. Compared to dry knot, the ring around the knot is thicker.”

  • decayed: “A knot containing decay. Decay is difficult to recognize, as it usually affects only the strength of the knot.”

Edge, horn and leaf knots are related to the orientation of the knot relative to the cutting plane, the position of the knot on the board relative to the edge, or both. These attributes are of different character than the ones related to structural integration. Theoretically, you could have an encased horn knot, or a decayed leaf knot, etc., though there do not happen to be any in this dataset. Unless otherwise specified, we have to assume the knot is sound, cross-cut, and not near an edge. This means that including these other labels makes the knot_class column ‘untidy’, in that it refers to more than one different kind of characteristic. We could try to split out distributed attributes for orientation and position, as well as for structural integration, but to keep this example simple we’ll just filter the data to keep only the sound, dry, encased, and decayed knots that are cross-cut and not near an edge. (It turns out that you can use featurization to fit image classification models that recognize position and orientation quite well, but we’ll leave that as an exercise for the reader.)

knot_classes_to_keep <- c("sound_knot", "dry_knot", "encased_knot", "decayed_knot")

knot_info <- knot_info[knot_info$knot_class %in% knot_classes_to_keep,]
knot_info$knot_class <- factor(knot_info$knot_class, levels=knot_classes_to_keep)

We kept the knot_class column as character data while we were deciding which classes to keep, but now we can make that a factor.

Here are a few examples of images from each of the four classes:



samples_per_category <- 3

op <- par(mfrow=c(1, samples_per_category), oma=c(2,2,2,2), no.readonly = TRUE)

for (kc in levels(knot_info$knot_class)){
  kc_files <- knot_info[knot_info$knot_class == kc, "path"]
  kc_examples <- sample(kc_files, samples_per_category)
  for (kc_ex in kc_examples){
    pnm <- read.pnm(kc_ex)
    plot(pnm, xlab=gsub(".*(knot[0-9]+).*", "\\1", kc_ex))
    mtext(text=gsub("_", " ", kc), side=3, line=0, outer=TRUE, cex=1.7)






Featurizing images

Here we use the rxFeaturize function from Microsoft R Server, which allows us to perform a number of transformations on the knot images in order to produce numerical features. We first resize the images to fit the dimensions required by the pre-trained deep neural model we will use, then extract the pixels to form a numerical data set, then run that data set through a DNN pre-trained model. The result of the image featurization is a numeric vector (“feature vector”) that represents key characteristics of that image.

Image featurization here is accomplished by using a deep neural network (DNN) model that has already been pre-trained by using millions of images. Currently, MRS supports four types of DNNs – three ResNet models (18, 50, 101)[1] and AlexNet [8].

knot_data_df <- rxFeaturize(data = knot_info,
                            mlTransforms = list(loadImage(vars = list(Image = "path")),
                                                resizeImage(vars = list(Features = "Image"), 
                                                            width = 224, height = 224, 
                                                            resizingOption = "IsoPad"),
                                                extractPixels(vars = "Features"),
                                                featurizeImage(var = "Features", 
                                                               dnnModel = "resnet101")),
                            mlTransformVars = c("path", "knot_class"))
## Elapsed time: 00:02:59.8362547

We have chosen the “resnet101” DNN model, which is 101 layers deep; the other ResNet options (18 or 50 layers) generate features more quickly (well under 2 minutes each for this dataset, as opposed to several minutes for ResNet-101), but we found that the features from 101 layers work better for this example.

We have placed the features in a dataframe, which lets us use any R algorithm to build a classifier. Alternatively, we could have saved them directly to an XDF file (the native file format for Microsoft R Server, suitable for large datasets that would not fit in memory, or that you want to distribute on HDFS), or generated them dynamically when training the model (see examples in the earlier blog post).

Since the featurization process takes a while, let’s save the results up to this point. We’ll put them in a CSV file so that, if you are so inspired, you can open it in Excel and admire the 2048 numerical features the deep neural net model has created to describe each image. Once we have the features, training models on this data will be relatively fast.

write.csv(knot_data_df, "knot_data_df.csv", row.names=FALSE)

Select training and test data

Now that we have extracted numeric features from each image, we can use traditional machine learning algorithms to train a classifier.

in_training_set <- sample(c(TRUE, FALSE), nrow(knot_data_df), replace=TRUE, prob=c(2/3, 1/3))
in_test_set <- !in_training_set

We put about two thirds of the examples (203) into the training set and the rest (88) in the test set.

Fit Random Forest model

We will use the popular randomForest package, since it handles both “wide” data (with more features than cases) as well as multiclass outcomes.


form <- formula(paste("knot_class" , paste(grep("Feature", names(knot_data_df), value=TRUE), collapse=" + "), sep=" ~ "))

training_set <- knot_data_df[in_training_set,]
test_set <- knot_data_df[in_test_set,]

fit_rf <- randomForest(form, training_set)

pred_rf <- as.data.frame(predict(fit_rf, test_set, type="prob"))
pred_rf$knot_class <- knot_data_df[in_test_set, "knot_class"]
pred_rf$pred_class <- predict(fit_rf, knot_data_df[in_test_set, ], type="response")

# Accuracy
with(pred_rf, sum(pred_class == knot_class)/length(knot_class))
## [1] 0.8068182

Let’s see how the classifier did for all four classes. Here is the confusion matrix, as well as a moisaic plot showing predicted class (pred_class) against the actual class (knot_class).

with(pred_rf, table(knot_class, pred_class))
##               pred_class
## knot_class     sound_knot dry_knot encased_knot decayed_knot
##   sound_knot           51        0            0            0
##   dry_knot              3       19            0            0
##   encased_knot          6        5            1            0
##   decayed_knot          2        1            0            0
mycolors <- c("yellow3", "yellow2", "thistle3", "thistle2")
mosaicplot(table(pred_rf[,c("knot_class", "pred_class")]), col = mycolors, las = 1, cex.axis = .55, 
           main = NULL, xlab = "Actual Class", ylab = "Predicted Class")


It looks like the classifier performs really well on the sound knots. On the other hand, all of the decayed knots in the test set are misclassified. This is a small class, so let’s look at those misclassified samples.


pred_rf$path <- test_set$path

# misclassified knots
misclassified <- pred_rf %>% filter(knot_class != pred_class)

# number of misclassified decayed knots
num_example <- sum(misclassified$knot_class == "decayed_knot")

op <- par(mfrow = c(1, num_example), mar=c(2,1,0,1), no.readonly = TRUE)

for (i in which(misclassified$knot_class == "decayed_knot")){
  example <- misclassified[i, ]
  pnm <- read.pnm(example$path)
  mtext(text=sprintf("predicted as\n%s", example$pred_class), side=1, line=-2)
mtext(text='Misclassified Decayed Knots', side = 3, outer = TRUE, line = -3, cex=1.2)



We were warned about the decayed knots in the README file; they are more difficult to visually classify. Interestingly, the ones classified as sound actually do appear to be well-integrated into the surrounding wood; they also happen to look somewhat rotten. Also, the decayed knot classified as dry does have a dark border, and looks like a dry knot. These knots appear to be two decayed sound knots and a decayed dry knot; in other words, maybe decay should be represented as a separate attribute that is independent of the structural integration of the knot. We may have found an issue with the labels, rather than a problem with the classifier.

Let’s look at the performance of the classifier more closely. Using the scores returned by the classifier, we can plot an ROC curve for each of the individual classes.


plot_target_roc <- function(target_class, outcome, test_data, multiclass_predictions){
  is_target <- test_data[[outcome]] == target_class
  roc_obj <- roc(is_target, multiclass_predictions[[target_class]])
  plot(roc_obj, print.auc=TRUE, main=target_class)
  text(x=0.2, y=0.2, labels=sprintf("total cases: %d", sum(is_target)), pos=3)

op <- par(mfrow=c(2,2), oma=c(2,2,2,2), #mar=c(0.1 + c(7, 4, 4, 2)), 
          no.readonly = TRUE)

for (knot_category in levels(test_set$knot_class)){
  plot_target_roc(knot_category, "knot_class", test_set, pred_rf)

mtext(text="ROC curves for individual classes", side=3, line=0, outer=TRUE, cex=1.5)



These ROC curves show that the classifier scores can be used to provide significant enrichment for any of the classes. With sound knots, for example, the score can unambiguously identify a large majority of the cases that are not of that class. Another way to look at this is to consider the ranges of each classification score for each actual class:


pred_rf %>% 
  select(-path, -pred_class) %>%
  gather(prediction, score, -knot_class) %>% 
  ggplot(aes(x=knot_class, y=score, col=prediction)) + geom_boxplot()


Note that the range of the sound knot and dry knot scores tend to be higher for all four classes. But when you consider the scores for a given prediction class across all the actual classes, the scores tend to be higher when they match than when they don’t. For example, even though the decayed knot score never goes very high, it tends to be higher for the decayed knots than for other classes. Here’s a boxplot of just the decayed_knot scores for all four actual classes:

boxplot(decayed_knot ~ knot_class, pred_rf, xlab="actual class", ylab="decayed_knot score", main="decayed_knot score for all classes")


Even though the multi-class classifier did not correctly identify any of the three decayed knots in the test set, this does not mean that it is useless for finding decayed knots. In fact, the decayed knots had higher scores for the decayed predictor than most of the other knots did (as shown in the boxplots above), and this score it is able to correctly determine that almost 80% of the knots are not decayed. This means that this classifier could be used to screen for knots that are more likely to be decayed, or alternatively, you could use these scores to isolate a collection of cases that are unlikely to be any of the kinds of knots your classifier is good at recognizing. This ability to focus on things we are not good at might be helpful if we needed to search through a large collection of images to find more of the kinds of knots for which we need more training data. We could show those selected knots to expert human labelers, so they wouldn’t need to label all the examples in the entire database. This would help us get started with an active learning process, where we use a model to help develop a better model.

Here we’ve shown an open-source random forest model trained on features that have been explicitly written to a dataframe. The model classes in the MicrosoftML package can also generate these features dynamically, so that the trained model can accept a file name as input and automatically generate the features on demand during scoring. For examples see the earlier blog post. In general, these data sets are “wide”, in that they have a large number of features; in this example, they have far more columns of features than rows of cases. This means that you need to select algorithms that can handle wide data, like elastic net regularized linear models, or, as we’ve seen here, random forests. Many of the MicrosoftML algorithms are well suited for wide datasets.


Image featurization allows us to make effective use of relatively small sets of labeled images that would not be sufficient to train a deep network from scratch. This is because we can re-use the lower-level features that the pre-trained model had learned for more general image classification tasks.

This custom classifier was constructed quite quickly; although the featurizer required several minutes to run, after the features were generated all training was done on relatively small models and data sets (a few hundred cases with a few thousand features), where training a given model takes on the order of seconds, and tuning hyperparameters is relatively straightforward. Industrial applications commonly require specialized classification or scoring models that can be used to address specific technical or regulatory concerns, so having a rapid and adaptable approach to generate such models should have considerable practical utility.

Featurization is the low-hanging fruit of transfer learning. More general transfer learning allows the specialization of deeper and deeper layers in a pre-trained general model, but the deeper you want to go, the more labelled training data you will generally need. We often find ourselves awash in data, but limited by the availability of quality labels. Having a partially effective classifier can let you bootstrap an active learning approach, where a crude model is used to triage images, classifying the obvious cases directly and referring only those with uncertain scores to expert labelers. The larger set of labeled images is then used to build a better model, which can do more effective triage, and so on, leading to iterative model improvement. Companies like CrowdFlower use these kinds of approaches to optimize data labeling and model building, but there are many use cases where general crowdsourcing may not be adequate (such as when the labeling must be done by experts), so having a straightforward way to bootstrap that process could be quite useful.

The labels we have may not be tidy, that is, they may not refer to distinct characteristics. In this case, the decayed knot does not seem to really be an alternative to “encased”, “dry”, or “sound” knots; rather, it seems that any of these categories of knots might possibly be decayed. In many applications, it is not always obvious what a properly distributed outcome labeling should include. This is one reason that a quick and easy approach is valuable; it can help you to clarify these questions with iterative solutions that can help frame the discussion with domain experts. In a subsequent iteration we may want to consider “decayed” as a separate attribute from “sound”, “dry” or “encased”.

Image featurization gives us a simple way to build domain-specific image classifiers using relatively small training sets. In the common use case where data is plentiful but good labels are not, this provides a rapid and straightforward first step toward getting the labeled cases we need to build more sophistiated models; with a few iterations, we might even learn to recognize decayed knots. Seriously, wood that knot be cool?


  1. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. (2016) The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778

  2. Kauppinen H. and Silven O.: A Color Vision Approach for Grading Lumber. In Theory & Applications of Image Processing II – Selected papers from the 9th Scandinavian Conference on Image Analysis (ed. G. Borgefors), World Scientific, pp. 367-379 1995.

  3. Silven O. and Kauppinen H.: Recent Developments in Wood Inspection. (1996) International Journal on Pattern Recognition and Artificial Intelligence, IJPRAI, pp. 83-95, 1996.

  4. Rinnhofer, Alfred & Jakob, Gerhard & Deutschl, Edwin & Benesova, Wanda & Andreu, Jean-Philippe & Parziale, Geppy & Niel, Albert. (2006). A multi-sensor system for texture based high-speed hardwood lumber inspection. Proceedings of SPIE – The International Society for Optical Engineering. 5672. 34-43. 10.1117/12.588199.

  5. Irene Yu-Hua Gu, Henrik Andersson, Raul Vicen. (2010) Wood defect classification based on image analysis and support vector machines. Wood Science and Technology 44:4, 693-704. Online publication date: 1-Nov-2010.

  6. Kauppinen H, Silven O & Piirainen T. (1999) Self-organizing map based user interface for visual surface inspection. Proc. 11th Scandinavian Conference on Image Analysis (SCIA’99), June 7-11, Kangerlussuaq, Greenland, 801-808.

  7. Kauppinen H, Rautio H & Silven O (1999). Nonsegmenting defect detection and SOM-based classification for surface inspection using color vision. Proc. EUROPTO Conf. on Polarization and Color Techniques in Industrial Inspection (SPIE Vol. 3826), June 17-18, Munich, Germany, 270-280.

  8. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. (2012) ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)