A comment on preparing data for classifiers

[This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have been working through (with some honest appreciation) a recent article comparing many classifiers on many data sets: “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim; 15(Oct):3133−3181, 2014 (which we will call “the DWN paper” in this note). This paper applies 179 popular classifiers to around 120 data sets (mostly from the UCI Machine Learning Repository). The work looks good and interesting, but we do have one quibble with the data-prep on 8 of the 123 shared data sets. Given the paper is already out (not just in pre-print) I think it is appropriate to comment publicly.

The DWN paper is an interesting empirical study that measures the performance of a good number of popular classifiers (179 but their own account) on about 120 data sets (mostly from UCI).

This actually represents a bit of work as the UCI data sets are not all in exactly the same format. The data sets have varying file names, varying separators, varying missing value symbols, varying quoting/escaping conventions, non-machine readable headers, some data sets have row-ids, column to be predicted in varying positions, some data in zip files, and many other painful variations. I have always described UCI as “not quite machine readable.” Working with any one data set is easy, but the prospect of building an adapter for each of a large number of such data sets is unappealing. Combined with the fact that the data sets are often of small size, and often artificial/synthetic (designed to show off one particular inference method) few people work with more than a few of these data sets. The authors of DMW worked with well over 100 and shared their fully machine readable results ( .arff and apparently standardized *_R.dat files) in a convenient single downloadable tar-file (see their paper for the URL).

The stated conclusion of the paper is comforting, and not entirely unexpected: random forest methods are usually in the top 3 classifiers in terms of accuracy.

The problem is: we are always more accepting of an expected outcome. To confirm such a conclusion we will, of course, need more studies (on larger and more industry-typical data sets), better measures than accuracy (see here for some details), and a lot of digging in to methodology (including data preparation).

To be clear: I like the paper. The authors (as good scientists) publicly shared their data and a bit of their preparation code. This is something most authors do not do, and should in fact be our standard for accepting work for evaluation.

But, let us get down to quibbles. Let’s unpack the data and look at an example. Suppose we start with “car” a synthetic data set we have often used as an example. The UCI repository supplies 3 files: car.c45-names, car.data, and car.names

  • car.names Free-form description of the data-set and format.
  • car.data Comma separated data (without header).
  • car.c45-names Presumably machine readable header for C4.5 packages

The standard way to deal with this data is to (by hand) inspect car.names or car.c45-names and hand-build a custom command to load the data. Example R code to do this is given below:

library(RCurl) url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data" tab <- read.table(text=getURL(url,write=basicTextGatherer()), header=F,sep=',') colnames(tab) <- c('buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class') options(width=50) print(summary(tab))

Which (assuming RCurl is properly installed) yields:

buying maint doors persons high :432 high :432 2 :432 2 :576 low :432 low :432 3 :432 4 :576 med :432 med :432 4 :432 more:576 vhigh:432 vhigh:432 5more:432 lug_boot safety class big :576 high:576 acc : 384 med :576 low :576 good : 69 small:576 med :576 unacc:1210 vgood: 65

For any one data set having to read the documentation and adapt that into custom loading code is not a big deal. However, having to do this for over 100 data sets is an effort. Let’s look into how the DWN paper did this.

The DWN paper car directory has 9 items:

  • car.data original file from UCI.
  • car.names original file from UCI.
  • le_datos.m Matlab custom data loading code.
  • car.txt Facts about the data set.
  • car.arff Derived .arff format version of the data set.
  • car.cost Pricing of classification errors.
  • car_R.dat Derived standard tab separated values file with header.
  • conxuntos.dat Likely a result file.
  • conxuntos_kfold.dat Likely a result file.

The files I am interested in are car_R.dat and le_datos.m. car_R.dat looks to be a TSV (tab separated values) file with header, likely intended to be read into R. It looks like the file is in a very regular format with row numbers, feature columns first (and named f*) and category to be predicted last (named clase and re-encoded as an integer). Notice that all features (which in this case were originally strings or factors) have been re-encoded as floating point numbers. That is potentially a problem. Let’s try and dig in how this conversion may have been done. We look into le_datos.m and see the following code fragment:

for i_fich=1:n_fich f=fopen(fich{i_fich}, 'r'); if -1==f error('erro en fopen abrindo %sn', fich{i_fich}); end for i=1:n_patrons(i_fich) fprintf(2,'%5.1f%%r', 100*n_iter++/n_patrons_total); for j = 1:n_entradas t= fscanf(f,'%s',1); if j==1 || j==2 val={'vhigh', 'high', 'med', 'low'}; elseif j==3 val={'2', '3', '4', '5-more'}; elseif j==4 val={'2', '4', 'more'}; elseif j==5 val={'small', 'med', 'big'}; elseif j==6 val={'low', 'med', 'high'}; end n=length(val); a=2/(n-1); b=(1+n)/(1-n); for k=1:n if strcmp(t,val{k}) x(i_fich,i,j)=a*k+b; break end end end t = fscanf(f,'%s',1); % lectura da clase for j=1:n_clases if strcmp(t,clase{j}) cl(i_fich,i)=j; break end end end fclose(f); end

It looks like for each categorical variable the researchers have hand-coded an ordered choice of levels. Then each level is replaced by equally spaced code-number from -1 through 1 (using the linear rule x(i_fich,i,j)=a*k+b). Then (in code not shown) possibly more transformations are applied to numeric variables (such as centering and scaling to unit variance). This changes the original data which looks like this:

buying maint doors persons lug_boot safety class 1 vhigh vhigh 2 2 small low unacc 2 vhigh vhigh 2 2 small med unacc 3 vhigh vhigh 2 2 small high unacc 4 vhigh vhigh 2 2 med low unacc 5 vhigh vhigh 2 2 med med unacc 6 vhigh vhigh 2 2 med high unacc

To this

f1 f2 f3 f4 f5 f6 clase 1 -1.34125 -1.34125 -1.52084 -1.22439 -1.22439 -1.22439 1 2 -1.34125 -1.34125 -1.52084 -1.22439 -1.22439 0 1 3 -1.34125 -1.34125 -1.52084 -1.22439 -1.22439 1.22439 1 4 -1.34125 -1.34125 -1.52084 -1.22439 0 -1.22439 1

It appears as if one of the machine learning libraries the authors are using only accepts numeric features (I think some of the Python scikit-learn methods have this limitation) or the authors believe they are using such a package. Whomever prepared this data seemed to be unaware that the standard way to convert categorical variables to numeric is the introduction of multiple indicator variables (see page 33 of chapter 2 of Practical Data Science with R for more details).


Indicators
Indicator variables encoding US Census reported levels of education.

The point is: encoding multiple levels of a categorical variable into a single number may seem reversible to a person (as it is a 1-1 map), but some machine learning methods can not undo the geometric detail lost in such an encoding. For example: with a linear method (be it regression, logistic regression, a linear SVM, or so on) we lose explanatory power unless the encoding has properly guessed both the correct order of the attributes and the relative magnitudes. Even tree-based methods (like decision trees, or even random forest) waste part of their explanatory power (roughly degrees of freedom) trying to invert the encoding (leaving less power remaining to explain the original relation in the data). This sort of ad-hoc encoding may not cause much harm in this one example, but it is exactly what you don’t want to do if there are a great number of levels, cases where the order isn’t obvious, or when you are comparing different methods (as different methods are damaged to different degrees by this encoding).

This sort of “convert categorical features” through an arbitrary function is something we have seen a few times. It is one of the reasons we explicitly discuss indicator variables in “Practical Data Science with R” despite the common wisdom that “everybody already knows about them.” When you are trying to get best possible results for a client, you don’t want to inflict avoidable errors in your data transforms.

If you absolutely don’t want to use indicator variables consider impact coding or a safe automated transform such as vtreat. In both cases the actual training data is used to try and estimate the order and relative magnitudes of an encoding that would be useful for downstream modeling.

Is there any actual damage in this encoding? Let’s load the processed data set and see.

url2 <- 'http://winvector.github.io/uciCar/car_R.dat' dTreated <- read.table(url2, sep='t',header=TRUE)

The original data set supports a pretty good logistic regression model for unaccaptable cars:

set.seed(32353) train <- rbinom(dim(tab)[[1]],1,0.5)==1 m1 <- glm(class=='unacc'~buying+maint+doors+persons+lug_boot+safety, family=binomial(link='logit'), data=tab[train,]) tab$pred <- predict(m1,newdata=tab,type="response") print(table(class=tab[!train,'class'], unnacPred=tab[!train,'pred']>0.5)) ## unnacPred ## class FALSE TRUE ## acc 181 18 ## good 30 0 ## unacc 22 577 ## vgood 35 0

The transformed data set does not support as good a logistic regression mode.

m2 <- glm(clase==1~f1+f2+f3+f4+f5, family=binomial(link='logit'), data=dTreated[train,]) dTreated$pred <- predict(m2,newdata=dTreated,type="response") print(table(class=dTreated[!train,'clase'], unnacPred=dTreated[!train,'pred']>0.5)) ## unnacPred ## class FALSE TRUE ## 0 28 7 ## 1 69 530 ## 2 64 135 ## 3 23 7

Now obviously some modeling methods are more sensitive to this miss-coding than others. In fact for a moderate number of levels you would expect random forest methods to actually invert the coding. But the fact that some methods are more affected than others is one reason why you don’t want to perform this encoding before making comparisons. As to the question whey to ever use logistic regression? Because when you have a proper encoding of of the data and the model structure is in fact somewhat linear logistic regression can in fact be a very good method.

In the DWN paper 8 data sets (out of 123) have the a*k+b fragment in their le_datos.m file. So likely the study was largely driven by data sets that natively have only numeric features. Also, we emphasize the DWN paper shared its data and a bit of its methods, which puts it light-years ahead of most published empirical studies. The only reason we can’t so citique other authors is many other authors don’t share their work.

To leave a comment for the author, please follow the link and comment on their blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)