Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R provides us with excellent resources to mine data, and there are some good overviews out there:

And there are other tools out there for data mining, like Weka.

Weka has a GUI and can be directed via the command line with Java as well, and Weka has a large variety of algorithms included. If, for whatever reason, you do not find the algorithm you need being implemented in R, Weka might be the place to go. And the RWeka-package marries R and Weka.

I am not an expert neither in R, nor in Weka, nor in data mining. But I happen to play around with them, and I’d like to share a starter on how to work with them. There is good documentation out there (e.g. Open-Source Machine Learning: R Meets Weka or RWeka Odds and Ends), but sometimes you want to document your own steps and ways of working, and this is what I do.

So, I want to build a classification model for the iris-dataset, based on a tree classifier. Joice is the C4.5 algorithm that I did not find implemented in any standard R package (anybody can help me out?).

We want to predict the class of a flower based on their attributes, namely sepal and petal width and length. The three species we have are “setosa”, “versicolor” and “virginica”. A short summary is given above.

Prediction with J48 (aka C4.5)

We next load the RWeka package.

summary(iris)

##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5
##        Species
##  setosa    :50
##  versicolor:50
##  virginica :50
##
##
##

library(RWeka)


We now build the classifier, and this works with the J48(.)-function:

iris_j48 <- J48(Species ~ ., data = iris)
iris_j48

## J48 pruned tree
## ------------------
##
## Petal.Width <= 0.6: setosa (50.0)
## Petal.Width > 0.6
## |   Petal.Width <= 1.7
## |   |   Petal.Length <= 4.9: versicolor (48.0/1.0)
## |   |   Petal.Length > 4.9
## |   |   |   Petal.Width <= 1.5: virginica (3.0)
## |   |   |   Petal.Width > 1.5: versicolor (3.0/1.0)
## |   Petal.Width > 1.7: virginica (46.0/1.0)
##
## Number of Leaves  :  5
##
## Size of the tree :   9

summary(iris_j48)

##
## === Summary ===
##
## Correctly Classified Instances         147               98      %
## Incorrectly Classified Instances         3                2      %
## Kappa statistic                          0.97
## Mean absolute error                      0.0233
## Root mean squared error                  0.108
## Relative absolute error                  5.2482 %
## Root relative squared error             22.9089 %
## Coverage of cases (0.95 level)          98.6667 %
## Mean rel. region size (0.95 level)      34      %
## Total Number of Instances              150
##
## === Confusion Matrix ===
##
##   a  b  c   <-- classified as
##  50  0  0 |  a = setosa
##   0 49  1 |  b = versicolor
##   0  2 48 |  c = virginica

plot(iris_j48)


We can assign the model to an object, and printing the object gives us the tree in “Weka-Output”, summary(.) gives us the Summary of the classification on the training set (again, in Weka-style), and plot(.) allows us to nicely plot it.

Evaluation in Weka

Well, we used the whole dataset now for training, but we actually might want to perform cross-validation. This can be done like this:

eval_j48 <- evaluate_Weka_classifier(iris_j48, numFolds = 10, complexity = FALSE,
seed = 1, class = TRUE)
eval_j48

## === 10 Fold Cross Validation ===
##
## === Summary ===
##
## Correctly Classified Instances         144               96      %
## Incorrectly Classified Instances         6                4      %
## Kappa statistic                          0.94
## Mean absolute error                      0.035
## Root mean squared error                  0.1586
## Relative absolute error                  7.8705 %
## Root relative squared error             33.6353 %
## Coverage of cases (0.95 level)          96.6667 %
## Mean rel. region size (0.95 level)      33.7778 %
## Total Number of Instances              150
##
## === Detailed Accuracy By Class ===
##
##                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
##                  0.980    0.000    1.000      0.980    0.990      0.985    0.990     0.987     setosa
##                  0.940    0.030    0.940      0.940    0.940      0.910    0.952     0.880     versicolor
##                  0.960    0.030    0.941      0.960    0.950      0.925    0.961     0.905     virginica
## Weighted Avg.    0.960    0.020    0.960      0.960    0.960      0.940    0.968     0.924
##
## === Confusion Matrix ===
##
##   a  b  c   <-- classified as
##  49  1  0 |  a = setosa
##   0 47  3 |  b = versicolor
##   0  2 48 |  c = virginica


We see slightly worse results now, as you would suspect.

Using Weka-controls

We used the standard options for th J48 classifier, but Weka allows more. You can acces these with the WOW-function:

WOW("J48")

## -U      Use unpruned tree.
## -O      Do not collapse tree.
## -C <pruning confidence>
##         Set confidence threshold for pruning.  (default 0.25)
##  Number of arguments: 1.
## -M <minimum number of instances>
##         Set minimum number of instances per leaf.  (default 2)
##  Number of arguments: 1.
## -R      Use reduced error pruning.
## -N <number of folds>
##         Set number of folds for reduced error pruning. One fold is
##         used as pruning set.  (default 3)
##  Number of arguments: 1.
## -B      Use binary splits only.
## -S      Don't perform subtree raising.
## -L      Do not clean up after the tree has been built.
## -A      Laplace smoothing for predicted probabilities.
## -J      Do not use MDL correction for info gain on numeric
##         attributes.
## -Q <seed>
##         Seed for random data shuffling (default 1).
##  Number of arguments: 1.


If, for example, we want to use a tree with minimum 10 instances in each leaf, we change the command as follows:

j48_control <- J48(Species ~ ., data = iris, control = Weka_control(M = 10))
j48_control

## J48 pruned tree
## ------------------
##
## Petal.Width <= 0.6: setosa (50.0)
## Petal.Width > 0.6
## |   Petal.Width <= 1.7: versicolor (54.0/5.0)
## |   Petal.Width > 1.7: virginica (46.0/1.0)
##
## Number of Leaves  :  3
##
## Size of the tree :   5


And you see the tree is different (well, it just does not go as deep as the other one..).

Building cost-sensitive classifiers

You might want to include a cost matrix, i.e you want to penalize some wrong classifications, see here. If you think classifying for example a versicolor wrongly is very harmful, you want to penalize such a classification in our example, you can do that easily – you just have to choose a different classifier, namely the “Cost-sensitive classifier” in Weka:

csc <- CostSensitiveClassifier(Species ~ ., data = iris, control = Weka_control(cost-matrix = matrix(c(0,
10, 0, 0, 0, 0, 0, 10, 0), ncol = 3), W = "weka.classifiers.trees.J48",
M = TRUE))


But you have to tell the “cost-sensitive-classifier” that you want to use J48 as algorithm, and you have to tell him the cost matrix you want to apply, name ly the matrix of the form

matrix(c(0, 1, 0, 0, 0, 0, 0, 1, 0), ncol = 3)

##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    1    0    1
## [3,]    0    0    0


where you penalize “versicolor” being falsly classified as one of the others by factor 10.

And again we evaluate on 10-fold CV:

eval_csc <- evaluate_Weka_classifier(csc, numFolds = 10, complexity = FALSE,
seed = 1, class = TRUE)
eval_csc

## === 10 Fold Cross Validation ===
##
## === Summary ===
##
## Correctly Classified Instances          98               65.3333 %
## Incorrectly Classified Instances        52               34.6667 %
## Kappa statistic                          0.48
## Mean absolute error                      0.2311
## Root mean squared error                  0.4807
## Relative absolute error                 52      %
## Root relative squared error            101.9804 %
## Coverage of cases (0.95 level)          65.3333 %
## Mean rel. region size (0.95 level)      33.3333 %
## Total Number of Instances              150
##
## === Detailed Accuracy By Class ===
##
##                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
##                  0.980    0.070    0.875      0.980    0.925      0.887    0.955     0.864     setosa
##                  0.980    0.450    0.521      0.980    0.681      0.517    0.765     0.518     versicolor
##                  0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.333     virginica
## Weighted Avg.    0.653    0.173    0.465      0.653    0.535      0.468    0.740     0.572
##
## === Confusion Matrix ===
##
##   a  b  c   <-- classified as
##  49  1  0 |  a = setosa
##   1 49  0 |  b = versicolor
##   6 44  0 |  c = virginica


and we see that the “versicolors” are now better predicted (only one wrong, compared to 3 in the normal J48 earlier). But this happened at the expense of more fals classification on “virginica”, where we have now 6 wrongly classified instead of 2.

Alright, this is just a short starter. I suggest you check out the very good introductions I referred to earlier to explore the full wealth of RWeka… Have fun!