决策树之三国争霸

 This post was kindly contributed by 数据科学与R语言 - go there to comment and to read the full post.

`library(C50)library(rpart)library(party)library(reshape2)library(ggplot2)data(churn)rate.c <- rate.r <-rate.p<-  rep(0,100)for (j in 1:100) {    num <- sample(1:10,nrow(churnTrain),replace=T)    res.c <- res.r  <-res.p<- array(0,dim=c(2,2,10))    for ( i in 1:10) {        train <- churnTrain[num!=i, ]        test <- churnTrain[num==i, ]         model.c <- C5.0(churn~.,data=train)        pre <- predict(model.c,test[,-20])        res.c[,,i] <- as.matrix(table(pre,test[ ,20]))         model.p <-ctree(churn~.,data=train)        pre <- predict(model.p,test[,-20])        res.p[,,i] <- as.matrix(table(pre,test[ ,20]))         model.r <- rpart(churn~.,data=train)        pre <- predict(model.r,test[,-20],type='class')        res.r[,,i] <- as.matrix(table(pre,test[ ,20]))    }    table.c <- apply(res.c,MARGIN=c(1,2),sum)    rate.c[j] <- sum(diag(table.c))/sum(table.c)     table.p <- apply(res.p,MARGIN=c(1,2),sum)    rate.p[j] <- sum(diag(table.p))/sum(table.p)     table.r <- apply(res.r,MARGIN=c(1,2),sum)    rate.r[j] <- sum(diag(table.r))/sum(table.r)}data <- data.frame(c50=rate.c,rpart=rate.r,party=rate.p)data.melt <- melt(data) p <- ggplot(data.melt,aes(variable,value,color=variable))p + geom_point(position='jitter')+    geom_violin(alpha=0.4)`

C5.0算法相对于C4.5有如下几点改进：
• 速度显著加快
• 内存使用减少
• 生成树模型更为简洁
• 支持boosting方法
• 支持加权和成本矩阵
• 支持变量筛选