April 3, 2011
By

Few can tell you what goes into a chicken nugget, but most will agree that it's good for your brain. If you're a little sluggish and can't focus, what do you normally do? That's right, you pop a couple chicken nuggets. And similar to our brains, our algorithms need some food too, to think properly. It's not a simple matter for them though. They can't drive to a fast-food establishment, so they rely on us to make the nuggets for them. With R and a little awk, it's actually quite easy to do this at home. Also, when you make brain food from scratch, you're one of those few people who actually know what goes into it.

For our purposes, we are going to prepare some information nuggets for the C/C++ SVM library of algorithms known as libsvm. This is open-source and award-winning, so it's a good brain for our purposes. It aims to predict an outcome given a specific set of conditions. Before it can start on the business of prediction, it needs to learn from "quote" -- examples. It needs to eat some chicken nuggets. And it needs those nuggets presented in a specific way. The libsvm format for data is:

label   index1:value1   index2:value2   ...

The label is the answer. This of course depends on what you want to predict. Let's say we are interested in a prediction if the market for a particular stock is going to be up or down from yesterday. In this case, label will be either 1 or -1. This is what it looks like when the answer is an up day.

1   index1:value1   index2:value2   ...

We could say TRUE or FALSE, but libsvm needs a numerical representation. The next series of lines are what you think the brain needs to get the answer correct. It follows in the format of:

1 1: "value of 50-day SMA" 2: "value of RSI" ...

We need to replace the quoted string with an actual number and that's where R comes in. I chose a Dow 30 stock as my example. The following R code gets us most of the way there.

require("quantmod")getSymbols("MCD")MCD$Cl.sma_10 <- Lag(SMA(Cl(MCD), n=10)) #yesterday's valueMCD$Cl.sma_30       <- Lag(SMA(Cl(MCD), n=30))    #yesterday's valueMCD$Vo.sma_10 <- Lag(SMA(Vo(MCD), n=10)) #yesterday's valueMCD$Vo.sma_30       <- Lag(SMA(Vo(MCD), n=30))    #yesterday's valueMCD$Cl.rsi <- Lag(RSI(Cl(MCD))) #yesterday's valueMCD$Cl.return.daily <- Lag(Delt(Cl(MCD)))         #yesterday's valueMCD$Cl.return.10 <- Lag(Delt(Cl(MCD), k=10)) #yesterday's valueMCD$Cl.return.30    <- Lag(Delt(Cl(MCD), k=30))   #yesterday's valueMCD$pre_answer <- Delt(Cl(MCD)) #today's pre_answersquish <- function(x){ if(x>0) return(1) else(x< 0) return(-1)} MCD <- na.locf(MCD, na.rm=TRUE) MCD$answer          <- cbind(MCD, apply(MCD,1, function(x)squish(x[15])))write.table(MCD, "~/Desktop/goo", row.names=FALSE, col.names=FALSE)
Try this yourself. You should get a text file on your desktop called goo. Here is what the first row looks like, but a warning first. It's not pretty. Remember what were making here.
45 45.38 44.86 45.32 6806600 39.83 44.71 44.2873333333333 4683760 6206100 60.5405839422285 -0.000888494002665663 0.0112410071942446 0.0253020287212218 0.00755891507336592 45 45.38 44.86 45.32 6806600 39.83 44.71 44.2873333333333 4683760 6206100 60.5405839422285 -0.000888494002665663 0.0112410071942446 0.0253020287212218 0.00755891507336592 1
I would have preferred doing a little more prep in the R code, but some mysterious going-ons created new columns when I tried indexing out data I wasn't interested in. I suppose it has to do with not being able to delete a column whose value creates a column you want to keep. Not sure about this one. I've turned to a little awk wizardry to get the values I truly want, and to get the format just so. Here, we convert the goo file to a paste file. This is all on one line from command line in the directory where your goo file is located.
$awk '{print$31 " 1:" $7 " 2:"$8 " 3:" $9 " 4:"$10 " 5:" $11 " 6:"$12  " 7:" $13 " 8:"$14}' goo > paste
This is the first line of the paste file. Still not appropriate for visually sensitive people, so be careful.
1 1:44.71 2:44.2873333333333 3:4683760 4:6206100 5:60.5405839422285 6:-0.000888494002665663 7:0.0112410071942446 8:0.0253020287212218
I didn't mention python earlier because I didn't want to make this sound too complicated right off the bat. But there is actually a little python script that checks to see if the format is satisfactory. The program comes with libsvm.
\$ python checkdata.py pasteNo error.

There is a little work to do with scaling the data, so be careful to feed this raw paste to your beloved algorithm. You still need to change the bubblegum color and pasteurize it. The README that comes with libsvm explains it well.

The elegance of this approach to feeding your brain is that not only do you control the ingredients, but you can experiment with those ingredients to find the best chicken nugget recipe of all time. Ever.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...