I’ve been spending a lot of time in the last month or so doing projects at work not statistics related, hence the lack of posts! In the interim, I had to do some serious research on handling datasets bigger than the last one I worked with (the one that kept threatening to max out my 8 gigs of RAM!). I kept trying to practice working with R packages like bigmemory and ffdf, but nothing was completely satisfying my need to be able to handle a big dataset with different data types in different columns. So, after reading up on different commercial stats packages, I determined that getting Statistica would be best for my supervisor and I (she’s insanely busy and wouldn’t have the time for the learning curve to learn Revolution R, if we were to buy that).
In speaking with my supervisor about Statistica, she mentioned that it can interface with R. So once we got our copies of Version 11 Advanced, I went ahead and learned how the interface works.
Setup/Installation: The setup and installation of the R integration was really annoying. There is a COM server application you have to download and install. You have to make sure you run the installation in administrator mode. Then you have to make sure that R is installed using administrator mode. You have to make sure you get the rscproxy package in R and that it is installed in the R Home directory that sits in your program files folder. It was quite a hassle. Statistica put a white paper on their website explaining the process.
Memory Usage: When you actively use the R integration in Statistica, take a look at your memory usage (I’m using a windows 7 computer for work). What you will notice is that any time you run an R function in statistica, the R connector program starts taking up more and more memory, representing the fact that data is being passed from Statistica to R to be processed. The upshot of this is that you should probably be careful how much data you’re passing to an R procedure from Statistica so that you don’t max out your memory.
Syntax: Check out the screenshot below. Typing in R syntax into Statistica is, thankfully, pretty easy. As you can see in the screenshot, if you want to access the active dataset to do something with it, you treat it as a dataframe labelled ActiveDataSet, and then you can use the $ sign and type the variable name of your statistica dataset like you would with R. The only catch seems to be variables with spaces in them. So for those variables it seems that you have to resort to referring to them by their column numbers, instead of name.
Functionality: So far, it looks like data only flows from the Statistica spreadsheet, to R, back to the Statistica report output, or a new Statistica spreadsheet. It would be nice if I could modify data from R within a spreadsheet, but that seems to be out of the question.
Main advantage: Being a commercial product, the good folks at Statsoft aren’t just going to give you the product with all of the statistical procedures they came up with for free. For example, since I now have Statistica Advanced, it does allow me to do some cool multivariate procedures, but I can’t generate random forests unless I get Statistica Data Miner. The advantage that the R integration brings then, is allowing me to have advanced statistical procedures, like Random Forests, or even graphing abilities like ggplot2, without having to pay extra. I show an example of having used a random forest procedure in Statistica using R in the screenshot above.