Thoughts on SPSS and R Integration

[This article was first published on Gage Theory » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As part of considering SPSS as a platform for modeling I wanted to test SPSS’ integration with R. What I found out is getting SPSS to work with R isn’t embarssingly obvious. What’s worse I found it quite difficult to find documentation either from SPSS, online, or from the SPSS developer community on how to get over the logical jumps that come up. In the interest of saving someone else time and encourging SPSS to take advantage of R here’s how to get the two talking to each other and my thoughts on the integration as it stands.

Sequence for Getting SPSS and R Integrated

First off, you must have SPSS Modeler, SPSS Statistics, and SPSS R Essentials installed. The integration works by creating an SPSS Statistics node in Modeler. But you’ll notice if you try to create an “SPSS Statistics” node in Modeler you’ll get the following error:

“Please run the IBM SPSS Statistics License Location Utility and restart this program before using this node.”

There’s nothing wrong with your license for SPSS Statistics or Modeler. The issue is Modeler doesn’t know where the Statistics license is. To resolve the error:

  1. In Modeler go to Tools > Options > Helper Applications
  2. Browse to the location of the SPSS Statistics executable called “stats.exe”
  3. Inside the Helper Applications dialog box run the “IBM SPSS Statistics License Location Utility” and click Okay

Now you’ve got the license error handled, but the SPSS Statistics node can only be used in 1 way:

  1. To use any SPSS Statistics nodes a “Filter” node must be used before the Statistics node
  2. In the “Filter” node in the “Filter” tab click on the “Filter” icon and click “Rename for IBM SPSS Statistics”
  3. In the dialog box click “Underscore” and click Okay
  4. Connect the “Filter” node to a “Statistics Output” node go to the “Syntax” tab and highlight the “Syntax Editor” radio button to enter R code
  • Note: There’s another ”gotcha” here. SPSS Statistics code that wraps R code must have the Data step removed because Modeler enforces the data delievered to the Statistics Output node

Example: R data -> SPSS

Now let’s try to do something useful and pass data generated in R back to SPSS.

Here’s the process:

  • Create a SPSS meta-data Dictionary
    • dict <- spssdictionary.CreateSPSSDictionary(a.dict, b.dict)
    • Set up the dictionary in SPSS
      • spssdictionary.SetDictionaryToSPSS(“results”, dict)
    • Then send the data to SPSS
      • spssdata.SetDataToSPSS(“results”,example)

For defining the meta-data for each variable the syntax is as follows:

  • varName = Variable Name
  • varLabel = Variable Label
  • varType = Type of variable
    • 0 for numeric
    • For strings an integer representing the number of characters in the string
    • varFormat = Format of the variables
      • Aw – for chars only, where “w” is equal to the number of characters e.g. A4
      • Fw.d – for numeric only, where “w” is the number of integers and “d” is the number of decimals, e.g. F5.2
    • varMeasurementLevel = nominal, scale, or ordinal

Here’s the actual code that goes into the SPSS Statistics Output Syntax Editor:

BEGIN PROGRAM R.
# Create the definition for the columns
# The format is c(“variable name”, “variable label”, “variable type”, “variable format”, “variable measurement level”)
a.dict <- c("alpha","alpha",0,"F1","scale")
b.dict <- c("beta","beta",0,"F1","scale")
# Create the dictionary for all the variables
dict <- spssdictionary.CreateSPSSDictionary(a.dict, b.dict)
# Apply the dictionary to an unpopulated, but named SPSS data set
spssdictionary.SetDictionaryToSPSS("results", dict)
# Create the actual data
example <- data.frame("alpha" = c(1, 2, 3), "beta" = c(2, 3, 4))
# Populate the data into the SPSS data set
spssdata.SetDataToSPSS("results",example)
spssdictionary.EndDataStep()
END PROGRAM.

General Thoughts

After spending quite a few hours playing around with the SPSS & R integration, I have a pretty distinct opinion on the subject. Here are responses to the questions I had and the questions I received from others:

Does SPSS Modeler integrate with R?

Yes.

Is the SPSS Modeler integration with R fluid?

No. Or if it is, I have yet to comprehend its beauty.

How does the integration work?

Modeler -> SPSS Statistics -> R. Modeler is indirectly integrated. Modeler talks to R through Statistics.

Is it possible to pass data from Modeler to R?

Yes. In fact, it looks like that is the only way to run R code even if you don’t need to pass R data. The “Statistics Output” node must have a “Filter” node preceeding it, and a “Filter” node must have a “Source” node attached to run – even if you don’t need the data.

How is the speed?

Slow. Even operations that are nearly instananeous in R or SPSS with small data sets took several seconds. I don’t know where the slow down occurs, but it’s very noticeable.

Where is R output displayed?

R output is sent to an SPSS Statistics Output Viewer. Data can be printed, SPSS pivot tables can be printed, and graphs can be printed.

Is it possible to pass data back to SPSS Statistics?

Yes.

Is it possible to directly pass data back to Modeler?

I’m unclear on this. It is definitely possible to indirectly pass data back to Modeler by having R or SPSS Statistics write data to a source that Modeler can pick up with one of its “Source” nodes. I have not found a way to directly pass any SPSS Statistics or R output back to Modeler. It seems like this should be easy, but I haven’t seen any built-in functionality like that. This is one of the elements where it hurts not to have a good, active, online community for SPSS Modeler.

Can you change data in SPSS or add a new variable without exporting the whole SPSS table?

R cannot modify data in the SPSS environment. A new variable could be created in R, sent to SPSS Statistics, and then attached to the SPSS Statistics data set using the SPSS Statistics command language, but it’s not as simple as appending a column to a Statistics data set.

How is the process for passing data back and forth between SPSS and R?

Cumbersome. Getting SPSS data into R is an easy a one-liner. Moving data from R back to SPSS Statistics is not. SPSS will only accept a data frame from R. To move the data frame requires defining 5 pieces of meta-data for every column in the data frame, a step to create this meta-data dictionary for SPSS, a step to write the meta-data to SPSS, and only then can the data frame actually be written to SPSS. I don’t think it’d be that hard to define some R helper functions to speed the process up a bit, but it’d be a project.

Must every variable referenced in the R code already be present as named in SPSS?

Nope. You can pass data from SPSS into R and then do whatever you normally would inside of R. That includes referencing data from other locations. When you pass in data from SPSS the column names of the table in R are the same as the table in SPSS, but you can do anything you want to those names once the data exists inside of R.

Can SPSS work with the twitteR package in R?

No. The twitteR package requires the RJSONIO package which will only run in R version 2.13 or higher and SPSS Statistics only integrates with R version 2.12. If SPSS makes it to version 2.13 it should work fine, but all the previous caveats about data passing still apply.

What’s your overall opinion of the SPSS & R integration?

A bit disappointed. Strangely most of my disappointment comes from the SPSS Statistics & SPSS Modeler integration though. The SPSS Statistics & R integration works to the extent that it is relatively easy to pass data from Statistics to R and view R’s output inside of SPSS Statistics just like you would any of their GUI menu options. That part works. What seems to be lacking are backwards paths for data between R and SPSS Statistics and between SPSS Statistics and SPSS Modeler. I’m new to Modeler so I can’t guarantee the SPSS Statistics -> Modeler communication is poor, but I haven’t seen it work. I can see how to incorporate R scripts into a Modeler workflow by writing a full-blown extension to SPSS Statistics that Modeler can call, but that really limits the use of R as an exploratory tool inside of the Modeler workflow. The overhead on writing an extension is too much for anything less than a piece of analysis you know you will use many times.

How’s the SPSS & Python integration?

I don’t know and after my experience with SPSS & R, I don’t intend to find out. Grinding through SPSS & R wasn’t very enjoyable and took a lot more time than I expected it to. I would only dive into Python (which I know far less well than R) if I knew there was something I absolutely wanted to do with Python & SPSS.

To leave a comment for the author, please follow the link and comment on their blog: Gage Theory » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)