In this post, I describe how to prepare your data for migrating between SPSS/Excel and R. This is the third post in a series, the first two of which can be found here and here. Don’t forget, this is primarily aimed at those working on datasets for psychology experiments, as that’s what I do.
Datasets in SPSS/Excel
One of the golden rules of working with datasets in SPSS is that you need to have one row for each participant. I know there are some exceptions to this, but it’s an important general rule for SPSS.
The main consequence of this is that, when you’re dealing with any form of within-subjects data, your dataset quickly becomes very wide indeed. Let’s look at an example below. Here, we have 10 participants, involved in two experimental sessions. For each session, we’ve measured the Reaction Time (RT).
That’s not too messy (note that I just pasted in 1200 for the values as this is just an illustration). But let’s make things worse. Let’s have 10 experimental sessions, each with three different blocks of trials, each representing a different within-subjects condition. What does it look like now?
Well, we can’t fit it all into a single screenshot, as the dataset has a large number of columns. This is an illustration of what gets referred to as a wide data format – you have a large number of columns mapping on to various factors, variables, etc.
R does things differently, for most of the statistical tests that I’ll be discussing: it uses the long data format instead.
Long Datasets in R
When you think about it, wide datasets can be a real pain. I’ve seen people spend hours running pivot tables and then having to drag columns around to get their datasets in a format that SPSS will be happy with.
With R, things are significantly easier: for many tests, such as t-tests and ANOVAs of various forms, you only need to use a single layout: the long data format. You can probably guess what this is already, but let’s do a direct comparison using the first example dataset described above.
Again, let’s say we have Reaction Times (RTs) for 10 participants involved in two sessions of experimental trials. In wide format, these data look like this:
In the long format, these data look like this:
Here you can see the difference: in the long format, the one row per participant rule does not apply. Instead, you have one row for each combination of factors under examination.
What if your Datasets are all in the Wide Format?
There are a number of options that you can use to convert between the two different formats. I’ve covered perhaps one of the easiest methods, in the form of the reshape package, in a previous post. You’ll need to install the reshape2 package to do this, using the package installation guide I presented previously.
Just to give an example, let’s work through the dataset I’ve been describing above.
First, let’s create some data:
session1 <- rnorm(mean=1500, sd=250, 10) session2 <- rnorm(mean=1000, sd=250, 10) ppt <- seq(1:10) wide<- data.frame(ppt, session1, session2)
That gives us a dataframe called wide. How do we reshape the dataset to the long format that we want? Simple, by using the following:
long<- melt(wide, id=c("ppt"))
This then gives us a dataframe called long, arranged in the format we want.
In many cases, if you want to avoid having to do this, it’s best to make sure your datasets are in the long format beforehand – it’s a simple case of planning ahead and knowing that you can do things differently.
Summary and Next Steps
This post illustrated how to get your data organised for use in R for those who are used to using SPSS/Excel. There are many useful ways to re-organise your data, and I’ve covered one of them here (the reshape package). The next steps include aggregating your data and then running statistical tests.