In this post, I give an outline for those interested in migrating from using SPSS and Excel for data processing/analysis across to using R for data processing/analysis. This will be the first post in a small series: it’s aimed at psychology researchers – as that’s what I am, but I’m sure much of this will apply to people from other fields/disciplines. For the purposes of this, I’ll assume that you do your data manipulation (e.g., pivot tables and organising datasets) using Excel, and your stats using SPSS. I also assume you use either SPSS or Excel, or perhaps an alternative package such as SigmaPlot, to make your graphs for publications.
Before you Start
Before we get into any details, I need to point out something important. Whatever you do, DON’T sit down with R in front of you, clutching a dataset in your hand, and expect to be able to do everything you could in SPSS/Excel straight away. I’ve seen many people try this, expecting R to be identical to what they are used to from SPSS/Excel, and they, not surprisingly, give up very quickly. Just remember this: you’ve probably spent years working with SPSS and Excel, beginning as an undergrad, so this process will take time – but that doesn’t mean you should give up, like I’ve seen many people do. The investment of time is more than worth it!
After you download R and have it installed, I recommend picking up RStudio and installing that. Load up RStudio. It will look something like this (click image to enlarge):
Now, there are several points worth considering when you are faced with your new setup. First, you can now officially forget about the annoyances of having to switch between programs to do different things: gone are your days of using Excel to manipulate data, then importing it to SPSS, then graphing it, and then realising there’s something horribly wrong with it. I’ve had that happen plenty of times – spotting an error at the very end of the long chain of processes. The worst thing about when this happens is that you then need to go back to step #1 and start from scratch. It’s no fun at all.
R changes all of that. You have SPSS, Excel and Sigmaplot all rolled into one – with oodles of other features packed in besides (plus you have the option of adding more features and packages yourself). More importantly, as you will be writing scripts to do all your data processing and graphs, if you do detect an error, it’s a case of tweaking your script and then re-running it. Gone will be the days when you had to laboriously click through pivot tables, pasting into loads of new worksheets, losing your place and getting confused…and so on. You’ll find yourself wondering why you ever did things like that in the first place.
I’ll now go through what you have in front of you – taking each part step by step.
The Scripting/Data Tabs and Console Window
In the scripting tab(s) area (yellow), you have access to scripts that you write. Scripts consist of commands that you run and send to R to do things.
The Console (highlighted red) is a bit like the output window you get from SPSS. However, unlike the SPSS output window, it’s interactive, so you can type commands to it directly rather than just seeing buckets of text spewed out. This is useful if you want to test something out or just run something once. For example, type:
in the console and press Return. It nicely gives you the answer.
To see what the script tab can be used for, and for what scripts can be used for in general, type the same command in again into the script. All you need is to type 40+2. Then, click anywhere in the row where you have typed your command, and click the Run button (see image below). This sends the command from the line in your script directly to the console. You’ll see other buttons nearby that have other, related functions – hover over them for tooltips.
If you find yourself clicking run, and noticing the console output just adding more ‘>’ signs to the output, it means you haven’t clicked the line containing your command. Try clicking the line containing 40+2 with your mouse and running the command again.
Next up, try adding some additional lines of basic maths to your script, highlight the lines with your mouse, and then hit run again. If you have multiple lines of commands in your script to run, then you need to highlight them with your mouse. When you run them, you’ll see that these lines have all been sent to the console, with the answers coming out at the appropriate times.
In Excel, you were used to having multiple worksheets with different datasets in them. These quickly become difficult to use when you have large numbers of worksheets to sift through. In SPSS, you need to have one main SPSS window open for each dataset. This could rapidly lead to having many SPSS windows open, causing annoyance and confusion about which one to work from.
R and RStudio takes a different approach entirely. In R, you have a workspace (users of Matlab will be familiar with this), and RStudio allows you to get an overview of your workspace in a glance. Gone are the days of many worksheets from Excel or many data windows in SPSS. The workspace in RStudio is highlighted below.
Let’s begin by demonstrating the purpose of the workspace. Type the following into a script and run it, or into the console itself:
answer <- 42
You will then see answer appear in the workspace with a value of 42.
But, this works not only with individual values – you can access, store and manipulate entire sets of data like this as well. Let’s create ourselves a little data set to begin with. Typically we work with dataframes in R, so we’ll call them that from now on. Just think of them a bit like a data window in SPSS, or a worksheet in Excel.
Next, run the following command:
testdf <- data.frame(“values”=rnorm(mean=5,sd=1,100))
What this does is create a dataframe called testdf – don’t worry about the details too much just yet, as I’ll get to handling data in the next post in this series. What this dataframe contains is a list of 100 values randomly selected from a normal distribution with a mean of five and standard deviation of one. When you run the command, you’ll notice that testdf gets added to your workspace:
One of the useful features here is that it gives us information about the number of rows (“100 obs.”) and the number of columns (“1 variables”) in our dataframe. This is often helpful for diagnostics and making sure we know what we’ve done with our data when it’s been copied/merged/transformed/etc.
Finally, when you want to take a look at your dataframe, all you need to do is click the name of it in the workspace – a single-click is all you need. That will then open a new tab showing your data, like so:
This now allows you to look at all of your data – though unlike SPSS and Excel, it does mean that you have open in tabs the data you need to look at, rather than everything. Personally I feel this makes everything seem cleaner and less cluttered, but that may just be me! I know that everyone has their own personal way of dealing with things like this.
Let’s end with making a very simple plot. For now, we’ll make a histogram of our dataframe. To do that, enter the following command:
You can see the plot shown in the Plots area in the bottom-right-hand corner of the display – shown in the image below. My randomly-selected values show a reasonably normal distribution!
That’s it for now. Hopefully this will serve as a basic orientation for using R with RStudio. For the next posts in the series, I’ll finish a tour of R, and then there will be a discussion on how to organise your data for R – in some cases, it’s a slightly different approach to SPSS, so is worth discussing in detail.
UPDATE: As Ian Fellows kindly points out in the comments, there are other GUIs available for R besides RStudio – including the excellent Deducer. I also discussed some of the other GUIs available in a previous post from when I was just starting to learn R.