R Tutorial Series: Introduction to The R Project for Statistical Computing (Part 2)

October 15, 2009
By

(This article was first published on R Tutorial Series, and kindly contributed to R-bloggers)

Welcome to part two of the Introduction to The R Project for Statistical Computing tutorial. If you missed part one, it can be found here. In this segment, we will explore the following topics.

  • Importing Data
  • Variables
  • Workspace Files
  • Console Files
  • Finding Help

Tutorial Files

Before we start, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory.

Importing Data

While values can be input directly into R, the most common method for obtaining data is to import it from preexisting sources. Most spreadsheets can be converted to CSV (comma-separated values) files, which are recommended for use with R. However, by way of the foreign package, a variety of alternative data files can be imported, such as ones generated in SPSS. Below are examples demonstrating how to import data using both methods.

To import data from a csv file, use the read.csv("FILENAME") command, where FILENAME is the name of the file that you would like to import.

  1. read.csv("intro_pt2_data.csv")

When a file is read, the console displays its contents, as depicted in the screenshot below.

Similarly, the foreign package can be used to import files from other spreadsheet and statistical analysis programs. A hypothetical example of loading data from an SPSS file (.sav) follows.

  1. > #first, load the foreign package
  2. > library(foreign)
  3. > #then, import the data file
  4. > read.spss("newData.sav")

Note that there are a variety of read.FUNCTION commands available in R. Depending on your source file, you may be better off using a different version of the command than what has been presented here. Nonetheless, the process of importing data will remain the same.

Variables

Creating Variables

An important aspect of conducting statistical analyses in R concerns the use of variables. As with other programming languages, variables can be thought of as containers that store information and allow it to be manipulated. This contrasts with merely displaying information, as takes place in previous demonstrations of the read command. For example, when the command read.csv("intro_pt2_data.csv") was used, age and income data for 20 subjects was read into and displayed in the console. Now the numbers can be seen, but what if you want to conduct statistical analyses on the data? To do this, you would have to save the information into a variable using the <- operator. The <- characters are used to set a variable to a certain value and can be remembered as meaning "is equal to the contents of." Subsequently, the format for creating a variable is NAME <- VALUE or, in words, "the variable named NAME is equal to the contents of the value VALUE.

  1. > dataSet <- read.csv("intro_pt2_data.csv")

Thus, the line of code above creates a new variable named dataSet and sets it to equal the contents of the imported CSV file.

Accessing Data Stored In Variables

Now that the contents of the spreadsheet have been stored in a variable, the individual data elements can be accessed. In the sample provided, age and income values were collected for 20 subjects and entered into a two-column spreadsheet. Since both age and income have their own column of values, each can be accessed individually using the format DATASET $COLUMN, where DATASET is the name of the variable that contains all of data (i.e. dataSet) and $COLUMN is the name of the column within the data (i.e. $Age or $Income). The following code demonstrates how individual variables within a dataset can be accessed and displayed.

  1. > dataSet $Age
  2. [1] 10 25 43 32 70 19 5 21 35 24 12 14 49 62 48 40 33 67 9 28
  3. dataSet $Income
  4. [1] 0 35000 75000 55000 25000 20000 0 20000 60000 30000 0 10000 35000 80000 80000 0 0 55000 0 100000

Data Frames

Data can also be saved as a frame. A data frame is very similar to a dataset in that it stores information and its variables can be accessed in the same way. However, data frames are displayed in a nice tabular format when printed in the R console. Additionally, operations can be conducted on data frames that cannot be done on regular dataset variables. Often, you will want to use both dataset and data frame variables when working in R. The differences between them will become more apparent in future tutorials. For now, know that you can create a data frame from a preexisting dataset via the data.frame(DATASET) command, where DATASET is the name of the variable containing the data.

  1. > dataFrame <- data.frame(dataSet)

Attaching Data Variables

A convenient method for accessing variables comes thanks to the ability to attach datasets in R. This is accomplished through the attach(NAME) command, where NAME is the name of the dataset variable that you want to attach. This allows you to refer to variables within the dataset without the need to list the name of the dataset and the $ symbol. Hence, the example below accomplishes the same tasks as in the previous section, but with less code.

  1. > #first, attach the dataset
  2. > attach(dataSet)
  3. > #now you can access variables using the shorthand method
  4. > Age
  5. [1] 10 25 43 32 70 19 5 21 35 24 12 14 49 62 48 40 33 67 9 28
  6. > Income
  7. [1] 0 35000 75000 55000 25000 20000 0 20000 60000 30000 0 10000 35000 80000 80000 0 0 55000 0 100000

Note that each time R is run, the dataset must be reattached. This method is most useful when you know that you will be working with a single dataset for an entire session. Furthermore, a data frame can be attached and used in the same manner as a dataset.

Workspace Files

Every time that you create a variable to store values in R, it is saved to the current Workspace. A Workspace is a repository for all of the objects managed during a session. For instance, when you assigned the variable "dataSet" to the contents of the sample CSV file, the dataSet object, complete with Age and Income data, was entered into the R Workspace. A Workspace can be saved at any time and loaded during a future session. Workspace files always end with the extention ".RData" and are a useful way to pick up your work where you left off at the end of a previous session. The essential functions related to Workspaces are demonstrated below.

To save a Workspace file, use the save.image("PATH/FILENAME.RData") command, where PATH represents the directory path where you would like to save the new file (the working directory is used by default) and FILENAME is the name of the new file.

  1. > save.image("Users/Admin/Desktop/NewSaveFile.RData")

Similarly, to load a Workspace file, use the load("PATH/FILENAME.RData") command, where PATH represents the directory path to the previously saved file (the working directory is used by default) and FILENAME is the name of the previously saved file.

  1. > load("Users/Admin/Desktop/PreviouslySavedFile.RData")

Furthermore, a list of all of the objects currently held in the Workspace can be displayed via the ls() function.

  1. > ls()
  2. [1] "dataSet"

Note that R also features a Workspace menu where each of the above tasks can be handled. The Workspace Browser (pictured) is especially useful for visualizing the contents of your current Workspace.

Console Files

As discussed in part one of this tutorial, the R Console is where commands are issued and subsequent outputs are displayed. In contrast to the Workspace, where all of the objects in use are being stored, the Console is the complete history of the actions taken by those objects.

Consider a meeting between people as an analogy to further explain the relationship between the Workspace and the Console. All of the individuals who attend the meeting are contained in a single room (i.e. the Workspace). Everything that the participants do and say is recorded in the meeting minutes (i.e. the Console). Thus, the Workspace contains objects (such as the people who attend a meeting) and the Console consists of a log of interactions between objects (such as what people say to each other during a meeting).

The contents of the Console can be saved to a text file using File > Save As… from the menu. In fact, the same procedure can be executed from the Quartz window to produce a PDF of a particular graphic. Moreover, the contents in any of the R windows can be copied and pasted into another program, such as a word processor. Unlike a Workspace, which may be saved and reloaded from session to session to continue work, a Console is most useful for keeping track of what you have done in previous sessions. This history can be a reminder of where you left off during the last session, the results of prior analyses, how to execute certain functions, or an array of other items. A sample Console output is pictured below. Take notice of the contrast between this and the previous image of the Workspace Browser.

Finding Help

When getting started with R for the first time, or when exploring new facets of the program, it can be useful to get help from more experienced users. Fortunately, R has a large community with a strong online presence. Help documentation, FAQs, tutorials, and discussions can be found covering nearly every aspect of R that one would ever need or want to become familiar with. The following list represents just a few of the excellent R resources that have assisted me thus far.

In spite of the abundance of R information available online, I have decided to create a series of my own tutorials for three main reasons. First, the R knowledge base is scattered across the internet, making it difficult for users to find what they need, when they need it. Second, information about R has been written by many people, in many places, at many times, causing inconsistencies in language and format to exist that challenge users' ability to easily comprehend and apply the solutions that they find. Third, there is no cohesive set of R tutorials that appeals directly to my own (and others') usage of the program, which leaves me searching for small bits of answers in many different places rather than finding holistic solutions. Thus, my goal in creating this series of tutorials is to provide fellow researchers with a coherent and unified set of essential statistical analyses that can be applied to diverse projects using the R system.

To leave a comment for the author, please follow the link and comment on his blog: R Tutorial Series.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , ,

Comments are closed.