The most difficult part of the learning curve in R is often getting going – many datasets are pre-installed in the packages and organised, so it is difficult to see how you to import your own data into R. This post takes you step by step through the process of making a table from a spreadsheet and then a simple graph.
The first thing is to get some data. A .csv file is a common “spreadsheet” like file. Currently I’m working with some air quality data downloaded from the UK air quality archive. The data I’ve downloaded is of 2009 data from Nottingham, UK containing automated measurements of Nitric Oxide, NO2, Ozone, and Sulphur Dioxide. The file is here. You can cut and paste the code below into R.
The first thing to do is put the data into a variable, called data. Copy the spreadsheet file into your working directory. We then use the read.csv for this:
We have also removed the first 7 lines of the file (if you look at the file in Notepad, you’ll see that the first 7 lines are descriptions and a header. I wanted my own headers, which I set in the columns vector. StringsAsFactors = FALSE is important – without this things can go wrong.
You can look at the data we’ve just imported using:
which shows the first 10 rows of the data (and all the columns). R has lots of ways to access data from a table. For example, we can look at the 5th to 10th measurments of NO using
So, lets now do a plot. A simple plot is to see what happens to NO levels over the whole dataset. In which case, all you have to do is:
For a more complex graph:
## start by saving the original graphical parameters
def.par <- par(no.readonly = TRUE)
x <- data$NO
y <- data$ozone
xlabel <- "NO"
ylabel <- "ozone"
layout(matrix(c(2,1,1,3,1,1), 2, 3, byrow = TRUE))
plot(x, y, xlab = xlabel, ylab = ylabel, pch = 20)
plot(x, xlab = NA, ylab = xlabel, pch = 20)
plot(y, xlab = NA, ylab = ylabel, pch = 20)
## reset the graphics display to default
You should get something like:
So, what we’ve done here is used the layout command. We’ve defined a matrix with 3 columns and 2 rows. The numbers in the matrix tell R where the plots should go. The matrix command which indicates this is:
matrix(c(2,1,1,3,1,1), 2, 3, byrow = TRUE)
and the output you get from this is:
[,1] [,2] [,3]
[1,] 2 1 1
[2,] 3 1 1
Which shows that the second plot will be on the top left, and the third in the bottom left, and the 1st spread over the 4 cells of the table on the right. The actual plots are simple. We’ve defined x to be the NO data (using x <- data$NO ) and y to be ozone. And then we’ve just plotted x and y against each other, and also in separate panes each like a time series. It’s worth playing with the numbers in this command to change the layout of the graph – can you stack the 3 graphs into a column?
Well, that’s got us going for now.
There are of course much more complex plots which we can use and other ways to work with data, but later.