Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
NOTE: In case you plan to go through the queries in sequence, please run data(iris) before each query, unless otherwise specified.
1. How to name or rename a column in a data frame?
3. How to export a data frame so that it can be used in other applications?
The best way is to export a csv file since most applications accept that format.
write.csv(iris, ‘iris.csv’, row.names= FALSE)
4. How to select a particular row/column in a data frame?
The easiest way to do this is to use the indexing notation .
To select the first column only
To select first column and put contents in a new vector
new.vec <- iris[, 1]
To select multiple columns, say 1st, 2nd and 5th, and put them in a data frame
new.data <- iris[, c(1, 2, 5)]
To select the first row only
To select first row and 3rd column
To select multiple rows from the 3rd column
iris[c 1=”4,” 2=”10,” 3=”111),” 4=”3″ language=”(1,”][/c]
5. How to aggregate a data set based on a variable? Similar to group by in proc sql.
Say we want to aggregate the entire iris data set by Species such that the new data set will have only 3 rows and the columns will have the mean value of the respective column.
We can use the aggregate function to do this.
iris.agg <- aggregate(iris[, c(1, 2, 3, 4)], by= list(iris$Species), FUN= mean)
Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
In case a different function needs to be applied to different columns, we need to use the powerful ddply() function from plyr.
iris.agg <- ddply(iris, .variables= .(Species), summarise,
sl.mean = mean(Sepal.Length),
sw.median = median(Sepal.Width),
pl.max = max(Petal.Length),
pw.sd = sd(Petal.Width), .progress= ‘text’)
Species sl.mean sw.median pl.max pw.sd
1 setosa 5.006 3.4 1.9 0.1053856
2 versicolor 5.936 2.8 5.1 0.1977527
3 virginica 6.588 3.0 6.9 0.2746501
6. How to create deciles of a particular variable? Similar to proc rank in SAS.
We can use the cut() function for this.
For example, to create deciles or 10 bins from Sepal.Length, do
iris$sp.decile <- cut(iris$Sepal.Length,
probs= seq(0, 1, by= 0.1)),
include.lowest= TRUE, labels= c(1:10))
1 2 3 4 5 6 7 8 9 10
16 16 13 20 15 15 13 12 17 13
Here we create ‘sp.decile’ as another column in the iris data set. After this, in case, we need to determine, say, the mean of each variable, we can use the aggregate function as below.
var.means.by.spdecile <- aggregate(iris[, c(1, 2, 3, 4)], by=
list(iris$sp.decile), FUN= mean)
7. How to deal with missing values?
Dealing with missing values in R is not very difficult, provided we use the correct notation.
Suppose we know the form of missing values in our file and it is . (period), i.e., for each observation that has a missing value, there is a . (period) in that cell. Then while importing the data, do
data.set <- read.csv('filename.csv', na.strings= '.')
In case the missing value is #N/A, then do
data.set <- read.csv('filename.csv', na.strings= '#N/A')
Similarly for other cases, we can substitute the missing value notation in the na.strings argument.
In case we are not sure of the missing values, then we first need to import the data and have a look at the values to decide.
These are some of the common doubts that I have come across. I’ll keep adding to the list as I keep getting newer ones. Please do let me know if there something you believe should be added to this. I’ll do it right away.