**We think therefore we R**, and kindly contributed to R-bloggers)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

NOTE: In case you plan to go through the queries in sequence, please run data(iris) before each query, unless otherwise specified.

**1. How to name or rename a column in a data frame?**

**Ans:**

**2.**

**How to determine the column information like names, type, missing values etc. in R? Similar to proc contents in SAS.**

**Ans:**

**3.** How to export a data frame so that it can be used in other applications?**Ans:**

The best way is to export a csv file since most applications accept that format.

write.csv(iris, ‘iris.csv’, row.names= FALSE)

**4.** **How to select a particular row/column in a data frame?****Ans:**

The easiest way to do this is to use the indexing notation [].

To select the first column only

iris[, 1]

To select first column and put contents in a new vector

new.vec <- iris[, 1]

To select multiple columns, say 1st, 2nd and 5th, and put them in a data frame

new.data <- iris[, c(1, 2, 5)]

To select the first row only

iris[1, ]

To select first row and 3rd column

iris[1, 3]

To select multiple rows from the 3rd column

iris

**5.** **How to aggregate a data set based on a variable? Similar to group by in proc sql.****Ans:**

Say we want to aggregate the entire iris data set by Species such that the new data set will have only 3 rows and the columns will have the mean value of the respective column.

We can use the aggregate function to do this.

iris.agg <- aggregate(iris[, c(1, 2, 3, 4)], by= list(iris$Species), FUN= mean)

iris.agg

Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width

1 setosa 5.006 3.428 1.462 0.246

2 versicolor 5.936 2.770 4.260 1.326

3 virginica 6.588 2.974 5.552 2.026

In case a different function needs to be applied to different columns, we need to use the powerful ddply() function from plyr.

library(plyr)

iris.agg <- ddply(iris, .variables= .(Species), summarise,

sl.mean = mean(Sepal.Length),

sw.median = median(Sepal.Width),

pl.max = max(Petal.Length),

pw.sd = sd(Petal.Width), .progress= ‘text’)

iris.agg

Species sl.mean sw.median pl.max pw.sd

1 setosa 5.006 3.4 1.9 0.1053856

2 versicolor 5.936 2.8 5.1 0.1977527

3 virginica 6.588 3.0 6.9 0.2746501

**6.** **How to create deciles of a particular variable? Similar to proc rank in SAS.****Ans:**

We can use the cut() function for this.

For example, to create deciles or 10 bins from Sepal.Length, do

iris$sp.decile <- cut(iris$Sepal.Length,

breaks= quantile(iris$Sepal.Length,

probs= seq(0, 1, by= 0.1)),

include.lowest= TRUE, labels= c(1:10))

table(iris$sp.decile)

1 2 3 4 5 6 7 8 9 10

16 16 13 20 15 15 13 12 17 13

Here we create ‘sp.decile’ as another column in the iris data set. After this, in case, we need to determine, say, the mean of each variable, we can use the aggregate function as below.

var.means.by.spdecile <- aggregate(iris[, c(1, 2, 3, 4)], by=

list(iris$sp.decile), FUN= mean)**7.** **How to deal with missing values? ****Ans:**

Dealing with missing values in R is not very difficult, provided we use the correct notation.

Suppose we know the form of missing values in our file and it is . (period), i.e., for each observation that has a missing value, there is a . (period) in that cell. Then while importing the data, do

data.set <- read.csv(‘filename.csv’, na.strings= ‘.’)

In case the missing value is #N/A, then do

data.set <- read.csv(‘filename.csv’, na.strings= ‘#N/A’)

Similarly for other cases, we can substitute the missing value notation in the na.strings argument.

In case we are not sure of the missing values, then we first need to import the data and have a look at the values to decide.

These are some of the common doubts that I have come across. I’ll keep adding to the list as I keep getting newer ones. Please do let me know if there something you believe should be added to this. I’ll do it right away.

**leave a comment**for the author, please follow the link and comment on their blog:

**We think therefore we R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...