R – Sorting a data frame by the contents of a column

February 12, 2010
By

(This article was first published on Developmentality » R, and kindly contributed to R-bloggers)

Let’s examine how to sort the contents of a data frame by the value of a column

> numPeople = 10
> sex=sample(c("male","female"),numPeople,replace=T)
> age = sample(14:102, numPeople, replace=T)
> income = sample(20:150, numPeople, replace=T)
> minor = age<18

This last statement might look surprising if you’re used to Java or a traditional programming language. Rather than becoming a single boolean/truth value, minor actually becomes a vector of truth values, one per row in the age column.  It’s equivalent to the much more verbose code in Java:

int[] age= ...;
for (int i = 0; i < income.length; i++) {
   minor[i] = age[i] < 18;
}

Just as expected, the value of minor is a vector:

> mode(minor)
[1] "logical"
> minor
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

Next we create a data frame, which groups together our various vectors into the columns of a data structure:

> population = data.frame(sex=sex, age=age, income=income, minor=minor)
> population
 sex age income minor
1    male  68    150 FALSE
2    male  48     21 FALSE
3  female  68     58 FALSE
4  female  27    124 FALSE
5  female  84    103 FALSE
6    male  92    112 FALSE
7    male  35     65 FALSE
8  female  15    117  TRUE
9    male  89     95 FALSE
10   male  26     54 FALSE

The arguments (sex=sex, age=age, income=income, minor=minor) assign the same names to the columns as I originally named the vectors; I could just as easily call them anything.  For instance,

> data.frame(a=sex, b=age, c=income, minor=minor)
 a  b   c minor
1    male 68 150 FALSE
2    male 48  21 FALSE
3  female 68  58 FALSE
4  female 27 124 FALSE
5  female 84 103 FALSE
6    male 92 112 FALSE
7    male 35  65 FALSE
8  female 15 117  TRUE
9    male 89  95 FALSE
10   male 26  54 FALSE

But I prefer the more descriptive labels I gave previously.

> population
     sex   age income minor
1    male  68    150 FALSE
2    male  48     21 FALSE
3  female  68     58 FALSE
4  female  27    124 FALSE
5  female  84    103 FALSE
6    male  92    112 FALSE
7    male  35     65 FALSE
8  female  15    117  TRUE
9    male  89     95 FALSE
10   male  26     54 FALSE

Now let's say we want to order by the age of the people. To do that is a one liner:

> population[order(population$age),]
 sex age income minor
8  female  15    117  TRUE
10   male  26     54 FALSE
4  female  27    124 FALSE
7    male  35     65 FALSE
2    male  48     21 FALSE
1    male  68    150 FALSE
3  female  68     58 FALSE
5  female  84    103 FALSE
9    male  89     95 FALSE
6    male  92    112 FALSE

This is not magic; you can select arbitrary rows from any data frame  with the same syntax:

> population
 sex age income minor
1   male  68    150 FALSE
2   male  48     21 FALSE
3 female  68     58 FALSE

The order function merely returns the indices of the rows in sorted order.

> order(population$age)
 [1]  8 10  4  7  2  1  3  5  9  6

Note the $ syntax; you select columns of a data frame by using a dollar sign and the name of the column. You can retrieve the names of the columns of a data frame with the names function.

> names(population)
[1] "sex"    "age"    "income" "minor" 

> population$income
 [1] 150  21  58 124 103 112  65 117  95  54
> income
 [1] 150  21  58 124 103 112  65 117  95  54

As you can see, they are exactly the same.

So what we're really doing with the command

population[order(population$age),]

is

population

Note the trailing comma; what this means is to take all the columns. If we only wanted certain columns, we could specify after this comma.

> population[order(population$age),c(1,2)]
 sex age
8  female  15
10   male  26
4  female  27
7    male  35
2    male  48
1    male  68
3  female  68
5  female  84
9    male  89
6    male  92

To leave a comment for the author, please follow the link and comment on their blog: Developmentality » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)