This post will consider some useful functions for dealing with data frames during data processing and validation.

Consider an artifical data set create using the **expand.grid** function where there are duplicate rows in the data frame.

> des = expand.grid(A = c(2,2,3,4), B = c(1,3,5,5,7)) > des A B 1 2 1 2 2 1 3 3 1 4 4 1 5 2 3 6 2 3 7 3 3 8 4 3 9 2 5 10 2 5 11 3 5 12 4 5 13 2 5 14 2 5 15 3 5 16 4 5 17 2 7 18 2 7 19 3 7 20 4 7 |

If we want to identify rows that are duplicates then the **duplicated** function comes in handy:

> duplicated(des) [1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE |

We can pick out the unique rows of the data frame with the following code:

> des[!duplicated(des),] A B 1 2 1 3 3 1 4 4 1 5 2 3 7 3 3 8 4 3 9 2 5 11 3 5 12 4 5 17 2 7 19 3 7 20 4 7 |

After loading a large file into a data frame we might be interested in checking some of the data to ensure that it is as expected. Rather than printing out the entirity of the data frame we can use the **head** and **tail** functions to view the top or bottom few rows of the data frame. An example using the rock data set that is available within R:

> head(rock) area peri shape perm 1 4990 2791.90 0.0903296 6.3 2 7002 3892.60 0.1486220 6.3 3 7558 3930.66 0.1833120 6.3 4 7352 3869.32 0.1170630 6.3 5 7943 3948.54 0.1224170 17.1 6 7979 4010.15 0.1670450 17.1 > tail(rock) area peri shape perm 43 5605 1145.690 0.464125 1300 44 8793 2280.490 0.420477 1300 45 3475 1174.110 0.200744 580 46 1651 597.808 0.262651 580 47 5514 1455.880 0.182453 580 48 9718 1485.580 0.200447 580 |

