Combining dataframes when the columns don’t match

May 13, 2013
By

(This article was first published on Amy Whitehead's Research » R, and kindly contributed to R-bloggers)

Most of my work recently has involved downloading large datasets of species occurrences from online databases and attempting to smoodge1 them together to create distribution maps for parts of Australia. Online databases typically have a ridiculous number of columns with obscure names which can make the smoodging process quite difficult.

For example, I was trying to combine data from two different regions into one file, where one region had 72 columns of data and another region had 75 columns. If you try and do this using rbind, you get an error but going through and identifying non-matching columns manually would be quite tedious and error-prone.

Here’s an example of the function in use with some imaginary data. You’ll note that Database One and Two have unequal number of columns (5 versus 6), a number of shared columns (species, latitude, longitude, database) and some unshared columns (method, data.source).

  species latitude longitude        method     database
1       y   -32.14     150.3   camera trap database.one
2       x   -32.32     153.6 live trapping database.one
3       i   -33.28     151.0   camera trap database.one
4       b   -33.38     152.5 live trapping database.one
5       e   -31.30     149.7   camera trap database.one
6       w   -34.15     151.1 live trapping database.one
      database species latitude longitude data.source accuracy
1 database.two       v   -33.53     150.6   herbarium   8.5843
2 database.two       s   -32.74     150.5      museum  13.0638
3 database.two       u   -33.45     150.3   herbarium   6.1510
4 database.two       y   -33.17     151.6      museum   1.9278
5 database.two       r   -32.49     151.9   herbarium   8.9160
6 database.two       g   -32.98     150.4      museum   0.9405
rbind(database.one, database.two)
Error: numbers of columns of arguments do not match

So I created a function that can be used to combine the data from two dataframes, keeping only the columns that have the same names (I don’t care about the other ones). I’m sure there are other fancier ways of doing this but here’s how my function works.

The basics steps
1. Specify the input dataframes
2. Calculate which dataframe has the greatest number of columns
3. Identify which columns in the smaller dataframe match the columns in the larger dataframe
4. Create a vector of the column names that occur in both dataframes
5. Combine the data from both dataframes matching the listed column names using rbind
6. Return the combined data

rbind.match.columns <- function(input1, input2) {
    n.input1 <- ncol(input1)
    n.input2 <- ncol(input2)

    if (n.input2 < n.input1) {
        TF.names <- which(names(input2) %in% names(input1))
        column.names <- names(input2[, TF.names])
    } else {
        TF.names <- which(names(input1) %in% names(input2))
        column.names <- names(input1[, TF.names])
    }

    return(rbind(input1[, column.names], input2[, column.names]))
}

rbind.match.columns(database.one, database.two)
   species latitude longitude     database
1        y   -32.14     150.3 database.one
2        x   -32.32     153.6 database.one
3        i   -33.28     151.0 database.one
4        b   -33.38     152.5 database.one
5        e   -31.30     149.7 database.one
6        w   -34.15     151.1 database.one
7        v   -33.53     150.6 database.two
8        s   -32.74     150.5 database.two
9        u   -33.45     150.3 database.two
10       y   -33.17     151.6 database.two
11       r   -32.49     151.9 database.two
12       g   -32.98     150.4 database.two

Running the function gives us a new dataframe with the four shared columns and twelve records, reflecting the combined data. Awesome!

1 A high technical and scientific term!

Bought to you by the powers of knitr & RWordpress

To leave a comment for the author, please follow the link and comment on their blog: Amy Whitehead's Research » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)