Clean, Consistent Column Names

July 7, 2019
By

(This article was first published on R – William Doane, and kindly contributed to R-bloggers)

I like to standardize the column names of data I’m reading into R so that I don’t have to match column names from one dataset that has an i.d. column and another that has an id column or maybe an ID column. Keep it simple: lower case with a single underscore separator between words.

My current variation on this takes pains to accept a data.frame or a vector of character values and to return the same data type. I convert some of the more common punctuation marks to abbreviated words (% to pct, # to cnt) so that datasets with both population # and population % columns are distinctly named as population_cnt and population_pct. I also convert camelCase to camel_case for consistency.

I’ve also stuck with base R to limit dependencies.

clean_names <- function(.data, unique = FALSE) {
  n <- if (is.data.frame(.data)) colnames(.data) else .data

  n <- gsub("%+", "_pct_", n)
  n <- gsub("\\$+", "_dollars_", n)
  n <- gsub("\\++", "_plus_", n)
  n <- gsub("-+", "_minus_", n)
  n <- gsub("\\*+", "_star_", n)
  n <- gsub("#+", "_cnt_", n)
  n <- gsub("&+", "_and_", n)
  n <- gsub("@+", "_at_", n)

  n <- gsub("[^a-zA-Z0-9_]+", "_", n)
  n <- gsub("([A-Z][a-z])", "_\\1", n)
  n <- tolower(trimws(n))
  
  n <- gsub("(^_+|_+$)", "", n)
  
  n <- gsub("_+", "_", n)
  
  if (unique) n <- make.unique(n, sep = "_")
  
  if (is.data.frame(.data)) {
    colnames(.data) <- n
    .data
  } else {
    n
  }
}

So I can then see the effects of this:

clean_names(
  c(
    "  a", "a  ",
    "a %", "a", "$a", "$$$a", "GDP ($)", "GDP (us$)",
    "a (#)", "a & b", "#", "$",
    "a_cnt",
    "Aa&Bb", "camelCasePhrases",
    "AlphaBetaGamma", "Alpha       Beta", "Beta  !!! Gamma",
    "a + b", "a - b", "a * b"
  )
)
 [1] "a"                  "a"                 
 [3] "a_pct"              "a"                 
 [5] "dollars_a"          "dollars_a"         
 [7] "gdp_dollars"        "gdp_us_dollars"    
 [9] "a_cnt"              "a_and_b"           
[11] "cnt"                "dollars"           
[13] "a_cnt"              "aa_and_bb"         
[15] "camel_case_phrases" "alpha_beta_gamma"  
[17] "alpha_beta"         "beta_gamma"        
[19] "a_plus_b"           "a_minus_b"         
[21] "a_star_b"  

Typically, I pipe my data import directly into my clean_names() function so that all my data starts with the same naming convention:

library(dplyr)

df <- read.csv("data-raw/source_data") %>%
  clean_names()

To leave a comment for the author, please follow the link and comment on their blog: R – William Doane.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)