Clean, Consistent Column Names
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I like to standardize the column names of data I’m reading into R so that I don’t have to match column names from one dataset that has an i.d.
column and another that has an id
column or maybe an ID
column. Keep it simple: lower case with a single underscore separator between words.
My current variation on this takes pains to accept a data.frame or a vector of character values and to return the same data type. I convert some of the more common punctuation marks to abbreviated words (% to pct, # to cnt) so that datasets with both population #
and population %
columns are distinctly named as population_cnt
and population_pct
. I also convert camelCase
to camel_case
for consistency.
I’ve also stuck with base R to limit dependencies.
clean_names <- function(.data, unique = FALSE) { n <- if (is.data.frame(.data)) colnames(.data) else .data n <- gsub("%+", "_pct_", n) n <- gsub("\\$+", "_dollars_", n) n <- gsub("\\++", "_plus_", n) n <- gsub("-+", "_minus_", n) n <- gsub("\\*+", "_star_", n) n <- gsub("#+", "_cnt_", n) n <- gsub("&+", "_and_", n) n <- gsub("@+", "_at_", n) n <- gsub("[^a-zA-Z0-9_]+", "_", n) n <- gsub("([A-Z][a-z])", "_\\1", n) n <- tolower(trimws(n)) n <- gsub("(^_+|_+$)", "", n) n <- gsub("_+", "_", n) if (unique) n <- make.unique(n, sep = "_") if (is.data.frame(.data)) { colnames(.data) <- n .data } else { n } }
So I can then see the effects of this:
clean_names( c( " a", "a ", "a %", "a", "$a", "$$$a", "GDP ($)", "GDP (us$)", "a (#)", "a & b", "#", "$", "a_cnt", "Aa&Bb", "camelCasePhrases", "AlphaBetaGamma", "Alpha Beta", "Beta !!! Gamma", "a + b", "a - b", "a * b" ) ) [1] "a" "a" [3] "a_pct" "a" [5] "dollars_a" "dollars_a" [7] "gdp_dollars" "gdp_us_dollars" [9] "a_cnt" "a_and_b" [11] "cnt" "dollars" [13] "a_cnt" "aa_and_bb" [15] "camel_case_phrases" "alpha_beta_gamma" [17] "alpha_beta" "beta_gamma" [19] "a_plus_b" "a_minus_b" [21] "a_star_b"
Typically, I pipe my data import directly into my clean_names() function so that all my data starts with the same naming convention:
library(dplyr) df <- read.csv("data-raw/source_data") %>% clean_names()
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.