Selecting subset of variables in data frame

July 24, 2013

(This article was first published on R (en) - Analytik dat, and kindly contributed to R-bloggers)

I frequently work with datasets with many variables. In this case I often need to apply some function to subset of variables in data frame. To simplify this task I wrote short function that allows me to specify what variables to include and what variables should be excluded.


I do choose subset of variables based on the following condition types:

  • variable/column type (factor, numeric, string) (I know there are other types, feel free to improve the function)
  • column name pattern (usually columns describing similar concepts have the same prefix)
  • variable is not excluded (I do not want some variables to be part of the result)

With R it was suprisingly easy to write function varlist():

varlist <- function (df=NULL,type=c("numeric","factor","character"), pattern="", exclude=NULL) {
   vars <- character(0)
   if (any(type %in% "numeric")) {
     vars <- c(vars,names(df)[sapply(df,is.numeric)])
   if (any(type %in% "factor")) {
     vars <- c(vars,names(df)[sapply(df,is.factor)])
   if (any(type %in% "character")) {
     vars <- c(vars,names(df)[sapply(df,is.character)])
   vars[(!vars %in% exclude) & grepl(vars,pattern=pattern)]

Function has the following parameters:

  • data frame
  • column type (numeric, factor, character or any combination given as vector)
  • pattern (will be used in regex to filter matching variable names)
  • exclude (vector of names to exclude)

I will demonstrate how this works on dataset “German Credit Data”:

german_data <- read.table(file="", sep=" ", header=FALSE, stringsAsFactors=TRUE)

names(german_data) <- c('ca_status','duration','credit_history','purpose','credit_amount','savings', 'present_employment_since','installment_rate_income','status_sex','other_debtors','present_residence_since','property','age','other_installment','housing','existing_credits', 'job','liable_maintenance_people','telephone','foreign_worker','gb')

Now we can start playing with varlist():

## All variable starting with cred
## All numeric variable
## All factor variable except variable gb and variables starting with c
## Same as previous, only using pattern instead of c()

Once we have list of column names, it is easy to use sapply and do real job:

> sapply(german_data[,varlist(german_data,type="numeric",pattern="credit")], summary)
        credit_amount existing_credits
Min.              250            1.000
1st Qu.          1366            1.000
Median           2320            1.000
Mean             3271            1.407
3rd Qu.          3972            2.000
Max.            18420            4.000

 Of course, we can have our own function in sapply:

> sapply(german_data[,varlist(german_data,type="numeric",pattern="credit")], function (x) length(unique(x)))
   credit_amount existing_credits 
             921                4 

Let me know if you find this useful or have other solutions when dealing with many variables.

To leave a comment for the author, please follow the link and comment on their blog: R (en) - Analytik dat. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)