Keeping rows containing particular strings in R

January 27, 2015

(This article was first published on Bearded Analytics » R, and kindly contributed to R-bloggers)

I was recently presented with the need to filter out certain rows in my dataset based upon them containing the desired strings. I needed to retain any row that had a “utm_source” and “utm_medium” and “utm_campaign”. Each row in my dataset was a single string. The idea is to parse the strings of interest. My approach was to use grep and check each string for each condition that I needed it to satisfy. I consulted with my co-blogger to see if he had a more intelligent way of approaching this problem. He tackled it with a regular expression using a look-ahead. You can see my ‘checker’ function below and Jeremy’s function ‘checker2’. Both seem to perform the required task correctly. So now it is simply a matter of performance.

#Sample Data
querystrings <- c("skuId=34567-02-S&qty=1&continueShoppingUrl=[email protected]person.invalid&codes-processed=true&qtyAvailableWithCartContents=True&basketcode=mybasket1&OrderEventCreateDateTimeLocal=2014-10-2011:06:04.937", 
"skuId=6950K-02-S&qty=1&continueShoppingUrl=[email protected]&codes-processed=true&qtyAvailableWithCartContents=True&basketcode=mybasket2&OrderEventCreateDateTimeLocal=2014-10-2011:06:04.937"

mydf <-

# This Should return TRUE when all conditions have been satisfied
checker <-function(foo){
  grepl(pattern="utm_source", x=foo) &
    grepl(pattern="utm_medium", x=foo)&
    grepl(pattern="utm_campaign", x=foo)

checker2 <- function(foo){
                      x=foo, perl=TRUE)
# This is the loop that was run to repeatedly test each function with a much larger dataset
# Yes, I know this is not an efficient way to do this but it is easy to read.
for( i in 1:100){
  tt <- system.time(  tresult <-mydf[checker(mydf[, 1]), ]  )
  ttime =rbind(ttime,tt[3])

for( i in 1:100){
  jt <- system.time(  jresult<-mydf[checker2(mydf[, 1]), ] ) 
  jtime =rbind(jtime,jt[3])


I am not able to share the full dataset that I was using, due to privacy concerns. The dataset that I tested both functions against had 26,746 rows. The ‘checker’ function which I wrote took  on average 0.0801 seconds and Jeremy’s approach took  0.1488 seconds. I decided to stick with my checker function, but that was not because of speed. I would have happily accepted the increased computation time for mine if the times had been reversed. The reason for this is that I find mine easier to read. This means that there is a chance that I could come back to this code in 6 months and have a clue about what it is suppose to be doing. Regular Expressions can sometimes be quite hard to come back to and say, ” oh yeah, I wanted to check if all the characters that occupy prime digits in my string are vowels!”. I think that my simplistic grep statement will be easier to change if that becomes needed in the future and so I will stick with the ‘checker’ approach. Do you have a better way to approach this using R? If so, make sure to post a comment.

To leave a comment for the author, please follow the link and comment on their blog: Bearded Analytics » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)