Working with strings

April 10, 2012
By

(This article was first published on factbased, and kindly contributed to R-bloggers)

R has a lot of string functions, many of them can be found with ls("package:base", pattern="str"). Additionally, there are add-on packages such as stringr, gsubfn and brew that enhance R string processing capabilities. As a statistical language and environment, R has an edge compared to other programming languages when it comes to text mining algorithms or natural language processing. There is even a taskview for this on CRAN.

I am currently playing with markdown files in R, which eventually will result in a new version of mdtools, and collected or created some string functions I like to present in this blogpost. The source code of the functions is at the end of the post, first I show how to use these functions.

Head and tail for strings

The idea for the first two functions I had earlier, and I had to learn that providing a S3 method for head and tail is not an good idea. But strhead and strtail did prove as handy. Here are some usage examples:

> strhead("egghead", 3)
[1] "egg"
> strhead("beagle", -1) # negative index
[1] "beagl"
> strtail(c("bowl", "snowboard"), 3) # vector-able in the first argument
[1] "owl" "ard"

These functions are only syntactic sugar, hopefully easy to memorize because of their similarity to existing R functions. For packages, they are probably not worth introducing an extra dependency. I thought about defining an replacement function like substr does, but I did not try it because head and tail do not have replacement functions.

Bare minimum template

With sprintf, format and pretty, there are powerful functions for formatting strings. However, sometimes I miss the named template syntax as in Python or in Makefiles. So I implemented this in R. Here are some usage examples:

> strsubst(
+   "$(WHAT) is $(HEIGHT) meters high.", 
+   list(
+     WHAT="Berlin's teletower",
+     HEIGHT=348
+   )
+ )
[1] "Berlin's teletower is 348 meters high."
> d <- strptime("2012-03-18", "%Y-%m-%d")
> strsubst(c(
+   "Be careful with dates.",
+   "$(NO_CONV) shows a list.",
+   "$(CONV) is more helpful."),
+   list(
+     NO_CONV=d,
+     CONV= as.character(d)
+   )
+ )
[1] "Be careful with dates."                                                                                        
[2] "list(sec = 0, min = 0, hour = 0, mday = 18, mon = 2, year = 112, wday = 0, yday = 77, isdst = 0) shows a list."
[3] "2012-03-18 is more helpful."                                                                                   

The first argument can be string or a vector of strings such as the output of readLines. The second argument can be any indexable object (i.e. with working [ operator) such as lists. Environments are not indexable hence won’t work.

Parse raw text

Frequently, I need to extract parts from raw text data. For instance, few weeks ago I had to parse a SPSS script (some variable labels were hard-coded theree and not in the .sav file). The script contained lines VARIABLE LABELS some_var "<some_label>". I was interested in some_var and <some_label>. The examples from the R documentation on regexpr gave me the direction and led me to the strparse function that is applied as follows:

> lines <- c(
+     'VARIABLE LABELS weight "weight".',
+     'VARIABLE LABELS altq "Year of birth".',
+     'VARIABLE LABELS hhg "Household size".',
+     'missing values all (-1).',
+     'EXECUTE.'
+ )
> pat <- 'VARIABLE LABELS (?<name>[^\\s]+) \\"(?<lbl>.*)\\".$'
> matches <- grepl(pat, lines, perl=TRUE)
> strparse(pat, lines[matches])
name     lbl             
[1,] "weight" "weight"        
[2,] "altq"   "Year of birth" 
[3,] "hhg"    "Household size"

The function returns a vector if one line was parsed and a matrix otherwise. It supports named groups.

Recoding with regular expressions

Sometimes I need to recode a vector of strings in a way that I find all mathces for a particular regular expression and replace these matches with one string. The I match all remaining strings with a second regular expression and replace the hits with a second replacement. And so on. I wrote the strrecode function to support this operation. The function can be seen as an generalisation of the gsub function. It is the only function without test code. Here is a made-up example analysing process information from the task manager:

> dat <- data.frame(
+     wtitle=c(paste(c("Inbox", "Starred", "All"), "- Google Mail"), paste("file", 1:4, "- Notepad++")),
+     pid=sample.int(9999,7),
+     exe=c(rep("chrome.exe",3), rep("notepad++.exe", 4))
+ )
> dat <- transform(
+     dat,
+     usage=strrecode(c("Google Mail$|Microsoft Outlook$", " - Notepad\\+\\+$|Microsoft Word$"), c("Mail", "Text"), dat$wtitle)
+ )
> dat
wtitle  pid           exe usage
1   Inbox - Google Mail 6810    chrome.exe  Mail
2 Starred - Google Mail 2488    chrome.exe  Mail
3     All - Google Mail 4086    chrome.exe  Mail
4    file 1 - Notepad++ 2946 notepad++.exe  Text
5    file 2 - Notepad++  112 notepad++.exe  Text
6    file 3 - Notepad++ 1176 notepad++.exe  Text
7    file 4 - Notepad++ 8881 notepad++.exe  Text

Interested in the source code of these helper functions? Read on.

Read more »

To leave a comment for the author, please follow the link and comment on his blog: factbased.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.