R has a lot of string functions, many of them can be found with ls("package:base", pattern="str"). Additionally, there are add-on packages such as stringr, gsubfn and brew that enhance R string processing capabilities. As a statistical language and environment, R has an edge compared to other programming languages when it comes to text mining algorithms or natural language processing. There is even a taskview for this on CRAN.
I am currently playing with markdown files in R, which eventually will result in a new version of mdtools, and collected or created some string functions I like to present in this blogpost. The source code of the functions is at the end of the post, first I show how to use these functions.
Head and tail for strings
The idea for the first two functions I had earlier, and I had to learn that providing a S3 method for head and tail is not an good idea. But strhead and strtail did prove as handy. Here are some usage examples:
> strhead("egghead", 3)
[1] "egg"
> strhead("beagle", -1) # negative index
[1] "beagl"
> strtail(c("bowl", "snowboard"), 3) # vector-able in the first argument
[1] "owl" "ard"These functions are only syntactic sugar, hopefully easy to memorize because of their similarity to existing R functions. For packages, they are probably not worth introducing an extra dependency. I thought about defining an replacement function like substr does, but I did not try it because head and tail do not have replacement functions.
Bare minimum template
With sprintf, format and pretty, there are powerful functions for formatting strings. However, sometimes I miss the named template syntax as in Python or in Makefiles. So I implemented this in R. Here are some usage examples:
> strsubst(
+ "$(WHAT) is $(HEIGHT) meters high.",
+ list(
+ WHAT="Berlin's teletower",
+ HEIGHT=348
+ )
+ )
[1] "Berlin's teletower is 348 meters high."
> d <- strptime("2012-03-18", "%Y-%m-%d")
> strsubst(c(
+ "Be careful with dates.",
+ "$(NO_CONV) shows a list.",
+ "$(CONV) is more helpful."),
+ list(
+ NO_CONV=d,
+ CONV= as.character(d)
+ )
+ )
[1] "Be careful with dates."
[2] "list(sec = 0, min = 0, hour = 0, mday = 18, mon = 2, year = 112, wday = 0, yday = 77, isdst = 0) shows a list."
[3] "2012-03-18 is more helpful." The first argument can be string or a vector of strings such as the output of readLines. The second argument can be any indexable object (i.e. with working [ operator) such as lists. Environments are not indexable hence won’t work.
Parse raw text
Frequently, I need to extract parts from raw text data. For instance, few weeks ago I had to parse a SPSS script (some variable labels were hard-coded theree and not in the .sav file). The script contained lines VARIABLE LABELS some_var "<some_label>". I was interested in some_var and <some_label>. The examples from the R documentation on regexpr gave me the direction and led me to the strparse function that is applied as follows:
> lines <- c(
+ 'VARIABLE LABELS weight "weight".',
+ 'VARIABLE LABELS altq "Year of birth".',
+ 'VARIABLE LABELS hhg "Household size".',
+ 'missing values all (-1).',
+ 'EXECUTE.'
+ )
> pat <- 'VARIABLE LABELS (?<name>[^\\s]+) \\"(?<lbl>.*)\\".$'
> matches <- grepl(pat, lines, perl=TRUE)
> strparse(pat, lines[matches])
name lbl
[1,] "weight" "weight"
[2,] "altq" "Year of birth"
[3,] "hhg" "Household size"The function returns a vector if one line was parsed and a matrix otherwise. It supports named groups.
Recoding with regular expressions
Sometimes I need to recode a vector of strings in a way that I find all mathces for a particular regular expression and replace these matches with one string. The I match all remaining strings with a second regular expression and replace the hits with a second replacement. And so on. I wrote the strrecode function to support this operation. The function can be seen as an generalisation of the gsub function. It is the only function without test code. Here is a made-up example analysing process information from the task manager:
> dat <- data.frame(
+ wtitle=c(paste(c("Inbox", "Starred", "All"), "- Google Mail"), paste("file", 1:4, "- Notepad++")),
+ pid=sample.int(9999,7),
+ exe=c(rep("chrome.exe",3), rep("notepad++.exe", 4))
+ )
> dat <- transform(
+ dat,
+ usage=strrecode(c("Google Mail$|Microsoft Outlook$", " - Notepad\\+\\+$|Microsoft Word$"), c("Mail", "Text"), dat$wtitle)
+ )
> dat
wtitle pid exe usage
1 Inbox - Google Mail 6810 chrome.exe Mail
2 Starred - Google Mail 2488 chrome.exe Mail
3 All - Google Mail 4086 chrome.exe Mail
4 file 1 - Notepad++ 2946 notepad++.exe Text
5 file 2 - Notepad++ 112 notepad++.exe Text
6 file 3 - Notepad++ 1176 notepad++.exe Text
7 file 4 - Notepad++ 8881 notepad++.exe TextInterested in the source code of these helper functions? Read on.
Read more »R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).