In my current job, I study HIV at the genetic and biochemical levels. Thus, I often work with data involving the sequences of nucleotides or amino acids of various patient samples of HIV, and this type of work involves a lot of manipulating text. (Strictly speaking, I analyze sequences of nucleotides from DNA that are reverse-transcribed from the HIV’s RNA.) In this post, I describe some common functions in R that I often use for text processing.
Obtaining Basic Information about Character Variables
> year = 2014 > is.character(year)  FALSE
If a variable is not a character variable, you can convert it to a character variable using the as.character() function.
> year.char = as.character(year) > is.character(year.char)  TRUE
A basic piece of information about a character variable is the number of characters that exist in this string. Use the nchar() function to obtain this information.
> nchar(year.char)  4
Pattern Matching and Manipulation
I often need to combine several character variables into one string, and the paste() function is useful for that. Notice my use of the “sep =” option to specify that I want to separate the variables with 1 space.
> first = 'The' > second = 'Chemical' > third = 'Statistician' > my.name = paste(first, second, third, sep = ' ') > my.name  "The Chemical Statistician"
A common task in my job is determining whether or not a sequence of nucleotides/amino acids is present in a much longer sequence of length (i.e. ). Essentially, I want to determine if a pattern of text exists in a character variable. The grepl() function is useful for that; in fact, the pattern of interest can be searched in multiple character variables simultaneously – just combine the 2 variables using the c() function!
> x = 'ATCG' > y = 'GGACTCTAAATCCGTACTATCGTCATCGTTTTTCCT' > z = 'CTATCGGGTAGCT' > grepl(x, c(y, z))  TRUE TRUE
If you want to determine precisely where “x” is located along “y” and along “z”, use the gregexpr() function.
> gregexpr(x, c(y, z)) []  19 25 attr(,"match.length")  4 4 attr(,"useBytes")  TRUE []  3 attr(,"match.length")  4 attr(,"useBytes")  TRUE
The output of gregexpr(x, c(y, z)) is a list of 2 objects.
- The first object contains the positional information about the pattern “x” in the variable “y”.
- “x” appears twice in the variable “y” – at positions 19 and 25. (Specifically, the “A” in x = ‘ATCG’ appears at positions 19 and 25.)
- The second object contains the positional information about the pattern “x” in the variable “z”.
To extract these positions, you must first slice the list into its 2 objects – use double braces to do this. Then, you can extract the positions from each object – use single braces to do this. For simplicity, let’s assign the output of gregexpr(x, c(y, z)) to a variable named “pos”.
> pos = gregexpr(x, c(y, z)) > pos[]  19 25 attr(,"match.length")  4 4 attr(,"useBytes")  TRUE > pos[]  19 > pos[]  25
If you want to extract a portion of a string, use the substr() function. For example, if I know that the first 3 nucleotides of a particular DNA sequence are junk, I would want to discard them and extract the rest of that sequence only. Let’s use the variable “y” to illustrate this.
> y  "GGACTCTAAATCCGTACTATCGTCATCGTTTTTCCT" > substr(y, 4, nchar(y))  "CTCTAAATCCGTACTATCGTCATCGTTTTTCCT"
John Myles White, who co-wrote the excellent “Machine Learning for Hackers” with Drew Conway, has a nice blog entry on some other useful functions for text processing in R. If you have any more suggestions, please share them in the comments!
Filed under: R programming Tagged: amino acids, as.character(), data manipulation, DNA, gregexpr(), grepl(), HIV, is.character(), manipulating strings, nchar(), nucleotides, paste(), R, R programming, string, strings, substr(), text, text data, text manipulation, text processing