How to Extract a String Between 2 Characters in R and SAS
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
I recently needed to work with date values that look like this:
| mydate |
| Jan 23/2 |
| Aug 5/20 |
| Dec 17/2 |
I wanted to extract the day, and the obvious strategy is to extract the text between the space and the slash. I needed to think about how to program this carefully in both R and SAS, because
- the length of the day could be 1 or 2 characters long
- I needed a code that adapted to this varying length from observation to observation
- there is no function in either language that is suited exactly for this purpose.
In this tutorial, I will show you how to do this in both R and SAS. I will write a function in R and a macro program in SAS to do so, and you can use the function and the macro program as you please!
Extracting a String Between 2 Characters in R
I will write a function called getstr() in R to extract a string between 2 characters. The strategy is simple:
- Find the position of the initial character and add 1 to it – that is the initial position of the desired string.
- Find the position of the final character and subtract 1 from it – that is the final position of the desired string.
- Use the substr() function to extract the desired string inclusively between the initial position and final position as found in Steps 1-2.
##### Extracting a String Between 2 Characters in R
##### By Eric Cai - The Chemical Statistician
# clear all variables in workspace
rm(list=ls(all=TRUE))
# create a vector of 3 example dates
mydate = c('Jan 23/2012', 'Aug 5/2011', 'Dec 17/2011')
# getstr() is my customized function
# it extracts a string between 2 characters in a string variable
getstr = function(mystring, initial.character, final.character)
{
# check that all 3 inputs are character variables
if (!is.character(mystring))
{
stop('The parent string must be a character variable.')
}
if (!is.character(initial.character))
{
stop('The initial character must be a character variable.')
}
if (!is.character(final.character))
{
stop('The final character must be a character variable.')
}
# pre-allocate a vector to store the extracted strings
snippet = rep(0, length(mystring))
for (i in 1:length(mystring))
{
# extract the initial position
initial.position = gregexpr(initial.character, mystring[i])[[1]][1] + 1
# extract the final position
final.position = gregexpr(final.character, mystring[i])[[1]][1] - 1
# extract the substring between the initial and final positions, inclusively
snippet[i] = substr(mystring[i], initial.position, final.position)
}
return(snippet)
}
# use the getstr() function to extract the day between the comma and the slash in "mydate"
getstr(mydate, ' ', '/')
Here is the output from getstr() on the vector “mydate”
> getstr(mydate, ' ', '/') [1] "23" "5" "17"
Extracting a String Between 2 Characters in SAS
I will write a macro program called %getstr(). It will accept a data set and the string variable as inputs, and it will create a new data set with the day extracted as a new variable.
The only tricky part in this macro program was creating a new data set name. The input data set is called “dates”, and I wanted to create a new data set called “dates2″. I accomplished that by appending %dataset with “.2″ within the macro.
First, let’s create the input data set. Notice my use of the “#” as a delimiter when inputting the dates.
data dates; infile datalines dlm = '#'; input mydate $; datalines; Jan 23/2015# Aug 5/2001# Dec 17/2007 ; run;
Let’s now write the macro program %getstr(). It will create a new data set with the appendix “2”.
%macro getstr(dataset, string_variable);
data &dataset.2;
set &dataset;
* search the string for the position of the space after the month;
space_position = INDEX(&string_variable, ' ');
* search the string for the position of the slash after the month;
slash_position = INDEX(&string_variable, '/');
* calculate the length between the space and the slash;
space_to_slash = slash_position - space_position;
* extract the day from the original string (the character(s) between the space and the slash;
day = substr(&string_variable, space_position, space_to_slash);
run;
%mend getstr;
Let’s use the %getstr() macro program to create a new data set called “dates2″ that contains the day of each date. I’ll print the results afteward.
%getstr(dates, mydate); proc print data = dates2 noobs; run;
Here is the output; if you prefer, you can modify the macro program to drop the variables “space_position” and “substring_afterspace”.
| mydate | space_position | substring_afterspace | day |
|---|---|---|---|
| Jan 23/2 | 4 | 23/2 | 23 |
| Aug 5/20 | 4 | 5/20 | 5 |
| Dec 17/2 | 4 | 17/2 | 17 |
Filed under: Data Analysis, R programming, SAS Programming Tagged: dates, macro, macro program, R, R programming, SAS, text, text processing
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.