[This article was first published on The Chemical Statistician » R programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

#### Introduction

I recently needed to work with date values that look like this:

 mydate Jan 23/2 Aug 5/20 Dec 17/2

I wanted to extract the day, and the obvious strategy is to extract the text between the space and the slash.  I needed to think about how to program this carefully in both R and SAS, because

1. the length of the day could be 1 or 2 characters long
2. I needed a code that adapted to this varying length from observation to observation
3. there is no function in either language that is suited exactly for this purpose.

In this tutorial, I will show you how to do this in both R and SAS.  I will write a function in R and a macro program in SAS to do so, and you can use the function and the macro program as you please!

#### Extracting a String Between 2 Characters in R

I will write a function called getstr() in R to extract a string between 2 characters.  The strategy is simple:

1. Find the position of the initial character and add 1 to it – that is the initial position of the desired string.
2. Find the position of the final character and subtract 1 from it – that is the final position of the desired string.
3. Use the substr() function to extract the desired string inclusively between the initial position and final position as found in Steps 1-2.

##### Extracting a String Between 2 Characters in R
##### By Eric Cai - The Chemical Statistician

# clear all variables in workspace
rm(list=ls(all=TRUE))

# create a vector of 3 example dates
mydate = c('Jan 23/2012', 'Aug 5/2011', 'Dec 17/2011')

# getstr() is my customized function
# it extracts a string between 2 characters in a string variable
getstr = function(mystring, initial.character, final.character)
{

# check that all 3 inputs are character variables
if (!is.character(mystring))
{
stop('The parent string must be a character variable.')
}

if (!is.character(initial.character))
{
stop('The initial character must be a character variable.')
}

if (!is.character(final.character))
{
stop('The final character must be a character variable.')
}

# pre-allocate a vector to store the extracted strings
snippet = rep(0, length(mystring))

for (i in 1:length(mystring))
{
# extract the initial position
initial.position = gregexpr(initial.character, mystring[i])[[1]][1] + 1

# extract the final position
final.position = gregexpr(final.character, mystring[i])[[1]][1] - 1

# extract the substring between the initial and final positions, inclusively
snippet[i] = substr(mystring[i], initial.position, final.position)
}

return(snippet)
}

# use the getstr() function to extract the day between the comma and the slash in "mydate"
getstr(mydate, ' ', '/')

Here is the output from getstr() on the vector “mydate”

> getstr(mydate, ' ', '/')
[1] "23" "5" "17"

#### Extracting a String Between 2 Characters in SAS

I will write a macro program called %getstr().  It will accept a data set and the string variable as inputs, and it will create a new data set with the day extracted as a new variable.

The only tricky part in this macro program was creating a new data set name.  The input data set is called “dates”, and I wanted to create a new data set called “dates2″.  I accomplished that by appending %dataset with “.2″ within the macro.

First, let’s create the input data set.  Notice my use of the “#” as a delimiter when inputting the dates.

data dates;
infile datalines dlm = '#';
input mydate \$;
datalines;
Jan 23/2015#
Aug 5/2001#
Dec 17/2007
;
run;


Let’s now write the macro program %getstr().  It will create a new data set with the appendix “2”.

%macro getstr(dataset, string_variable);

data &dataset.2;
set &dataset;

* search the string for the position of the space after the month;
space_position = INDEX(&string_variable, ' ');

* search the string for the position of the slash after the month;
slash_position = INDEX(&string_variable, '/');

* calculate the length between the space and the slash;
space_to_slash = slash_position - space_position;

* extract the day from the original string (the character(s) between the space and the slash;
day = substr(&string_variable, space_position, space_to_slash);
run;

%mend getstr;


Let’s use the %getstr() macro program to create a new data set called “dates2″ that contains the day of each date.  I’ll print the results afteward.

%getstr(dates, mydate);
proc print
data = dates2
noobs;
run;

Here is the output; if you prefer, you can modify the macro program to drop the variables “space_position” and “substring_afterspace”.

mydate space_position substring_afterspace day
Jan 23/2 4 23/2 23
Aug 5/20 4 5/20 5
Dec 17/2 4 17/2 17

Filed under: Data Analysis, R programming, SAS Programming Tagged: dates, macro, macro program, R, R programming, SAS, text, text processing