How to Extract a String Between 2 Characters in R and SAS

[This article was first published on The Chemical Statistician » R programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

I recently needed to work with date values that look like this:

mydate
Jan 23/2
Aug 5/20
Dec 17/2

I wanted to extract the day, and the obvious strategy is to extract the text between the space and the slash.  I needed to think about how to program this carefully in both R and SAS, because

  1. the length of the day could be 1 or 2 characters long
  2. I needed a code that adapted to this varying length from observation to observation
  3. there is no function in either language that is suited exactly for this purpose.

In this tutorial, I will show you how to do this in both R and SAS.  I will write a function in R and a macro program in SAS to do so, and you can use the function and the macro program as you please!

Extracting a String Between 2 Characters in R

I will write a function called getstr() in R to extract a string between 2 characters.  The strategy is simple:

  1. Find the position of the initial character and add 1 to it – that is the initial position of the desired string.
  2. Find the position of the final character and subtract 1 from it – that is the final position of the desired string.
  3. Use the substr() function to extract the desired string inclusively between the initial position and final position as found in Steps 1-2.

 

##### Extracting a String Between 2 Characters in R
##### By Eric Cai - The Chemical Statistician

# clear all variables in workspace
rm(list=ls(all=TRUE))

# create a vector of 3 example dates
mydate = c('Jan 23/2012', 'Aug 5/2011', 'Dec 17/2011')

# getstr() is my customized function
# it extracts a string between 2 characters in a string variable
getstr = function(mystring, initial.character, final.character)
{
 
     # check that all 3 inputs are character variables
     if (!is.character(mystring))
     {
          stop('The parent string must be a character variable.')
     }
 
     if (!is.character(initial.character))
     {
          stop('The initial character must be a character variable.')
     }
 
 
     if (!is.character(final.character))
     {
          stop('The final character must be a character variable.')
     }

  
 
     # pre-allocate a vector to store the extracted strings
     snippet = rep(0, length(mystring))
 
 
 
     for (i in 1:length(mystring))
     {
          # extract the initial position
          initial.position = gregexpr(initial.character, mystring[i])[[1]][1] + 1
  
          # extract the final position
          final.position = gregexpr(final.character, mystring[i])[[1]][1] - 1
 
          # extract the substring between the initial and final positions, inclusively
          snippet[i] = substr(mystring[i], initial.position, final.position)
     }
 
     return(snippet)
}


# use the getstr() function to extract the day between the comma and the slash in "mydate"
getstr(mydate, ' ', '/')

 

Here is the output from getstr() on the vector “mydate”

> getstr(mydate, ' ', '/')
 [1] "23" "5" "17"

 

 

Extracting a String Between 2 Characters in SAS

I will write a macro program called %getstr().  It will accept a data set and the string variable as inputs, and it will create a new data set with the day extracted as a new variable.

The only tricky part in this macro program was creating a new data set name.  The input data set is called “dates”, and I wanted to create a new data set called “dates2″.  I accomplished that by appending %dataset with “.2″ within the macro.

 

First, let’s create the input data set.  Notice my use of the “#” as a delimiter when inputting the dates.

 

data dates;
     infile datalines dlm = '#';
     input mydate $;
     datalines;
          Jan 23/2015#
          Aug 5/2001#
          Dec 17/2007
     ;
run;

 

Let’s now write the macro program %getstr().  It will create a new data set with the appendix “2”.

%macro getstr(dataset, string_variable);

data &dataset.2;
          set &dataset;

          * search the string for the position of the space after the month;
           space_position = INDEX(&string_variable, ' ');

          * search the string for the position of the slash after the month;
           slash_position = INDEX(&string_variable, '/');

          * calculate the length between the space and the slash;
           space_to_slash = slash_position - space_position;

          * extract the day from the original string (the character(s) between the space and the slash;
           day = substr(&string_variable, space_position, space_to_slash);
run;

%mend getstr;

 

Let’s use the %getstr() macro program to create a new data set called “dates2″ that contains the day of each date.  I’ll print the results afteward.

%getstr(dates, mydate);
proc print
     data = dates2
          noobs;
 run;

 

Here is the output; if you prefer, you can modify the macro program to drop the variables “space_position” and “substring_afterspace”.

mydate space_position substring_afterspace day
Jan 23/2 4 23/2 23
Aug 5/20 4 5/20 5
Dec 17/2 4 17/2 17

Filed under: Data Analysis, R programming, SAS Programming Tagged: dates, macro, macro program, R, R programming, SAS, text, text processing

To leave a comment for the author, please follow the link and comment on their blog: The Chemical Statistician » R programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)