Data sets of all kinds may contain data stored as text which needs to be processed before it can be useful. Often, the first step is to split up the text to make use of the individual elements. One example that requires splitting is timestamps, which you may need to break up to compute duration. Another common case is where data from multiple variables has been stored as comma-separated text values because of some data combining or restructuring process. I’ll go over how to create a new variable by splitting text strings using R.
Example of string splitting
In this post we consider a simple example, where people’s responses to an awareness question on soft drinks have been stored in a single variable. Commas separate the 1st mention, 2nd mention, and so on. The raw data looks like this:
Respondent #16 has three responses to the awareness question. Splitting that response by a comma produces three separate bits of information, which can then be stored separately and processed. Below we consider how to get each respondent’s 1st Mention.
Splitting text strings with an R variable
In Displayr, the split text may be added as new variables in our data set, meaning that each person’s split data will be available side-by-side with the original data. To do so, follow these steps:
- Select Insert > Variables > New R.
- Select Text Variable. Note that if your calculation aims to create numeric values, select Numeric Variable
- Enter your code under Properties > R CODE on the right of the screen.
- Click Calculate.
If your goal is not to add the data into the data set, you can instead use Insert > R Output. This is also a good option to use when prototyping your code before you create new variables.
The code used in this example is:
x <- strsplit(awareness, ",") #get max length n = max(sapply(x, length)) for (j in 1:length(x)) length(x [[j]]) <- n z = do.call(rbind, x) z[is.na(z)] <- "" # Replace NAs with blanks colnames(z) = paste0("Mention: ",1:ncol(z)) z[,1] # Show only first column of results
This code does the following things:
- Uses the strsplit() function to split the text. This function returns a list of vectors, where each vector contains the elements of the split text.
- Resets the length of each vector so they are all equal. This is done so that the data can be coerced into a matrix.
- Uses call() as a convenient way to rbind() (combine as rows) all of the split elements.
- Ensures any NA values introduced are converted to blank strings.
- Extracts the first column of the tabulated data.
Variables for 2nd brand mentioned, 3rd mention, and so on, could be added by repeating the process and modifying the last line of code to refer to columns 2, 3, etc of the table of split elements.
Want to be able to do more in Displayr using R? Check out the R in Displayr section of our blog.