How to Split Text Strings in Displayr

September 6, 2018
By

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

Data sets of all kinds may contain data stored as text which needs to be processed before it can be useful. Often, the first step is to split up the text to make use of the individual elements. One example that requires splitting is timestamps, which you may need to break up to compute duration. Another common case is where data from multiple variables has been stored as comma-separated text values because of some data combining or restructuring process. I’ll go over how to create a new variable by splitting text strings using R.

Example of string splitting

In this post we consider a simple example, where people’s responses to an awareness question on soft drinks have been stored in a single variable.  Commas separate the 1st mention, 2nd mention, and so on. The raw data looks like this:

Split Text Strings

Respondent #16 has three responses to the awareness question. Splitting that response by a comma produces three separate bits of information, which can then be stored separately and processed. Below we consider how to get each respondent’s 1st Mention.

Splitting text strings with an R variable

In Displayr, the split text may be added as new variables in our data set, meaning that each person’s split data will be available side-by-side with the original data. To do so, follow these steps:

  1. Select Insert > Variables > New R.
  2. Select Text Variable. Note that if your calculation aims to create numeric values, select Numeric Variable
  3. Enter your code under Properties > R CODE on the right of the screen.
  4. Click Calculate.

If your goal is not to add the data into the data set, you can instead use Insert > R Output. This is also a good option to use when prototyping your code before you create new variables.

The code used in this example is:

x <- strsplit(awareness, ",")
#get max length
n = max(sapply(x, length))
for (j in 1:length(x))
length(x [[j]]) <- n
z = do.call(rbind, x)
z[is.na(z)] <- "" # Replace NAs with blanks
colnames(z) = paste0("Mention: ",1:ncol(z))
z[,1] # Show only first column of results

This code does the following things:

  • Uses the strsplit() function to split the text. This function returns a list of vectors, where each vector contains the elements of the split text.
  • Resets the length of each vector so they are all equal. This is done so that the data can be coerced into a matrix.
  • Uses call() as a convenient way to rbind() (combine as rows) all of the split elements.
  • Ensures any NA values introduced are converted to blank strings.
  • Extracts the first column of the tabulated data.

Variables for 2nd brand mentioned, 3rd mention, and so on, could be added by repeating the process and modifying the last line of code to refer to columns 2, 3, etc of the table of split elements.

Split Text Strings First Mention

Want to be able to do more in Displayr using R? Check out the R in Displayr section of our blog.

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)