String Manipulation in R

[This article was first published on finnstats », and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Story 502135128

String Manipulation in R, In this article, we’ll show you how to manipulate strings in the R programming language using many methods.

To begin, we’ll read text from a file into the computer to demonstrate the string operations.

data<-readLines("D:/RStudio/Binning/TextData.txt")
head(data)

The “data” variable will have a vector with five elements, one for each of the five lines of the document.

Draw a trend line using ggplot-Quick Guide »

You can see an example of those lines here.

[1] "Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data,[1][2] and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data."                                                                                                                                                                                                                                                                                                                                                   
[2] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
[3] "Data science is a \"concept to unify statistics, data analysis, informatics, and their related methods\" in order to \"understand and analyze actual phenomena\" with data.[3] It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. However, data science is different from computer science and information science. Turing Award winner Jim Gray imagined data science as a \"fourth paradigm\" of science (empirical, theoretical, computational, and now data-driven) and asserted that \"everything about science is changing because of the impact of information technology\" and the data deluge.[4][5]"
[4] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
[5] "A data scientist is someone who creates programming code, and combines it with statistical knowledge to create insights from data.[6]"    

String Manipulation in R

You can use the “nchar” function to count the number of characters in a string by giving the string in as an argument.

nchar(data[1])
[1] 362

Our vector’s first element is a 362-character string, as you can see.

The “toupper” function can be used to convert all the characters in a string to upper case.

toupper(data[1])
[1] "DATA SCIENCE IS AN INTERDISCIPLINARY FIELD THAT USES SCIENTIFIC METHODS, PROCESSES, ALGORITHMS AND SYSTEMS TO EXTRACT KNOWLEDGE AND INSIGHTS FROM NOISY, STRUCTURED AND UNSTRUCTURED DATA,[1][2] AND APPLY KNOWLEDGE AND ACTIONABLE INSIGHTS FROM DATA ACROSS A BROAD RANGE OF APPLICATION DOMAINS. DATA SCIENCE IS RELATED TO DATA MINING, MACHINE LEARNING AND BIG DATA."

You can see an example of how that would appear here.

Similarly, you can use the “tolower” method if you’d like to change all the string’s characters to lower case.

tolower(data[1])

The “chartr” function can be used to replace a certain set of characters in a string.

chartr(" ","-",data[1])

The first input is a string containing the characters that should be replaced.  The replacement characters are stored in the second argument, which is a string.

Dot Plots in R-Strip Charts for Small Sample Size »

The last argument is the string upon which the operation should be applied. You can see how the function replaced every space character with a hyphen in the output.

[1] "Data-science-is-an-interdisciplinary-field-that-uses-scientific-methods,-processes,-algorithms-and-systems-to-extract-knowledge-and-insights-from-noisy,-structured-and-unstructured-data,[1][2]-and-apply-knowledge-and-actionable-insights-from-data-across-a-broad-range-of-application-domains.-Data-science-is-related-to-data-mining,-machine-learning-and-big-data."

The “strsplit” function allows you to split a string into two parts using an expression.

Take a look at the syntax in this section.

list<-strsplit(data[1]," ")

The first input is the string we want to split, and the second argument is the expression we want to use to split it.

The space character is used to break up the string in this situation. This will produce a list, therefore we’ll need to use the “unlist” method to create a character vector.

list1<-unlist(list)

Because each word in the original string was separated by a space character, you’ll note that the vector contains one element per word when you look at the output.

 [1] "Data"              "science"           "is"                "an"               
 [5] "interdisciplinary" "field"             "that"              "uses"             
 [9] "scientific"        "methods,"          "processes,"        "algorithms"       
[13] "and"               "systems"           "to"                "extract"          
[17] "knowledge"         "and"               "insights"          "from"             
[21] "noisy,"            "structured"        "and"               "unstructured"     
[25] "data,[1][2]"       "and"               "apply"             "knowledge"        
[29] "and"               "actionable"        "insights"          "from"             
[33] "data"              "across"            "a"                 "broad"            
[37] "range"             "of"                "application"       "domains."         
[41] "Data"              "science"           "is"                "related"          
[45] "to"                "data"              "mining,"           "machine"          
[49] "learning"          "and"               "big"               "data." 

By feeding the “list1” vector we just produced into the “sort” function, we can sort it as well.

sorting<-sort(list1)

As a result, the components will be sorted alphabetically.

The “paste” function can also be used to concatenate the elements of a character vector.

Types of Data Visualization Charts » Advantages»

paste(sorting,collapse=" ")

The string value that will be used to separate the distinct elements is determined by the “collapse” option.

[1] "a across actionable algorithms an and and and and and and application apply big broad data data Data Data data,[1][2] data. domains. extract field from from insights insights interdisciplinary is is knowledge knowledge learning machine methods, mining, noisy, of processes, range related science science scientific structured systems that to to unstructured uses"

We’ll simply use a single space character to separate them in our situation. Our alphabetically sorted list is represented by a single string in this output.

The “substr” function can be used to isolate a specified portion of a string.

subs<-substr(data[1],start=3,stop=30)
subs

Simply enter the segment’s start and end indices, and this contiguous section will be output.

"ta science is an interdiscip"

However, you’ll see that this substring has a leading and trailing space character.

What is mean by the best standard deviation? »

We can get rid of them by using the “trimws” function, which removes any whitespace from a string’s beginning and end.

It’s possible that you’ll wish to count backward from the last location to build a substring.

So, for example, you might desire the last five characters, as shown above.  You’ll need to utilize the “stringr” library’s “str sub” function for this.

library(stringr)
str_sub(data[1],-5,-1)

In this situation, notice how the start and endpoint arguments are both negative.

As a result, the start point is the fifth character from the string’s final point, and the endpoint is the last character’s index.

[1] "data."

The output shows that the final five characters were successfully recovered.

You should now be able to change the characters in a string, split a string into a vector, and retrieve specific substrings.

tidyverse in r – Complete Tutorial » Unknown Techniques »

The post String Manipulation in R appeared first on finnstats.

To leave a comment for the author, please follow the link and comment on their blog: finnstats ».

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)