Hacking Strings with stringi

July 28, 2017
By

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the last set of exercises, we worked on the basic concepts of string manipulation with stringr. In this one we will go further into hacking strings universe and learn how to use stringi package.Note that stringi acts as a backend of stringr but have many more useful string manipulation functions compared to stringr and one should really know stringi for text manipulation .

Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
create two strings
c1 <- "a quick brown fox jumps over a lazy dog"
c2 <- "a quick brown fox jump over a lazy dog"
Now stringi comes with many functions and wrappers around functions to check if two string are equivalent. Check if they are equivalent with
stri_compare, %s<=% and try to reason about the answers.

Learn more about Text analysis in the online course Text Analytics/Text Mining Using R. In this course you will learn how create, analyse and finally visualize your text based data source. Having all the steps easily outlined will be a great reference source for future work.

Exercise 2

How would you find no of words in c1 and c2 . Its pretty easy with stringi.Find it out .

Exercise 3

Similarly How would you find all words in c1 and c2 . Again its pretty straight forward with stringi.Find it out .

Exercise 4
Lets say you have a vector which contains famous mathematicians
genius <- c(Godel,Hilbert,Cantor,Gauss, Godel, Fermet,Gauss)
Find the duplications .

Exercise 5

Find the number of characters in genius vector by stri function.

Exercise 6
Its important to keep the character’s of a set of strings in same encoding .Suppose you have a vector
Genius1 <- c("Godel","Hilbert","Cantor","Gauss", "Gödel", "Fermet","Gauss")
Now basically Godel and Gödel are same person but the encoding of the characters are different . but if you try to compare them in a naive way they will act as different .So for the sake of consistency,we should really translate it to similar encoding .Find it how .

Hint – use “Latin-ASCII” transliterator in stri_trans* like function.

Exercise 7
How do we collapse the LETTER vector in R such that it looks like this
“A-B_C-D_E-F_G-H_I-J_K-L_M-N_O-P_Q-R_S-T_U-V_W-X_Y-Z_”

Exercise 8
Suppose you have a string of words like c1 that we have created earlier . You might want to know the starting and end index of the first word, last word.which is obvious for start index of first word and last word but not so obvious for the end index of first word and start index of last word. How would you find this .

Exercise 9
Suppose I have a string
pun <- "A statistician can have his head in an oven and his feet in ice, and he will say that on the average he feels fine"
Suppose I want to replace statistician and average with mathematician and median in the string pun .How can I achieve that .
Hint -use a stri_replace* method.

Exercise 10
My string x is like
x <- "I AM SAM. I AM SAM. SAM I AM"
replace last SAM with ADAM.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)