Capitalization of Names using R code

[This article was first published on K & L Fintech Modeling, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post shows simple R trick for capitalization of names, which may have some delimiter.


Problem


Problem is to apply capitalization to names separated by punctuation mark (“.”). For example, “BABACAR.THIOMBANE” is to converted to “Babacar.Thiombane” as follows.

1
2
3
4
5
6
7
8
9
10
11
           name(input)                      output
     BABACAR.THIOMBANE           Babacar.Thiombane
         DAMEN.THACKER               Damen.Thacker
         GABE.QUINNETT   ==>         Gabe.Quinnett
      JAVARY.CHRISTMAS            Javary.Christmas
         SCOTT.BLAKNEY               Scott.Blakney
 BABACAR.THIOMBANE.AAA       Babacar.Thiombane.Aaa
               BABACAR                     Babacar
cs


In this case, str_to_title() function from stringr library is used but results in wrong output as follows.

1
2
3
4
5
6
7
8
9
10
> df$wrong < str_to_title(df$name)
> print(df)
                   name                 wrong
1     BABACAR.THIOMBANE     Babacar.thiombane
2         DAMEN.THACKER         Damen.thacker
3         GABE.QUINNETT         Gabe.quinnett
4      JAVARY.CHRISTMAS      Javary.christmas
5         SCOTT.BLAKNEY         Scott.blakney
6 BABACAR.THIOMBANE.AAA Babacar.thiombane.aaa
7               BABACAR               Babacar
cs

It’s because each name may contain “.”. It, therefore, should be taken into account.

Useful R functions


The following functions are used for solving the above problem.

  • scan(text = “BABACAR.THIOMBANE”, sep = “.”, what = “”)
    1. –> [1] “BABACAR” “THIOMBANE”
  • gsub(“[.]”, ” “, “BABACAR.THIOMBANE”)
    1. –> [1] “BABACAR THIOMBANE”
  • gsub(” “, “.”, “BABACAR THIOMBANE”)
    1. –> [1] “BABACAR.THIOMBANE”
  • paste(c(“BABACAR”, “THIOMBANE”), collapse = “.”)
    1. –> [1] “BABACAR.THIOMBANE”

scan() function is used to read data into vector or list using delimiter. gsub(a,b,x) function replace a with b in x but some system characters are used with “[]” when these character is placed at a. paste() function concatenates strings with delimiter (default is a space).

Using these functions, we can implement the following R code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
library(stringr) # str_to_title
 
# input data as one column
df < as.data.frame(
        c(“BABACAR.THIOMBANE”,
          “DAMEN.THACKER”,
          “GABE.QUINNETT”,
          “JAVARY.CHRISTMAS”,
          “SCOTT.BLAKNEY”,
          “BABACAR.THIOMBANE.AAA”,
          “BABACAR”))
 
colnames(df) < “name”
print(df) 
 
# Wrong method
df$wrong < str_to_title(df$name)
print(df)
 
# Method 1
df$method1 < sapply(df$name, function(x) { 
                    paste(str_to_title(
                        scan(text = x, sep = “.”, what = “”))
                        , collapse = “.”)})
# Method 2
df$method2 < gsub(” “,“.”,str_to_title(gsub(“[.]”” “, df$name)))
print(df)
 
cs



We implement two methods. Method 1 use sapply() function for the sequential row-wise operation on all rows which consist of multiple elements in each one entry. Method 2 use gsub() and paste() functions which is simpler than method 1.

From the output below, we can find the right answer.

1
2
3
4
5
6
7
8
9
10
> print(df[,c(1,3,4)])
                   name               method1               method2
1     BABACAR.THIOMBANE     Babacar.Thiombane     Babacar.Thiombane
2         DAMEN.THACKER         Damen.Thacker         Damen.Thacker
3         GABE.QUINNETT         Gabe.Quinnett         Gabe.Quinnett
4      JAVARY.CHRISTMAS      Javary.Christmas      Javary.Christmas
5         SCOTT.BLAKNEY         Scott.Blakney         Scott.Blakney
6 BABACAR.THIOMBANE.AAA Babacar.Thiombane.Aaa Babacar.Thiombane.Aaa
7               BABACAR               Babacar               Babacar
 
cs


We may encounter similar or more difficult problems which require complicated and time-consuming data manipulation.

The example above is the tip of the iceberg. R will help us when we use appropriately. \(\blacksquare\)


To leave a comment for the author, please follow the link and comment on their blog: K & L Fintech Modeling.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)