Character strings in R

[This article was first published on Thiago G. Martins » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post deals with the basics of character strings in R. My main reference has been Gaston Sanchez‘s ebook [1], which is excellent and you should read it if interested in manipulating text in R. I got the encoding’s section from [2], which is also a nice reference to have nearby. Text analysis will be one topic of interest to this Blog, so expect more posts about it in the near future.

Creating character strings

The class of an object that holds character strings in R is “character”. A string in R can be created using single quotes or double quotes.

chr = 'this is a string'
chr = "this is a string"

chr = "this 'is' valid"
chr = 'this "is" valid'

We can create an empty string with empty_str = "" or an empty character vector with empty_chr = character(0). Both have class “character” but the empty string has length equal to 1 while the empty character vector has length equal to zero.

empty_str = ""
empty_chr = character(0)

class(empty_str)
[1] "character"
class(empty_chr)
[1] "character"

length(empty_str)
[1] 1
length(empty_chr)
[1] 0

The function character() will create a character vector with as many empty strings as we want. We can add new components to the character vector just by assigning it to an index outside the current valid range. The index does not need to be consecutive, in which case R will auto-complete it with NA elements.

chr_vector = character(2) # create char vector
chr_vector
[1] "" ""

chr_vector[3] = "three" # add new element
chr_vector
[1] ""      ""      "three"

chr_vector[5] = "five" # do not need to 
                       # be consecutive
chr_vector
[1] ""      ""      "three" NA      "five" 

Auxiliary functions

The functions as.character() and is.character() can be used to convert non-character objects into character strings and to test if a object is of type “character”, respectively.

Strings and data objects

R has five main types of objects to store data: vector, factor, multi-dimensional array, data.frame and list. It is interesting to know how these objects behave when exposed to different types of data (e.g. character, numeric, logical).

  • vector: Vectors must have their values all of the same mode. If we combine mixed types of data in vectors, strings will dominate.
  • arrays: A matrix, which is a 2-dimensional array, have the same behavior found in vectors.
  • data.frame: By default, a column that contains a character string in it is converted to factors. If we want to turn this default behavior off we can use the argument stringsAsFactors = FALSE when constructing the data.frame object.
  • list: Each element on the list will maintain its corresponding mode.
# character dominates vector
c(1, 2, "text") 
[1] "1"    "2"    "text"

# character dominates arrays
rbind(1:3, letters[1:3]) 
    [,1] [,2] [,3]
[1,] "1"  "2"  "3" 
[2,] "a"  "b"  "c" 

# data.frame with stringsAsFactors = TRUE (default)
df1 = data.frame(numbers = 1:3, letters = letters[1:3])
df1
  numbers letters
1       1       a
2       2       b
3       3       c

str(df1, vec.len=1)
'data.frame':  3 obs. of  2 variables:
  $ numbers: int  1 2 ...
  $ letters: Factor w/ 3 levels "a","b","c": 1 2 ...

# data.frame with stringsAsFactors = FALSE
df2 = data.frame(numbers = 1:3, letters = letters[1:3], 
                 stringsAsFactors = FALSE)
df2
  numbers letters
1       1       a
2       2       b
3       3       c

str(df2, vec.len=1)
'data.frame':  3 obs. of  2 variables:
  $ numbers: int  1 2 ...
  $ letters: chr  "a" ...

# Each element in a list has its own type
list(1:3, letters[1:3])
[[1]]
[1] 1 2 3

[[2]]
[1] "a" "b" "c"

Character encoding

R provides functions to deal with various set of encoding schemes. The Encoding() function returns the encoding of a string. iconv() converts the encoding.

chr = "lá lá"
Encoding(chr)
[1] "UTF-8"

chr = iconv(chr, from = "UTF-8", 
            to = "latin1")
Encoding(chr)
[1] "latin1"

References:

[1] Gaston Sanchez’s ebook on Handling and Processing Strings in R.
[2] R Programming/Text Processing webpage.

To leave a comment for the author, please follow the link and comment on their blog: Thiago G. Martins » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)