Where do letters occur in words

[This article was first published on 56north | Skræddersyet dataanalyse » Renglish, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A while back I encountered an interesting graphic showing where letters were located in english words (http://www.prooffreader.com/2014/05/graphing-distribution-of-english.html). The other day I decided to do a similar one for letters in danish words and for this I used R.

I downloaded all abstracts from the danish Wikipedia and made my own version as you can see here:

Bogstavsplaceringer

Here is how you can do it:

# First you need to load in some text

library(rvest)

# I’ll grab an article from FiveThirtyEight.com as a show case.
# I did my analysis on all the danish abstracts from Wikipedia (took a while!)
# When you do your final analysis you’ll want as much text as possible too.

# We grab the html data
html_data <- html(“http://fivethirtyeight.com/features/how-to-read-the-mind-of-a-supreme-court-justice/”)

# We extract some text
textfile <- html_data %>% html_nodes(“p”) %>% html_text(trim=TRUE)

# We collapse it in to a single string
textfile <- paste(textfile, collapse= ” “)

# Then we need to do a little string manipulation

library(stringr)

# We set all text to lower case
textfile <- str_to_lower(textfile)

# We remove all punctuation and all digits
textfile <- str_replace_all(textfile, “[[:punct:]]|[[:digit:]]”, “”)

# Then we split the string into individual words
words <- unique(unlist(str_split(textfile, ” “)))

# And we count the letters in each word
word_length <- unlist(lapply(words, function(x) nchar(x)))

# And we split each word in to its individual letters
split_words <- str_split(words, “”)

# Then we create a loop to find the position of each letter in each word
# If you have national letters like we do in Denmark you icnlude them like this: for(i in c(letters, “æ”, “ø”, “å”))

for(i in letters){ # We loop through all the letters

# Create empty list to hold data later
letter_place.list <- c()

# We find the position of each letter in the words (that we split apart)
letter_data <- lapply(split_words, function(x) which(x == i))

# A nested loop calculates the relative position of the letter in each word
for(y in 1:length(word_length)){

# We find the relative position
letter_place <- unlist(lapply(letter_data[y], function(x) x/word_length[y]))

# We add that position to a lit of positions
letter_place.list <- c(letter_place.list, letter_place)
}

# We create a new list to hold all the data and we then add the results from the loop
if(!exists(“letter_place.data”)) letter_place.data <- list(letter_place.list) else letter_place.data <- append(letter_place.data , list(letter_place.list))

# We make sure to name each list properly
names(letter_place.data)[length(letter_place.data)] <- i

}

# Now we have a nested list with the data we need, but first we’ll convert it to a long form data frame

# We create an empty data frame to hold the data
letter_place.data.df <- data.frame()

# Then we create a loop to put the data from each letter list into the data frame
for(z in 1:length(letter_place.data)){ # We loop through each nested list

tryCatch({ # I add the tryCatch so the loop doesn’t break if there is an error (can occur with if a letter is missing)

# Here we extract the data from the letter list and create a data frame
loop_data <- data.frame(letter = names(letter_place.data)[z], value = letter_place.data[[z]], stringsAsFactors = F)

# We then bind all the data frames together
letter_place.data.df <- rbind(letter_place.data.df, loop_data)

}, error=function(e){}) # Ends the tryCatch
}

# We check to see if we have all the letters
unique(letter_place.data.df$letter)

# We change the letters back to upper case for aesthetics in the graphic
letter_place.data.df$letter <- str_to_upper(letter_place.data.df$letter)

library(ggplot2)

# We create a density plot with free y scales to show the distribution, we choose a red fill colour and then we facet wrap it to show each individual letter
p <- ggplot(letter_place.data.df, aes(x=value)) + geom_density(aes(fill=”red”)) + facet_wrap( ~ letter, scales=”free_y”)

# We add appropriate text to titles and axis
p <- p + labs(title = “Where do letters typically appear in english words”, y = “Appearance”, x = “Word length”, fill=””)

# We set a deeper red, choose the minimal theme, remove axis markers and grid, and remove the legend
p <- p + scale_fill_brewer(palette = “Set1″) + theme_minimal() +
theme(axis.ticks = element_blank(), axis.text.y = element_blank(), axis.text.x = element_blank(),
legend.position=”none”, panel.grid.major = element_blank(), panel.grid.minor = element_blank())

# Voila! Here it is
p

To leave a comment for the author, please follow the link and comment on their blog: 56north | Skræddersyet dataanalyse » Renglish.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)