Where do letters occur in words

July 26, 2015
By

(This article was first published on 56north | Skræddersyet dataanalyse » Renglish, and kindly contributed to R-bloggers)

A while back I encountered an interesting graphic showing where letters were located in english words (http://www.prooffreader.com/2014/05/graphing-distribution-of-english.html). The other day I decided to do a similar one for letters in danish words and for this I used R.

I downloaded all abstracts from the danish Wikipedia and made my own version as you can see here:

Bogstavsplaceringer

Here is how you can do it:

# First you need to load in some text

library(rvest)

# I’ll grab an article from FiveThirtyEight.com as a show case.
# I did my analysis on all the danish abstracts from Wikipedia (took a while!)
# When you do your final analysis you’ll want as much text as possible too.

# We grab the html data
html_data <- html(“http://fivethirtyeight.com/features/how-to-read-the-mind-of-a-supreme-court-justice/”)

# We extract some text
textfile <- html_data %>% html_nodes(“p”) %>% html_text(trim=TRUE)

# We collapse it in to a single string
textfile <- paste(textfile, collapse= ” “)

# Then we need to do a little string manipulation

library(stringr)

# We set all text to lower case
textfile <- str_to_lower(textfile)

# We remove all punctuation and all digits
textfile <- str_replace_all(textfile, “[[:punct:]]|[[:digit:]]”, “”)

# Then we split the string into individual words
words <- unique(unlist(str_split(textfile, ” “)))

# And we count the letters in each word
word_length <- unlist(lapply(words, function(x) nchar(x)))

# And we split each word in to its individual letters
split_words <- str_split(words, “”)

# Then we create a loop to find the position of each letter in each word
# If you have national letters like we do in Denmark you icnlude them like this: for(i in c(letters, “æ”, “ø”, “å”))

for(i in letters){ # We loop through all the letters

# Create empty list to hold data later
letter_place.list <- c()

# We find the position of each letter in the words (that we split apart)
letter_data <- lapply(split_words, function(x) which(x == i))

# A nested loop calculates the relative position of the letter in each word
for(y in 1:length(word_length)){

# We find the relative position
letter_place <- unlist(lapply(letter_data[y], function(x) x/word_length[y]))

# We add that position to a lit of positions
letter_place.list <- c(letter_place.list, letter_place)
}

# We create a new list to hold all the data and we then add the results from the loop
if(!exists(“letter_place.data”)) letter_place.data <- list(letter_place.list) else letter_place.data <- append(letter_place.data , list(letter_place.list))

# We make sure to name each list properly
names(letter_place.data)[length(letter_place.data)] <- i

}

# Now we have a nested list with the data we need, but first we’ll convert it to a long form data frame

# We create an empty data frame to hold the data
letter_place.data.df <- data.frame()

# Then we create a loop to put the data from each letter list into the data frame
for(z in 1:length(letter_place.data)){ # We loop through each nested list

tryCatch({ # I add the tryCatch so the loop doesn’t break if there is an error (can occur with if a letter is missing)

# Here we extract the data from the letter list and create a data frame
loop_data <- data.frame(letter = names(letter_place.data)[z], value = letter_place.data[[z]], stringsAsFactors = F)

# We then bind all the data frames together
letter_place.data.df <- rbind(letter_place.data.df, loop_data)

}, error=function(e){}) # Ends the tryCatch
}

# We check to see if we have all the letters
unique(letter_place.data.df$letter)

# We change the letters back to upper case for aesthetics in the graphic
letter_place.data.df$letter <- str_to_upper(letter_place.data.df$letter)

library(ggplot2)

# We create a density plot with free y scales to show the distribution, we choose a red fill colour and then we facet wrap it to show each individual letter
p <- ggplot(letter_place.data.df, aes(x=value)) + geom_density(aes(fill=”red”)) + facet_wrap( ~ letter, scales=”free_y”)

# We add appropriate text to titles and axis
p <- p + labs(title = “Where do letters typically appear in english words”, y = “Appearance”, x = “Word length”, fill=””)

# We set a deeper red, choose the minimal theme, remove axis markers and grid, and remove the legend
p <- p + scale_fill_brewer(palette = “Set1″) + theme_minimal() +
theme(axis.ticks = element_blank(), axis.text.y = element_blank(), axis.text.x = element_blank(),
legend.position=”none”, panel.grid.major = element_blank(), panel.grid.minor = element_blank())

# Voila! Here it is
p

To leave a comment for the author, please follow the link and comment on their blog: 56north | Skræddersyet dataanalyse » Renglish.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)