[This article was first published on rdata.lu Blog | Data science with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is part 2 of a 3 part blog post. This post uses the data that we scraped in part 1 and prepares it for further analysis, which is quite technical. If you’re only interested in the results of the analysis, skip to part 3!

First, let’s load the data that we prepared in part 1. Let’s start with the full text:

library("tidyverse")
library("tidytext")
renert = readRDS("renert_full.rds")

I want to study the frequencies of words, so for this, I will use a function from the tidytext package called unnest_tokens() which breaks the text down into tokens. Each token is a word, which will then make it possible to compute the frequencies of words.

So, let’s unnest the tokens:

renert = renert %>%
unnest_tokens(word, text)

We still need to do some cleaning before continuing. In Luxembourgish, the is written d’ for feminine nouns. For example d’Kaz for the the cat. There’s also a bunch of ’ts in the text, which is it. For example, the second line of the first song:

’T stung Alles an der Bléi,

Everything (it) was on flower,

We can remove these with a couple lines of code:

renert_tokenized = renert %>%
mutate(word = str_replace_all(word, "d'", "")) %>%
mutate(word = str_replace_all(word, "'t", ""))

But that’s not all! We still need to remove so called stop words. Stop words are words that are very frequent, such as “and”, and these words usually do not add anything to the analysis. There are no set rules for defining a list of stop words, so I took inspiration for the stop words in English and German, and created my own, which you can get on Github.

stopwords = read.csv("stopwords_lu.csv", header = TRUE)

For my Luxembourgish-speaking compatriots, I’d be glad to get help to make this list better! This list is far from perfect, certainly contains typos, or even words that have no reason to be there! Please help ?.

Using this list of stop words, I can remove words that don’t add anything to the analysis. Creating a list of stop words for the Luxembourgish language is very challenging, because there might be stop words that come from German, such as “awer”, from the German “aber”, meaning but, but you could also use “mä”, from the French mais, meaning also but. Plus, as a kid, we never really learned how to write Luxembourgish. Actually, most Luxembourguians don’t know how to write Luxembourgish 100% correctly. This is because for a very long time, Luxembourgish was used for oral communication, and French for formal written correspondence. This is changing, and more and more people are learning how to write correctly. I definitely have a lot to learn! Thus, I have certainly missed a lot of stop words in the list, but I am hopeful that others will contribute to the list and make it better! In the meantime, that’s what I’m going to use.

Let’s take a look at some lines of the stop words data frame:

head(stopwords, 20)
##          word
## 1           a
## 2           à
## 3         äis
## 4          är
## 5         ärt
## 6        äert
## 7        ären
## 8         all
## 9       allem
## 10      alles
## 11   alleguer
## 12        als
## 13       also
## 14         am
## 15         an
## 16 anerefalls
## 17        ass
## 18        aus
## 19       awer
## 20        bei

We can remove the stop words from our tokens using an anti_join():

renert_tokenized = renert_tokenized %>%
anti_join(stopwords)
## Joining, by = "word"
## Warning: Column word joining character vector and factor, coercing into
## character vector

Let’s save this for later use:

saveRDS(renert_tokenized, "renert_tokenized.rds")

I now have to do the same for the data that is stored by song. Because this is a list where each element is a data frame, I have to use purrr::map() to map each of the functions I used before to each data frame:

renert_songs = readRDS("renert_songs_df.rds")
renert_songs = map(renert_songs, ~unnest_tokens(., word, text))
renert_songs = map(renert_songs, ~anti_join(., stopwords))
renert_songs = map(renert_songs, ~mutate(., word = str_replace_all(word, "d'", "")))
renert_songs = map(renert_songs, ~mutate(., word = str_replace_all(word, "'t", "")))

Let’s take a look at the object we have:

head(renert_songs[[1]])
## # A tibble: 6 x 1
##        word
##       <chr>
## 1   éischte
## 2    gesank
## 3      edit
## 4 päischten
## 5     stung
## 6      bléi

Looks pretty nice! But I can make it nicer by adding a column containing which song the data refers to. Indeed, the first line of each data frame contains the number of the song. I can extract this information and add it to each data set:

renert_songs = map(renert_songs, ~mutate(., gesank = pull(.[1,1])))

Let’s take a look again:

head(renert_songs[[1]])
## # A tibble: 6 x 2
##        word  gesank
##       <chr>   <chr>
## 1   éischte éischte
## 2    gesank éischte
## 3      edit éischte
## 4 päischten éischte
## 5     stung éischte
## 6      bléi éischte

Now I can save this object for later use:

saveRDS(renert_songs, "renert_songs_tokenized.rds")

In the final part of this series, I will use the tokenized data as well as the list of songs to create a couple of visualizations!