Analysis of the Renert – Part 1: Scraping

Posted on January 21, 2018 by rdata.lu Blog | Data science with R in R bloggers | 0 Comments

[This article was first published on rdata.lu Blog | Data science with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is part 1 of a 3 part blog post. This post presents the Luxembourgish language as well as the literary work I am going to analyze using the R programming language. Part 2 deals with preparing the data for analysis, and finally part 3 is the analysis. Hope you enjoy!

Luxembourg and the Luxembourgish language

Luxembourg is a small European country, squeezed between France, Belgium and Germany. Over the course of its history, it’s been invaded over and over by either France or Prussia (later Germany). It eventually became a state under the personal possession of William I of the Netherlands in 1815, with a… Prussian garrison to guard its capital, Luxembourg City, from further French invasions. After the Belgian revolution of 1839, the purely French-speaking part of the country was ceded to Belgium and the Luxembourgish-speaking part became what is known today as the Grand-Duchy of Luxembourg. What’s a Grand-Duchy you might wonder?

Luxembourg is the only remaining Grand-Duchy in the world. A Grand-Duchy is like a Kingdom, but instead of a King, we have a Grand Duke. The current monarch is Henri, which means that Luxembourg is a constitutional monarchy with the head of state being the prime minister, Xavier Bettel. As you can imagine, Luxembourg’s history has had a very important impact on the languages we speak today in the country; there are three official languages, French, German, and Luxembourgish. Unlike other countries with several official languages, in Luxembourg, there is not a French, or German, or Luxembourgish speaking part. In Luxembourg, you use one of the three languages based on context.

For example, the laws are all written in French, and French is mostly the language used for official or formal written correspondence.German has traditionally been the language of the press and the police. And finally Luxembourgish is the language Luxembourguians use to speak with one another. This means that on a given day, most people here might switch between these three languages; of course, add English to the pile, which is rapidly growing in the country due to all the English speaking expats that come here to work (coughbrexitcough).

There is also a sizable Portuguese community in Luxembourg, so you’ll hear a lot of Portuguese on the streets too, as well as Italian. Around 50% of the inhabitants of Luxembourg are foreign born, mostly from other EU countries. The Italians, Portuguese and a lot of others have emigrated to Luxembourg starting in the 60s to work in the metallurgic sector, and later, in the construction sector. The children of these emigrants usually speak five languages; their mother tongue, say, Portuguese, the three official languages of the country, and finally English.

You might wonder what Luxembourgish sounds like? Here is a video of our Prime Minister talking in Luxembourgish:

Here is another video of him speaking French:

Here he’s speaking German :

And here English :

On the English video, you might notice the typical accent Luxembourguians have when speaking English ?

The text we’re analysing

The text I’ll be analyzing is called Renert oder de Fuuss am Frack an a Maansgréisst, published in 1872 by Michel Rodange. My high school was named after Michel Rodange by the way! Renert is a fable featuring a sly fox as the main character, called Renert. He gets in trouble because of his shenanigans and gets sentenced to death by the Lion King. However, through further lies and deceptions, he manages to escape. After some tribulations, he proves his worth to the King by winning a duel against the wolf and becomes an aristocrat. Because it was written in the 19th century, the way some words are written may be different that how we write them in modern Luxembourgish, which might create some problems when analyzing the text.

Now starts the technical part. If you’re only interested in the results, you can skip to part 3!

Scraping the data

First of all, let’s load (or install if you don’t have them) the needed packages:

install.packages(c("tidyverse",
                   "tidytext",
                   "janitor"))
library("tidyverse")
library("tidytext")
library("janitor")

The tidyverse is a collection of packages that are very useful for a lot of different tasks. If you are not familiar with these packages, check out the tidyverse website.

tidytext is a package that uses the same principles than the tidyverse, but for text analysis. You can learn more about it here which is the book I took inspiration from for this series of blog posts.

The full text of the Renert is available here, so I’m going to use rvest, to get the text into R:

renert_link = "https://wikisource.org/wiki/Renert"

renert_raw = renert_link %>%
  xml2::read_html() %>%
  rvest::html_nodes(".mw-parser-output") %>%
  rvest::html_text() %>%
  str_split("\n", simplify = TRUE) %>%
  .[1, -c(1:24)]

I download the text using read_html() from the xml2 package (which gets loaded by the tidyverse) and then find the nodes that interest me, in this case mw-parser-output. Then I extract the text from this node, and split it on the \n character, to get a big vector where each element is a line of text. I also remove the 24 first lines, which are mostly blank. Let’s take a look at the first five lines:

renert_raw[1:5]
## [1] "Éischte Gesank.[edit]"       ""                           
## [3] "Et war esou ëm d'Päischten," "'T stung Alles an der Bléi,"
## [5] "An d'Villercher di songen"

The Renert is divided into 14 songs, so I’d like to create a list with 14 elements, where each element is the text of a song. Every song is titled “First Song”, “Second Song” etc, so I first check on which lines I find the word Gesank, which identifies the start of a song.

(indices = grepl("Gesank", renert_raw) %>% which(isTRUE(.)))
##  [1]    1  605  885 1172 1555 1906 2441 2664 2995 3686 4214 4625 5116 5963

indices contains the indices of where the songs start. So I need to create the indices of when the songs end. If you think about it, the first songs ends where the second song begins, minus 1. So I create a new vector of indices, by first removing the index for the first song, substracting 1, and then adding the index for the last line (using length(renert_raw)).

(indices2 = c(indices[-1] - 1, length(renert_raw)))
##  [1]  604  884 1171 1554 1905 2440 2663 2994 3685 4213 4624 5115 5962 6506

I can now create a list of sequences, called song_lines which contains the indices for all the songs:

song_lines = map2(indices, indices2,  ~seq(.x,.y))

And using this list of indices, I can now extract the songs into a list:

renert_songs = map(song_lines, ~`[`(renert_raw, .))

I’ll save this object for later use, using saveRDS():

saveRDS(renert_songs, "renert_songs.rds")

I will also save a version of the above list, but where each element of the list is a data frame. This will make analysis much easier later.

renert_songs_df = map(renert_songs, ~data_frame(text = .))
saveRDS(renert_songs_df, "renert_songs_df.rds")

I also need to have the full text as a single character object, so I reduce my list into a single object and also save it:

renert_full = reduce(renert_songs, c)

renert_full = data_frame(text = renert_full) %>%
  filter(!grepl("Gesank", text))

saveRDS(renert_full, "renert_full.rds")

This is the end of part 1. In part 2, we are going to prepare the data for analysis, and in part 3 we are going to analyze it.

Don’t hesitate to follow us on twitter @rdata_lu and to subscribe to our youtube channel.
You can also contact us if you have any comments or suggestions. See you for the next post!

To leave a comment for the author, please follow the link and comment on their blog: rdata.lu Blog | Data science with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Analysis of the Renert – Part 1: Scraping

Luxembourg and the Luxembourgish language

The text we’re analysing

Scraping the data

Related

Luxembourg and the Luxembourgish language

The text we’re analysing

Scraping the data

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)