Learning Italian with rvest and Duolingo
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
By Aimee Gott, R Consultant, Mango Solutions
Over the last month I have found multiple reasons for needing to scrape web pages for information. This started out with wanting to create a simple database for a training course containing movie data. Of course I turned to IMDB for the data and it turned out that the examples included in rvest happened to also use IMDB. It was so simple to use and quickly get data with that I decided to try out a slightly different application.
For the last year I have been (trying) to learn Italian in my spare time and part of that has been using Duolingo (www.duolingo.com). If you haven’t come across it before it’s a free to use language learning site that gamifies the process of language learning. The site includes many features to help you learn, one of which (for some languages, not all) is a listing of all the words that you have been introduced to during lessons. Apparently I have been introduced to 2219 and for some time now I wanted to be able to extract that information to use in other ways including to add to my vocab list of words from outside of Duolingo so I can practice them more.
Now that I have the tools to do this in R in the form of rvest I thought I would give it a go. Unfortunately it wasn’t quite as simple as scraping from IMDB. Duolingo use a large amount of JavaScript on their site; including the component that gives the complete table of words. It turned out that the easiest way to get the data was to save the html of the page, rather than point to a URL, which would then save the complete table of words.
Getting the words wasn’t that simple either as the page has quite a complex structure and the html_table function proved to be unhelpful in this case. But eventually I managed to extract each row of the table followed by the component in the row that contained the word and finally the word itself, all in four calls to rvest functions!
library(rvest)
page <- html(“~/Duolingo_ Words.html”)
vocab <- html_nodes(page, “#vocab-list tr.word-cell”) %>%
html_node(“td span.hidden”) %>%
html_text()
So now I have all of the Italian words that I “know” the next challenge is to work out how to extract the English translations, which appear when you hover over the Italian. But, having the Italian words is certainly a useful starting point to help my learning, and I can even see which words I have practiced most recently by visualising a sample with wordcloud, although maybe plotting the ones I haven’t practiced recently would be more useful…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.