By Aimee Gott, R Consultant, Mango Solutions
Over the last month I have found multiple reasons for needing to scrape web pages for information. This started out with wanting to create a simple database for a training course containing movie data. Of course I turned to IMDB for the data and it turned out that the examples included in rvest happened to also use IMDB. It was so simple to use and quickly get data with that I decided to try out a slightly different application.
For the last year I have been (trying) to learn Italian in my spare time and part of that has been using Duolingo (www.duolingo.com). If you haven’t come across it before it’s a free to use language learning site that gamifies the process of language learning. The site includes many features to help you learn, one of which (for some languages, not all) is a listing of all the words that you have been introduced to during lessons. Apparently I have been introduced to 2219 and for some time now I wanted to be able to extract that information to use in other ways including to add to my vocab list of words from outside of Duolingo so I can practice them more.
Getting the words wasn’t that simple either as the page has quite a complex structure and the html_table function proved to be unhelpful in this case. But eventually I managed to extract each row of the table followed by the component in the row that contained the word and finally the word itself, all in four calls to rvest functions!
page <- html(“~/Duolingo_ Words.html”)
vocab <- html_nodes(page, “#vocab-list tr.word-cell”) %>%
html_node(“td span.hidden”) %>%
So now I have all of the Italian words that I “know” the next challenge is to work out how to extract the English translations, which appear when you hover over the Italian. But, having the Italian words is certainly a useful starting point to help my learning, and I can even see which words I have practiced most recently by visualising a sample with wordcloud, although maybe plotting the ones I haven’t practiced recently would be more useful…