Learning Italian with rvest and Duolingo

September 1, 2015
By

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

italian

 

By Aimee Gott,  R Consultant, Mango Solutions

Over the last month I have found multiple reasons for needing to scrape web pages for information. This started out with wanting to create a simple database for a training course containing movie data. Of course I turned to IMDB for the data and it turned out that the examples included in rvest happened to also use IMDB. It was so simple to use and quickly get data with that I decided to try out a slightly different application.

For the last year I have been (trying) to learn Italian in my spare time and part of that has been using Duolingo (www.duolingo.com). If you haven’t come across it before it’s a free to use language learning site that gamifies the process of language learning. The site includes many features to help you learn, one of which (for some languages, not all) is a listing of all the words that you have been introduced to during lessons. Apparently I have been introduced to 2219 and for some time now I wanted to be able to extract that information to use in other ways including to add to my vocab list of words from outside of Duolingo so I can practice them more.

italian blog

Now that I have the tools to do this in R in the form of rvest I thought I would give it a go. Unfortunately it wasn’t quite as simple as scraping from IMDB. Duolingo use a large amount of JavaScript on their site; including the component that gives the complete table of words. It turned out that the easiest way to get the data was to save the html of the page, rather than point to a URL, which would then save the complete table of words.

Getting the words wasn’t that simple either as the page has quite a complex structure and the html_table function proved to be unhelpful in this case. But eventually I managed to extract each row of the table followed by the component in the row that contained the word and finally the word itself, all in four calls to rvest functions!

library(rvest)

page <- html(“~/Duolingo_ Words.html”)

 

vocab <- html_nodes(page, “#vocab-list tr.word-cell”) %>%

html_node(“td span.hidden”) %>%

html_text()

So now I have all of the Italian words that I “know” the next challenge is to work out how to extract the English translations, which appear when you hover over the Italian. But, having the Italian words is certainly a useful starting point to help my learning, and I can even see which words I have practiced most recently by visualising a sample with wordcloud, although maybe plotting the ones I haven’t practiced recently would be more useful…

italian-blog-r-mango-solutions

 

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)