Site icon R-bloggers

Webscraping with R using a Raspberry Pi

[This article was first published on databait, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Setting up the Raspberry Pi

After the basic setup, i.e.

I started to install the R packages usually needed for my cron-job tasks (mostly webscraping). I ran into problems with the rvest package because several packages could not be installed. Maybe there is a more efficient way but I did the following steps:

Install packages for webscraping

To install xml and related R packages (rvest), I needed the libxml2 on the system although apt-get had it, so I manually installed it:

123
wget ftp://xmlsoft.org/libxml2/libxml2-2.9.2.tar.gztar -xzvf libxml2-2.9.2.tar.gzcd libxml2-2.9.2/

I also needed python-dev to make libxml2 compile.

12
sudo apt-get updatesudo apt-get install python-dev

Then built libxml2:

12
./configure --prefix=/usr --disable-static --with-history && makesudo make install

I also had problems with the curl Package. Installation suggested to install libcurl4-openssl-dev therefore:

1
sudo apt-get install libcurl4-openssl-dev

Last problem was the openssl package. Again, I followed the suggestions from the failed R-package installation and installed libssl-dev:

1
sudo apt-get install libssl-dev

After that, rvest installed nicely. However, it took quite a while for the Pi to install all dependencies.

Webscraping Example – A simple frost warning for my plants

A simple Task, my Raspberry Pi is doing for me is sending a frost warning to my email if at 6 pm the weather forecast for the night goes below 3 °C. For this I got an API Key at openweathermap.org. Mind, that openweathermap.org does not like frequent requests (less than 1 per 10 minutes). At the beginning I got blocked.

You can then request some JSON for your city ID using your APPID (API Key):

12
library(jsonlite)wd_json <- fromJSON("http://api.openweathermap.org/data/2.5/forecast/city?id=CITY_ID_GOES_HERE&APPID=YOUR_API_KEY_GOES_HERE")

Then tidy and extract the values needed. Temperatures are in degrees kelvin so we need to convert to celsius. The date I transform to POSIX.

12345
wd <- wd_json$listwd$Datum <- as.character(as.POSIXct(wd$dt, origin="1970-01-01", tz="Europe/Berlin"))wd$Celsius_min <- wd$main$temp_min-273.15wd$Celsius_max <- wd$main$temp_max-273.15wd$Celsius_mean <- wd$main$temp-273.15

Sending results via email

Now for the part sending a mail:

12345678910111213141516171819202122232425262728293031
library(sendmailR)library(xtable)wd <- wd[as.POSIXct(Sys.time()+86400)>wd$Datum,]if(any(wd$Celsius_min < 3)) {  dispatch <- print(xtable(wd[wd$Celsius_min<3,c("Datum","Celsius_min","Celsius_mean","Celsius_max")]),type="html")  msg <- mime_part(paste0('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0                          Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">                          <html xmlns="http://www.w3.org/1999/xhtml">                          <head>                          <http-equiv="Content-Type" content="text/html; charset=utf-8" />                          <name="viewport" content="width=device-width, initial-scale=1.0"/>                          <title>HTML demo</title>                          <style type="text/css">                          </style>                          </head>                          <body><h2>Frostwarnung</h2>',                          dispatch,                          '</body>                          </html>'))  ## Override content type.  msg[["headers"]][["Content-Type"]] <- "text/html"  from <- sprintf("<sendmailR@%s>", Sys.info()[4])  to <- "<YOUR@EMAIL_GOES_HERE.COM>"  subject <- paste("Frostwarnung",date())  body    <- list(msg)  sendmail(from, to, subject, body,control=list(smtpServer="ASPMX.L.GOOGLE.COM"))

Finally we have to tell the Raspberry Pi to schedule the script to run daily at early evening. Save the .R file and add it to your crontab:

1
crontab -e

The first time you use crontab you are asked to choose an editor. Easiest (at least for me) to use is nano.
Add the following line:

1
00 18 * * * Rscript ~/path_to_your/script.R

Which will add the script to your cronjobs scheduling it at 18:00 every day and month.

To leave a comment for the author, please follow the link and comment on their blog: databait.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.