Web Scraping and “invalid multibyte string”

August 2, 2016
By

(This article was first published on R – Exegetic Analytics, and kindly contributed to R-bloggers)

A couple of my collaborators have had trouble using read_html() from the readr package to access this Wikipedia page. Specifically they have been getting errors like this:

Error in utils::type.convert(out[, i], as.is = TRUE, dec = dec) :
  invalid multibyte string at '<e2>€<94>'

Since I couldn’t reproduce these errors on my machine it appeared to be something relating to their particular machine setup. Looking at their locale provided a clue:

> Sys.getlocale()
[1] "LC_COLLATE=Korean_Korea.949;LC_CTYPE=Korean_Korea.949;LC_MONETARY=Korean_Korea.949;
LC_NUMERIC=C;LC_TIME=Korean_Korea.949"

whereas on my machine I have:

> Sys.getlocale()
[1] "LC_CTYPE=en_ZA.UTF-8;LC_NUMERIC=C;LC_TIME=en_ZA.UTF-8;LC_COLLATE=en_ZA.UTF-8;
LC_MONETARY=en_ZA.UTF-8;LC_MESSAGES=en_ZA.UTF-8;LC_PAPER=en_ZA.UTF-8;LC_NAME=C;LC_ADDRESS=C;
LC_TELEPHONE=C;LC_MEASUREMENT=en_ZA.UTF-8;LC_IDENTIFICATION=C"

The document that they were trying to scrape is encoded in UTF-8, which I see in my locale but not in theirs. Perhaps changing locale will sort out the problem? Since the en_ZA locale is a bit of an acquired taste (unless you’re South African, in which case it’s de rigueur!), the following should resolve the problem:

> Sys.setlocale("LC_CTYPE", "en_US.UTF-8")

This might precipitate an error stating that the directive cannot be honoured by your system. Do not fear, all is not lost. Try the following (which seems to work almost universally!):

Sys.setlocale("LC_ALL", "English")

Try scraping again. Your issues should be resolved.

The post Web Scraping and “invalid multibyte string” appeared first on Exegetic Analytics.

To leave a comment for the author, please follow the link and comment on their blog: R – Exegetic Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)