Blog Archives

Historical newspaper scraping with {tesseract} and R

Historical newspaper scraping with {tesseract} and R

I have been playing around with historical newspapers data for some months now. The “obvious” type of analysis to do is NLP, but there is also a lot of numerical data inside historical newspapers. For instance, you can find these tables that show the market prices of the day in the L’Indépendance Luxembourgeoise: I wanted to see how easy it was to...

Read more »

Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick}

Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick}

In this blog post I’m going to show you how you can extract text from scanned pdf files, or pdf files where no text recognition was performed. (For pdfs where text recognition was performed, you can read my other blog post). The pdf I’m going to use can be downloaded from here. It’s a poem titled, D’Léierchen (Dem Léiweckerche säi Lidd), written by Michel...

Read more »

Pivoting data frames just got easier thanks to `pivot_wide()` and `pivot_long()`

Pivoting data frames just got easier thanks to `pivot_wide()` and `pivot_long()`

There’s a lot going on in the development version of {tidyr}. New functions for pivoting data frames, pivot_wide() and pivot_long() are coming, and will replace the current functions, spread() and gather(). spread() and gather() will remain in the package though: You may have heard a rumour that gather/spread are going away. This is simply not true (they’ll stay around forever) but I...

Read more »

Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 2

Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 2

In part 1 of this series I set up Vowpal Wabbit to classify newspapers content. Now, let’s use the model to make predictions and see how and if we can improve the model. Then, let’s train the model on the whole data. Step 1: prepare the data The first step consists in importing the test data and preparing it. The test data need...

Read more »

Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit

Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit

Can I get enough of historical newspapers data? Seems like I don’t. I already wrote four (1, 2, 3 and 4) blog posts, but there’s still a lot to explore. This blog post uses a new batch of data announced on twitter: For all who love to analyse text, the BnL released half a million of processed newspaper articles. Historical news from 1841-1878. They directly...

Read more »

Manipulating strings with the {stringr} package

Manipulating strings with the {stringr} package

This blog post is an excerpt of my ebook Modern R with the tidyverse that you can read for free here. This is taken from Chapter 4, in which I introduce the {stringr} package. Manipulate strings with {stringr} {stringr} contains functions to manipulate strings. In Chapter 10, I will teach you about regular expressions, but the functions contained in {stringr} allow you to already...

Read more »

Building a shiny app to explore historical newspapers: a step-by-step guide

Building a shiny app to explore historical newspapers: a step-by-step guide

Introduction I started off this year by exploring a world that was unknown to me, the world of historical newspapers. I did not know that historical newspapers data was a thing, and have been thoroughly enjoying myself exploring the different datasets published by the National Library of Luxembourg. You can find the data here. In my first blog post, I analyzed data from L’indépendence Luxembourgeoise....

Read more »

Using Data Science to read 10 years of Luxembourguish newspapers from the 19th century

Using Data Science to read 10 years of Luxembourguish newspapers from the 19th century

I have been playing around with historical newspaper data (see here and here). I have extracted the data from the largest archive available, as described in the previous blog post, and now created a shiny dashboard where it is possible to visualize the most common words per article, as well as read a summary of each article. The summary was made using a method called...

Read more »

Making sense of the METS and ALTO XML standards

Making sense of the METS and ALTO XML standards

Last week I wrote a blog post where I analyzed one year of newspapers ads from 19th century newspapers. The data is made available by the national library of Luxembourg. In this blog post, which is part 1 of a 2 part series, I extract data from the 257gb archive, which contains 10 years of publications of the L’Union, another 19th century Luxembourguish...

Read more »

Looking into 19th century ads from a Luxembourguish newspaper with R

Looking into 19th century ads from a Luxembourguish newspaper with R

The national library of Luxembourg published some very interesting data sets; scans of historical newspapers! There are several data sets that you can download, from 250mb up to 257gb. I decided to take a look at the 32gb “ML Starter Pack”. It contains high quality scans of one year of the L’indépendence Luxembourgeoise (Luxembourguish independence) from the year 1877. To make life easier...

Read more »

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)