Extracting the Data from Static Images of Graphs with magick

[This article was first published on R Tutorials – Omni Analytics Group, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Prior to the era of reproducible research, it was quite common for published graphs, charts, and other figures to be released solely as static images such and PNGs or JPEGs. Often times this is not done with accompanying code, or with the plot data available as a separate download, making it difficult to either reproduce or validate the findings.

We’ve talked about the virtues of the magick package in the past, and it turns out magick provides us with a way of extracting data from images. The exact details vary depending on the properties of the plot, including its saturation, lightness, and hue, but some general themes emerge. We wanted to briefly document one particular instance of this problem.

Consider the following image:

How can we reproduce this chart by extracting the data points and ultimately plotting in ggplot2? First, we read the image in with Magick:

library(tidyverse)
library(magick)

im <- image_read("line.jpg")

The first thing we will do is extract the saturation channel. Depending on the image, we may want to choose a different channel here - in this case, however, the lines are the most saturated part of the image, and therefore the saturation is the most distinguishable characteristic of the chart. When we extract the saturation, we use the following code and obtain the following image:

im_proc <- im %>%
    image_channel("saturation")

im_proc

You'll see that we now have a grayscale image, where darker values represent areas of lower saturation, in this case the points, and in particular the lines we are interested in. Next, we will threshold this image at 30%, so that any pixel value that is over 30% saturation gets set to white - this will effectively eliminate all pixels aside from the lines.

im_proc2 <- im_proc %>%
    image_threshold("white", "30%")

im_proc2

Finally, we will negate the image so that the lines are white and the rest are black. This ensures that any pixel value in the image higher than 0 (as white is) represents a point which we will choose to extract.

im_proc3 <- im_proc2 %>%
    image_negate()

im_proc3

Our final step is to perform some data manipulation with the tidyverse in order to convert the pixel values to points suitable for plotting in ggplot2.

dat <- image_data(im_proc)[1,,] %>%
    as.data.frame() %>%
    mutate(Row = 1:nrow(.)) %>%
    select(Row, everything()) %>%
    mutate_all(as.character) %>%
    gather(key = Column, value = value, 2:ncol(.)) %>%
    mutate(Column = as.numeric(gsub("V", "", Column)),
           Row = as.numeric(Row),
           value = ifelse(value == "00", NA, 1)) %>%
    filter(!is.na(value))

And then we plot:

ggplot(data = dat, aes(x = Row, y = Column, colour = (Column < 300))) +
    geom_point() +
    scale_y_continuous(trans = "reverse") +
    scale_colour_manual(values = c("red4", "blue4")) +
    theme(legend.position = "off")

And there you have it! We'd love any feedback or enhancements to this method, especially ones that may pertain to more general cases. We hope you enjoyed!

The post Extracting the Data from Static Images of Graphs with magick appeared first on Omni Analytics Group.

To leave a comment for the author, please follow the link and comment on their blog: R Tutorials – Omni Analytics Group.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)