Site icon R-bloggers

Goodreads API

[This article was first published on max humber, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s December 23rd and I’ve only read 49 books. Whoops. There’s still time, but it’s definitely getting dicey. I’m about halfway through three books right now so I think I’ll be able to pull it off. Fingers crossed.

Of course, last year I did 52 books in 52 weeks and remember sitting pretty just before Christmas.

As I’ve been logging all my activity on Goodreads I thought it would be neat to plug into the API and compare my reading progress between the years. To see if I read at the same pace and whether there’s some sort of seasonality in my reading habits.

If you happen to use Goodreads and want to do the same here’s how I did it:

Setup

(Man, I love Hadley)

# load packages
library(httr)
library(tidyverse)
library(stringr)
library(xml2)
library(viridis)
library(knitr)

# knitr options
opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE)

API Guide

I’ve censored my API_KEY and GR_ID but if you replace the "XXXXXXXXXXXXX"s with your KEY and your ID you should be good to go!

# API_KEY <- "XXXXXXXXXXXXX"
# GR_ID   <- "XXXXXXXXXXXXX"

URL <- "https://www.goodreads.com/review/list?"

Get Shelf

This is where the heavy lifting GETs done. I’m leaning on httr and XML2 to parse the API responses.

get_shelf <- function(GR_ID) {
    shelf <- GET(URL, query = list(
    	v = 2, key = API_KEY, id = GR_ID, shelf = "read", per_page = 200))
    shelf_contents <- content(shelf, as = "parsed")
    return(shelf_contents)
}

shelf <- get_shelf(GR_ID)

get_df <- function(shelf) {

    title <- shelf %>% 
        xml_find_all("//title") %>% 
        xml_text()
    
    rating <- shelf %>% 
        xml_find_all("//rating") %>% 
        xml_text()
    
    added <- shelf %>% 
        xml_find_all("//date_added") %>% 
        xml_text()
    
    started <- shelf %>% 
        xml_find_all("//started_at") %>% 
        xml_text()
    
    read <- shelf %>% 
        xml_find_all("//read_at") %>% 
        xml_text()
    
    df <- tibble(
        title, rating, added, started, read)
    
    return(df)
}

df <- get_df(shelf)

Clean

After getting the XML data into my IDE I tabled and cleaned the data with dplyr and tidyr.

get_books <- function(df) {

    books <- df %>% 
        gather(date_type, date, -title, -rating) %>% 
        separate(date, 
            into = c("weekday", "month", "day", "time", "zone", "year"), 
            sep = "\\s", fill = "right") %>% 
        mutate(date = str_c(year, "-", month, "-", day)) %>% 
        select(title, rating, date_type, date) %>% 
        mutate(date = as.Date(date, format = "%Y-%b-%d")) %>% 
        spread(date_type, date) %>% 
        mutate(title = str_replace(title, "\\:.*$|\\(.*$|\\-.*$", "")) %>% 
        mutate(started = ifelse(
        	is.na(started), as.character(added), as.character(started))) %>% 
        mutate(started = as.Date(started)) %>% 
        mutate(rating = as.integer(rating))
    
    return(books)
}

books <- get_books(df)

Compare

All of that get to this graph:

It’s funny to see that I started strong in both years and fell off sometime around March. Though I recovered somewhat in 2015, Spring 2016 was a bad season for reading, apparently.

Looks like I was finished 52 books by December 21st last year. Whoops. Oh well, I still think I can mad rush it to the finish line.

To leave a comment for the author, please follow the link and comment on their blog: max humber.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.