Scraping for a Booklist of the Chinese Classics

(This article was first published on Posts on Anything Data, and kindly contributed to R-bloggers)

Last week I was considering a project that would be interesting and unique. I decided I would like to do a text analysis on classical Chinese texts, but wasn’t sure what kind of analysis regarding which texts. I decided to keep it small – and use five of the “core” Chinese classics – The Analects, The Mengzi, Dao De Jing, Zhuangzi, and Mozi. While there are many books in Confucianism, Daoism, and Moism, these texts are often used as the most representative examples of each “genre”.

Of course, the first key question was, from where can I get the data? One website with a rich amount of Chinese text data regarding the classics is ctext.org.

A screenshot of Ctext.org

A screenshot of Ctext.org

But when looking at the site design, I wondered “How can I get this in R?” Scraping wasn’t entirely feasible due to the terms outlawing this practice. Secondly, scraping is a bit of a delicate operation – if text isn’t formated uniformly across pages then you might be in for a headache. You also don’t want to give the server unnecessary stress. In the end, if you opt for it, you’ll have to make your functions work with the site structure – and as evidenced by the screenshot, it seemed a bit… messy. (Well, it turns out that actually it wasn’t.)

To get the text of the Chinese classics into R, the solution was to build an API. There is an API avaialble on ctext.org’s website, but it’s made in Python. I’ve never built an API or proto-API functions before, but the latter was easier than I thought. Right now I’ll save that for a future post.

To wrap up this post – Many of the key functions in the site API revolve around passing a book or chapter as the args. So, it turned out scraping was a necessary evil. Therefore I kept it limited and not too demanding.

Without ado, here is the (very limited) scraping I did to create a book list with chapters, which I put to use later in my homemade API.

library(tidyverse)
library(rvest)
library(stringr)

First Scrape

## 1st Scrape - Get list of books available on ctext website. 
url <- "https://ctext.org/"

path <- read_html(url)
genre_data <- path %>%
  html_nodes(css = ".container > .etext") %>%
  html_attr("href")

##Delete first observation which is not a genre
genre_data <- genre_data[-1] %>% tibble("genre" = .)
##Append the base url to the sub-links
genre_data <- genre_data %>%
  mutate(genre_links = paste("https://ctext.org", "/", genre_data[[1]], sep = ""))
## Warning: package 'bindrcpp' was built under R version 3.3.2

Next I set up a scraping function which needs to iterate over each book from the “genre_data” dataframe just created. Note the “Sys.sleep” call at the end to avoid overloading the server and play nicely with the website.

Function – Preparing for the 2nd scrape

##2nd Scrape - Make function to apply to each book, to get chapters
scraping_function <- function(genre, genre_links) {
  url <- genre_links[[1]]
  path <- read_html(url)
  
  data <- path %>%
    html_nodes(css = "#content3 > a") %>%
    html_attr("href")
  
  genre <- genre
  data <- data_frame(data, genre)
  
  ##Some string cleaning with stringr and mutate commands
  data <- data %>% mutate(book = str_extract(data, "^[a-z].*[\\/]")) %>%
    mutate(book = str_replace(book, "\\/", ""))
  data <- data %>%
    mutate(chapter = str_extract(data, "[\\/].*$")) %>%
    mutate(chapter = str_replace(chapter, "/", ""))
  data <- data %>%
    mutate(links = paste("https://ctext.org/", book, "/", chapter, sep = ""))
  data <- data %>% select(-data) %>%
    filter(complete.cases(.))

  Sys.sleep(2.5)
  data
}

If there was one takeaway from writing that function, it was that I should deepen my proficiency in regex. Finding the right regular expressions to capture the book and chapter names wasn’t HARD, but I did have to make several attempts before getting it all right. Previously web content was clean enough that I didn’t have to do this. Anyway, let’s apply the hard work to our original genre dataframe so that we can get a dataframe of books and their chapters. It’s going to be a big one.

Apply the function and get the data.. I have come to love purrr for this.

##Apply function to genre_data dataframe, create a data frame of books and chapters

all_works <- map2(genre_data$genre, genre_data$genre_links, ~ scraping_function(..1, ..2))

book_list <- all_works %>% do.call(rbind, .)

And here it is. The final variable “book_list” is a collection of books and chapters of each book, as listed on Ctext.org.

head(book_list)
## # A tibble: 6 x 4
##          genre     book       chapter
##                       
## 1 confucianism analects        xue-er
## 2 confucianism analects     wei-zheng
## 3 confucianism analects         ba-yi
## 4 confucianism analects        li-ren
## 5 confucianism analects gong-ye-chang
## 6 confucianism analects       yong-ye
## # ... with 1 more variables: links 

It is clearly in long format (convenient but not necessary, in fact this more a side effect of my scraping)

str(book_list)
## Classes 'tbl_df', 'tbl' and 'data.frame':    5869 obs. of  4 variables:
##  $ genre  : chr  "confucianism" "confucianism" "confucianism" "confucianism" ...
##  $ book   : chr  "analects" "analects" "analects" "analects" ...
##  $ chapter: chr  "xue-er" "wei-zheng" "ba-yi" "li-ren" ...
##  $ links  : chr  "https://ctext.org/analects/xue-er" "https://ctext.org/analects/wei-zheng" "https://ctext.org/analects/ba-yi" "https://ctext.org/analects/li-ren" ...

It is quite lengthy at nearly 6,000 rows and 130 different books. And this is an important dataframe which I will use in my API that I make, to pull textual data into R from Ctext.org.

Next post, I plan on sharing the process and results of my Chinese Classics text analysis.

To leave a comment for the author, please follow the link and comment on their blog: Posts on Anything Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)