Writing Code to Read Quotes About Writing Code

October 11, 2018
By

(This article was first published on George J. Mount, and kindly contributed to R-bloggers)

A recent project of mine has been setting up a Twitter bot on innovation quotes. I enjoy this project because in addition to curating a great set of content and growing an audience around it, I have also learned a lot about coding.

From web scraping to regular expressions to social media automation, I’ve learned a lot collecting a list of over 30,000 quotes related to innovation.

Lately I’ve been turning my attention to finding quotes about computer programming, as digital-savvy is crucial to innovation today. These exercises prove great blog post material and quite “meta,” too… writing code to read quotes about writing code. I will cover one of what I hope to make a series below. For this example…

Scraping DevTopics.com’s “101 Great Computer Programming Quotes”

This is a nice set of quotes but we can’t quite copy-and-paste them into a .csv file as in doing so each quote is split across multiple rows and begins with its numeric position. I also want to eliminate the quotation marks and parentheses from these quotations as stylistically I tend to avoid them for Twitter.

While we might despair about the orderliness of this page based on this first attempt, make no mistake that there is well-reasoned logic running under the code with its HTML, and we will need to go there instead.

Part I: Scrape

To do this I will load up the rvest package for R and SelectorGadget extension for Chrome.

I want to identify the HTML nodes which hold the quotes we want, then collect that text. To do that, I will initialize the SelectorGadget, then hover and click on the first quote.

In the bottom toolbar we see the value is set as li, a common HTML tag for items of a list.

Knowing this, we will use the html_nodes function in R to parse those nodes, then html_text to extract the text they hold.

Doing this will return a character vector, but I will convert it to a dataframe for ease of manipulation.

Our code thus far is below.

#initialize packages and URL
library(rvest)
library(tidyverse)
library(stringr)

link <- c("http://www.devtopics.com/101-great-computer-programming-quotes/")

#read in our url
quotes <- read_html(link)

#gather text held in the "li" html nodes
quote <- quotes %>% 
  html_nodes("li") %>% 
  html_text()

is.vector(quote)

#convert to data frame
quote <- as.data.frame(quote)

Part II: Clean

Gathering our quotes via rvest versus copying-and-pasting, we get one quote per line, making it more legible to store in our final workbook. We’ve also left the numerical position of each quote. But some issues with the text remain.

First off, looking through the gathered selection of text, I will see that not all text held in the li node is a quote. This takes some manual intervention to spot, but here I will use dplyr’s slice function to keep only rows 26 through 126 (corresponding to 100 quotes).

We still want to eliminate the parentheses and quotation markers, and to do this I will use regular expression functions from stringr to replace them.

a. Replace “(“, “)”, and ““” with “”

This is not meant as a comprehensive guide to the notorious regular expression, and if you are not familiar I suggest Chapter 14 of R for Data Science. So I assume some familiarity here as otherwise it becomes quite tedious.

Because “(” and “)” are both metacharacters we will need to escape them. Placing these three characters together with the “or” pipe (|) we then use the str_replace_all function to replace strings matching any of the three with nothing “”.

b. Replace “”” with ” “

The end of a quotation is handled differently as we need a space between the quotation and the author; thus this expression is moved to its own function and we use str_replace to replace matches with ” “.

Bonus: Set it up for social media

Because I intend to send these quotes to Twitter so I will put a couple finishing touches on here.

First, using the paste function from base R, I will concatenate our quotes with a couple select hashtags.

Next, I use dplyr’s filter function to exclude lines that are longer than 240 characters, using another stringr function, str_length.

The quote for Part II is displayed below.

#get the rows I want
quote <- slice(quote, 26:126)

#delete the characters I don't want

charsd <- c("\\(|\\)|“")

quote$quote <- str_replace_all(quote$quote,charsd,"")

quote$quote <- str_replace(quote$quote,"”"," ")

#filter lines >240 characters
quote$quote <- paste(quote$quote, "#quote #coding")
quote <- filter(quote, str_length(quote)< 240)

#write csv
write.csv(quote,"C:/RFiles/tech2quotes.csv")

Finally, find the complete code below.

From web scraping to dataframe manipulation to regular expression, this exercise packs a punch in dealing with real-world unstructured text data — and it comes with some enjoyable reading, too.

I hope this post inspires you to tackle the world of text, and I plan to walk through a couple more of these.

To leave a comment for the author, please follow the link and comment on their blog: George J. Mount.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)