Follow-Up: Making a Word Cloud for a Search Result from GScholar_Scraper_3.1

August 30, 2012
By

(This article was first published on theBioBucket*, and kindly contributed to R-bloggers)


Here's a short follow-up on how to produce a word cloud for a search result from GScholarScraper_3.1:






# File-Name: GScholarScraper_3.1.R
# Date: 2012-08-22
# Author: Kay Cichini
# Email: [email protected]
# Purpose: Scrape Google Scholar search result
# Packages used: XML
# Licence: CC BY-SA-NC
#
# Arguments:
# (1) input:
# A search string as used in Google Scholar search dialog
#
# (2) write:
# Logical, should a table be writen to user default directory?
# if TRUE ("T") a CSV-file with hyperlinks to the publications will be created.
#
# Difference to version 3:
# (3) added "since" argument - define year since when publications should be returned..
# defaults to 1900..
#
# (4) added "citation" argument - logical, if "0" citations are included
# defaults to "1" and no citations will be included..
# added field "YEAR" to output
#
# Caveat: if a submitted search string gives more than 1000 hits there seem
# to be some problems (I guess I'm being stopped by Google for roboting the site..)
#
# And, there is an issue with this error message:
# > Error in htmlParse(URL):
# > error in creating parser for http://scholar.google.com/scholar?q
# I haven't figured out his one yet.. most likely also a Google blocking mechanism..
# Reconnecting / new IP-address helps..


GScholar_Scraper <- function(input, since = 1900, write = F, citation = 1) {

require(XML)

# putting together the search-URL:
URL <- paste("http://scholar.google.com/scholar?q=", input, "&as_sdt=1,5&as_vis=",
citation, "&as_ylo=", since, sep = "")
cat("\nThe URL used is: ", "\n----\n", paste("* ", "http://scholar.google.com/scholar?q=", input, "&as_sdt=1,5&as_vis=",
citation, "&as_ylo=", since, " *", sep = ""))

# get content and parse it:
doc <- htmlParse(URL)

# number of hits:
h1 <- xpathSApply(doc, "//div[@id='gs_ab_md']", xmlValue)
h2 <- strsplit(h1, " ")[[1]][2]
num <- as.integer(sub("[[:punct:]]", "", h2))
cat("\n\nNumber of hits: ", num, "\n----\n", "If this number is far from the returned results\nsomething might have gone wrong..\n\n", sep = "")

# If there are no results, stop and throw an error message:
if (num == 0 | is.na(num)) {
stop("\n\n...There is no result for the submitted search string!")
}

pages.max <- ceiling(num/100)

# 'start' as used in URL:
start <- 100 * 1:pages.max - 100

# Collect URLs as list:
URLs <- paste("http://scholar.google.com/scholar?start=", start, "&q=", input,
"&num=100&as_sdt=1,5&as_vis=", citation, "&as_ylo=", since, sep = "")

scraper_internal <- function(x) {

doc <- htmlParse(x, encoding="UTF-8")

# titles:
tit <- xpathSApply(doc, "//h3[@class='gs_rt']", xmlValue)

# publication:
pub <- xpathSApply(doc, "//div[@class='gs_a']", xmlValue)

# links:
lin <- xpathSApply(doc, "//h3[@class='gs_rt']/a", xmlAttrs)

# summaries are truncated, and thus wont be used..
# abst <- xpathSApply(doc, '//div[@class='gs_rs']', xmlValue)
# ..to be extended for individual needs
options(warn=(-1))
dat <- data.frame(TITLES = tit, PUBLICATION = pub,
YEAR = as.integer(gsub(".*\\s(\\d{4})\\s.*", "\\1", pub)),
LINKS = lin)
options(warn=0)
return(dat)
}

result <- do.call("rbind", lapply(URLs, scraper_internal))
if (write == T) {
result$LINKS <- paste("=Hyperlink(","\"", result$LINKS, "\"", ")", sep = "")
write.table(result, "GScholar_Output.CSV", sep = ";",
row.names = F, quote = F)
shell.exec("GScholar_Output.CSV")
} else {
return(result)
}
}

# EXAMPLE:

input <- "allintitle:amphibian+diversity"
df <- GScholar_Scraper(input, since = 1980, citation = 1)

#install.packages("tm")
library(tm)

#install.packages("wordcloud")
library(wordcloud)

corpus <- Corpus(VectorSource(df$TITLES))
corpus <- tm_map(corpus, function(x)removeWords(x, c(stopwords(), "PDF", "B", "DOC", "HTML", "BOOK", "CITATION")))
corpus <- tm_map(corpus, removePunctuation)
tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing = TRUE)
d <- data.frame(word = names(v), freq = v)

# remove numbers from strings:
d <- d[-grep("[0-9]", d$word), ]

# print wordcloud:
wordcloud(d$word, d$freq)


To leave a comment for the author, please follow the link and comment on his blog: theBioBucket*.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.