Follow-Up: Making a Word Cloud for a Search Result from GScholar_Scraper_3.1
[This article was first published on theBioBucket*, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here’s a short follow-up on how to produce a word cloud for a search result from GScholarScraper_3.1:
# File-Name: GScholarScraper_3.1.R # Date: 2012-08-22 # Author: Kay Cichini # Email: [email protected] # Purpose: Scrape Google Scholar search result # Packages used: XML # Licence: CC BY-SA-NC # # Arguments: # (1) input: # A search string as used in Google Scholar search dialog # # (2) write: # Logical, should a table be writen to user default directory? # if TRUE ("T") a CSV-file with hyperlinks to the publications will be created. # # Difference to version 3: # (3) added "since" argument - define year since when publications should be returned.. # defaults to 1900.. # # (4) added "citation" argument - logical, if "0" citations are included # defaults to "1" and no citations will be included.. # added field "YEAR" to output # # Caveat: if a submitted search string gives more than 1000 hits there seem # to be some problems (I guess I'm being stopped by Google for roboting the site..) # # And, there is an issue with this error message: # > Error in htmlParse(URL): # > error in creating parser for http://scholar.google.com/scholar?q # I haven't figured out his one yet.. most likely also a Google blocking mechanism.. # Reconnecting / new IP-address helps.. GScholar_Scraper <- function(input, since = 1900, write = F, citation = 1) { require(XML) # putting together the search-URL: URL <- paste("http://scholar.google.com/scholar?q=", input, "&as_sdt=1,5&as_vis=", citation, "&as_ylo=", since, sep = "") cat("\nThe URL used is: ", "\n----\n", paste("* ", "http://scholar.google.com/scholar?q=", input, "&as_sdt=1,5&as_vis=", citation, "&as_ylo=", since, " *", sep = "")) # get content and parse it: doc <- htmlParse(URL) # number of hits: h1 <- xpathSApply(doc, "//div[@id='gs_ab_md']", xmlValue) h2 <- strsplit(h1, " ")[[1]][2] num <- as.integer(sub("[[:punct:]]", "", h2)) cat("\n\nNumber of hits: ", num, "\n----\n", "If this number is far from the returned results\nsomething might have gone wrong..\n\n", sep = "") # If there are no results, stop and throw an error message: if (num == 0 | is.na(num)) { stop("\n\n...There is no result for the submitted search string!") } pages.max <- ceiling(num/100) # 'start' as used in URL: start <- 100 * 1:pages.max - 100 # Collect URLs as list: URLs <- paste("http://scholar.google.com/scholar?start=", start, "&q=", input, "&num=100&as_sdt=1,5&as_vis=", citation, "&as_ylo=", since, sep = "") scraper_internal <- function(x) { doc <- htmlParse(x, encoding="UTF-8") # titles: tit <- xpathSApply(doc, "//h3[@class='gs_rt']", xmlValue) # publication: pub <- xpathSApply(doc, "//div[@class='gs_a']", xmlValue) # links: lin <- xpathSApply(doc, "//h3[@class='gs_rt']/a", xmlAttrs) # summaries are truncated, and thus wont be used.. # abst <- xpathSApply(doc, '//div[@class='gs_rs']', xmlValue) # ..to be extended for individual needs options(warn=(-1)) dat <- data.frame(TITLES = tit, PUBLICATION = pub, YEAR = as.integer(gsub(".*\\s(\\d{4})\\s.*", "\\1", pub)), LINKS = lin) options(warn=0) return(dat) } result <- do.call("rbind", lapply(URLs, scraper_internal)) if (write == T) { result$LINKS <- paste("=Hyperlink(","\"", result$LINKS, "\"", ")", sep = "") write.table(result, "GScholar_Output.CSV", sep = ";", row.names = F, quote = F) shell.exec("GScholar_Output.CSV") } else { return(result) } } # EXAMPLE: input <- "allintitle:amphibian+diversity" df <- GScholar_Scraper(input, since = 1980, citation = 1) #install.packages("tm") library(tm) #install.packages("wordcloud") library(wordcloud) corpus <- Corpus(VectorSource(df$TITLES)) corpus <- tm_map(corpus, function(x)removeWords(x, c(stopwords(), "PDF", "B", "DOC", "HTML", "BOOK", "CITATION"))) corpus <- tm_map(corpus, removePunctuation) tdm <- TermDocumentMatrix(corpus) m <- as.matrix(tdm) v <- sort(rowSums(m), decreasing = TRUE) d <- data.frame(word = names(v), freq = v) # remove numbers from strings: d <- d[-grep("[0-9]", d$word), ] # print wordcloud: wordcloud(d$word, d$freq)
To leave a comment for the author, please follow the link and comment on their blog: theBioBucket*.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.