Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

qdapRegex 0.2.0 & qdapTools 1.1.0 have been released to CRAN.  This post will provide some of the packages’ updates/features and provide an integrate demonstration of extracting and viewing in-text APA 6 style citations from an MS Word (.docx) document.

qdapRegex 0.2.0

The qdapRegex package is meant to provide access to canned, common regular expression patterns that can be used within qdapRegex, with R‘s own regular expression functions, or add on string manipulation packages such as stringr and stringi.  The qdapRegex package serves a dual purpose of being both functional and educational.

New Features/Changes

Here are a select few new features.  For a complete list of changes CLICK HERE:

• is.regex added as a logical check of a regular expression’s validy (conforms to R’s regular expression rules).
• Case wrapper functions, TC (title case), U (upper case), and L (lower case) added for convenient case manipulation.
• rm_citation_tex added to remove/extract/replace bibkey citations from a .tex (LaTeX) file.
• regex_cheat data set and cheat function added to act as a quick reference for common regex task operations such a lookaheads.
• explain added to view a visual representation of a regular expression using http://www.regexper.com andhttp://rick.measham.id.au/paste/explain. Also takes named regular expressions from the regex_usa or other supplied dictionary.

The last two functions regex_cheat & explain provide educational regex tools. regex_cheat provides a cheatsheet of common regex elements. explain interfaces with  http://www.regexper.com & http://rick.measham.id.au/paste/explain.

qdapTools 1.1.0

qdapTools is an R package that contains tools associated with the qdap package that may be useful outside of the context of text analysis.

New Features/Changes

• loc_split added to split data forms (list, vector, data.frame, matrix) on a vector of integer locations.
• matrix2long makes a long format data.frame. It takes a matrix object, stacks all columns and adds identifying columns by repeating row and column names accordingly.
• read_docx added to read in .docx documents.
• split_vector picks up a regex argument to allow for regular expression search of break location.

Integrated Demonstration

In this demonstration we will use dl_url to grab a .docx file from the Internet. We’ll then read this document in with read_docx. We’ll use split_vector to split the text from the .docx into main body and a references section. rm_citations will be utilize to extract in-text APA 6 style citations. Last we will view frequencies and a visualization of the distribution of the citations using ggplot2. For a complete script of this R code used in this blog post CLICK HERE.

First we’ll make sure we have the correct versions of the packages, install them if necessary, and load the required packages for the demonstration:

Map(function(x, y) {
if (!x %in% list.files(.libPaths())){
install.packages(x)
} else {
if (packageVersion(x) < y) {
install.packages(x)
} else {
message(sprintf("Version of %s is suitable for demonstration", x))
}
}
}, c("qdapRegex", "qdapTools"), c("0.2.0", "1.1.0"))

lapply(c("qdapRegex", "qdapTools", "ggplot2", "qdap"), require, character.only=TRUE)


Now let’s grab the .docx document, read it in, and split into body/references sections:

## Download .docx

## Remove non ascii characters
txt <- rm_non_ascii(txt)

## Split into body/references sections
parts <- split_vector(txt, split = "References", include = TRUE, regex=TRUE)

## View body
parts[[1]]

## View references
parts[[2]]


Now we can extract the in-text APA 6 citations and view frequencies:

## Extract citations in order of appearance
rm_citation(unbag(parts[[1]]), extract=TRUE)[[1]]

## Extract citations by section
rm_citation(parts[[1]], extract=TRUE)

## Frequency
left_just(cites <- list2df(sort(table(rm_citation(unbag(parts[[1]]),
extract=TRUE)), TRUE), "freq", "citation")[2:1])


 ## citation freq ## 1 Walker, 2008 14 ## 2 Flesch (1955) 2 ## 3 Adams (1990) 1 ## 4 Anderson, Hiebert, Scott, and Wilkinson (1985) 1 ## 5 Baumann & Hoffman, 1998 1 ## 6 Baumann, 1998 1 ## 7 Bond and Dykstra (1967) 1 ## 8 Chall (1967) 1 ## 9 Clay (1979) 1 ## 10 Goodman and Goodman (1979) 1 ## 11 McCormick & Braithwaite, 2008 1 ## 12 Read Adams (1990) 1 ## 13 Stahl and Miller (1989) 1 ## 14 Stahl and Millers (1989) 1 ## 15 Word Perception Intrinsic Phonics Instruction Gates (1951) 1 

Now we can find the locations of the citations in the text and plot a distribution of the in-text citations throughout the text:

## Distribution of citations (find locations)
cite_locs <- do.call(rbind, lapply(cites[[1]], function(x){
m <- gregexpr(x, unbag(parts[[1]]), fixed=TRUE)
data.frame(
citation=x,
start = m[[1]] -5,
end =  m[[1]] + 5 + attributes(m[[1]])[["match.length"]]
)
}))

## Plot the distribution
ggplot(cite_locs) +
geom_segment(aes(x=start, xend=end, y=citation, yend=citation), size=3,
color="yellow") +
xlab("Duration") +
scale_x_continuous(expand = c(0,0),
limits = c(0, nchar(unbag(parts[[1]])) + 25)) +
theme_grey() +
theme(
panel.grid.major=element_line(color="grey20"),
panel.grid.minor=element_line(color="grey20"),
plot.background = element_rect(fill="black"),
panel.background = element_rect(fill="black"),
panel.border = element_rect(colour = "grey50", fill=NA, size=1),
axis.text=element_text(color="grey50"),
axis.title=element_text(color="grey50")
)