Site icon R-bloggers

Visualizing APA 6 Citations: qdapRegex 0.2.0 & qdapTools 1.1.0

[This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

qdapRegex 0.2.0 & qdapTools 1.1.0 have been released to CRAN.  This post will provide some of the packages’ updates/features and provide an integrate demonstration of extracting and viewing in-text APA 6 style citations from an MS Word (.docx) document.

qdapRegex 0.2.0

The qdapRegex package is meant to provide access to canned, common regular expression patterns that can be used within qdapRegex, with R‘s own regular expression functions, or add on string manipulation packages such as stringr and stringi.  The qdapRegex package serves a dual purpose of being both functional and educational.

New Features/Changes

Here are a select few new features.  For a complete list of changes CLICK HERE:

The last two functions regex_cheat & explain provide educational regex tools. regex_cheat provides a cheatsheet of common regex elements. explain interfaces with  http://www.regexper.com & http://rick.measham.id.au/paste/explain.

qdapTools 1.1.0

 qdapTools is an R package that contains tools associated with the qdap package that may be useful outside of the context of text analysis.

New Features/Changes

Integrated Demonstration

In this demonstration we will use dl_url to grab a .docx file from the Internet. We’ll then read this document in with read_docx. We’ll use split_vector to split the text from the .docx into main body and a references section. rm_citations will be utilize to extract in-text APA 6 style citations. Last we will view frequencies and a visualization of the distribution of the citations using ggplot2. For a complete script of this R code used in this blog post CLICK HERE.

First we’ll make sure we have the correct versions of the packages, install them if necessary, and load the required packages for the demonstration:

Map(function(x, y) {
    if (!x %in% list.files(.libPaths())){
        install.packages(x)   
    } else {
        if (packageVersion(x) < y) {
            install.packages(x)   
        } else {
            message(sprintf("Version of %s is suitable for demonstration", x))
        }
    }
}, c("qdapRegex", "qdapTools"), c("0.2.0", "1.1.0"))

lapply(c("qdapRegex", "qdapTools", "ggplot2", "qdap"), require, character.only=TRUE)

Now let’s grab the .docx document, read it in, and split into body/references sections:

## Download .docx
url_dl("http://umlreading.weebly.com/uploads/2/5/2/5/25253346/whole_language_timeline-updated.docx")

## Read in .docx
txt <- read_docx("whole_language_timeline-updated.docx")

## Remove non ascii characters
txt <- rm_non_ascii(txt) 

## Split into body/references sections
parts <- split_vector(txt, split = "References", include = TRUE, regex=TRUE)

## View body
parts[[1]]

## View references
parts[[2]]

Now we can extract the in-text APA 6 citations and view frequencies:

## Extract citations in order of appearance
rm_citation(unbag(parts[[1]]), extract=TRUE)[[1]]

## Extract citations by section 
rm_citation(parts[[1]], extract=TRUE)

## Frequency
left_just(cites <- list2df(sort(table(rm_citation(unbag(parts[[1]]),
    extract=TRUE)), TRUE), "freq", "citation")[2:1])

## citation freq ## 1 Walker, 2008 14 ## 2 Flesch (1955) 2 ## 3 Adams (1990) 1 ## 4 Anderson, Hiebert, Scott, and Wilkinson (1985) 1 ## 5 Baumann & Hoffman, 1998 1 ## 6 Baumann, 1998 1 ## 7 Bond and Dykstra (1967) 1 ## 8 Chall (1967) 1 ## 9 Clay (1979) 1 ## 10 Goodman and Goodman (1979) 1 ## 11 McCormick & Braithwaite, 2008 1 ## 12 Read Adams (1990) 1 ## 13 Stahl and Miller (1989) 1 ## 14 Stahl and Millers (1989) 1 ## 15 Word Perception Intrinsic Phonics Instruction Gates (1951) 1

Now we can find the locations of the citations in the text and plot a distribution of the in-text citations throughout the text:

## Distribution of citations (find locations)
cite_locs <- do.call(rbind, lapply(cites[[1]], function(x){
    m <- gregexpr(x, unbag(parts[[1]]), fixed=TRUE)
    data.frame(
        citation=x,
        start = m[[1]] -5,
        end =  m[[1]] + 5 + attributes(m[[1]])[["match.length"]]
    )
}))

## Plot the distribution
ggplot(cite_locs) +
    geom_segment(aes(x=start, xend=end, y=citation, yend=citation), size=3,
        color="yellow") +
    xlab("Duration") +
    scale_x_continuous(expand = c(0,0),
        limits = c(0, nchar(unbag(parts[[1]])) + 25)) +
    theme_grey() +
    theme(
        panel.grid.major=element_line(color="grey20"),
        panel.grid.minor=element_line(color="grey20"),
        plot.background = element_rect(fill="black"),
        panel.background = element_rect(fill="black"),
        panel.border = element_rect(colour = "grey50", fill=NA, size=1),
        axis.text=element_text(color="grey50"),    
        axis.title=element_text(color="grey50")  
    )


To leave a comment for the author, please follow the link and comment on their blog: TRinker's R Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.