Removing Uncited References in a Tex File (with R)

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week, with @3wen, we were working a the revised version of our work on smoothing densities of spatial processes (with edge correction). Usually, once you have revised the paper, some references were added, others were droped. But you need to spend some time, to check that all references are actually mentioned in the paper. For instance, consider the following compiled tex file :

Only three references are actually mentioned in the document, so we need to update the reference list (by removing the first three). If you use a bib file, it is very simple, and only cited references will appear in the list. The problem here is that we used bibitems,

I wanted to work on that manually this week-end, but @3wen suggested to write a simple R function to scan the tex f file (as well as the aux file actually) to remove uncited references. The idea is the following. First, let us scan the two files

> library(stringr)
> setwd("/home/tex/")
> file_tex <- scan("file_test.tex", what = "character", sep = "n")
Read 15 items
> file_aux <- scan("file_test.aux", what = "character", sep = "n")
Read 21 items

Then, we extract only parts related to the bibliography,

> beg_file <- which(str_detect(string = file_tex, pattern = "\\begin\{thebibliography\}"))
> end_file <- which(str_detect(string = file_tex, pattern = "\\end\{thebibliography\}"))

References here are the following lines

> biblio <- file_tex[seq(beg_file+1, end_file-1)]
> biblio
[1] "\bibitem[Cressie(1991)]{Cressie} Cressie, N. (1991). Statistics for Spatial Data. New York: John Wiley \& Sons"                                  
[2] "\bibitem[Diggle (2002)]{Diggle} Diggle, P., Heagerty, P., Liang, K.Y. \& Zeger, S. 2002. Analysis of Longitudinal Data. Oxford University Press."
[3] "\bibitem[Ripley(1981)]{Ripley} Ripley, B. 1981. Spatial Statistics, Wiley, New York."                                             
[4] "\bibitem[Scott(1992)]{Scott} Scott, D W 1992 Multivariate Density Estimation: Theory, Practice, and Visualization. New York, John Wiley and Sons."
[5] "\bibitem[Silverman(2004)]{Silverman} Silverman B W 1986 Density Estimation for Statistics and Data Analysis."
[6] "London, Chapman \& Hall."                                             [7] "\bibitem[Wand \& Jones(1995)]{Wand} Wand, M.P; Jones, M.C. (1995). Kernel Smoothing. London: Chapman \& Hall/CRC. "

If you look carefully at the output, you can observe that the fifth reference is on two lines. Which might happend frequently. So we need to check precisely when a reference starts, and when it ends.

> beg_bibitem <- which(str_detect(string = biblio, pattern = "\\bibitem"))
> go_through <- cbind(beg_bibitem, c(beg_bibitem[-1]-1,length(biblio)))
> go_through
     beg_bibitem  
[1,]           1 1
[2,]           2 2
[3,]           3 3
[4,]           4 4
[5,]           5 6
[6,]           7 7

Actually, we should also check if a reference is cited. Sometimes, there are references with a comment sign.

> go_through <- data.frame(beg = beg_bibitem, end = rep(NA, length(beg_bibitem)))
> for(i in seq_len(length(beg_bibitem))-1){
+   go_through[i,2] <- beg_bibitem[i+1]-1
+ }
> go_through[nrow(go_through), 2] <- length(biblio)
> go_through$comment <- str_detect(biblio[beg_bibitem], "^%")
> go_through
  beg end comment
1   1   1   FALSE
2   2   2   FALSE
3   3   3   FALSE
4   4   4   FALSE
5   5   6   FALSE
6   7   7   FALSE

Let us now extract the labels of all the references (%).

> extract_ref_cite <- function(bibitem, file){
+   entree <- file[bibitem]
+   if(str_detect(entree, "bibitem\[.*\]\{")){
+     nom_citation <- str_extract(entree, "]\{(.*?)\}")
+   }else{
+     nom_citation <- str_extract(entree, "\{(.*?)\}")
+   }
+   str_replace_all(string = nom_citation, pattern = "\{|\}|]", replacement = "")
+ }
> bibitems_ref <- unlist(lapply(beg_bibitem, extract_ref_cite, biblio))
> bibitems_ref
[1] "Cressie"   "Diggle"    "Ripley"    "Scott"     "Silverman" "Wand"

We have six references, with those labels (as expected).

Now, if we look at the aux file, to see which references are cited in the text,

> ind_cite <- which(str_detect(string = file_aux, pattern = "\\citation"))
> bibitems_cite_names <- unlist(lapply(ind_cite, extract_ref_cite, file_aux))
> bibitems_cite_names
[1] "Scott"     "Scott"     "Silverman" "Silverman" "Wand"      "Wand"     
[7] "Scott"     "Scott"

Note that references are mentioned twice (at least): once for the author’s name, once for the year of publication. Since we just need to see which one actually appears in the aux file, we can use

> bibitems_cite_names <- unique(bibitems_cite_names)
> bibitems_cite_names
[1] "Scott"     "Silverman" "Wand"

Now, we can see which references are cited,

> go_through$keep <- bibitems_ref %in% bibitems_cite_names
> go_through
  beg end comment  keep
1   1   1   FALSE FALSE
2   2   2   FALSE FALSE
3   3   3   FALSE FALSE
4   4   4   FALSE  TRUE
5   5   6   FALSE  TRUE
6   7   7   FALSE  TRUE

Based on that table, we can use a simple code: references that we do not need will be seen as comments, while those that are cited will appear in the reference list.

> return_cite <- function(one_ligne){
+   citation <- str_c(biblio[one_ligne[1,"beg"]:one_ligne[1,"end"]], collapse = "n")
+   if(!one_ligne[1,"keep"] & !str_detect(citation, "^%")){
+     citation <- str_replace_all(citation, pattern = "n", replacement =  "n%")
+   }
+   citation
+ }

For instance,

> return_cite(go_through[1,])
[1] "%\bibitem[Cressie(1991)]{Cressie} Cressie, N. (1991). Statistics for Spatial Data. New York: John Wiley \& Sons"

since the first reference does not appear in the text, while

> return_cite(go_through[4,])
[1] "\bibitem[Scott(1992)]{Scott} Scott, D W 1992 Multivariate Density Estimation: Theory, Practice, and Visualization. New York, John Wiley and Sons."

Now, we can easily generate our bibliography, in LaTeX

> cat(unlist(lapply(1:nrow(go_through), function(x) return_cite(go_through[x,]))), sep = "nn")
%bibitem[Cressie(1991)]{Cressie} Cressie, N. (1991). Statistics for Spatial Data. New York: John Wiley & Sons

%bibitem[Diggle (2002)]{Diggle} Diggle, P., Heagerty, P., Liang, K.Y. & Zeger, S. 2002. Analysis of Longitudinal Data. Oxford University Press.

%bibitem[Ripley(1981)]{Ripley} Ripley, B. 1981. Spatial Statistics, Wiley, New York.

bibitem[Scott(1992)]{Scott} Scott, D W 1992 Multivariate Density Estimation: Theory, Practice, and Visualization. New York, John Wiley and Sons.

bibitem[Silverman(2004)]{Silverman} Silverman B W 1986 Density Estimation for Statistics and Data Analysis. 
London, Chapman & Hall.

bibitem[Wand & Jones(1995)]{Wand} Wand, M.P; Jones, M.C. (1995). Kernel Smoothing. London: Chapman & Hall/CRC.

We simply need to copy that list and paste it in our LaTeX file. Nice, isn’t it?

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)