# Mining CRAN DESCRIPTION Files

[This article was first published on data science ish, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A couple of weeks ago, I saw on Dirk Eddelbuettel’s blog that R 3.4.0 was going to include a function for obtaining information about packages currently on CRAN, including basically everything in DESCRIPTION files. When R 3.4.0 was released, this was one of the things I was most immediately excited about exploring, because although I recently dabbled in scraping CRAN to try to get this kind of information, it was rather onerous.

There you go, all the packages currently on CRAN!

## Practices of CRAN maintainers

Some of the fields in the DESCRIPTION file of an R package tell us a bit about how a CRAN maintainer works, and in aggregate we can see how R package developers are operating.

How many packages have a URL, a place to go like GitHub to see the code and check out what is going on?

What about a URL for bug reports?

How many packages have a package designated as a VignetteBuilder?

Are there packages that have vignettes but also have NA for VignetteBuilder? Yes, those would be packages that use Sweave, the built-in vignette engine that comes with R. This must be biased toward older packages and it can’t be a large proportion of the total, given when CRAN has been growing the fastest. I know there are still packages with Sweave vignettes, but these days, having something in VignetteBuilder is at least somewhat indicative of whether a package has a vignette. There isn’t anything else in the DESCRIPTION file, to my knowledge, that indicates whether a package has a vignette or not.

How many packages use testthat or RUnit for unit tests?

(Another handful of packages have these testing suites in Imports or Depends, but not enough to change that proportion much.)

Is it the same ~20% of packages that are embracing the practices of unit tests, building vignettes, and providing a URL for bug reports?

Huh, so no, actually. I would have guessed that there would have been more packages in the TRUE/TRUE/TRUE bin in this data frame and fewer in the bins that are mixes of TRUE and FALSE. What does that distribution look like?

Maybe I should not be surprised, since a package that I myself maintain has unit tests and a URL for bug reports but no vignette. And remember that a few of the “No vignette builder” packages are maintainers choosing to produce vignettes via Sweave, OLD SCHOOL.

## Yo dawg I heard you like Descriptions in your DESCRIPTION

One of the fields in the DESCRIPTION file for an R package is the Description for the package.

Let’s use the tidytext package that I have developed with David Robinson to take a look at the words maintainers use to describe their packages. What words do they use the most often?

Now let’s see what the relationships between all these description words are. Let’s look at how words are correlated together within description fields and make a word network.

## The End

If you are interested in this approach to text analysis in R, check out the book Dave and I are publishing with O’Reilly, to be released this summer, available online as well. I found it really interesting to get a glimpse into this ecosystem that is such an important part of my professional and open-source life, both to see the overlap with the areas that I work in and the vast areas that I do not! The R Markdown file used to make this blog post is available here. I am very happy to hear feedback and questions!