brew: Creating Repetitive Reports

September 9, 2009
By

(This article was first published on Learning R, and kindly contributed to R-bloggers)


United Nations report World Population Prospects: The 2008 Revision (highlights available here) provides data about the historical and forecasted population of the country. In exploring the future and past population trends it is relatively easy to subset the dataset by your selected variable.

> file <- c("UNdata_Population.csv")
> population <- read.csv(file)
> names(population) <- c("code", "country", "year",
+     "variant", "value")
> df <- subset(population, year <= 2005)

Likewise, it is straightforward to produce a 263 page pdf-file that shows the population trend between 1950 – 2005 for all the countries in the dataset.

> pdf("population_growth.pdf", paper = "a4")
> d_ply(df, .(country), function(x) plot(x$year,
+     x$value, type = "l", main = unique(x$country)))
> dev.off()

However, if one would like to create a report showing some summary tables and graphs along with some textual description for all the countries, then the process becomes more complicated. This is exactly what the brew() package written by Jeffrey Horner was designed to help the users with. brew “implements a templating framework for mixing text and R code for report generation” and makes it very easy to generate repetitive reports which is of great help for example in performing exploratory analysis on a large dataset with many variables.

The report-generation can be split into three parts:

  1. Prepare the data.
    1. Add Regional Information to Population Data.
    2. Generate Graphs and Data to be included in the report
  2. Prepare the report template.
    1. Helper Functions
    2. Brew template – population.brew
  3. Produce the report.

Data Preparation

Data preparation script is saved in the popreportdata.R file which is later sourced into the brew template.

Add Regional Information to Population Data

In fact I would like to explore the data by continent rather than by country, so in order to do this the first step is to group the countries by continent with the help of Isocodes package that contains standard country or area codes and geographical regions used by UN.

> library(ggplot2)
> library(ISOcodes)
> data("UN_M.49_Countries")
> data("UN_M.49_Regions")
> Regions <- subset(UN_M.49_Regions, Type == "Region")
> regionsdf <- ddply(Regions, .(Code, Name, Parent),
+     function(x) {
+         df <- data.frame(strsplit(x$Children,
+             ", "))
+         names(df) <- "countrycode"
+         df
+     })
> countries <- merge(regionsdf, UN_M.49_Countries,
+     by.x = "countrycode", by.y = "Code")
> countries <- rename(countries, c(Name.x = "region",
+     Name.y = "country", Code = "regioncode", Parent = "parentcode"))
> countries <- merge(countries, Regions[, 1:2],
+     by.x = "parentcode", by.y = "Code")
> countries <- rename(countries, c(Name = "continent"))
> countries$countrycode <- as.numeric(as.character(countries$countrycode))

countries dataframe contains now regional classification for each country. Next step is to merge a subset of this information with the population data.

> population <- merge(population, countries[, c("countrycode",
+     "continent")], by.x = "code", by.y = "countrycode")
> population$value <- population$value/1000

Generate Graphs and Data to be included in the report

In this step the graphs are saved to the disk using ggsave, and a list of lists with four dataframes about each continent is returned.

> popreportdata <- dlply(population, .(continent),
+     function(df) {
+         continent <- gsub(" ", "_", unique(df$continent))
+         filename <- function(y) {
+             paste("graphs", continent, y, ".pdf",
+                 sep = "")
+         }
+         forecast <- subset(df, variant != "Estimate variant")
+         forecast$variant <- forecast$variant[,
+             drop = TRUE]
+         historic <- subset(df, variant == "Estimate variant")
+         historic <- ddply(historic, .(continent,
+             year), transform, cont_value = sum(value))
+         current <- subset(df, year == 2005)
+         growthrate <- function(df) {
+             rng <- range(df$year)
+             min_value <- df[df$year == rng[1],
+                 "value"]
+             max_value <- df[df$year == rng[2],
+                 "value"]
+             abs_growth <- max_value/min_value
+             yr5_growth <- abs_growth^(1/length(df$year))
+             growthdf <- data.frame(min_value,
+                 max_value, abs_growth, yr5_growth)
+             names(growthdf)[1:2] <- c(rng[1],
+                 rng[2])
+             growthdf
+         }
+         growth <- ddply(forecast, .(continent,
+             country, variant), growthrate)
+         growth$variant <- factor(growth$variant,
+             levels = c("Constant-fertility scenario",
+                 "High variant", "Medium variant",
+                 "Low variant"))
+         growth <- sort_df(growth, vars = c("continent",
+             "variant", "abs_growth"))
+         blabel <- c(0.01, 0.1, 1, 10, 100)
+         alabel <- "Population (in millions)"
+         phist <- ggplot(current, aes(value)) +
+             geom_histogram(binwidth = 0.5, fill = NA,
+                 colour = "black") + scale_x_log10(breaks = blabel,
+             labels = blabel) + labs(x = alabel)
+         ggsave(filename("_hist"), phist, dpi = 100)
+         prank <- ggplot(current, aes(seq_along(country),
+             rev(sort(value)))) + geom_point() +
+             scale_y_log10(breaks = blabel, labels = blabel) +
+             labs(x = "Rank", y = alabel)
+         ggsave(filename("_rank"), prank, dpi = 100)
+         pbox <- ggplot(historic, aes(factor(year),
+             value)) + geom_boxplot() + labs(x = "",
+             y = alabel)
+         ggsave(filename("_box"), pbox, dpi = 100)
+         ptrend <- ggplot(historic, aes(year, value)) +
+             stat_summary(fun.y = "sum", geom = "line",
+                 colour = "red", size = 1) + stat_summary(data = forecast,
+             aes(y = value, group = variant, colour = variant),
+             fun.y = "sum", geom = "line", size = 1) +
+             labs(y = alabel, x = "") + opts(legend.position = c(0.8,
+             0.3), legend.background = theme_blank(),
+             legend.key = theme_blank()) + scale_colour_hue("Forecast")
+         ggsave(filename("_trend"), ptrend, width = 8,
+             height = 4, dpi = 100)
+         pgrowth <- ggplot(growth, aes(variant,
+             abs_growth, colour = variant)) + geom_boxplot() +
+             xlab("") + opts(legend.position = "none")
+         ggsave(filename("_abs_growth"), pgrowth,
+             dpi = 100)
+         pann_growth <- ggplot(growth, aes(variant,
+             yr5_growth, colour = variant)) + geom_boxplot() +
+             xlab("") + opts(legend.position = "none")
+         ggsave(filename("_ann_growth"), pann_growth,
+             dpi = 100)
+         current <- current[with(current, order(-value)),
+             ]
+         top <- head(current, 5)[, c("country",
+             "value")]
+         bottom <- tail(current, 5)[, c("country",
+             "value")]
+         growth <- growth[growth$variant == "Medium variant",
+             c(2, 4:7)]
+         growth <- growth[order(-growth$abs_growth),
+             ]
+         names(growth)[c(4:5)] <- c("Abs.Growth",
+             "Compound Growth")
+         growthtop <- head(growth, 5)
+         growthbottom <- tail(growth, 5)
+         list(top = top, bottom = bottom, growthtop = growthtop,
+             growthbottom = growthbottom)
+     })

Report Template

My report template is essentially a latex document which includes an R code loop wrapped in brew syntax. To make it easier to generate the latex statements for inclusion of graphs and tables, a few helper functions are defined first.

Helper functions

> include_graph <- function(width = 1, filename) {
+     paste("includegraphics[width=", width, "linewidth]{",
+         filename, "}", sep = "")
+ }
> include_tbl <- function(width = 1, filename) {
+     print(xtable(filename), table.placement = "",
+         latex.environments = "", include.rownames = FALSE,
+         floating = FALSE)
+ }
> subfloat_graph <- function(width, filename, caption = "") {
+     paste("subfloat[", caption, "]{", "begin{minipage}[h]{",
+         width, "linewidth}centering", include_graph(width = 1,
+             filename), "end{minipage}}", sep = "")
+ }
> subfloat_tbl <- function(width, filename, caption) {
+     paste("subfloat[", caption, "]{", "begin{minipage}[h]{",
+         width, "linewidth}centering", print(xtable(filename),
+             file = stderr(), table.placement = "",
+             latex.environments = "", include.rownames = FALSE,
+             floating = FALSE), "end{minipage}}",
+         sep = "")
+ }

Brew template – population.brew

Brew syntax is really very simple to use. From the help file:

  1. All text that falls outside of the delimiters is printed as-is.
  2. R expressions between the <% and %> delimiters are executed in-place.
  3. The value of the R expression between the <%= and %> delimiters is printed
documentclass[oneside]{article}

usepackage[margin=2cm,nohead]{geometry}
usepackage[pdftex]{graphicx}
usepackage{subfig}
usepackage{float}
usepackage{verbatim}

usepackage{hyperref}
hypersetup{
  colorlinks=true,
  pdfauthor={http://learnr.wordpress.com}
  }

graphicspath{{./graphs/}}

title{World Population Trends}
author{url{http://learnr.wordpress.com}}
date{today}
raggedbottom
setcounter{tocdepth}{1}

begin{document}

maketitle

This report has been compiled based on the United Nations report World Population Prospects: The 2008 Revision (highlights available href{http://www.un.org/esa/population/publications/wpp2008/wpp2008_highlights.pdf}{here}). The dataset can be accessed href{http://data.un.org/Data.aspx?d=PopDiv&f=variableID%3a12&c=1,2,4,6,7&s=_crEngNameOrderBy:asc,_timeEngNameOrderBy:desc,_varEngNameOrderBy:asc&v=1}{here}.

tableofcontents

<% library(xtable); library(ggplot2)%>

<% for (i in seq_along(names(popreportdata))) { -%>

pagebreak

<% i = names(popreportdata)[i] %>
<% reportlist <- popreportdata[match(i,names(popreportdata))][[1]] %>
<% filename <- function(y){paste(gsub(" ", "_", i) , y, ".pdf", sep="")} %>

<%=cat("section{", i, "}", sep="") %>

begin{figure}[H]
  centering
  <%= include_graph(width = 1, filename("_trend")) %>
  <%= subfloat_graph(0.33, filename("_hist"), "Histogram") %>
  <%= subfloat_graph(0.33, filename("_rank"), "Rank Curve") %>
  <%= subfloat_graph(0.33, filename("_box"), "Boxplot") %>
  caption{Distribution plots}
end{figure}

begin{table}[h]
  centering
  <%= subfloat_tbl(0.4, reportlist[[1]], "Top 5 Countries") %>
  <%= subfloat_tbl(0.4, reportlist[[2]], "Bottom 5 Countries") %>
  caption{Population in 2005}
end{table}

begin{figure}
  centering
  <%= subfloat_graph(0.5, filename("_abs_growth"), "Absolute Growth") %>
  <%= subfloat_graph(0.5, filename("_ann_growth"), "Annual Compound Growth") %>
  caption{Growth charts 2010 - 2050}
end{figure}

begin{table}[H]
  centering
  <%= subfloat_tbl(1, reportlist[[3]], "Top 5 Growing Countries") %>
  quad
  <%= subfloat_tbl(1, reportlist[[4]], "Bottom 5 Growing Countries") %>
  caption{Growth tables 2010 - 2050}
end{table}

<% } -%>

end{document}

Produce the report

> library(tools)
> library(brew)
> brew("population.brew", "population.tex")
> texi2dvi("population.tex", pdf = TRUE)

The resulting pdf-file can be seen here

To leave a comment for the author, please follow the link and comment on his blog: Learning R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,

Comments are closed.