Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Over the last few years, I’ve experimented on and off with various recipes for creating text reports from tabular data sets, (spreadsheet plugins are also starting to appear with a similar aim in mind). There are several issues associated with this, including:

• identifying what data or insight you want to report from your dataset;
• (automatically deriving the insights);
• constructing appropriate sentences from the data;
• organising the sentences into some sort of narrative structure;
• making the sentences read well together.

Another approach to humanising the reporting of tabular data is to generate templated webpages that review and report on the contents of a dataset; this has certain similarities to dashboard style reporting, mixing tables and charts, although some simple templated text may also be generated to populate the page.

In a business context, reporting often happens via Powerpoint presentations. Slides within the presentation deck may include content pulled from a templated spreadsheet, which itself may automatically generate tables and charts for such reuse from a new dataset. In this case, the recipe may look something like:

#render via: http://blockdiag.com/en/blockdiag/demo.html
{
X1[label='macro']
X2[label='macro']

Y1[label='Powerpoint slide']
Y2[label='Powerpoint slide']

data -> Excel -> Chart -> X1 -> Y1;
Excel -> Table -> X2 -> Y2 ;
}

In the previous couple of posts, the observant amongst you may have noticed I’ve been exploring a couple of components for a recipe that can be used to generate reveal.js browser based presentations from the 20% that account for the 80%.

The dataset I’ve been tinkering with is a set of monthly transparency spending data from the Isle of Wight Council. Recent releases have the form:

So as hinted at previously, it’s possible to use the following sort of process to automatically generate reveal.js slideshows from a Jupyter notebook with appropriately configured slide cells (actually, normal cells with an appropriate metadata element set) used as an intermediate representation.

{
X1[label="text"]
X2[label="Jupyter notebook\n(slide mode)"]
X3[label="reveal.js\npresentation"]

Y1[label="text"]
Y2[label="text"]
Y3[label="text"]

data -> "pandas dataframe" -> X1  -> X2 ->X3
"pandas dataframe" -> Y1,Y2,Y3  -> X2 ->X3

Y2 [shape = "dots"];
}


There’s an example slideshow based on October 2016 data here. Note that some slides have “subslides”, that is, slides underneath them, so watch the arrow indicators bottom left to keep track of when they’re available. Note also that the scrolling is a bit hit and miss – ideally, a new slide would always be scrolled to the top, and for fragments inserted into a slide one at a time the slide should scroll down to follow them).

The structure of the presentation is broadly as follows:

For example, here’s a summary slide of the spends by directorate – note that we can embed charts easily enough. (The charts are styled using seaborn, so a range of alternative themes are trivially available). The separate directorate items are brought in one at a time as fragments.

The next slide reviews the capital versus expenditure revenue spend for a particular directorate, broken down by expenses type (corresponding slides are generated for all other directorates). (I also did a breakdown for each directorate by service area.)

The items listed are ordered by value, and taken together account for at least 80% of the spend in the corresponding area. Any further items contributing more than 5%(?) of the corresponding spend are also listed.

Notice that subslides are available going down from this slide, rather than across the mains slides in the deck. This 1.5D structure means we can put an element of flexible narrative design into the presentation, giving the reader an opportunity to explore the data, but in a constrained way.

In this case, I generated subslides for each major contributing expenses type to the capital and revenue pots, and then added a breakdown of the major suppliers for that spending area.

This just represents a first pass at generating a 1.5D slide deck from a tabular dataset. A Pareto (80/20) heurstic is used to try to prioritise to the information displayed in order to account for 80% of spend in different areas, or other significant contributions.

Applying this principle repeatedly allows us to identify major spending areas, and then major suppliers within those spending areas.

The next step is to look at other ways of segmenting and structuring the data in order to produce reports that might actually be useful…

If you have any ideas, please let me know via the comments, or get in touch directly…

PS FWIW, it should be easy enough to run any other dataset that looks broadly like the example at the top through the same code with only a couple of minor tweaks…