Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve just finished a slideshow about a preliminary translation of a simulation of the UK economy from Python into R. The simulation is the one I’ve been blogging about in recent posts. It’s a “microeconomic” model, meaning that it simulates at the level of individual people. In contrast, a “macroeconomic” model (which this is not), would simulate bulk economic variables such as inflation and unemployment.

The model reads data about the expenditure and income of British households. It contains rules describing the tax and benefits system, and a front-end through which the user can change these: for example, by increasing VAT on alcohol or by decreasing child benefit. With these, it calculates how the changes would affect each household’s net expenditure and income.

The model works with various different data sets. For our initial translation, we’re concentrating on the Government’s Family Resources Survey, which surveys 20,000 households each year. See “Introduction to the Family Resources Survey” for an overview of the 2013-2014 survey. This and other documentation, including details of the columns in the data tables, are available at the “Family Resources Survey, 2013-2014” data catalogue page.

Tax-benefit models like ours can be used in various ways. One is evaluation of Government economic policy. If the Government increases income tax, how will this affect net incomes? Will it push a lot of low-income people into poverty? Or will it extract income mainly from the better off? To answer such questions, we need to summarise how change affects the population as a whole, via tools such as graphs of income distribution.

Another use is education: about economics, about policy, and about statistics and data analysis. We believe it’s an important application, where R will be very helpful.

Yet a third is as a “reference standard”. The benefits system is notoriously complicated, making it difficult for claimants to know what they’re entitled to. To help, some organisations have written their own guides to the system. For example, the Childhood Poverty Action Group publishes its Welfare Benefits and Tax Credits Handbook: “an essential resource for all professional advisers serious about giving the best and most accurate advice to their clients”. We would love R-Taxben to become a reference standard for the benefits system, against which interpretations of the benefits rules could be tested: an electronic Handbook. We think this is feasible. It will, of course, require our code to be very very readable. Like this, perhaps:

# Returns a family_type. See the
# family_type document.
# age_2 and sex_2 are only defined
# if is_couple .
#
family_type <- function( age_1, sex_1
, age_2, sex_2
, is_couple
, num_children
)

{
ad1_is_pensioner <- of_pensionable_age( age_1, sex_1 )
ad2_is_pensioner <- of_pensionable_age( age_2, sex_2 )

is_single <- !is_couple

case_when(
, is_single & ad1_is_pensioner                        ~ 'single_pensioner'
, is_couple & num_children == 0                       ~ 'couple_no_children'
, is_single & num_children == 0                       ~ 'single_no_children'
, is_couple & num_children > 0                        ~ 'couple_children'
, is_single & num_children > 0                        ~ 'single_children'
)
}


I wrote my slideshow for a non-R-speaking colleague, and then decided I should make it public. It presents R a fragment at a time, with screenshots to show what the interaction is like. Having demonstrated some elementary features such as assignment and function calls, the slideshow moves on to more advanced things that I use in the model, or that will be useful for users of its R interface when probing economic data.

I see this as helping the users “reify” economic concepts and data: making them more “thing-like”, more “manipulable”, like toys in a playground:

Indeed, as the slides explain, R-Taxben could be a playground for the mind: an “educational microworld” where students can experiment and learn in a non-threatening and stimulating environment. This idea was promoted by Seymour Papert, co-designer of Logo, the language with which children learned mathematics by programming robot “turtles” to draw squares and other figures. There’s an introduction (from 1983) in Education Week‘s “Seymour Papert’s ‘Microworld’: An Educational Utopia” by Charlie Euchner. A survey of Logo and other microworlds can be found in “Microworlds” by Lloyd P. Rieber. To quote:

However, another, much smaller collection of software, known as microworlds, is based on very different principles, those of invention, play, and discovery. Instead of seeking to give students knowledge passed down from one generation to the next as efficiently as possible, the aim is to give students the resources to build and refine their own knowledge in personal and meaningful ways.

Rieber describes their effect on learning thus:

Similar to the idea that the best way to learn Spanish is to go and live in Spain, Papert conjectured that learning math- ematics via Logo was similar to having students visit a comput- erized Mathland where the inhabitants (i.e., the turtle) speak only Logo. And because mathematics is the language of Logo, children would learn mathematics naturally by using it to com- municate to the turtle. In Mathland, people do not just study mathematics, according to Papert, they “live” mathematics.
So with R-Taxben, could we build an EconomicsLand?

Returning to R, a lot of the slideshow’s more advanced demonstrations feature the Tidyverse — hat-tip to its inventor Hadley Wickham. This cornucopia of data-manipulation tools can be confusing, because it’s very unlike base R; but as I blogged in “Should I Use the Tidyverse?”, I think the confusion is worth it. I’d also like to give something back to the Tidyverse team, by showing lots of ways it can be useful in such a project.

One of these ways is in analysing and displaying data. The plots below are of FRS household income versus ethnic group.
They were suggested to me by an exercise in “SC504: Statistical Analysis: ASSIGNMENT ONE: Spring 2008”, an assignment for sociology students at the Essex University School of Health and Social Care. I did the plots using the Tidyverse’s ggplot function, with the “ggthemes” package to display them in the style of The Economist. As you can see from the screenshot, I didn’t need to write very much code, and I’ve got a lovely publication-quality plot.

The above also shows that even the raw FRS data, untransformed by our model, can be educational. Another example of analysis and display comes from my slide on “Probing Data: Multi-level Grouping”. This uses the Tidyverse’s summarise function to count and histogram the number of “benefit units” — adult plus spouse if any, plus children — per household.

For some reason, the counts almost follow a logarithmic distribution.

R-Taxben is not yet complete: what I have now is a proof of concept. It was funded by Landman Economics, and we’ll be looking for grants to finish the work. Some indication of where the emphasis lies is that my slideshow has nine slides about testing. We want this model to be reliable and accurate, and that means rigorous testing. That’s all the more necessary because R lacks static typing, which is what makes languages like Pascal and Ada safe.

(On the other hand, the freedom to put any kind of value almost anywhere does make prototyping easier. As Michael Clarkson points out in his lecture on static versus dynamic checking, there are trade-offs. But as he also points out, few people now believe that “strong types are for weak minds”. Humans are really bad at avoiding bugs, and we need all the help we can get!)

Not only do we not want surprises when R-Taxben runs, we don’t want users to be surprised by the assumptions built into it. Traditionally, economic models have been opaque, their assumptions known only to the few who implemented them. R programmers can probe R-Taxben’s data using R. But I’ve also implemented a novel web-based interface which “visualises” the model as a network of data nodes connected by functions, so that even non-programmers can peer inside: I think that’s essential.

And, though I agree with Donald Knuth that premature optimisation is the root of all evil, R-Taxben does have to be fast enough to be usable.

One thing we may need is help merging files. There are a lot of FRS files: one for the households themselves, but others for data such as: mortgage payments; accommodation rentals; house purchases; payments into pension schemes; benefits claims; jobs; and share dividends and other non-job income. We need to translate each of these into a form the model can use, then merge with the households. To make my code easy to read and maintain, I’ve written a separate function for translating each file, and I then JOIN the results. But I suspect this won’t be terribly fast, and almost certainly not as fast as the Python code which looped over files in parallel, merging records one by one.

Another is something I blogged under the cryptic subtitle “Why Douglas T. Ross would hate nest(), unnest(), gather() and spread()”. Douglas Ross was one of the first to propose what he called the Uniform Referent Principle: that code for extracting or changing data should be independent of how the data is stored. Following the principle means you can change storage while not affecting the rest of your program; not following the principle makes every change bleeds into the rest of your code, with all the consequences for time-wasting updates, typos, and re-testing:

Why? Look at the tables below. They represent four different ways of storing my income data.

PersonIncome_TypeIncome_Value
AliceWages37000
AliceBonuses0
AliceBenefits0
BobWages14000
BobBonuses1000
BobBenefits6000
PersonIncome_WagesIncome_BonusesIncome_Benefits
Alice370000 0
Bob 1400010006000
PersonIncome
Alice
TypeValue
Wages37000
Bonuses0
Benefits0
Bob
TypeValue
Wages14000
Bonuses1000
Benefits6000
PersonIncome
Alice
WagesBonusesBenefits
3700000
Bob
WagesBonusesBenefits
1400010006000

Abstractly, the data is the same in each case. But the tables are implemented in very different ways. If you access their elements with \$ or an equivalent, and you then change the implementation, you have to reprogram all those accesses. I’ve written some code which hides implementation details, so that I can access the different representations without having to change the interface, but again, it may not be efficient. It may also not work well with vectorisation, the way R implicitly loops over entire vectors. It would be great to have R experts, even R implementors, who were willing to advise on this, and even to collaborate on our grant applications.

The slideshow was written in Hakim El Hattab et. al.’s reveal.js. This is a JavaScript system for building web-based slideshows. A demonstration can be seen at revealjs.com .

The contents page was implemented with Frederick Feibel’s Presentable plugin.