Data Carpentries builds communities to teach data literacy, primarily for masters and doctoral researchers, but the skills taught are applicable in a wide variety of areas. The social science curriculum uses the SAFI (Studying African Farmer-Led Irrigation) data set, to teach participants how to collect clean data using a spreadsheet, clean messy data using OpenRefine and examine the data using Rstudio. The workshop can be taught entirely using open source software. This workflow can be used for other data sets. The World Bank Boost Open Budget portal provides access to budgetary information from 35 regions, which are listed here. One of the regions is the country Kenya, where budget information between 2006 and 2017 is provided here, under a Creative Commons CC-BY 4.0 license.
The initial analysis of the data seeks to answer the following questions:
- What is the portion of expenditure in the budget at subnational relative to the national level
- What portion of the budget is used for recurrent expenditure compared to development expenditure
The questions are of interest since China’s rapid development over the last 40 years has been attributed to competition among different regions of China, see for example Hofman, “Reflections on 40 years of China’s reforms”, as well as to investment in public infrastructure.
To analyze this data, one downloads the Excel file, which is composed of 3 sheets,
- data from 2006-2012 of 730896 records
- data from 2013-2017 of 730927 records
- an informational sheet
The data is relatively large for a laptop with 4Gb of RAM, so to process it in Rstudio, it is best to first export each of the sheets from LibreOffice to tab separated value files. The files obtained are still relatively large, so one can use the command cut available within the Linux shell to extract the columns of interest. The data can then be read into Rstudio. Examining the data allows one to ensure column entries are consistent, in particular names of data categories. One can also see that not all the data is complete, and that data from the period 2007-2010 has actual expenditure information in addition to allocated expenditure information, whereas the data in the period 2013-2018 has mostly allocated expenditure information and very little actual expenditure information. Kenya adopted a new constitution in 2012 which created counties and county governments with elected officials that replaced an appointed provincial administration, which is also reflected in the budget allocations.
Once the data is cleaned, it can be plotted. To answer the questions, the data is first grouped by year and category. The package ggplot is used to visualize the data by creating stacked barcharts showing absolute and relative expenditures. The R scripts that create the plots are KenyaBudgetAnalysis06to13.R and KenyaBudgetAnalysis13to18.R. The pre-processed data is also available. Since some data is missing from these datasets, the resulting figures have some errors, but it is expected that the missing data would not significantly change the conclusions of the analysis. The scripts generate the figures below which compare expenditure at national and subnational levels.
A portion of the first period 2006-2010 has both allocated and actual expenditure, whereas the second period 2013-2017 has only allocated expenditure, so less information can be shown for the second period. The second period does however have expected revenue information. The figures show that expected subnational expenditure is a small portion of the total national budgetary expenditure, but institutional reforms have increased the allocated proportion in the second period. They also show that budgetary expenditure has increased substantially between 2006 and 2017, the current figures are not inflation adjusted, though the Kenyan inflation rate is such that it would not affect this conclusion. Values for the constituency development fund executed regional expenditure, between 2006 and 2010 seem to be missing.
The figures below compare recurrent expenditure and development expenditure.
Recurrent expenditure is typically composed of operating expenses such as wages, interest and rent. Development expenditure is typically composed of public investments not expected to be repeatedly paid for, such as roads and other public infrastructure. Salaries for workers in the public education sector are also a public investment that can result in development, so a more detailed analysis of the exact expenses and their effectiveness is needed to determine expenses that lead to development and those that are administrative overhead that is possibly necessary for coordination of service delivery. Nevertheless, the figures above show that a large portion of expenditure is for recurrent rather than development expenditure, and that the proportion of development expenditure fluctuates at about 25%.
To summarise, the skills taught in data carpentry can be used to help understand good ways to make data available and to analyse the resulting data. Data carpentry social science materials are openly available and they may be used both in workshops and for self study. An economics curriculum for data carpentry is also being developed. Such materials may allow the people in national economic institutions and the interested public to obtain the skills required to analyse government spending. Computationally literate people can use these materials to educate others by holding workshops in collaboration with among others, public libraries, local and national budget offices.
This article is published under a CC-by-SA 4.0 license.
BKM thanks Selorm Tamakaloe for helpful discussions.