Introducing the Redwall IRS SOI Tax Dashboard

[This article was first published on R on Redwall Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


The IRS SOI Tax Statistics is a fabulous seemingly undiscovered data set, which should be frequently referenced in any discussion about income and taxes. We often see median income represented by Census data, but have always been skeptical that most people know their exact income in any given year or would report it according to a uniform definition if asked by a census taker. This data set aggregates elements of all 1040 filings by zip code across six income levels, with Adjusted Gross Income (AGI), sources of income, credits and deductions with both amounts and counts. It seems to reflect a much more robust measure of earnings, although a slightly less granular format than the census tracts. As is typical in almost every open public data we have worked with, formats, variables and groupings change from year to year, so accessing a clean long-term time series was a painful proposition.

We first introduced the IRS SOI Tax Statistics as a thematic data set for Redwall in Analysis of Connecticut Tax Load by Income Bracket. In that post, we discussed the data and methodology used to collect and clean it in. Then, we used it in more detail in IRS Data Shows Growth in Number not Income of Highest Earners since 2005. Finally, IRS Data shows Connecticut Taxpayers Also Pay Higher Federal Taxes compared the federal tax rates payed by CT residents to national averages. All three of these previous posts were before we had the disclosure for 2018, one of the more interesting tax years because of the Trump tax cut and its controversial SALT limitations.

When we worked on those posts, we also did not have the skill we have now have to make a Shiny dashboard, unlocking this valuable data for anyone wishing to explore it. In this post, our we discuss our new {irs.soi} package (shortly to be available on Github, not CRAN), describe the methodology we created to download, clean and prepare the data for exploration in a new Shiny app.


One of our themes in recent months has been re-factoring the messy code and work flows, that we built up while learning over these last few years. Like many, our IRS explorations had become a mess and certainly not accessible to anyone else, so moving all of our functions into a package structure was overdue. Hence, we built the {irs.soi} package. We found the NBER’s Public Use Data Archive cleaned up version easier to use than the daunting collection of .csv’s on the IRS website, so the package downloads specified years from the Individual Income Tax Statistics – ZIP Code Data (SOI) to a specified local folder. Please note that our app data includes 2017-2018, previously included on the NBER site, but now removed. Using {irs.soi}, specifying a path and running download_nber_data() will download the data to a local folder (or put it in a data/ file if none is specified). Please note that the full data set is over 3GB and should take at least 15 minutes to download.

Again using the {irs.soi}, the load_soi() function can be used to load and compile the data into a data.table, and clean_soi() cleans according to our process. One place we had to manually intervene was to aggregate income levels, because the IRS had changed size bands in 2007. We did this in a way which maintained the integrity of the bands (ie: <$10k and <$25k became one band and >$100k and >$250k another), but at the expense of interesting granularity. Over the last 10 years, >$250k has been generally been defined as “rich”, but our highest bands starts well below that. There were also some cases in 2006 where the data had bad classifications, but we were able to impute, and 2007-2008 strangely did not round by 1,000 as the other years did. Our prepare_app_data() filters sparse zip codes, and merges by zip code to add County and City filters, and can be used to ready for it to be run locally in Shiny using our irsApp() function.

Data Summary

To give a quick summary of the raw data and our preparations. There are almost 42,000 unique zip codes in the US, but many of these are P.O. Boxes with few or no people, and others are suppressed from disclosure if they don’t have many taxpayers in a given period. We also further screened in {irs.soi} for only zip codes where all income levels were included for every year of our specified data, which gave us a total of 24,143. The IRS data is only disclosed by zip code and state, but we added County and Post Office City fields by joining with the zip code database in {zipcodeR}. The Tax Foundation reported total US taxpayers of 143.3 million, earnings of $10.9 trillion and taxes of $1.6 trillion in 2017. The IRS removes zip codes below a threshold of filers for privacy reasons, and our package drops all zip codes which are not present in all years of the data set, which cost between $100-150 billion of annual income in most years, but our final data is still close the full amount of returns, income and tax paid as reported by the Tax Foundation.

Conditional Drop Down Filtering

This is not a code-based post, but just a word about the filtering of locations, which is now excellent. The app has a nested geographical structure. We wanted users to be able to find their chosen geography with as little struggle, waiting time and server usage as possible. We tried several ways of doing this efficiently, but it was difficult. In Hadley Wickham’s new book section, Mastering Shiny: 10.1.2 Hierarchical Select Boxes proposed a sequence of nested observeEvent() and updateSelectInput() operations to filter a reactive data.frame inside the server function. We also looked at doing this on the server side with updateSelectizeInput, but with the size of our data or inexperience, it was very slow and usually ran out of memory. For others looking to accomplish a similar task, please see our code using shinyWidgets and its selectizeGroupUI and selectizeGroupServer. This solution may be widely known to Shiny experts, but little has been written about it. The credit for helping us goes fully to David Solito for Filtering a data frame with dependent drop down lists in Shiny thanks to Magrittr (aka conditional drop down list).. This was an excellent summary of a complicated task, which seems to be by far the best solution for this seemingly common task.

Assessing the App

At about 40MB and almost 1.7 million rows, this app is larger, and may have a broader audience than some of our previous ones (Connecticut Property Selling Prices vs Assessment Values over Three Revaluation Cycles and Risk Score History of Selected CT Town, so we are not sure how long our free server time on the will last. We intend to add many more fields and visual analyses, and to expand the functionality with {golem}, so have to explore the options for cost effective hosting since this is a personal project. If the app runs out of server time, there will be the option to use {irs.soi} to download the data and run the app locally (as described above).


When the app launches, it shows the aggregate AGI and tax paid of all included zip codes. It is possible to filter by State, Counties, Cities or all the way down to individual or groups of zip codes. There is a second tab to see a plot of the change in AGI and tax rates over time. Exact amounts or counts can be viewed by hovering with the tooltips. Given the richness of this data set and the incredible functionality of Shiny, we hope to add many more avenues of exploration in the future.

We observe that income of lowest income group viewed on national level has increased average AGI by about 20% since 2005, and there are significantly fewer filers in that group over the period. Average income levels of the three middle bands are so flat in nominal terms over the period that it is almost hard to believe. There were increases in the number of tax payers in all of the middle groups, but not even close to increase in number of taxpayers in the highest income group, which more than doubled over the period. Unlike the other groups, the average income of the highest group declined considerably during the Great Recession, but has more recently recovered to be slightly higher than the previous peak. In real terms, there has been a significant loss of purchasing power on average in all groups over the period.

The effects of the recent tax cuts can be seen in all income bands having lower Federal tax rates in 2018, and even the highest bands of most blue states show lower tax rates despite SALT restrictions. The highest income bands did indeed see a reduction in percentage of total tax paid during 2018, to 79.5% from 81.7%, though in our bands, this includes incomes could be considered middle income in high cost areas.


In an upcoming version, we hope to increase the size of the data by 10x to include types of income, credits and deductions. We would also like to add other variables such as real estate prices, transportation or education data. The next version of the app will have the ability to compare the trajectory of selected 1040 items by selected geographical units, and to view in aggregate or per capita. In order to allow flexibility to grow and change over time, we will use {golem} and AWS to host our app in a Docker container. We already have a working {golem} app ready to go in our local environment using the full data set, but managing the AWS environment and hosting will take more work. Should we built a separate database to host the data outside the app? Should we use Shiny Server on an EC2 machine or can we host Docker containers in ECS? We are excited about the many avenues to add to and learn from this project.

To leave a comment for the author, please follow the link and comment on their blog: R on Redwall Analytics. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)