Datasets for Building a Data Analysis Portfolio

September 18, 2017
By

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

I recently had the pleasure of attending the 2017 Association of Public Data Users (APDU) Conference.

My favorite part of the conference was talking to people who work with federal data on a daily basis. Overall I found people to be passionate about their work and eager to share information about it.

I know many of my readers are looking for interesting datasets to use in their portfolios, so I decided to publish a list of some of the most interesting datasets I learned about.

IRS Statistics

One of the most enjoyable conversations I had was with Kevin Pierce, an economist with the IRS. When I first learned where Kevin worked, I wanted to run away. However, we wound up having a fascinating conversation about IRS data.

Kevin works in the IRS’s Statistics of Income (SOI) program. As far as I can tell, SOI data is the highest quality data on US income that’s available. This is simply because all Americans are required to accurately file their taxes every year.

I was surprised to learn that the SOI publishes a great deal of this data. As an example, here is their page dedicated to data from Form 1040. They also aggregate this data by State, County and ZIP Code, so it is possible to map the data.

Kevin works on the IRS’s migration reports. Because the IRS knows everyone’s address and and income each year, they can analyze migration and the financial impact it has.

Any portfolio that focuses on income data from the IRS is sure to get a lot of attention!

Vital Statistics

Charles Rothwell, the Director of the National Center for Health Statistics, appeared on a panel titled “Federal Statistical Agency Leadership”. Charles is a gifted public speaker and I really enjoyed his presentation.

Charles works with “Vital Statistics”, which involves counting births and deaths. Normally I would shy away from a dataset like this. But as Charles pointed out, this data is necessary if you want to understand the opioid epidemic that the US is currently facing.

A portfolio that focused on using this dataset to explore the opioid epidemic would be fascinating to read.

Labor Statistics

Michael Dalton, a research economist at the Bureau of Labor Statistics (BLS), spoke on a panel about the Role of Commercial Firms in Public Data. I found his case studies to be very interesting, and after his talk we chatted for a bit. I asked him which BLS statistics he thought would be good for a data analysis student who is interested in employment data. His recommendations were the Current Employment Statistics (CES) and Occupational Employment Statistics (OES).

These statistics will tell you the types of jobs that people in the US have, as well as the amount that people in those occupations earn. If I had the time, I’d love to analyze the growth in the number tech workers in the Bay Area over time.

Of course, BLS also releases statistics on Unemployment. (Note that I have already packaged up some of that data in the rUnemploymentData package (1, 2)).

Energy Data

I also had the pleasure of meeting Chip Berry of the Energy Information Administration (EIA). Chip manages the Residential Energy Consumptions Survey (RECS). I was not previously aware of the EIA, and it turns out that they have a ton of interesting data. For example, they have real-time information about energy supply and demand nationwide. They also know the location of each and every energy production facility in the US.

As I write this much of Florida is still without power due to hurricane Irma. If you were interested in researching this (or any other energy-related topic), this data would be a great place to start.

Closing Thoughts

In my experience, the more specialized a portfolio is, the easier it is for the portfolio to get traction. Each of the datasets I link to above could easily form the cornerstone of a successful data-related portfolio.

The post Datasets for Building a Data Analysis Portfolio appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)