I have been working with a data set on causes of death in my adopted home state of Utah for a little while now, and I had been struggling with the best way to visualize it. This week, David Robinson released the gganimate package to create animated ggplot2 plots and I thought “AH HA! This is what I have needing.” The data on causes of death in Utah is available here via Utah’s Open Data Catalog and can be accessed via Socrata Open Data API.
I have been having a lot of fun exploring Utah’s Open Data Catalog but I’ve got to admit that this particular data set is a bit of a mess compared to the other ones I have used. Let’s make this more amenable to analysis. To start with, what are we dealing with?
There are some rows that contain sums of the other rows and are not actual observations of numbers of deaths in years, so let’s get rid of those. After that, let’s remake the cause of death factor because it had entries that were links to a website and other not-so-useful information.
The data set includes 46 different causes of death.
The population column contains commas (!) and is a factor so let’s get this fixed and transform it to numeric values.
There are a handful of NA values for some inexplicable reason, but we will deal with that in a little bit. In the columns that record the age adjusted mortality rate, 95% confidence interval, and standard error, there are double asterisks (!!!) instead of zeroes or NA values for years and causes of death where no one died from that cause in that year.
There are some rows in this data set that do have a zero recorded (i.e. zero people died of a certain cause in a certain year), but then there are a whole bunch missing. This is going to make analysis and plotting difficult, so let’s complete this data frame. I just read a great explanation of how tidyr uses complete to fill in missing rows and turn implicit missing values into explicit missing values. In our case here, these aren’t “missing” values so much as zeroes; we’ll get to that a bit later.
This data set had the total number of deaths and total age adjusted mortality rate on separate rows for each year, but it will be helpful to have these as columns for each observation. Let’s make a data frame of just the total numbers for each year and then join this data frame to the original one. This will also take care of those NA values in the population column.
Now let’s replace NA values with zeroes for the number of deaths and age adjusted mortality rate.
Are we done? I think we’re done. Let’s look at our cleaned, tidy data.
What are the most important causes of death in Utah? Let’s find the top 10 causes of death for the 15 years in this data set.
Heart disease and cancer are far away the most important causes of death in Utah. Let’s take these top 10 causes of death and make a new data frame for some plotting, although this does mean we won’t get to talk about “Arthropod-borne viral encephalitis” and how people in Utah have died from that!
I made a shorter version of the cause of death name for plotting purposes.
Let’s take a look at Utah’s #1 killer, heart disease. First let’s plot the raw number of how many people have died each year.
Oh no! This is very bad, right? Heart disease deaths are going up UP UP. But of course, Utah’s population has been growing steadily during these years as well, so perhaps this is not a particularly meaningful graph. Let’s look at the per capita number of heart disease deaths. These things are typically measured per 100,000 population.
Very different, right? But actually, not only has the population in Utah been growing, but it has been changing in demographics significantly. Utah is very young in population compared to the United States as a whole, but it is less young than it once was. The birth rate in Utah is dropping so the population 10 years ago was younger than the population today. What we really want to look at is the age adjusted mortality rate.
By this measure, we can see that heart disease outcomes have improved in Utah during these years.
Let’s Animate Something
The gganimate package works by using some variable in one’s data as the frame with which to animate a plot. Let’s start with looking at how the causes of death change over the years in the data set and animate over the causes of death. This is so nice because the plot was way too crowded when I tried to plot them all together.
Now let’s look at the causes of death in each year and animate over the years in the data set.
This is perhaps a bit heavy and grim for the weekend, but you know, one of these 46 causes of death (or something very similar) will be written down on a death certificate for all of us one day. Carpe diem, and may you enjoy many more animated GIFs in your life.