[This article was first published on data science ish, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I saw this analysis at Flowing Data about the most common consumer products involved in hospital ER visits and was delighted, interested, etc. Nathan’s next related post is, um, also super interesting, if entirely horrifying. Apparently, I am not the only one who thought this data set was compelling, because this week Hadley Wickham took the NEISS data set that these beautiful analyses are based on and made an R package for them.
RIGHT HAND PAIN AND SWELLING AFTER PUNCHING A WOODEN DOOR
Since the data set is wrapped up nicely in an R package, getting it is very easy. This is a pretty big data set, though; it includes the entire NEISS sample of injuries from 2009 to 2014. I did this blog post on an older laptop and it took my aging computer a bit to chug along and do some things. First, let’s download the data.
Now, let’s open the main data set and look at the column names.
Each row is a case, i.e. injury. The consumer product(s) implicated in the injury are in prod1 and prod2 as numbers, which can be looked up in another data set, products.
What, for example, is the product associated with code 235?
Some of the other observations made for each injury include the age, sex, and race of the injured person, the diagnosis and body part injured, where the injury took place (at home, school, etc.), whether the fire department was involved, and a narrative describing what happened. The narratives are just, WOW.
Just For Starters
So just as a very first glimpse, what consumer products cause the most injuries in this data set?
If you live in a house without stairs, it looks like your floor is the most dangerous thing in your house. (Is it weird that floors and stairs are “consumer products” in this data set?) These results, as we would expect, agree with the more detailed plots at Flowing Data.
Don’t Leave Your House! But Don’t Stay Home!
Let’s look at where these injuries occur, first for the dataset as a whole.
Is this different for males and females? (I have left out the injuries where no sex is listed.)
The numbers are so low that you can’t see them on the graph, but injuries at farms, industrial places, and mobile homes are all higher for males than females.
Does this change for people of different ages? This is going to get a little more complicated, because there are fewer 70-year-olds in America than 20-year-olds, but more women than men at those older ages. Who is more likely to be injured? And where?
Counting Injuries, Counting People
Let’s start by looking at the number of injuries by age and by sex.
I have parented three children through their toddler years and oh, this graph makes me cringe. We only ended up at the ER once and that was an incident involving a collision with a child’s head and:
None of my kids have yet entered their teen years and now that second peak can cause me to be filled with DATA-DRIVEN DREAD. Anyway, you can see there are more males visiting the ER for an injury from a consumer product until about age 50, but women live longer than men, so there are more women alive at those wise, advanced years. What we want to do is divide by how many people there are at each age and sex to find the per capita rate. Hadley included a data set that contains just that information in the neiss package. Let’s load it.
I am just looking at the whole NEISS dataset in aggregate, not dividing up by individual years, and this also could complicate matters. During the years 2009 to 2014, individuals obviously age, but does the age and sex distribution of the population of the United States change enough to make a big difference here?
Those are close enough for my purposes that I am going to take a median and use that as the population distribution for the aggregated injury dataset.
Now let’s combine the population sex/age distribution with the total injuries by sex and age.
What do we have at this point?
Now let’s divide the injuries by the population in each age/sex bin to get a rate. Let’s multiply by 100,000 to get a number that is per 100,000 population. Then let’s melt for some plotting.
We can see the rate of injuries per capita increasing at the highest ages, but the injuries are still higher for women than men. I’ll be honest; this is not what I was expecting to see. Ahead of time, I thought that crossing point when injuries in women outnumber injuries in men was probably due to there being more older women than men. Let’s look in detail at injuries caused by a few types of consumer products.
TOILET COVER FELL ON TOE
Hadley shared some plots on Twitter while he was working on this package.
This toilet-related one, naturally, caught everyone’s eye. Let’s reproduce that plot and then divide by the population at each age bin to see how the distribution changes.
The reason my plot doesn’t extend to as high ages as Hadley’s is that the population data doesn’t extend to as high ages as the NEISS data. Also, the y-axes are different because Hadley must have been just using the number of cases (i.e. rows) in the data set at that point, but it actually contains a weight (weight) for each case that can be used to get a national estimate. (The NEISS is a sample of many hospitals, but not every single hospital in the United States; the weights are assigned so that we can use these data to get a national estimate.) More of substance, dividing by the population in each age/sex bin shows that the toilet-related injury rate increases significantly with age. The difference between older women and men is not due to there being more women in the older population; older women are more likely to suffer toilet-related injuries than older men. SAD. And also somewhat sensible, I suppose, if I stop to think about it. Everyone do some squats.
TRIED TO SKI JUMP, LEGS WENT APART IN AIR
Let’s do one more. I live in a ski-centric city, so let’s find all the skiing-related injuries.
The distribution of these injuries is quite different from the total injuries or the toilet-related injuries. There is not much difference between the number of injuries and the per capita rate of injuries; the people getting injured skiing (probably, just the people skiing) have ages in the range where the distribution of ages is fairly flat so the two plots look mostly the same. There are not many babies and toddlers out there skiing and getting hurt, and the adults seem to be practicing safe skiing behavior. The big peak is just past age 10. That is… exactly the age of my oldest who now likes to ski black runs.
Now that we have a handle on this business with the population at different sex/age bins, let’s go back to the distribution of injury location. Let’s look at the top four locations that are not “unknown”.
And now let’s plot the rates per 100,000 population in each sex/age bin for these locations.
Teenagers are getting injured at school and at sports/recreation locations, while babies, toddlers, and the elderly are getting injured at home and in public. Here again we see that elderly women are injured at a higher rate than elderly men.
I copy-and-pasted code a little bit in this post (FOR SHAME); I maybe should have defined some functions. There is SO MUCH MORE that can be explored with this data set. It is enormous. I didn’t touch most of the consumer products, or any of the date information, or any of the race information, or any of the information on type of injury or body part injured… You get the idea; go have a ball. Just be forewarned it might make you want to wrap everyone you love in bubblewrap. The R Markdown file used to make this blog post is available here. I am very happy to hear feedback or questions!
To leave a comment for the author, please follow the link and comment on their blog: data science ish.