Data from the World Health Organization API

[This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Eric Persson released yesterday a new WHO R package which allows easy access to the World Health Organization’s data API. He’s also done a nice vignette introducing its use.

I had a play and found it was easy access to some interesting data. Some time down the track I might do a comparison of this with other sources, the most obvious being the World Bank’s World Development Indicators, to identify relative advantages – there’s a lot of duplication of course. It’s a nice problem to have, too much data that’s too easy to get hold of. I wish we’d had that problem when I studied aid and development last century – I vividly remember re-keying numbers from almanac-like hard copy publications, and pleased we were to have them too!

Here’s a plot showing country-level relationships between the latest data of three indicators – access to contraception, adolescent fertility, and infant mortality – that help track the Millennium Development Goals.

Note that there’s some micro-interactivity here – you can move the labels around, zoom in on a part of the plot, or reposition the whole thing. It’s easier to explor in the full screen version. Thanks to Julien Barnier’s scatterD3 R package for making this JavaScript D3 functionality easily available from R.

Here are the exploratory plots I made on the way to that. All the code is reproduced at the bottom of the post.

Infant mortality has been increasing steadily since 1990 when these data start, and in fact from decades before too. Sad blip in Haiti from the 2010 earthquake: MDG1

The data on adolescent fertility is surprisingly sparse, with only one data point per country and many of them quite old. While Africa is often in a class of its own on development indicators, this graphic might surprise people in showing how much teen pregnancy is such an issue in Africa (note that the inclusion of Sudan and Somalia in “Eastern Mediterranean” region wasn’t done by me): MDG3

Data on contraception use is also sparse; mostly one data point per country, but a few have two. There’s not much that stands out here other than the high usage in the wealthier countries of Europe (the blue ones) and low in the poor countries of Africa (red). MDG5

Here’s the R code that did that. Hopefully other people will pick up on this – good work from the WHO for their database and API, and Eric Persson for making it easy to access from R.

# basic functionality
library(scatterD3) # note - this is the GitHub version so we have ellipses

# ---------------default fonts, themes and colour scale"Poppins", "myfont")

mdg_theme <- theme_light(base_family = "myfont") + theme(legend.position = "bottom")

scale_colour_discrete <- function(...) {
   scale_colour_manual("", values = brewer.pal(6, "Spectral")[c(1,2,5,6)], ...)

#------------------data prep-------------
# download all WHO codes
codes <- get_codes()
dim(codes) # 2144 codes

mdg_codes <- codes[grepl("^MDG", codes$label), ]
mdg_codes$number <- as.numeric(str_sub(mdg_codes$label, start = -2))

dim(mdg_codes) # 33 for MDGs

#----------------helper functions---------------
prep <- function(data){
   # remove regional groupings, and make income a factor with levels in correct order
   data %>%
      filter(!worldbankincomegroup %in% c("Global", "NA")) %>%
      filter(! %>%
      filter(! %>%
      mutate(worldbankincomegroup = factor(worldbankincomegroup,
                                           levels = c("Low-income", "Lower-middle-income",
                                                      "Upper-middle-income", "High-income")))

latest <- function(data, nudge = 1){
   # return a cut back data frame of just the latest value, useful for geom_text annotations
   data %>%
      group_by(country) %>%
      filter(year == max(year)) %>%
      mutate(year = year + nudge)

#-------------------Infant mortality--------------
mdg1 <- get_data("MDG_0000000001")
# values are a funny mixture of points and intervals, so we extract just the points:
mdg1$value_n <- as.numeric(str_extract(mdg1$value, "[0-9]*\.[0-9]*"))

mdg1a <- mdg1 %>% prep()

p1 <- mdg1a %>%
   ggplot(aes(x = year, y = value_n, colour = worldbankincomegroup, group = country)) +
   geom_line() + 
   facet_wrap(~region) +
   geom_text(data = latest(mdg1a), aes(label = country), hjust = 0, size = 3) +
   xlim(1990, 2025) +
   labs(x = "Year",
        y = mdg_codes[1, "display"],
        title = "'Reduce child mortality' MDG Indicator 14, infant mortality rate")

svg("..", 10, 9)

#--------------Adolescent fertility----------------
mdg3 <- get_data("MDG_0000000003")
# This isn't listed at as an MDG indicator,
# but according to it's part of
# measuring Goal 5 "Improve Maternal Health"

svg("..", 10, 9)
mdg3 %>%
   prep() %>%
   ggplot(aes(x = year, y = value, label = country, colour = worldbankincomegroup)) +
   labs(x = "Year of latest data",
        y = mdg_codes[2, "display"],
        title = "'Improve maternal health' MDG un-numbered indicator, adolescent fertility") +
   geom_text() +
   facet_wrap(~region) +
   scale_colour_manual("", values = brewer.pal(6, "Spectral")[c(1,2,5,6)]) +
   theme(legend.position = "bottom")

#----------contraceptive prevalence----------
mdg5 <- get_data("MDG_0000000005")

# for some odd reason there's no income group so we make it up:
mdg5a <- prep(mdg5) %>%
   select(-worldbankincomegroup) %>%
   left_join(unique(mdg1a[ , c("country", "worldbankincomegroup")]))

p5 <- mdg5a %>%
   prep() %>%
   ggplot(aes(x = year, y = value / 100, group = country, colour = worldbankincomegroup)) +
   facet_wrap(~region) +
   scale_y_continuous(mdg_codes[3, "display"], label = percent) +
   labs(x = "Year of latest data",
        title = "'Combat HIV/AIDS, malaria and other diseases' MDG indicator 19c, contraception prevalence") +
   xlim(1990, 2020) +
   geom_point() +
   geom_line() +
   geom_text(data = latest(mdg5a, 0.3), aes(label = country), hjust = 0, size = 3)
svg("..", 10, 9)

#==================scatter plot==================
# combine the three datasets into just one and knock out countries with missing values
comb <- latest(mdg3) %>%
   select(country, region, worldbankincomegroup, value) %>%
   rename(AdolFert = value) %>%
   left_join(latest(mdg5)[, c("country", "value")], by = "country") %>%
   rename(Contra = value) %>%
   left_join(latest(mdg1)[ , c("country", "value_n")], by = "country") %>%
   rename(InfantMortality = value_n) %>%
   filter(! %>%

# draw scatter plot
scatterD3(x = comb$Contra, y = comb$AdolFert, lab = comb$country,
          size_var = comb$InfantMortality,
          col_var=comb$worldbankincomegroup, symbol_var = comb$region,
          xlab = "Access to contraception (%)", 
          ylab = "Adolescent fertility rate (per 1000 girls aged 15-19)", 
          col_lab = "",
          symbol_lab = "",
          size_lab = "Infant Mortality per 1000",
          ellipses = TRUE, ellipses_level = 0.75)
# note - next step to get into a web page requires manually saving it.

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)