Identifying regional differences in Perinatal Mental Health indicators in the UK with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Identifying regional differences of Perinatal Mental Health indicators in the UK with R
The fingertipsR package provides an easy interface to access the fingertips API. This repository contains a large variety of public health indicators managed by Public Health England.
I will be focusing on the data related to Perinatal Mental Health as our laboratory is interested in (among other things) the epigenetic embedding of early adversity. Perinatal mental health problems are those which occur during pregnancy or in the first year following the birth of a child. Perinatal mental illness affects up to 20% of women. Perinatal mental health problems can also have long-standing effects on the children’s emotional, social and cognitive development. If left untreated, it can have significant and long lasting effects on the woman and her family.
I thought it would be interesting to use Principle Component Analysis and Partitioning around medioids on monitoring indicators to identify regional variations that could provide improvements for control in those areas (like has been done with TB).
# load the needs package to load a bunch of libraries quickly (or install them if you don't already have them) library(needs) needs(purrr, dplyr, tibble, fingertipsR, tidyr, magrittr, FactoMineR, impute, DT, broom, stringr, skimr, cluster, ggplot2, ggfortify, viridis, hrbrthemes, ggthemes, cluster, maptools, gpclib )
Let’s find out what is within the fingertips API
The profiles() function is used to search profiles — indicators related to a specific disease or risk factor.
profs <- profiles() DT::datatable(profs, rownames= FALSE)
As mentioned before I’m interested in Perinatal Mental Health so I’ll select those profiles.
sel_profs <- profs[grepl("Perinatal Mental Health", profs$ProfileName),] DT::datatable(sel_profs, rownames = FALSE)
There are four Domains related to Perinatal Mental Health. Within these domains are a number of indicators presented at different time periods, geographies, sexes, age ranges etc.
peri_inds <- indicators(ProfileID = sel_profs$ProfileID) DT::datatable(peri_inds, rownames= FALSE)
This gives us 49 Perinatal mental health indicators, which we can now extract using the fingertips_data() function combined with a call to purrr::map().
peri_df <- peri_inds$IndicatorID %>% map(~fingertips_data(IndicatorID = .))
This results in 49 tibbles; however, many are empty and need to be removed.
I will use dimension reduction (PCA) and clustering (PAM) on these 49 indicators to generate clusters of counties with similar characteristics of Perinatal Mental Health indicators. This may help to identify regional variation and possibly a future framework for improvements in research and control efforts.
I will choose Depression: % recorded prevalence (aged 18+) from the Risk & related factors domain as the key indicator for which all others will be matched. The following code extracts this at the county-level, re-codes the value variable as recent incidence rates, and pulls the overall incidence of cases. I filtered out missing data and adjusted the time period to represent the final year for each rolling average.
peri_inc <- as_tibble(peri_df[[10]]) %>% dplyr::filter(AreaType %in% "County & UA") %>% dplyr::select(AreaName, Timeperiod, rec_inc_rate = Value) %>% mutate(Timeperiod = Timeperiod %>% str_split("/") %>% map_chr(first) %>% as.numeric() %>% {. + 2} %>% as.character()) peri_df_extraction <- function(peri_df, var_name, area_type = "County & UA") { df <- peri_df %>% filter(AreaType %in% area_type) %>% dplyr::select(AreaName, Value, Timeperiod) %>% rename_at(.vars = vars(Value), funs(paste0(var_name))) return(df) } var_names <- c("fertility_rate", "booking_under_20", "reproductive_age", "non_UK", "birth_under_20", "booking_over_40", "booking_BAME", "poverty", "IDACI", "homelessness", "contraceptions_under_18", "single_parent", "sever_mental_illness", "drug_treatment", "alcohol_treatment", "child_protection", "child_in_need", "infant_mortality", "child_looked_after", "sole_birth", "complex_social_factors", "multiparity", "caesarean", "preterm", "stillbirth", "abuse", "skin_contact", "distress_lb", "distress_ub", "mild_depressive_lb", "mild_depressive_ub", "chronic_SMI", "postpartum_psychosis", "PTSD", "severe_depressive", "booking_detection", "booking_substance", "booking_support", "booking_alcohol", "booking_complex_social_factors", "booking_early", "less_14_days", "review_8_weeks", "review_12_months", "antenatal_MH_detection", "postnatal_MH_detection", "postnatal_emotional_change", "postnatal_MH_support" ) extracted_tb <- map2(peri_df[-10], var_names, ~peri_df_extraction(.x, .y)) %>% reduce(full_join, by = c("AreaName", "Timeperiod")) com_peri_df <- peri_inc %>% left_join(extracted_tb, by = c("AreaName", "Timeperiod")) %>% mutate(year = Timeperiod %>% as.numeric) %>% mutate_if(is.character, as.factor) %>% dplyr::select(-Timeperiod) skimr::skim(com_peri_df)
We need to check completeness of the data:
get_frac_missing <- function(df) { df %>% nest() %>% mutate(missing = map(data,~map_dfr(. ,~sum(is.na(.))/length(.)))) %>% dplyr::select(-data) %>% unnest(missing) } ## Get the proportion missing per variableby year peri_miss_per_year <- com_peri_df %>% group_by(year) %>% get_frac_missing %>% mutate_all(~round(., 2)) %>% arrange(year) ## Drop full missing variables peri_partial_miss_year <- peri_miss_per_year %>% select_if(~!sum(.) == length(.)) ## Full missing variables com_miss_vars <- setdiff(names(peri_miss_per_year), names(peri_partial_miss_year)) ## Which year has the most complete data peri_complete_years_all_vars <- com_peri_df %>% group_by(year) %>% nest() %>% mutate(missing = map(data,~mean(colSums(is.na(.))/nrow(.)))) %>% dplyr::select(-data) %>% unnest(missing) %>% mutate(missing = round(missing, 2)) %>% arrange(year) DT::datatable(peri_complete_years_all_vars, rownames = FALSE)
Usually I would go with the most recent data available; however, in this case it’s quite sparse so let’s look at the data from 2015.
peri_df_2015 <- janitor::remove_empty(com_peri_df) %>% filter(year == 2015) %>% filter(rec_inc_rate > 7.39) %>% select(-year) DT::datatable(peri_df_2015, rownames = FALSE)
This leaves us with a tidy dataset on 8 maternal mental health indicators for 38 counties from 2015.
Dimension reduction
We are now ready to conduct some clustering analysis on the data. The first step is to reduce the dimensionality of the data using PCA. We use the estim_ncp function (which uses a method outlined in this paper from the FactoMineR package to estimate the number of principal components required. We then perform PCA and plot the variance explained by each component as a check on estim_ncp. All of the following analysis is done using nested tibbles and so can be easily generalized to higher dimensional use cases.
Let’s calculate the optimal number of principal components is using the estim_ncp function, perform PCA using the prcomp function and plot the variance explained by each component as a check on estim_ncp.
df <- as.data.frame(peri_df_2015) FactoMineR::estim_ncp(df[2:9], ncp.min = 2, ncp.max = 10)
The optimal number of components is 4.
peri_pca <- peri_df_2015 %>% nest() %>% mutate( numeric_data = map(data, ~select_if(., is.numeric) %>% as.data.frame()), optimal_pca_no = map(numeric_data, ~estim_ncp(., scale = TRUE, ncp.min = 2, ncp.max = 6)) %>% map_dbl(~.$ncp), pca = map(numeric_data, ~prcomp(.x, center = TRUE, scale = TRUE)), pca_data = map(pca, ~.$x), pca_aug = map2(pca, data, ~augment(.x, data = .y))) ## Variance explained var_exp <- peri_pca %>% dplyr::select(-optimal_pca_no) %>% unnest(pca_aug) %>% summarize_at(.vars = vars(contains("PC")), .funs = funs(var)) %>% gather(key = pc, value = variance) %>% mutate(var_exp = variance/sum(variance) * 100, cum_var_exp = cumsum(var_exp), pc = str_replace(pc, ".fitted", "") %>% str_replace("PC", "")) ## Plot variance explained var_exp %>% rename( `Variance Explained` = var_exp, `Cumulative Variance Explained` = cum_var_exp ) %>% gather(key = key, value = value, `Variance Explained`, `Cumulative Variance Explained`) %>% mutate(key = key %>% factor(levels = c("Variance Explained", "Cumulative Variance Explained"))) %>% mutate(value = value / 100) %>% mutate(pc = factor(pc, levels = as.character(1:max(var_exp$pc %>% as.numeric)))) %>% ggplot(aes(pc, value, group = key)) + geom_point(size = 2, alpha = 0.8) + geom_line(size = 1.1, alpha = 0.6) + facet_wrap(~key, scales = "free_y") + theme_bw() + labs( title = "Variance Explained by Principal Component", subtitle = paste0("The optimal number of principal components suggested by estim_ncp was ", peri_pca$optimal_pca_no, " which explains ", round(var_exp$cum_var_exp[[2]], 0), "% of the data."), x = "Principal Component", y = "Variance Explained (%)", caption = "@MattOldach Source: Public Health England (fingertipsR)" )
## Perform pam on pca data 1 to 6 groups peri_pca_pam <- peri_pca %>% mutate(centers = list(2:10)) %>% unnest(centers, .preserve = everything()) %>% dplyr::select(-centers, centers = centers1) %>% group_by(centers) %>% mutate( pam = map(pca_data, ~ pam(x = .x[, 1:optimal_pca_no], k = centers, stand = TRUE)), clusters = map(pam, ~.$clustering), avg_silhouette_width = map(pam, ~.$silinfo$avg.width), data_with_clusters = map2(.x = data, .y = clusters, ~mutate(.x, cluster = factor(.y, ordered = TRUE))) ) %>% ungroup ## Get max silhouette width max_silhouette_width <- peri_pca_pam %>% dplyr::select(centers, avg_silhouette_width) %>% unnest(avg_silhouette_width) %>% arrange(desc(avg_silhouette_width)) %>% slice(1) ## Plot average silhouette width peri_pca_pam %>% dplyr::select(centers, avg_silhouette_width) %>% unnest(avg_silhouette_width) %>% ggplot(aes(x = centers, y = avg_silhouette_width)) + geom_line(size = 2, alpha = 0.4) + geom_point(size = 3, alpha = 0.8) + theme_bw() + scale_x_continuous(breaks = seq(1, 10, 1), minor_breaks = NULL) + scale_y_continuous(limits = c(NA, NA), breaks = seq(0, 1, 0.01), minor_breaks = NULL) + labs(title = "Average Silhouette Width by Number of PAM Clusters", subtitle = paste0("The optimal number of clusters identifed by avg. silhouette width was ", max_silhouette_width$centers, " with an avg. silhouette width of ", round(max_silhouette_width$avg_silhouette_width, 2) ), x = "Clusters", y = "Avg. Silhouette Width", caption = "@MattOldach Source: Public Health England (fingertipsR)")
## Plot clusters pca_plot <- peri_pca_pam %>% filter(centers == max_silhouette_width$centers) %>% select(data_with_clusters, pca) %>% mutate(pca_graph = map2(.x = pca, .y = data_with_clusters, ~ autoplot(.x, x = 1, y = 2, loadings = TRUE, loadings.label = TRUE, loadings.label.repel = TRUE, loadings.label.size = 2, loadings.alpha = 0.8, loadings.label.vjust = -1, data = .y, label = TRUE, label.label = "AreaName", label.size = 1.5, label.vjust = -1, alpha = 0.3, frame = TRUE, frame.type = 'convex', frame.alpha= 0.05, colour = "cluster", size = "rec_inc_rate") + theme_bw() + labs(x = paste0("Principal Component 1 (Variance Explained: ", round(var_exp$var_exp[[1]], 1), "%)"), y = paste0("Principal Component 2 (Variance Explained: ", round(var_exp$var_exp[[2]], 1), "%)")) + guides(colour=guide_legend(title = "Cluster", ncol = 2), fill=guide_legend(title= "Cluster", ncol = 2), size = guide_legend(title = "Depression: % recorded prevalence", ncol = 2)) + scale_color_manual(name = "Cluster", values = c("#6b5b95", "#feb236", "#d64161", "#ff7b25", "#87bdd8")) + scale_fill_manual(name = "Cluster", values = c("#6b5b95", "#feb236", "#d64161", "#ff7b25", "#87bdd8")) + theme(legend.position = "bottom", legend.box = "horizontal") + labs( title = "Maternal Mental Health in England", subtitle = "The arrows are variable loadings and points are counties coloured by cluster membership", caption = "@MattOldach Source: Public Health England (fingertipsR)" ) )) %>% pull(pca_graph) %>% first pca_plot
sum_peri_df <- peri_pca_pam %>% filter(centers == max_silhouette_width$centers) %>% pull(data_with_clusters) %>% map(~ gather(., key = "Variable", value = "value", -AreaName, -cluster)) %>% first %>% rename(Cluster = cluster) plot_cluster_diff <- sum_peri_df %>% ggplot(aes(x = Variable, y = value, col = Cluster, fill = Cluster)) + geom_violin(draw_quantiles = c(0.025, 0.5, 0.975), alpha = 0.2, scale = "width") + geom_jitter(position = position_jitterdodge(), alpha = 0.3) + coord_flip() + theme_minimal() + scale_y_continuous(breaks = seq(0, 100, 6), minor_breaks = NULL) + scale_fill_manual(name = "Cluster", values = c("#6b5b95", "#feb236", "#d64161", "#ff7b25", "#87bdd8")) + scale_color_manual(name = "Cluster", values = c("#6b5b95", "#feb236", "#d64161", "#ff7b25", "#87bdd8")) + theme(legend.position = "bottom") + labs( title = "Maternal Mental Health in England; Summarised by Cluster", subtitle = "Violin plots are scaled by width, with the 2.5%, 50% and 97.5% quantiles shown.", x = "Variable", y = "Recorded % depression prevalance (aged 18+) for rec_int_rate, otherwise proportion (0-100%)", caption = "@MattOldach Source: Public Health England (fingertipsR)") plot_cluster_diff
From these plots we see that the thrid cluster showing a higher recent increase rate has a highger proportion of postpartum psychosis, mild depression and distress.
We can plot the cluster membership for each county on a map with the outlines of the counties in England from the UK data service (download the ShapeFiles here and here).
peri_cluster_map <- function(peri_pca_pam) { gpclibPermit() england_counties <- rgdal::readOGR("england_ct_2011.shp") %>% fortify(region = "code") %>% as_tibble england_urban_areas <- rgdal::readOGR("england_urb_2001.shp") %>% fortify(region = "name") %>% as_tibble %>% filter(id %in% c("Sheffield Urban Area", "Plymouth", "Skelton (Redcar and Cleveland)", "Brighton/Worthing/Littlehampton", "Leicester Urban Area" )) ## Make custom positions for urban area labels urban_area_labels <- england_urban_areas %>% group_by(id) %>% slice(100) %>% ungroup() %>% mutate(long = long - 200000, lat = lat + 20000) peri_cluster_results <- peri_pca_pam %>% filter(centers == max_silhouette_width$centers) %>% pull(data_with_clusters) %>% first peri_cluster_results <- peri_df[[14]] %>% dplyr::select(AreaName, AreaCode, AreaType) %>% filter(AreaType %in% "County & UA") %>% unique %>% left_join(peri_cluster_results, by = "AreaName") %>% left_join(england_counties, by = c("AreaCode" = "id")) peri_cluster_results %>% rename(Cluster = cluster) %>% drop_na(Cluster) %>% dplyr::select(long, lat, Cluster, group) %>% ggplot( aes(x = long, y = lat, fill = Cluster)) + geom_polygon(data = england_urban_areas, aes(group = group, fill = NULL), alpha = 0.4) + geom_path(data = peri_cluster_results, aes(group = group, fill = NULL), alpha = 0.4) + geom_polygon(data = peri_cluster_results, aes(group = group, fill = NULL), alpha = 0.1) + geom_polygon(aes(group = group), alpha = 0.6) + geom_line(data = urban_area_labels %>% bind_rows(urban_area_labels %>% mutate(long = long + 200000, lat = lat - 20000)), aes(fill = NA, group = id), alpha = 0.8) + geom_label(data = urban_area_labels, aes(label = id), fill = "grey") + scale_fill_manual(name = "Cluster", values = c("#6b5b95", "#feb236", "#d64161", "#ff7b25", "#87bdd8")) + coord_equal() + theme_map() + theme(legend.position = "bottom") + labs(title = "Maternal Mental Health Indicators; Map of County Level Clusters in England", subtitle = "Using data from 2015", caption = "Selected urban areas are shown (dark grey) and labelled. @MattOldach Source: Public Health England (fingertipsR) Contains National Statistics data © Crown copyright and database right 2019. Contains OS data © Crown copyright and database right 2019") } plot_peri_cluster_map <- peri_cluster_map(peri_pca_pam) plot_peri_cluster_map
Identifying regional differences in Perinatal Mental Health indicators in the UK with R was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.