% group_by(broad_field) %>% summarise(n_phds = sum(n_phds, na.rm = T)) %>% arrange(desc(n_phds)) %>% datatable(colnames = c("Broad Field", "Number of PhDs"), rownames = FALSE, caption = "Number of PhDs by their broad fields. Life sciences lead the way.") %>% formatRound("n_phds", digits = 0) Life sciences has most number of graduates. Engineering has least number of graduates — even less than mysterious Other. Surprisingly social sciences, humanities and eucation are higher than mathematics and computer science. And they lead by a margin. The number of graduates in “humanities and social science” subjects is four times the number of PhDs in in “hard sciences” like engineering and maths. No wonder there is such a shortage of people in the tech world. Life sciences as such a broad encompassing field. Let’s explore what is covered in life sciences. phds %>% filter(broad_field == "Life sciences") %>% group_by(major_field) %>% summarise(n_phds = sum(n_phds, na.rm = T)) %>% arrange(desc(n_phds)) %>% datatable(colnames = c("Major Field", "Number of PhDs"), rownames = FALSE, caption = "Number of PhDs by their major fields. Biology, excluding health sciences, leads the way.") %>% formatRound("n_phds", digits = 0) Biological and biomedical sciences has the most number of graduates. Let me explore engineering too. There are so few PhDs in geosciences. With climate change becoming another major issue, I wonder why the field isn’t picking up fast. Let’s see the fields in engineering. phds %>% filter(broad_field == "Engineering") %>% group_by(major_field) %>% summarise(n_phds = sum(n_phds, na.rm = T)) %>% arrange(desc(n_phds)) ## # A tibble: 1 × 2 ## major_field n_phds ## ## 1 Other engineering 18139 Oh, so no information. The information is nested in another column, I guess. I’ll have to group by field. phds %>% filter(broad_field == "Engineering") %>% group_by(field) %>% summarise(n_phds = sum(n_phds, na.rm = T)) %>% arrange(desc(n_phds)) %>% datatable(colnames = c("Field", "Number of PhDs")) %>% formatRound("n_phds", digits = 0) Computer engineering PhDs are most popular; twice as much as next in the list. Environmental engineering is the second most popular. That’s impressive. Let’s visualise the counts. phds %>% filter(broad_field == "Engineering") %>% group_by(field) %>% summarise(n_phds = sum(n_phds, na.rm = T)) %>% ggplot(aes(reorder(field, n_phds), n_phds)) + geom_col() + coord_flip() + labs(y = "Number of PhDs", x = "Field (Engineering only)") The data gives me opportunity to see how it grew up with the rise in popoularity of computer engineering. I’ve heard numerous time that its popularity has increased over the years. # ggrepel for text labels library(ggrepel) phds %>% filter(broad_field == "Engineering") %>% mutate(label = if_else(year == max(year), field, NA_character_)) %>% ggplot(aes(x = year, y = n_phds, colour = field)) + geom_line() + scale_x_continuous(breaks = seq(from = 2008, to = 2017, by = 1)) + geom_label_repel(aes(label = label), nudge_x = 1, na.rm = TRUE) + labs(x = "Year", y = "Number of PhDs") + theme(legend.position = "none") ## Warning: Removed 20 row(s) containing missing values (geom_path). ## Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider ## increasing max.overlaps phds_top_engineering = phds %>% filter(broad_field == "Engineering") %>% group_by(field) %>% summarise(n_phds = sum(n_phds)) %>% filter(n_phds > 100) %>% slice_max(order_by = n_phds, n = 6) phds_top_engineering ## # A tibble: 6 × 2 ## field n_phds ## ## 1 Computer engineering 4030 ## 2 Environmental, environmental health engineeringl 2001 ## 3 Engineering, other 1488 ## 4 Nuclear engineering 1166 ## 5 Operations research (engineering) 985 ## 6 Systems engineering 924 phds %>% filter(field %in% phds_top_engineering$field) %>% ggplot(aes(x = year, y = n_phds, fill = field)) + geom_bar(stat = "identity") + scale_x_continuous(labels = scales::label_number(accuracy = 1)) + scale_fill_manual(values = MetBrewer::met.brewer("Hokusai1", 6)) + facet_wrap( ~ field) + labs(x = "Year", y = "Number of PhDs", fill = "Field") Computer engineering has been ever popular. I didn’t expect that. But wait, wasn’t there a computer science in major_field? What was that? It was called Mathematics and computer sciences. phds %>% filter(broad_field == "Mathematics and computer sciences") %>% group_by(major_field) %>% summarise(n_phds = sum(n_phds, na.rm = T)) %>% arrange(desc(n_phds)) %>% datatable(colnames = c("Major Field", "Number of PhDs"), rownames = FALSE, caption = "Mathematics and computer sciences has two fields.") %>% formatRound("n_phds", digits = 0) phds %>% filter(broad_field == "Mathematics and computer sciences") %>% filter(n_phds >= 300) %>% mutate(label = if_else(year == max(year), field, NA_character_)) %>% ggplot(aes(x = year, y = n_phds, colour = field)) + geom_line() + scale_x_continuous(breaks = seq(from = 2008, to = 2017, by = 1)) + geom_label_repel(aes(label = label), nudge_x = 1, na.rm = TRUE) + labs(x = "Year", y = "Number of PhDs") + theme(legend.position = "none") Computer engineering averaged around 400; computer science averaged around 1500. I think this the “computer science” in general parlance. This exploration is incomplete. I couldn’t finish it in time but I’d get back to it someday. Today I found this wonderful visualisation on Twitter that I thought to replicate for the number of PhDs by field. library(tweetrmd) tweet_screenshot("https://twitter.com/jenjentro/status/1512997114896269312?t=nWQqyQa3tHQVNSHPakh2TA") Her codes were available on Github. # Loading packages library(tidytuesdayR) library(tidylog) ## ## Attaching package: 'tidylog' ## The following objects are masked from 'package:dplyr': ## ## add_count, add_tally, anti_join, count, distinct, distinct_all, ## distinct_at, distinct_if, filter, filter_all, filter_at, filter_if, ## full_join, group_by, group_by_all, group_by_at, group_by_if, ## inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if, ## relocate, rename, rename_all, rename_at, rename_if, rename_with, ## right_join, sample_frac, sample_n, select, select_all, select_at, ## select_if, semi_join, slice, slice_head, slice_max, slice_min, ## slice_sample, slice_tail, summarise, summarise_all, summarise_at, ## summarise_if, summarize, summarize_all, summarize_at, summarize_if, ## tally, top_frac, top_n, transmute, transmute_all, transmute_at, ## transmute_if, ungroup ## The following objects are masked from 'package:tidyr': ## ## drop_na, fill, gather, pivot_longer, pivot_wider, replace_na, ## spread, uncount ## The following object is masked from 'package:stats': ## ## filter library(showtext) ## Loading required package: sysfonts ## Loading required package: showtextdb " />

Number of PhDs by Field

[This article was first published on R on Harshvardhan, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Yesterday I was talking to one of my friends about his plans post PhD. “I want to go for pure sciences and abstract mathematics, but there are hardly any positions in academia on these topics.”, he said. It got me into thinking how many PhD students graduate every year and if the demand (in academia or in industry) is less than that. But I didn’t even know how many PhDs are awarded each year, let alone employed.

While searching for a dataset for my Text Mining class project, I discovered this dataset on number of PhDs by field. So, let’s explore!

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5          ✓ purrr   0.3.4     
## ✓ tibble  3.1.6          ✓ dplyr   1.0.8.9000
## ✓ tidyr   1.2.0          ✓ stringr 1.4.0     
## ✓ readr   2.1.2          ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(garlic)
library(DT)
theme_set(theme_linedraw())

# Loading dataset from their repository
phds = readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-19/phd_by_field.csv")

## Rows: 3370 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): broad_field, major_field, field
## dbl (2): year, n_phds
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

phds

## # A tibble: 3,370 × 5
##    broad_field   major_field                                 field   year n_phds
##    <chr>         <chr>                                       <chr>  <dbl>  <dbl>
##  1 Life sciences Agricultural sciences and natural resources Agric…  2008    111
##  2 Life sciences Agricultural sciences and natural resources Agric…  2008     28
##  3 Life sciences Agricultural sciences and natural resources Agric…  2008      3
##  4 Life sciences Agricultural sciences and natural resources Agron…  2008     68
##  5 Life sciences Agricultural sciences and natural resources Anima…  2008     41
##  6 Life sciences Agricultural sciences and natural resources Anima…  2008     18
##  7 Life sciences Agricultural sciences and natural resources Anima…  2008     77
##  8 Life sciences Agricultural sciences and natural resources Envir…  2008    182
##  9 Life sciences Agricultural sciences and natural resources Fishi…  2008     52
## 10 Life sciences Agricultural sciences and natural resources Food …  2008     96
## # … with 3,360 more rows

There are many records by fields — in three levels of granularity.There are 337 fields and we have records for each of them between 2008 to 2017. Let’s see how many people are from which field.

phds %>%
   group_by(broad_field) %>%
   summarise(n_phds = sum(n_phds, na.rm = T)) %>%
   arrange(desc(n_phds)) %>%
   datatable(colnames = c("Broad Field", "Number of PhDs"),
             rownames = FALSE,
             caption = "Number of PhDs by their broad fields. Life sciences lead the way.") %>%
   formatRound("n_phds", digits = 0)

Life sciences has most number of graduates. Engineering has least number of graduates — even less than mysterious Other. Surprisingly social sciences, humanities and eucation are higher than mathematics and computer science. And they lead by a margin. The number of graduates in “humanities and social science” subjects is four times the number of PhDs in in “hard sciences” like engineering and maths. No wonder there is such a shortage of people in the tech world.

Life sciences as such a broad encompassing field. Let’s explore what is covered in life sciences.

phds %>%
   filter(broad_field == "Life sciences") %>%
   group_by(major_field) %>%
   summarise(n_phds = sum(n_phds, na.rm = T)) %>%
   arrange(desc(n_phds)) %>%
   datatable(colnames = c("Major Field", "Number of PhDs"),
             rownames = FALSE,
             caption = "Number of PhDs by their major fields. Biology, excluding health sciences, leads the way.") %>%
   formatRound("n_phds", digits = 0)

Biological and biomedical sciences has the most number of graduates. Let me explore engineering too. There are so few PhDs in geosciences. With climate change becoming another major issue, I wonder why the field isn’t picking up fast.

Let’s see the fields in engineering.

phds %>% 
  filter(broad_field == "Engineering") %>% 
  group_by(major_field) %>% 
  summarise(n_phds = sum(n_phds, na.rm = T)) %>% 
  arrange(desc(n_phds))

## # A tibble: 1 × 2
##   major_field       n_phds
##   <chr>              <dbl>
## 1 Other engineering  18139

Oh, so no information. The information is nested in another column, I guess. I’ll have to group by field.

phds %>% 
  filter(broad_field == "Engineering") %>% 
  group_by(field) %>% 
  summarise(n_phds = sum(n_phds, na.rm = T)) %>% 
  arrange(desc(n_phds)) %>% 
   datatable(colnames = c("Field", "Number of PhDs")) %>% 
   formatRound("n_phds", digits = 0)

Computer engineering PhDs are most popular; twice as much as next in the list. Environmental engineering is the second most popular. That’s impressive. Let’s visualise the counts.

phds %>% 
  filter(broad_field == "Engineering") %>% 
  group_by(field) %>% 
  summarise(n_phds = sum(n_phds, na.rm = T)) %>% 
  ggplot(aes(reorder(field, n_phds), n_phds)) +
  geom_col() +
  coord_flip() +
  labs(y = "Number of PhDs", x = "Field (Engineering only)")

The data gives me opportunity to see how it grew up with the rise in popoularity of computer engineering. I’ve heard numerous time that its popularity has increased over the years.

# ggrepel for text labels
library(ggrepel)

phds %>%
   filter(broad_field == "Engineering") %>%
   mutate(label = if_else(year == max(year), field, NA_character_)) %>%
   ggplot(aes(x = year, y = n_phds, colour = field)) +
   geom_line() +
   scale_x_continuous(breaks = seq(from = 2008, to = 2017, by = 1)) +
   geom_label_repel(aes(label = label),
                    nudge_x = 1,
                    na.rm = TRUE) +
   labs(x = "Year", y = "Number of PhDs") +
   theme(legend.position = "none")

## Warning: Removed 20 row(s) containing missing values (geom_path).

## Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
phds_top_engineering = phds %>% 
  filter(broad_field == "Engineering") %>% 
  group_by(field) %>% 
  summarise(n_phds = sum(n_phds)) %>% 
  filter(n_phds > 100) %>% 
  slice_max(order_by = n_phds, n = 6)

phds_top_engineering

## # A tibble: 6 × 2
##   field                                            n_phds
##   <chr>                                             <dbl>
## 1 Computer engineering                               4030
## 2 Environmental, environmental health engineeringl   2001
## 3 Engineering, other                                 1488
## 4 Nuclear engineering                                1166
## 5 Operations research (engineering)                   985
## 6 Systems engineering                                 924

phds %>% 
  filter(field %in% phds_top_engineering$field) %>% 
ggplot(aes(x = year, y = n_phds, fill = field)) +
  geom_bar(stat = "identity") + 
  scale_x_continuous(labels = scales::label_number(accuracy = 1)) +
  scale_fill_manual(values = MetBrewer::met.brewer("Hokusai1", 6)) +
  facet_wrap( ~ field) +
  labs(x = "Year", y = "Number of PhDs", fill = "Field")

Computer engineering has been ever popular. I didn’t expect that.

But wait, wasn’t there a computer science in major_field? What was that? It was called Mathematics and computer sciences.

phds %>%
   filter(broad_field == "Mathematics and computer sciences") %>%
   group_by(major_field) %>%
   summarise(n_phds = sum(n_phds, na.rm = T)) %>%
   arrange(desc(n_phds)) %>%
   datatable(colnames = c("Major Field", "Number of PhDs"),
             rownames = FALSE,
             caption = "Mathematics and computer sciences has two fields.") %>%
   formatRound("n_phds", digits = 0)
phds %>%
   filter(broad_field == "Mathematics and computer sciences") %>%
   filter(n_phds >= 300) %>% 
   mutate(label = if_else(year == max(year), field, NA_character_)) %>%
   ggplot(aes(x = year, y = n_phds, colour = field)) +
   geom_line() +
   scale_x_continuous(breaks = seq(from = 2008, to = 2017, by = 1)) +
   geom_label_repel(aes(label = label),
                    nudge_x = 1,
                    na.rm = TRUE) +
   labs(x = "Year", y = "Number of PhDs") +
   theme(legend.position = "none")

Computer engineering averaged around 400; computer science averaged around 1500. I think this the “computer science” in general parlance.


This exploration is incomplete. I couldn’t finish it in time but I’d get back to it someday.

Today I found this wonderful visualisation on Twitter that I thought to replicate for the number of PhDs by field.

library(tweetrmd)
tweet_screenshot("https://twitter.com/jenjentro/status/1512997114896269312?t=nWQqyQa3tHQVNSHPakh2TA")

Her codes were available on Github.

# Loading packages
library(tidytuesdayR)
library(tidylog)

## 
## Attaching package: 'tidylog'

## The following objects are masked from 'package:dplyr':
## 
##     add_count, add_tally, anti_join, count, distinct, distinct_all,
##     distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
##     full_join, group_by, group_by_all, group_by_at, group_by_if,
##     inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
##     relocate, rename, rename_all, rename_at, rename_if, rename_with,
##     right_join, sample_frac, sample_n, select, select_all, select_at,
##     select_if, semi_join, slice, slice_head, slice_max, slice_min,
##     slice_sample, slice_tail, summarise, summarise_all, summarise_at,
##     summarise_if, summarize, summarize_all, summarize_at, summarize_if,
##     tally, top_frac, top_n, transmute, transmute_all, transmute_at,
##     transmute_if, ungroup

## The following objects are masked from 'package:tidyr':
## 
##     drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
##     spread, uncount

## The following object is masked from 'package:stats':
## 
##     filter

library(showtext)

## Loading required package: sysfonts

## Loading required package: showtextdb
To leave a comment for the author, please follow the link and comment on their blog: R on Harshvardhan.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)