Psychotherapies PubMed battle

[This article was first published on R Psychologist » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Todays mission have been to visualize publication frequencies over the years for different psychotherapies using PubMed and R. I choose to concentrate on Psychodynamic Therapy (PDT), Cognitive Behavior Therapy (CBT), Acceptance and Commitment Therapy (ACT) and Mindfulness. I know there’s is a R function to query PubMed, but I didn’t use it and instead searched PubMed manually and saved the result as four separate csv-files.

PubMed search filter
I used a really rudimentary search filter:

  • “cognitive behavior therapy” OR “cognitive therapy” (13245 hits)
  • “psychodynamic therapy” OR “psychoanalytic therapy” (13933 hits)
  • Mindfulness (1099 hits)
  • “Acceptance and commitment therapy” (374 hits)
  • Writing the R code

    So, after I’d saved the data files from PubMed to my computer, I switched over the R and loaded the necessary packages.

    library(stringr) # used for str_extract()
    library(ggplot2) # used for plotting
    library(plyr) # used for count()

    Then I wrote a short function to extract year and frequency from the csv-file. Since publication year wasn’t under it’s own header I used str_extract to extract that information.

    # create function named scrape.pubmed
    scrape.pubmed <- function() {
      # we only need column 5
      results <- pubmed_result[5]
      # the header is repeated regularly, so I remove those rows.
      results <- results[results$Details != "ShortDetails",]
      # The remaning field contain journal name in addition to pub. year
      # The general expression catches values 1900--1999 OR 2000--2019
      # Most rows had year information at the end of the string,
      # but not all rows had this. Therefore I used this regular expression to
      # minimize mismatches.
      year <- str_extract(results, "(19[0-9]{2}|20[0-1][0-9])")
      # Use the count()-function to group years together.
      # I didn't want values from 2012, hence the < 2012 addition.
      count <- count(year[year < 2012])
      # Return the value of count

    Now that I have a function to collect my data, I loaded the different csv-files I saved from PubMed and collect year data from them.

    # Get CBT data
    pubmed_result <- read.csv("~/Downloads/pubmed_cbt.csv", row.names = NULL)
    cbt.count <- scrape.pubmed()
    # Get PDT data
    pubmed_result <- read.csv("~/Downloads/pubmed_pdt.csv", row.names = NULL)
    pdt.count <- scrape.pubmed()
    # Get mindfulness data
    pubmed_result <- read.csv("~/Downloads/pubmed_mindfulness.csv", row.names = NULL)
    mindfulness.count <- scrape.pubmed()
    # Get ACT data
    pubmed_result <- read.csv("~/Downloads/pubmed_ACT.csv", row.names = NULL)
    act.count <- scrape.pubmed()

    Now I have 4 variables who look like this:

    > head(cbt.count, 10)
          x freq
    1  1976   19
    2  1977    2
    3  1978    1
    4  1979    1
    5  1980   12
    6  1981    9
    7  1982   12
    8  1983   15
    9  1984   14
    10 1985   16

    I could combine them into one data.frame but I choose not to. Instead I plot them individually on different layers with ggplot.

    # It doesn't matter in which order you enter the variables, I
    # started with cbt.count. For some reason ggplot wouldn't accept
    # the syntax without 'group = 1'.
    ggplot(cbt.count, aes(x, freq, group = 1, color = "CBT")) +
      # line layer for cbt.count
      geom_line() +
      # new layer for pdt.count
      geom_line(data = pdt.count, aes(x, freq, color = "PDT")) +
      # new layer for mindfulness.count
      geom_line(data = mindfulness.count, aes(x, freq, color = "Mindfulness")) +
      # new layer for act.clunt
      geom_line(data = act.count, aes(x, freq, color = "ACT"))+
      # This line both control the groups colors and labels on the legend.
                         values= c("CBT" = "blue",
                                   "PDT" = "red",
                                   "Mindfulness" = "green",
                                   "ACT" = "magenta")) +
      ylab("Publications") + # y-label
      xlab("Year") + # x-label
      # break up x-axis from 1900 to 2010 by ticks of 10
      scale_x_discrete(breaks = seq(1900,2010, 10)) +
      # add plot title
      opts(title = "Pubmed citations by year")


    We can see that PDT really had a decline in the early 90′s and the CBT started it’s rise and that it really sky rocketed in the 21th century. PDT also had an increase in the 21th century but it got a heck of a lot of work to do before it can match CBT’s publication output. The plot also show that mindfulness had a surge and that it’s output now is on par with PDT.

    I’ve made some updates to this code in a more recent post:
    Ggplot2, PubMed citation frequency and DSM-IV Axis I disorders by year

    To leave a comment for the author, please follow the link and comment on their blog: R Psychologist » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)