Medicines under evaluation

[This article was first published on Wiekvoet, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

While browsing through the internet I ran into the medicines under evaluation page of the European Medicines Agency. There was a .pdf for each month and I thought it would be interesting to look if I could analyze those data. Each .pdf contains various sections of medicine, I have been focusing on the ‘Non-orphan medicinal products’. The information given is the international non-proprietary name, the area. In addition, an entry in bold is new. The .pdf I looked it contained not too long a list, with about 40 or 50 entries and a few entries in bold. Hence I had the feeling that about a years worth data would see many medicines entering and leaving the list. I decided to take data starting January 2014, hence have 14 months of data.

Importing data

Reading from .pdf is difficult, this proved no exception. Rather than typing all data I pulled tabula ‘Tabula is a tool for liberating data tables locked inside PDF files‘ from the internet. I marked all tables and converted them to spreadsheets. Unfortunately tabula did not understand that two text lines within a cell of the .pdf means it is one string. Quite some post processing was done to rectify this. Still, it is probably more easy than typing all data. Finally, I ended up with 14 .csv files. These could be imported into R. As can be seen, some post processing was needed, there was some inconsistency in spacing usage. The empty lines are a remainder of my merging texts into cells. I thought it more convenient to strip these in R than in each spreadsheet. The superscript 1 refers to a footnote which is indicated in some .pdf which got transformed into its own cell. Again, I decided to do that part in R. Not seen is some caps usage, this was resolved by editing the spreadsheets.
csvs <- dir(pattern='.csv')

step1 <- lapply(csvs,function(csv) {
          r1 <- readLines(csv) 
          r1 <- r1[r1!=',']  # empty lines
          r1<-  r1[r1!=',1'] # lines with superscript 1
          r1 %<>% read.csv(text=., skip=1,col.names=c(‘Name’,’Area’),header=FALSE) %>%
                  Name=gsub(“([[:space:]]+$)|(^[[:space:]]+)”, “”, Name),
                  Name=gsub(‘ +’,’ ‘,Name),
                  Name=gsub(“/”,’ / ‘,Name),
                  Name=gsub(“-“,’ – ‘,Name),
                  Name=gsub(‘ +’,’ ‘,Name),
                  Area=gsub(‘ *- *’,’-‘,Area),
                  Area=gsub(‘ +’,’ ‘,Area),
                  Area=gsub(“([[:space:]]+$)|(^[[:space:]]+)”, “”, Area),
        }) %>%,.) 
Since this now contains name of mother file rather than actual month, this is added.
csvsdf <- data.frame(csvs=csvs) %>%
    mutate(.,months =factor(tolower(substr(csvs,1,3)),
        monthno=monthno-min(monthno)+1) %>%
step2 <- merge(step1,csvsdf) %>%
        Area=factor(Area)) %>%

Areas of medicine

There are 99 medicines, distributed over 34 areas. The most frequent areas are:
xtabs(~ Name + Area,step2) %>% %>%
    filter(.,Freq!=0) %>%
    xtabs(~ Area,.) %>% %>%
    arrange(.,-Freq) %>%
                                       Area Freq
1                  Antineoplastic medicines   11
2               Antivirals for systemic use   11
3                Medicines used in diabetes    7
4 Medicines for obstructive airway diseases    6
5                          Antihemorrhagics    5
6                  Antithrombotic medicines    5
7               Other therapeutic medicines    5


The intention was to run a Cox proportional hazards model. Hence I added an extra row for each medicine where I know both the beginning and the end month off.
terminated <- group_by(step2,Name,Area) %>%
        event=!(min(monthno)==1 | max(monthno)==14) ) %>%
    filter(.,event) %>%
living <- mutate(step2,
        time=monthno-ave(monthno,Name,FUN=min)+1) %>%
both <- rbind(terminated,living) %>% %>%
Unfortunately Coxph gave warning messages, I do not trust the results sufficiently.
#coxph(Surv(time=time,event=event) ~ Name ,data=both)
Looking at the data it self, it seems I have only very sparse data which are not censored:
step2 %>%
    group_by(.,Name,Area) %>%
        event=!(min(monthno)==1 | max(monthno)==14) ) %>%
    mutate(.,time=time+as.numeric(event) ) %>%
    select(.,Name,Area,time,event) %>%
    xtabs(~ time + event,.)
  1      8    0
  2     14    1
  3      8    3
  4      2    0
  5      9    1
  6      5    2
  7      7    4
  8      1    1
  9      7    0
  10     2    0
  11     9    1
  12     5    2
  13     3    1
  14     3    0

To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)