Medicines under evaluation
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
While browsing through the internet I ran into the medicines under evaluation page of the European Medicines Agency. There was a .pdf for each month and I thought it would be interesting to look if I could analyze those data. Each .pdf contains various sections of medicine, I have been focusing on the ‘Non-orphan medicinal products’. The information given is the international non-proprietary name, the area. In addition, an entry in bold is new. The .pdf I looked it contained not too long a list, with about 40 or 50 entries and a few entries in bold. Hence I had the feeling that about a years worth data would see many medicines entering and leaving the list. I decided to take data starting January 2014, hence have 14 months of data.
Importing data
Reading from .pdf is difficult, this proved no exception. Rather than typing all data I pulled tabula ‘Tabula is a tool for liberating data tables locked inside PDF files‘ from the internet. I marked all tables and converted them to spreadsheets. Unfortunately tabula did not understand that two text lines within a cell of the .pdf means it is one string. Quite some post processing was done to rectify this. Still, it is probably more easy than typing all data. Finally, I ended up with 14 .csv files. These could be imported into R. As can be seen, some post processing was needed, there was some inconsistency in spacing usage. The empty lines are a remainder of my merging texts into cells. I thought it more convenient to strip these in R than in each spreadsheet. The superscript 1 refers to a footnote which is indicated in some .pdf which got transformed into its own cell. Again, I decided to do that part in R. Not seen is some caps usage, this was resolved by editing the spreadsheets.
library(dplyr)
library(survival)
csvs <- dir(pattern='.csv')
step1 <- lapply(csvs,function(csv) {
print(csv)
r1 <- readLines(csv)
r1 <- r1[r1!=','] # empty lines
r1<- r1[r1!=',1'] # lines with superscript 1
r1 %<>% read.csv(text=., skip=1,col.names=c(‘Name’,’Area’),header=FALSE) %>%
mutate(.,
Name=gsub(“([[:space:]]+$)|(^[[:space:]]+)”, “”, Name),
Name=gsub(‘ +’,’ ‘,Name),
Name=gsub(“/”,’ / ‘,Name),
Name=gsub(“-“,’ – ‘,Name),
Name=gsub(‘ +’,’ ‘,Name),
Area=gsub(‘ *- *’,’-‘,Area),
Area=gsub(‘ +’,’ ‘,Area),
Area=gsub(“([[:space:]]+$)|(^[[:space:]]+)”, “”, Area),
csvs=csv)
}) %>%
do.call(rbind,.)
Since this now contains name of mother file rather than actual month, this is added.
csvsdf <- data.frame(csvs=csvs) %>%
mutate(.,months =factor(tolower(substr(csvs,1,3)),
levels=c(‘jan’,’feb’,’mrt’,’apr’,’may’,’jun’,
‘jul’,’aug’,’sep’,’oct’,’nov’,’dec’)),
year=as.numeric(substr(csvs,5,8)),
monthno=12*year+c(1:12)[months],
monthno=monthno-min(monthno)+1) %>%
arrange(.,monthno)
step2 <- merge(step1,csvsdf) %>%
mutate(.,Name=factor(Name),
Area=factor(Area)) %>%
arrange(.,Area,Name,monthno)
Areas of medicine
Duration
Looking at the data it self, it seems I have only very sparse data which are not censored:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.