After importing the data, I started classically with data visualisation. In this database, there is a lot of text data. To visualise this, some wordcloud is always welcome. They are maybe not accurate at all but are from my point of view a very good illustration of a text-based dataset.
To my knowledge there are two main wordcloud packages in R :
Let’s play with this.
Prepare the data
# Read the previously imported data db <- readRDS("raw_rds/bdpm.rds")
For example, there is the pharmaceutical form column.
##  "pommade" ##  "capsule molle" ##  "solution injectable" ##  "solution injectable" ##  "suspension à diluer pour perfusion" ##  "poudre et pommade et comprimé et granules et solution(s)"
There is a lot of different forms
uforme <- unique(db$forme) length(uforme)
##  405
405 various form. But there is multiple form in one line sometimes, separated by “et”. Try to find the real different form.
forms <- db$forme %>% strsplit(split = " et ") %>% unlist() length(unique(forms))
##  393
##  "pommade" ##  "capsule molle" ##  "solution injectable" ##  "solution injectable" ##  "suspension à diluer pour perfusion" ##  "poudre"
Select the 100 most frequent
## forms Freq ## 1 comprimé 17 ## 2 comprimé enrobé 19 ## 3 comprimé pelliculé 36 ## 4 comprimé pelliculé buvable pelliculé 1 ## 5 comprimé pelliculé pelliculé 1 ## 6 crème 17
cent <- forms %>% table() %>% as.data.frame() %>% arrange(desc(Freq)) %>% head(100) kable(head(cent))
comprimé pelliculé 2289
solution injectable 934
comprimé sécable 852
Make a wordcloud with wordcloud
library(wordcloud) wordcloud(cent$., freq = cent$Freq)
Not bad. Try something funkier.
wordcloud( words = cent$., freq = cent$Freq, random.color = T, random.order = F, colors = brewer.pal(8,"Dark2") )
I find this very informative. Intituively it’s possible to see what’s the most frequent forms are. And is far more attractive than a table or an unreadable barplot.
library(ggplot2) ggplot(cent) + aes(x = ., y = Freq) + geom_bar(stat = "identity") + coord_flip()
OK, I would be possible to make a better plot but I think you see the point.
Wordcloud 2 produce html widget
It’s easier and more fun! Try it, it’s interactive.
Enough with wordcloud. We understood that’s the “comprimé” (tablet) pharmaceutical form is the most frequent, followed by the “gélule” (capsule), “poudre” (powder) and “granule” (small pill). We can also see that’s some text cleaning would be necessary to make a proper analysis.