(This article was first published on ** R – Tales of R**, and kindly contributed to R-bloggers)

Wordclouds are one of the most visually straightforward, compelling ways of displaying text info in a graph.

Of course, we have a lot of web pages (and even apps) that, given an input text, will plot you some nice tagclouds. However, when you need reproducible results, or getting done complex tasks -like combined wordclouds from several files-, a programming environment may be the best option.

In R, there are (as always), several alternatives to get this done, such as tagcloud and wordcloud.

For this script I used the following packages:

- “RCurl” to retrieve a
**PMID list**, stored in my GitHub account as a**.csv**file. - “RefManageR” and “plyr“ to retrieve and arrange PM records. To fetch the info from the inets, we’ll be using the PubMed API (free version, with some limitations).
- Finally, “tm“, “SnowballC” to prepare the data and “wordcloud” to plot the wordcloud. This part of the script is based on this from Georeferenced.

One of the advantages of using RefManageR is that you can easily change the field from which you are importing from, and it usually works flawlessly with the PubMed API.

My biggest problem sources when running this script… download caps, busy hours, and firewalls!.

At the beginning of the gist, there is also a handy function that automagically downloads all needed packages for you.

To source the script, simply type in the R console:

library(devtools) source_url("https://gist.githubusercontent.com/aurora-mareviv/697cbb505189591648224ed640e70fb1/raw/b42ac2e361ede770e118f217494d70c332a64ef8/pmid.tagcloud.R")

And there is the code:

Enjoy!

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Tales of R**.

R-bloggers.com offers

The post A book on RStan in Japanese: *Bayesian Statistical Modeling Using Stan and R* (Wonderful R, Volume 2) appeared first on Statistical Modeling, Causal Inference, and Social Science.

(This article was first published on ** R – Statistical Modeling, Causal Inference, and Social Science**, and kindly contributed to R-bloggers)

Wonderful, indeed, to have an RStan book in Japanese:

- Kentarou Matsuura. 2016.
*Bayesian Statistical Modeling Using Stan and R*. Wonderful R Series, Volume 2. Kyoritsu Shuppan Co., Ltd.

Google translate makes the following of the description posted on Amazon Japan (linked from the title above):

In recent years, understanding of the phenomenon by fitting a mathematical model using a probability distribution on data and prompts the prediction “statistical modeling” has attracted attention. Advantage when compared with the existing approach is both of the goodness of the interpretation of the ease and predictability. Since interpretation is likely to easily connect to the next action after estimating the values in the model. It is rated as very effective technique for data analysis Therefore reality.

In the background, the improvement of the calculation speed of the computer, that the large scale of data becomes readily available, there are advances in stochastic programming language to very simple trial and error of modeling. From among these languages, in this document to introduce Stan is a free software. Stan is a package which is advancing rapidly the development equipped with a superior algorithm, it can easily be used from R because the package for R RStan has been published in parallel. Descriptive power of Stan is high, the hierarchical model and state space model can be written in as little as 30 lines, estimated calculation is also carried out automatically. Further tailor-made extensions according to the analyst of the problem is the easily possible.

In general, dealing with the Bayesian statistics books or not to remain in rudimentary content, what is often difficult application to esoteric formulas many real problem. However, this book is a clear distinction between these books, and finished to a very practical content put the reality of the data analysis in mind. The concept of statistical modeling was wearing through the Stan and R in this document, even if the change is grammar of Stan, even when dealing with other statistical modeling tools, I’m sure a great help.

I’d be happy to replace this with a proper translation if there’s a Japanese speaker out there with some free time (Masanao Yajima translated the citation for us).

**Big in Japan?**

I’d like to say Stan’s big in Japan, but that idiom implies it’s not so big elsewhere. I can say there’s a very active Twitter community tweeting about Stan in Japanese, which we follow occasionally using Google Translate.

The post A book on RStan in Japanese: *Bayesian Statistical Modeling Using Stan and R* (Wonderful R, Volume 2) appeared first on Statistical Modeling, Causal Inference, and Social Science.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Statistical Modeling, Causal Inference, and Social Science**.

R-bloggers.com offers

(This article was first published on ** R – Modern Data**, and kindly contributed to R-bloggers)

**By Carson Sievert, lead Plotly R developer**

I’m excited to announce that plotly’s R package just sent its first CRAN update in nearly four months. To install the update, run `install.packages("plotly")`

.

This update has breaking changes, enables new features, fixes numerous bugs, and takes us from version 3.6.0 to 4.5.2. To see all the changes, I encourage you to read the NEWS file. In this post, I’ll highlight the most important changes, explain why they needed to happen, and provide some tips for fixing errors brought about by this update. As you’ll see, this update is mostly about improving the `plot_ly()`

interface, so `ggplotly()`

users won’t notice much (if any) change. I’ve also started a plotly for R book which provides more narrative than the documentation on https://plot.ly/r (which is now updated to 4.0), more recent examples, and features exclusive to the R package. The first three chapters are nearly finished and replace the package vignettes. The later chapters are still in their beginning stages – they discuss features that are still under development, but I plan adding stability, and more documentation in the coming months.

In the past, you could use an *expression* to reference variable(s) in a data frame, but this no longer works. Consequently, you might see an error like this when you update:

library(plotly) plot_ly(mtcars, x = mpg, y = sqrt(wt))

## Error in plot_ly(mtcars, x = mpg, y = sqrt(wt)): object 'wt' not found

`plot_ly()`

now requires a *formula* (which is basically an expression, but with a `~`

prefixed) when referencing variables. You do not *have to* use a formula to reference objects that exist in the namespace, but I recommend it, since it helps populate sensible axis/guide title defaults (e.g., compare the output of `plot_ly(z = volcano)`

with `plot_ly(z = ~volcano)`

).

plot_ly(mtcars, x = ~mpg, y = ~sqrt(wt))

There are a number of technical reasons why imposing this change from expressions to formulas is a good idea. If you’re interested in the details, I recommend reading Hadley Wickham’s notes on non-standard evaluation, but here’s the gist of the situation:

- Since formulas capture the environment in which they are created, we can be confident that evaluation rules are always correct, no matter the context.
- Compared to expressions/symbols, formulas are easier to program with, which makes writing custom functions around
`plot_ly()`

easier.

myPlot <- function(x, y, ...) { plot_ly(mtcars, x = x, y = y, color = ~factor(cyl), ...) } myPlot(~mpg, ~disp, colors = "Dark2")

Also, it’s fairly easy to convert a string to a formula (e.g., `as.formula("~sqrt(wt)")`

). This trick can be quite useful when programming in shiny (and a variable mapping depends on an input value).

Instead of always defaulting to a “scatter” trace, `plot_ly()`

now infers a sensible trace type (and other attribute defaults) based on the information provided. These defaults are determined by inspecting the vector type (e.g., numeric/character/factor/etc) of positional attributes (e.g., x/y). For example, if we supply a discrete variable to x (or y), we get a vertical (or horizontal) bar chart:

subplot( plot_ly(diamonds, y = ~cut, color = ~clarity), plot_ly(diamonds, x = ~cut, color = ~clarity), margin = 0.07 ) %>% hide_legend()

Or, if we supply two discrete variables to both x and y:

plot_ly(diamonds, x = ~cut, y = ~clarity)

Also, the order of categories on a discrete axis, by default, is now either alphabetical (for character strings) or matches the ordering of factor levels. This makes it easier to sort categories according to something meaningful, rather than the order in which the categories appear (the old default). If you prefer the old default, use `layout(categoryorder = "trace")`

library(dplyr) # order the clarity levels by their median price d <- diamonds %>% group_by(clarity) %>% summarise(m = median(price)) %>% arrange(m) diamonds$clarity <- factor(diamonds$clarity, levels = d[["clarity"]]) plot_ly(diamonds, x = ~price, y = ~clarity, type = "box")

`plot_ly()`

now Previously `plot_ly()`

*always* produced at least one trace, even when using `add_trace()`

to add on more traces (if you’re familiar with ggplot2 lingo, a trace is similar to a layer). From now on, you’ll have to specify the `type`

in `plot_ly()`

if you want it to always produce a trace:

subplot( plot_ly(economics, x = ~date, y = ~psavert, type = "scatter") %>% add_trace(y = ~uempmed) %>% layout(yaxis = list(title = "Two Traces")), plot_ly(economics, x = ~date, y = ~psavert) %>% add_trace(y = ~uempmed) %>% layout(yaxis = list(title = "One Trace")), titleY = TRUE, shareX = TRUE, nrows = 2 ) %>% hide_legend()

Why enforce this change? Often times, when composing a plot with multiple traces, you have attributes that are shared across traces (i.e., global) and attributes that are not. By allowing `plot_ly()`

to simply initialize the plot and define global attributes, it makes for a much more natural to describe such a plot. Consider the next example, where we declare x/y (longitude/latitude) attributes and alpha transparency globally, but alter trace specific attributes in `add_trace()`

-like functions. This example also takes advantage of a few other new features:

- The
`group_by()`

function which defines “groups” within a trace (described in more detail in the next section). - New
`add_*()`

functions which behave like`add_trace()`

, but are higher-level since they assume a trace type, might set some attribute values (e.g.,`add_marker()`

set the scatter trace mode to marker), and might trigger other data processing (e.g.,`add_lines()`

is essentially the same as`add_paths()`

, but guarantees values are sorted along the x-axis). - Scaling is avoided for “AsIs” values (i.e., values wrapped with
`I()`

) which makes it easier directly specify a constant value for a visual attribute(s) (as opposed to mapping data values to visuals). - More support for R’s graphical parameters such as
`pch`

for symbols and`lty`

for linetypes.

map_data("world", "canada") %>% group_by(group) %>% plot_ly(x = ~long, y = ~lat, alpha = 0.1) %>% add_polygons(color = I("black"), hoverinfo = "none") %>% add_markers(color = I("red"), symbol = I(17), text = ~paste(name, "<br />", pop), hoverinfo = "text", data = maps::canada.cities) %>% hide_legend()

The `group`

argument in `plot_ly()`

has been removed in favor of the `group_by()`

function. In the past, the `group`

argument incorrectly created multiple traces. If you want that same behavior, use the new `split`

argument, but groups are now used to define “gaps” *within* a trace. This is more consistent with how ggplot2’s `group`

aesthetic is translated in `ggplotly()`

, and is much more efficient than plotting a trace for each group.

txhousing %>% group_by(city) %>% plot_ly(x = ~date, y = ~median) %>% add_lines(alpha = 0.3)

The default hovermode (compare data on hover) isn’t super useful here since we have only 1 trace to compare, so you may want to add `layout(hovermode = "closest")`

when using `group_by()`

. If you’re group sizes aren’t that large, you may want to use `split`

to generate one trace per group, then set a constant color (using the `I()`

function to avoid scaling).

txhousing %>% plot_ly(x = ~date, y = ~median) %>% add_lines(split = ~city, color = I("steelblue"), alpha = 0.3)

In the coming months, we will have better ways to identify/highlight groups to help combat overplotting (see here for example). This same interface can be used to coordinate multiple linked plots, which is a powerful tool for exploring multivariate data and presenting multivariate results (see here and here for examples).

Prior to version 4.0, plotly functions returned a data frame with special attributes attached (needed to track the plot’s attributes). At the time, I thought this was the right way to enable a “data-plot-pipeline” where a plot is described as a sequence of visual mappings and data manipulations. For a number of reasons, I’ve been convinced otherwise, and decided the central plotly object should inherit from an htmlwidget object instead. This change does not destroy our ability to implement a “data-plot-pipeline”, but it does, in a sense, constrain the set manipulations we can perform on a plotly object. Below is a simple example of transforming the data underlying a plotly object using **dplyr**’s `mutate()`

and `filter()`

verbs (the plotly book has a whole section on the data-plot-pipeline, if you’d like to learn more).

library(dplyr) economics %>% plot_ly(x = ~date, y = ~unemploy / pop, showlegend = F) %>% add_lines(linetype = I(22)) %>% mutate(rate = unemploy / pop) %>% slice(which.max(rate)) %>% add_markers(symbol = I(10), size = I(50)) %>% add_annotations("peak")

In this context, I’ve often found it helpful to inspect the (most recent) data associated with a particular plot, which you can do via `plotly_data()`

diamonds %>% group_by(cut) %>% plot_ly(x = ~price) %>% plotly_data()

## Source: local data frame [53,940 x 10] ## Groups: cut [5] ## ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 ## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 ## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 ## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39 ## # ... with 53,930 more rows

To keep up to date with currently supported data manipulation verbs, please consult the `help(reexports)`

page, and for more examples, check out the examples section under `help(plotly_data)`

.

This change in the representation of a plotly object also has important implications for folks using `plotly_build()`

to “manually” access or modify a plot’s underlying spec. Previously, this function returned the JSON spec as an R list, but it now returns more “meta” information about the htmlwidget, so in order to access that same list, you have to grab the “x” element. The new `as_widget()`

function (different from the now deprecated `as.widget()`

function) is designed to turn a plotly spec into an htmlwidget object.

# the style() function provides a more elegant way to do this sort of thing, # but I know some people like to work with the list object directly... pl <- plotly_build(qplot(1:10))[["x"]] pl$data[[1]]$hoverinfo <- "none" as_widget(pl)

The latest CRAN release upgrades plotly’s R package from version 3.6.0 to 4.5.2. This upgrade includes a number of breaking changes, as well as a ton of new features and bug fixes. The time spent upgrading your code will be worth it as enables a ton of new features. It also provides a better foundation for advancing the `plot_ly()`

interface (not to mention the linked highlighting stuff we have on tap). This post should provide the information necessary to fix these breaking changes, but if you have any trouble upgrading, please let us know on http://community.plot.ly. Happy plotting!

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Modern Data**.

R-bloggers.com offers

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

R’s `boxplot`

function has a lot of useful parameters allowing us to change the behaviour and appearance of the boxplot graphs. In this exercise we will try to use those parameters in order to replicate the visual style of Matlab’s boxplot. Before trying out this exercise please make sure that you are familiar with the following functions: `bxp`

, `boxplot`

, `axis`

, `mtext`

Here is the plot we will be replicating:

We will be using the same **iris** dataset which is available in R by default in the variable of the same name – `iris`

. The exercises will require you to make incremental changes to the default boxplot style.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

Make a default boxplot of Sepal.Width stratified by Species.

**Exercise 2**

Change the range of the y-axis so it starts at 2 and ends at 4.5.

**Exercise 3**

Modify the boxplot function so it doesn’t draw ticks nor labels of the x and y axes.

**Exercise 4**

Add notches (triangular dents around the median representing confidence intervals) to the boxes in the plot.

**Exercise 5**

Increase the distance between boxes in the plot.

**Exercise 6**

Change the color of the box borders to blue.

**Exercise 7**

a. Change the color of the median lines to red.

b. Change the line width of the median line to 1.

**Exercise 8**

a. Change the color of the outlier points to red.

b. Change the symbol of the outlier points to “+”.

c. Change the size of the outlier points to 0.8.

**Exercise 9**

a. Add the title to the boxplot (try to replicate the style of matlab’s boxplot).

b. Add the y-axis label to the boxplot (try to replicate the style of matlab’s boxplot).

**Exercise 10**

a. Add x-axis (try to make it resemble the x-axis in the matlab’s boxplot)

b. Add y-axis (try to make it resemble the y-axis in the matlab’s boxplot)

c. Add the y-axis ticks on the other side.

NOTE: You can use `format(as.character(c(2, 4.5)), drop0trailing=TRUE, justify="right")`

to obtain the text for y-axis labels.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.

R-bloggers.com offers

Related Post

(This article was first published on ** DataScience+**, and kindly contributed to R-bloggers)

We can use unsupervised machine learning to identify which drugs are associated with which adverse events. Specifically, machine learning can help us to create clusters based on gender, age, outcome of adverse event, route drug was administered, purpose the drug was used for, body mass index, etc. This can help for quickly discovering hidden associations between drugs and adverse events.

Clustering is a non-supervised learning technique which has wide applications. Some examples where clustering is commonly applied are market segmentation, social network analytics, and astronomical data analysis. Clustering is grouping of data into sub-groups so that objects within a cluster have high similarity in comparison to other objects in that cluster, but are very dissimilar to objects in other classes. For clustering, each pattern is represented as a vector in multidimensional space and a distance measure is used to find the dissimilarity between the instances. In this post, we will see how we can use hierarchical clustering to identify drug adverse events. You can read about hierarchical clustering from Wikipedia.

Let’s create fake drug adverse event data where we can visually identify the clusters and see if our machine learning algorithm can identify the clusters. If we have millions of rows of adverse event data, clustering can help us to summarize the data and get insights quickly.

Let’s assume a drug AAA results in adverse events shown below. We will see in which group (cluster) the drug results in what kind of reactions (adverse events).

In the table shown below, I have created four clusters:

- Route=ORAL, Age=60s, Sex=M, Outcome code=OT, Indication=RHEUMATOID ARTHRITIS and Reaction=VASCULITIC RASH + some noise
- Route=TOPICAL, Age=early 20s, Sex=F, Outcome code=HO, Indication=URINARY TRACT INFECTION and Reaction=VOMITING + some noise
- Route=INTRAVENOUS, Age=about 5, Sex=F, Outcome code=LT, Indication=TONSILLITIS and Reaction=VOMITING + some noise
- Route=OPHTHALMIC, Age=early 50s, Sex=F, Outcome code=DE, Indication=Senile osteoporosis and Reaction=Sepsis + some noise

Below is a preview of my data. You can download the data here

head(my_data)route age sex outc_cod indi_pt pt 1 ORAL 63 M OT RHEUMATOID ARTHRITIS VASCULITIC RASH 2 ORAL 66 F OT RHEUMATOID ARTHRITIS VASCULITIC RASH 3 ORAL 66 M OT RHEUMATOID ARTHRITIS VASCULITIC RASH 4 ORAL 57 M OT RHEUMATOID ARTHRITIS VASCULITIC RASH 5 ORAL 66 M OT RHEUMATOID ARTHRITIS VASCULITIC RASH 6 ORAL 66 M OT RHEUMATOID ARTHRITIS VASCULITIC RASH

To perform hierarchical clustering, we need to change the text to numeric values so that we can calculate distances. Since age is numeric, we will remove it from the rest of the variables and change the character variables to multidimensional numeric space.

age = my_data$age my_data = select(my_data,-age)

my_matrix = as.data.frame(do.call(cbind, lapply(mydata, function(x) table(1:nrow(mydata), x))))

Now, we can add the age column:

my_matrix$Age=age head(my_matrix)INTRAVENOUS OPHTHALMIC ORAL TOPICAL F M DE HO LT OT RHEUMATOID ARTHRITIS Senile osteoporosis 1 0 0 1 0 0 1 0 0 0 1 1 0 2 0 0 1 0 1 0 0 0 0 1 1 0 3 0 0 1 0 0 1 0 0 0 1 1 0 4 0 0 1 0 0 1 0 0 0 1 1 0 5 0 0 1 0 0 1 0 0 0 1 1 0 6 0 0 1 0 0 1 0 0 0 1 1 0 TONSILLITIS URINARY TRACT INFECTION Sepsis VASCULITIC RASH VOMITING Age 1 0 0 0 1 0 63 2 0 0 0 1 0 66 3 0 0 0 1 0 66 4 0 0 0 1 0 57 5 0 0 0 1 0 66 6 0 0 0 1 0 66

Let’s normalize our variables using *caret package*.

library(caret) preproc = preProcess(my_matrix) my_matrixNorm = as.matrix(predict(preproc, my_matrix))

Next, let’s calculate distance and apply hierarchical clustering and plot the dendrogram.

distances = dist(my_matrixNorm, method = "euclidean") clusterdrug = hclust(distances, method = "ward.D") plot(clusterdrug, cex=0.5, labels = FALSE,cex=0.5,xlab = "", sub = "",cex=1.2)

From the dendrogram shown above, we see that four distinct clusters can be created from the fake data we created. Let’s use different colors to identify the four clusters.

library(dendextend) dend <- as.dendrogram(clusterdrug) install.packages("dendextend") library(dendextend) # Color the branches based on the clusters: dend <- color_branches(dend, k=4) #, groupLabels=iris_species) # We hang the dendrogram a bit: dend <- hang.dendrogram(dend,hang_height=0.1) # reduce the size of the labels: # dend <- assign_values_to_leaves_nodePar(dend, 0.5, "lab.cex") dend <- set(dend, "labels_cex", 0.5) plot(dend)

Now, let’s create cluster groups with four clusters.

clusterGroups = cutree(clusterdrug, k = 4)

Now, let’s add the clusterGroups column to the original data.

my_data= cbind(data.frame(Cluster=clusterGroups), my_data, age) head(my_data)Cluster route sex outc_cod indi_pt pt age 1 1 ORAL M OT RHEUMATOID ARTHRITIS VASCULITIC RASH 63 2 1 ORAL F OT RHEUMATOID ARTHRITIS VASCULITIC RASH 66 3 1 ORAL M OT RHEUMATOID ARTHRITIS VASCULITIC RASH 66 4 1 ORAL M OT RHEUMATOID ARTHRITIS VASCULITIC RASH 57 5 1 ORAL M OT RHEUMATOID ARTHRITIS VASCULITIC RASH 66 6 1 ORAL M OT RHEUMATOID ARTHRITIS VASCULITIC RASH 66

observationsH=c() for (i in seq(1,4)){ observationsH=c(observationsH,length(subset(clusterdrug, clusterGroups==i))) } observationsH =as.data.frame(list(cluster=c(1:4),Number_of_observations=observationsH)) observationsHcluster Number_of_observations 1 1 20 2 2 13 3 3 15 4 4 24

Let’s calculate column average for each cluster.

z=do.call(cbind,lapply(1:4, function(i) round(colMeans(subset(my_matrix,clusterGroups==i)),2))) colnames(z)=paste0('cluster',seq(1,4)) zcluster1 cluster2 cluster3 cluster4 INTRAVENOUS 0.00 0.00 1.00 0.00 OPHTHALMIC 0.00 0.00 0.00 0.92 ORAL 1.00 0.08 0.00 0.08 TOPICAL 0.00 0.92 0.00 0.00 F 0.10 0.85 0.80 1.00 M 0.90 0.15 0.20 0.00 DE 0.00 0.00 0.00 0.83 .....

Next, most common observation in each cluster:

Age=z[nrow(z),] z=z[1:(nrow(z)-1),] my_result=matrix(0,ncol=4,nrow=ncol(mydata)) for(i in seq(1,4)){ for(j in seq(1,ncol(mydata))){ q = names(mydata)[j] q = as.vector(as.matrix(unique(mydata[q]))) my_result[j,i]=names(sort(z[q,i],decreasing = TRUE)[1]) }} colnames(my_result)=paste0('Cluster',seq(1,4)) rownames(my_result)=names(mydata) my_result=rbind(Age,my_result) my_result <- cbind(Attribute =c("Age","Route","Sex","Outcome Code","Indication preferred term","Adverse event"), my_result) rownames(my_result) <- NULL my_resultAttribute cluster1 cluster2 cluster3 cluster4 Age 61.8 17.54 5.8 44.62 Route ORAL TOPICAL INTRAVENOUS OPHTHALMIC Sex M F F F Outcome Code OT HO LT DE Indication preferred term RHEUMATOID ARTHRITIS URINARY TRACT INFECTION TONSILLITIS Senile osteoporosis Adverse event VASCULITIC RASH VOMITING VOMITING Sepsis

We see that we have created the clusters using hierarchical clustering. From cluster 1, for male in the 60s, the drug results in vasculitic rash when taken for rheumatoid arthritis. We can interpret the other clusters similarly. Remember, this data is not real data. It is fake data made to show the application of clustering for drug adverse event study. From, this short post, we see that clustering can be used for knowledge discovery in drug adverse event reactions. Specially in cases where the data has millions of observations, where we cannot get any insight visually, clustering becomes handy for summarizing our data, for getting statistical insights and for discovering new knowledge.

**
**

**Related Post**

- GoodReads: Exploratory data analysis and sentiment analysis (Part 2)
- GoodReads: Webscraping and Text Analysis with R (Part 1)
- Euro 2016 analytics: Who’s playing the toughest game?
- Integrating R with Apache Hadoop
- Plotting App for ggplot2

To **leave a comment** for the author, please follow the link and comment on their blog: ** DataScience+**.

R-bloggers.com offers

Talking to some fellow colleagues, we couldn’t help but notice that **maybe** in another era this decision would have been good policy. The problem, some concluded, was the influence of social media today. In fact, the Trump debacle did cause outcry among leading politica voices online.

I wanted to investigate this further, and thankfully for me, I’ve been using **R** to collect tweets from a catalog of leading political personalities in Mexico for a personal business project.

Here is a short descriptive look at what the 65 twitter accounts I’m following tweeted between August 27th and September 5th (the Donald announced his visit on August the 30th). I’m sorry I can’t share the dataset, but you get the idea with the code…

```
library(dplyr)
library(stringr)
# 42 of the 65 accounts tweeted between those dates.
d %>%
summarise("n" = n_distinct(NOMBRE))
# n
# 42
```

We can see how mentions of trump spike just about the time it was announced…

```
byhour <- d %>%
mutate("MONTH" = as.numeric(month(T_CREATED)),
"DAY" = as.numeric(day(T_CREATED)),
"HOUR" = as.numeric(hour(T_CREATED)),
"TRUMP_MENTION" = str_count(TXT, pattern = "Trump|TRUMP|trump")) %>%
group_by(MONTH, DAY, HOUR) %>%
summarise("N" = n(),
"TRUMP_MENTIONS" = sum(TRUMP_MENTION)) %>%
mutate("PCT_MENTIONS" = TRUMP_MENTIONS/N*100) %>%
arrange(desc(MONTH), desc(DAY), HOUR) %>%
mutate("CHART_DATE" = as.POSIXct(paste0("2016-",MONTH,"-",DAY," ", HOUR, ":00")))
library(ggplot2)
library(eem)
ggplot(byhour,
aes(x = CHART_DATE,
y = PCT_MENTIONS)) +
geom_line(colour=eem_colors[1]) +
theme_eem()+
labs(x = "Time",
y = "Trump mentions \n (% of Tweets)")
```

The peak of mentions (as a percentage of tweets) was September 1st at 6 am (75%). But it terms of amount of tweets, it is much more obvious the outcry was following the anouncement and later visit of the candidate:

```
ggplot(byhour,
aes(x = CHART_DATE,
y = TRUMP_MENTIONS)) +
geom_line(colour=eem_colors[1]) +
theme_eem()+
labs(x = "Time",
y = "Trump mentions \n (# of Tweets)")
```

We can also (sort-of) identify the effect of these influencers tweeting. I’m going to add the followers, which are potential viewers, of each tweet mentioning Trump, by hour.

```
byaudience <- d %>%
mutate("MONTH" = as.numeric(month(T_CREATED)),
"DAY" = as.numeric(day(T_CREATED)),
"HOUR" = as.numeric(hour(T_CREATED)),
"TRUMP_MENTION" = str_count(TXT, pattern = "Trump|TRUMP|trump")) %>%
filter(TRUMP_MENTION > 0) %>%
group_by(MONTH, DAY, HOUR) %>%
summarise("TWEETS" = n(),
"AUDIENCE" = sum(U_FOLLOWERS)) %>%
arrange(desc(MONTH), desc(DAY), HOUR) %>%
mutate("CHART_DATE" = as.POSIXct(paste0("2016-",MONTH,"-",DAY," ", HOUR, ":00")))
ggplot(byaudience,
aes(x = CHART_DATE,
y = AUDIENCE)) +
geom_line(colour=eem_colors[1]) +
theme_eem()+
labs(x = "Time",
y = "Potential audience \n (# of followers)")
```

So clearly, I’m stating the obvious. People were talking. But how was the conversation being developed? Let’s first see the type of tweets (RT’s vs drafted individually):

```
bytype <- d %>%
mutate("TRUMP_MENTION" = str_count(TXT, pattern = "Trump|TRUMP|trump")) %>%
# only the tweets that mention trump
filter(TRUMP_MENTION>0) %>%
group_by(T_ISRT) %>%
summarise("count" = n())
kable(bytype)
```

FALSE | 313 |

TRUE | 164 |

About 1 in 3 was a RT. Comparing to the overall tweets, (1389 out of 3833) this seems not too much of a difference, so it wasn’t necesarrily an influencer pushing the discourse. In terms of the most mentioned by tweet it was our President on the spotlight:

```
bymentionchain <- d %>%
mutate("TRUMP_MENTION" = str_count(TXT, pattern = "Trump|TRUMP|trump")) %>%
# only the tweets that mention trump
group_by(TRUMP_MENTION, MENTION_CHAIN) %>%
summarise("count" = n()) %>%
ungroup() %>%
mutate("GROUPED_CHAIN" = ifelse(grepl(pattern = "EPN",
x = MENTION_CHAIN),
"EPN", MENTION_CHAIN)) %>%
mutate("GROUPED_CHAIN" = ifelse(grepl(pattern = "realDonaldTrump",
x = MENTION_CHAIN),
"realDonaldTrump", GROUPED_CHAIN))
ggplot(order_axis(bymentionchain %>%
filter(count>10 & GROUPED_CHAIN!="ND"),
axis = GROUPED_CHAIN,
column = count),
aes(x = GROUPED_CHAIN_o,
y = count)) +
geom_bar(stat = "identity") +
theme_eem() +
labs(x = "Mention chain \n (separated by _|.|_ )", y = "Tweets")
```

How about the actual persons who tweeted? It seemed like news anchor Joaquin Lopez-Doriga and security analyst Alejandro Hope were the most vocal about the visit (out of the influencers i’m following).

```
bytweetstar <- d %>%
mutate("TRUMP_MENTION" = ifelse(str_count(TXT, pattern = "Trump|TRUMP|trump")<1,0,1)) %>%
group_by(TRUMP_MENTION, NOMBRE) %>%
summarise("count" = n_distinct(TXT))
## plot with ggplot2
```

I also grouped each person by his political affiliation and I found it confirms the notion that the conversation on the eve of the visit, at least among this **very small** subset of twitter accounts, was driven by those with no party afiliation or in the “PAN” (opposition party).

```
byafiliation <- d %>%
mutate("MONTH" = as.numeric(month(T_CREATED)),
"DAY" = as.numeric(day(T_CREATED)),
"HOUR" = as.numeric(hour(T_CREATED)),
"TRUMP_MENTION" = ifelse(str_count(TXT, pattern = "Trump|TRUMP|trump")>0,1,0)) %>%
group_by(MONTH, DAY, HOUR, TRUMP_MENTION, AFILIACION) %>%
summarise("TWEETS" = n()) %>%
arrange(desc(MONTH), desc(DAY), HOUR) %>%
mutate("CHART_DATE" = as.POSIXct(paste0("2016-",MONTH,"-",DAY," ", HOUR, ":00")))
ggplot(byafiliation,
aes(x = CHART_DATE,
y = TWEETS,
group = AFILIACION,
fill = AFILIACION)) +
geom_bar(stat = "identity") +
theme_eem() +
scale_fill_eem(20) +
facet_grid(TRUMP_MENTION ~.) +
labs(x = "Time", y = "Tweets \n (By mention of Trump)")
```

However, It’s interesting to note how there is a small spike of the accounts afiliated with the PRI (party in power) on the day after his visit (Sept. 1st). Maybe they were trying to drive the conversation to another place?

]]>
(This article was first published on ** En El Margen - R-English**, and kindly contributed to R-bloggers)

I’m sure many of my fellow Mexicans will remember the historically ill-advised (to say the least) decision by our President to invite Donald Trump for a meeting.

Talking to some fellow colleagues, we couldn’t help but notice that **maybe** in another era this decision would have been good policy. The problem, some concluded, was the influence of social media today. In fact, the Trump debacle did cause outcry among leading politica voices online.

I wanted to investigate this further, and thankfully for me, I’ve been using **R** to collect tweets from a catalog of leading political personalities in Mexico for a personal business project.

Here is a short descriptive look at what the 65 twitter accounts I’m following tweeted between August 27th and September 5th (the Donald announced his visit on August the 30th). I’m sorry I can’t share the dataset, but you get the idea with the code…

```
library(dplyr)
library(stringr)
# 42 of the 65 accounts tweeted between those dates.
d %>%
summarise("n" = n_distinct(NOMBRE))
# n
# 42
```

We can see how mentions of trump spike just about the time it was announced…

```
byhour <- d %>%
mutate("MONTH" = as.numeric(month(T_CREATED)),
"DAY" = as.numeric(day(T_CREATED)),
"HOUR" = as.numeric(hour(T_CREATED)),
"TRUMP_MENTION" = str_count(TXT, pattern = "Trump|TRUMP|trump")) %>%
group_by(MONTH, DAY, HOUR) %>%
summarise("N" = n(),
"TRUMP_MENTIONS" = sum(TRUMP_MENTION)) %>%
mutate("PCT_MENTIONS" = TRUMP_MENTIONS/N*100) %>%
arrange(desc(MONTH), desc(DAY), HOUR) %>%
mutate("CHART_DATE" = as.POSIXct(paste0("2016-",MONTH,"-",DAY," ", HOUR, ":00")))
library(ggplot2)
library(eem)
ggplot(byhour,
aes(x = CHART_DATE,
y = PCT_MENTIONS)) +
geom_line(colour=eem_colors[1]) +
theme_eem()+
labs(x = "Time",
y = "Trump mentions \n (% of Tweets)")
```

The peak of mentions (as a percentage of tweets) was September 1st at 6 am (75%). But it terms of amount of tweets, it is much more obvious the outcry was following the anouncement and later visit of the candidate:

```
ggplot(byhour,
aes(x = CHART_DATE,
y = TRUMP_MENTIONS)) +
geom_line(colour=eem_colors[1]) +
theme_eem()+
labs(x = "Time",
y = "Trump mentions \n (# of Tweets)")
```

We can also (sort-of) identify the effect of these influencers tweeting. I’m going to add the followers, which are potential viewers, of each tweet mentioning Trump, by hour.

```
byaudience <- d %>%
mutate("MONTH" = as.numeric(month(T_CREATED)),
"DAY" = as.numeric(day(T_CREATED)),
"HOUR" = as.numeric(hour(T_CREATED)),
"TRUMP_MENTION" = str_count(TXT, pattern = "Trump|TRUMP|trump")) %>%
filter(TRUMP_MENTION > 0) %>%
group_by(MONTH, DAY, HOUR) %>%
summarise("TWEETS" = n(),
"AUDIENCE" = sum(U_FOLLOWERS)) %>%
arrange(desc(MONTH), desc(DAY), HOUR) %>%
mutate("CHART_DATE" = as.POSIXct(paste0("2016-",MONTH,"-",DAY," ", HOUR, ":00")))
ggplot(byaudience,
aes(x = CHART_DATE,
y = AUDIENCE)) +
geom_line(colour=eem_colors[1]) +
theme_eem()+
labs(x = "Time",
y = "Potential audience \n (# of followers)")
```

So clearly, I’m stating the obvious. People were talking. But how was the conversation being developed? Let’s first see the type of tweets (RT’s vs drafted individually):

```
bytype <- d %>%
mutate("TRUMP_MENTION" = str_count(TXT, pattern = "Trump|TRUMP|trump")) %>%
# only the tweets that mention trump
filter(TRUMP_MENTION>0) %>%
group_by(T_ISRT) %>%
summarise("count" = n())
kable(bytype)
```

T_ISRT | count |
---|---|

FALSE | 313 |

TRUE | 164 |

About 1 in 3 was a RT. Comparing to the overall tweets, (1389 out of 3833) this seems not too much of a difference, so it wasn’t necesarrily an influencer pushing the discourse. In terms of the most mentioned by tweet it was our President on the spotlight:

```
bymentionchain <- d %>%
mutate("TRUMP_MENTION" = str_count(TXT, pattern = "Trump|TRUMP|trump")) %>%
# only the tweets that mention trump
group_by(TRUMP_MENTION, MENTION_CHAIN) %>%
summarise("count" = n()) %>%
ungroup() %>%
mutate("GROUPED_CHAIN" = ifelse(grepl(pattern = "EPN",
x = MENTION_CHAIN),
"EPN", MENTION_CHAIN)) %>%
mutate("GROUPED_CHAIN" = ifelse(grepl(pattern = "realDonaldTrump",
x = MENTION_CHAIN),
"realDonaldTrump", GROUPED_CHAIN))
ggplot(order_axis(bymentionchain %>%
filter(count>10 & GROUPED_CHAIN!="ND"),
axis = GROUPED_CHAIN,
column = count),
aes(x = GROUPED_CHAIN_o,
y = count)) +
geom_bar(stat = "identity") +
theme_eem() +
labs(x = "Mention chain \n (separated by _|.|_ )", y = "Tweets")
```

How about the actual persons who tweeted? It seemed like news anchor Joaquin Lopez-Doriga and security analyst Alejandro Hope were the most vocal about the visit (out of the influencers i’m following).

```
bytweetstar <- d %>%
mutate("TRUMP_MENTION" = ifelse(str_count(TXT, pattern = "Trump|TRUMP|trump")<1,0,1)) %>%
group_by(TRUMP_MENTION, NOMBRE) %>%
summarise("count" = n_distinct(TXT))
## plot with ggplot2
```

I also grouped each person by his political affiliation and I found it confirms the notion that the conversation on the eve of the visit, at least among this **very small** subset of twitter accounts, was driven by those with no party afiliation or in the “PAN” (opposition party).

```
byafiliation <- d %>%
mutate("MONTH" = as.numeric(month(T_CREATED)),
"DAY" = as.numeric(day(T_CREATED)),
"HOUR" = as.numeric(hour(T_CREATED)),
"TRUMP_MENTION" = ifelse(str_count(TXT, pattern = "Trump|TRUMP|trump")>0,1,0)) %>%
group_by(MONTH, DAY, HOUR, TRUMP_MENTION, AFILIACION) %>%
summarise("TWEETS" = n()) %>%
arrange(desc(MONTH), desc(DAY), HOUR) %>%
mutate("CHART_DATE" = as.POSIXct(paste0("2016-",MONTH,"-",DAY," ", HOUR, ":00")))
ggplot(byafiliation,
aes(x = CHART_DATE,
y = TWEETS,
group = AFILIACION,
fill = AFILIACION)) +
geom_bar(stat = "identity") +
theme_eem() +
scale_fill_eem(20) +
facet_grid(TRUMP_MENTION ~.) +
labs(x = "Time", y = "Tweets \n (By mention of Trump)")
```

However, It’s interesting to note how there is a small spike of the accounts afiliated with the PRI (party in power) on the day after his visit (Sept. 1st). Maybe they were trying to drive the conversation to another place?

To **leave a comment** for the author, please follow the link and comment on their blog: ** En El Margen - R-English**.

R-bloggers.com offers

The post Better Model Selection for Evolving Models appeared first on Quintuitive.

]]>
(This article was first published on ** R – Quintuitive**, and kindly contributed to R-bloggers)

For quite some time now I have been using R’s caret package to choose the model for forecasting time series data. The approach is satisfactory as long as the model is not an evolving model (i.e. is not re-trained), or if it evolves rarely. If the model is re-trained often – the approach has significant computational overhead. Interestingly enough, an alternative, more efficient approach allows also for more flexibility in the area of model selection.

Let’s first outline how caret chooses a single model. The high level algorithm is outlined here:

So let’s say we are training a random forest. For this model, a single parameter, *mtry* is optimized:

require(caret) getModelInfo('rf')$wsrf$parameters # parameter class label # 1 mtry numeric #Randomly Selected Predictors

Let’s assume we are using some form of cross validation. According to the algorithm outline, caret will create a few subsets. On each subset, it will train all models (as many models as different values for *mtry* there are) and finally it will choose the model behaving best over all cross validation folds. So far so good.

When dealing with time series, using regular cross validation has a future-snooping problem and from my experience general cross validation doesn’t work well in practice for time series data. The results are good on the training set, but the performance on the test set, the hold out, is bad. To address this issue, caret provides the *timeslice* cross validation method:

require(caret) history = 1000 initial.window = 800 train.control = trainControl( method="timeslice", initialWindow=initial.window, horizon=history-initial.window, fixedWindow=T)

When the above *train.control* is used in training (via the *train* call), we will end up using 200 models for each set of parameters (each value of *mtry* in the random forest case). In other words, for a single value of *mtry*, we will compute:

Window | Training Points | Test Point |
---|---|---|

1 | 1..800 | 801 |

2 | 2..801 | 802 |

3 | 3..803 | 803 |

… | … | … |

200 | 200..999 | 1000 |

The training set for each model is the previous 800 points. The test set for a single model is the single point forecast. Now, for each value of *mtry* we end up with 200 forecasted points, using the accuracy (or any other metric) we select the best performing model over these 200 points. No future-snooping here, because all history points are prior the points being forecasted.

Granted, this approach (of doing things on daily basis) may sound extreme, but it’s useful to illustrate the overhead which is imposed when the model evolves over time, so bear with me.

So far we have dealt with a single model selection. Once the best model is selected, we can forecast the next data point. Then what? What I usually do is to walk the time series forward and repeat these steps at certain intervals. This is equivalent to saying something like: “Let’s choose the best model each Friday, use the selected model to predict each day for the next week. Then re-fit it on Friday.”. This forward-walking approach has been found useful in trading, but surprisingly, hasn’t been discussed pretty much elsewhere. Abundant time series data is generated everywhere, hence, I feel this evolving model approach deserves at least as much attention as the “fit once, live happily thereafter” approach.

Back to our discussion. To illustrate the inefficiency, consider an even more extreme case – we are selecting the best model every day, using the the above parameters, i.e. the best model for each day is selected tuning the parameters over the previous 200 days. On day *n* for a given value of the parameter (mtry), we will train this model over a sequence of 200 sliding windows each of which is of size 800. Next we will move to day *n+1* and we will compute, yet again, this model over a sequence of 200 sliding windows each of which is of size 800. Most of these operations are repeated (the last 800 window on day *n* is the second last 800 window on day *n+1*). So just for a single parameter value, we are repeating most of the computation on each step.

At this point, I hope you get the idea. So what is my solution? Simple. For each set of model parameters (each value of *mtry*), walk the series separately, do the training (no cross validation – we have a single parameter value), do the forecasting and store everything important into, let’s say, SQLite database. Next, pull out all predictions and walk the combined series. On each step, look at the history, and based on it, decide which model prediction to use for the next step. Assuming we are selecting the model over 5 different values for *mtry*, here is how the combined data may look like for a three-class (0, -1 and 1) classification:

Obviously the described approach is going to be orders of magnitude faster, but will deliver very similar (there are differences based on the window sizes) results. It also has an added bonus – once the forecasts are generated, one can experiment with different metrics for model selection on each step and all that without re-running the machine learning portion. For instance, instead of model accuracy (the default *caret* metric for classification), one can compare accumulative returns over the last n days.

Still cryptic or curious about the details? My plan is keep posting details and code as I progress with my Python implementation. Thus, look for the next installments of these series.

The post Better Model Selection for Evolving Models appeared first on Quintuitive.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Quintuitive**.

R-bloggers.com offers

(This article was first published on ** Stat Of Mind**, and kindly contributed to R-bloggers)

Anyone that follows US politics will be aware of the tremendous changes and volatility that has struck the US political landscape in the past year. In this post, I leverage third-party data to surface who are the most frequent liars, and show how to build a containerized Shiny app to visualize direct comparisons between individuals.

http://tlfvincent.github.io/2016/06/11/biggest-political-liars/

To **leave a comment** for the author, please follow the link and comment on their blog: ** Stat Of Mind**.

R-bloggers.com offers