(This article was first published on ** bnosac :: open analytical helpers**, and kindly contributed to R-bloggers)

Just before the summer holidays, BNOSAC presented a talk called Computer Vision and Image Recognition algorithms for R users at the UseR conference. In the talk 6 packages on Computer Vision with R were introduced in front of an audience of about 250 persons. The R packages we covered and that were developed by BNOSAC are:

- image.CornerDetectionF9: FAST-9 corner detection
- image.CannyEdges: Canny Edge Detector
- image.LineSegmentDetector: Line Segment Detector (LSD)
- image.ContourDetector: Unsupervised Smooth Contour Line Detection
- image.dlib: Speeded up robust features (SURF) and histogram of oriented gradients (FHOG) features
- image.darknet: Image classification using darknet with deep learning models AlexNet, Darknet, VGG-16, GoogleNet and Darknet19. As well object detection using the state-of-the art YOLO detection system

For those of you who missed this, you can still see the video of the presentation & view the pdf of the presentation below. The packages are open-sourced and made available at https://github.com/bnosac/image

If you have a computer vision endaveour in mind, feel free to get in touch for a quick chat. For those of you interested in following training on how to do image analysis, you can always register for our training on Computer Vision with R and Python here. More details on the full training program and training dates provided by BNOSAC: visit http://bnosac.be/index.php/training

{aridoc engine=”pdfjs” width=”100%” height=”450″}images/bnosac/blog/presentation-user2017.pdf{/aridoc}

To **leave a comment** for the author, please follow the link and comment on their blog: ** bnosac :: open analytical helpers**.

R-bloggers.com offers

(This article was first published on ** Florian Teschner**, and kindly contributed to R-bloggers)

The last months, I have worked on brand logo detection in R with Keras. Starting with a model from scratch adding more data and using a pretrained model. The goal is to build a (deep) neural net that is able to identify brand logos in images.

Just to recall, the dataset is a combination of the Flickr27-dataset, with 270 images of 27 classes and self-scraped images from google image search. In case you want to reproduce the analysis, you can download the set here.

In the last post, I used the VGG-16 pretrained model and showed that it can be trained to achieve an accuracy of 55% on the training 35% on the validation set.

In this post, I will show how to further improve the model accuracy.

Keras (in R) provides a set of pretrained models:

- Xception
- VGG16
- VGG19
- ResNet50
- InceptionV3
- MobileNet

Naturally, it raises the question which model is best suited for the task at hand.

The article 10 advanced deep learning architectures points out that Google Xception model performs better than VGG in transfer learning cases.

In addition to changing the pre-trained model, I wanted to see how data augmentation changes the results.

The function “image_data_generator” takes the input data and randomly alters the original training images.

Here is the code:

I trained the model first 100 epochs on the original training data and added 50 epochs on the augmented (altered) dataset.

Plotting the training history shows that training on the original data results in a validation accuracy of ~57% after 50 epochs. After that the neither the training nor the validation accuracy increases any further. Further training the model on the augmented data (red colored lines) leads to another boost in the validation accuracy.

To sum up, just changing a couple of lines from the previous setup changes the network’s performance significantly. Using a different pre-trained network and adding data augmentation doubles the classification accuracy.

as a sidenote; it appears (to me) that the current DL landscape is very dynamic and fast evolving. It is a safe bet to say that the content of this post is probably outdated in 6 months. Just in the last month the Rstudio/Keras repository has significantly changed;

Excluding merges, 3 authors have pushed 178 commits to master and 178 commits to all branches. On master, 349 files have changed and there have been 5,468 additions and 1,719 deletions.

Kudos to the rstudio-team for the great work on the package.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Florian Teschner**.

R-bloggers.com offers

(This article was first published on ** S+/R – Yet Another Blog in Statistical Computing**, and kindly contributed to R-bloggers)

The dropout approach developed by Hinton has been widely employed in deep learnings to prevent the deep neural network from overfitting, as shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks.

In the paper http://proceedings.mlr.press/v38/korlakaivinayak15.pdf, the dropout can also be used to address the overfitting in boosting tree ensembles, e.g. MART, caused by the so-called “over-specialization”. In particular, while first few trees added at the beginning of ensembles would dominate the model performance, the rest added later can only improve the prediction for a small subset, which increases the risk of overfitting. The idea of DART is to build an ensemble by randomly dropping boosting tree members. The percentage of dropouts can determine the degree of regularization for boosting tree ensembles.

Below is a demonstration showing the implementation of DART with the R xgboost package. First of all, after importing the data, we divided it into two pieces, one for training and the other for testing.

pkgs <- c('pROC', 'xgboost') lapply(pkgs, require, character.only = T) df1 <- read.csv("Downloads/credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, ] set.seed(2017) n <- nrow(df2) sample <- sample(seq(n), size = n / 2, replace = FALSE) train <- df2[sample, -1] test <- df2[-sample, -1]

For the comparison purpose, we first developed a boosting tree ensemble without dropouts, as shown below. For the simplicity, all parameters were chosen heuristically. The max_depth is set to 3 due to the fact that the boosting tends to work well with so-called “weak” learners, e.g. simple trees. While ROC for the training set can be as high as 0.95, ROC for the testing set is only 0.60 in our case, implying the overfitting issue.

mart.parm <- list(booster = "gbtree", nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc") mart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = mart.parm, nrounds = 500, verbose = 0, seed = 2017) pred1 <- predict(mart, as.matrix(train[, -1])) pred2 <- predict(mart, as.matrix(test[, -1])) roc(as.factor(train$DEFAULT), pred1) # Area under the curve: 0.9459 roc(as.factor(test$DEFAULT), pred2) # Area under the curve: 0.6046

With the same set of parameters, we refitted the ensemble with dropouts, e.g. DART. As shown below, by dropping 10% tree members, ROC for the testing set can increase from 0.60 to 0.65. In addition, the performance disparity between training and testing sets with DART decreases significantly.

dart.parm <- list(booster = "dart", rate_drop = 0.1, nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc") dart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = dart.parm, nrounds = 500, verbose = 0, seed = 2017) pred1 <- predict(dart, as.matrix(train[, -1])) pred2 <- predict(dart, as.matrix(test[, -1])) roc(as.factor(train$DEFAULT), pred1) # Area under the curve: 0.7734 roc(as.factor(test$DEFAULT), pred2) # Area under the curve: 0.6517

Besides rate_drop = 0.1, a wide range of dropout rates have also been tested. In most cases, DART outperforms its counterpart without the dropout regularization.

To **leave a comment** for the author, please follow the link and comment on their blog: ** S+/R – Yet Another Blog in Statistical Computing**.

R-bloggers.com offers

(This article was first published on ** S+/R – Yet Another Blog in Statistical Computing**, and kindly contributed to R-bloggers)

In the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm), it has been explained why we should consider modeling operational losses for non-material UoMs directly with Tweedie models. However, for material UoMs with significant losses, it is still beneficial to model the frequency and the severity separately.

In the prevailing modeling practice for operational losses, it is often convenient to assume a functional independence between frequency and severity models, which might not be the case empirically. For instance, in the economic downturn, both the frequency and the severity of consumer frauds might tend to increase simultaneously. With the independence assumption, while we can argue that same variables could be included in both frequency and severity models and therefore induce a certain correlation, the frequency-severity dependence and the its contribution to the loss distribution might be overlooked.

In the context of Copula, the distribution of operational losses can be considered a joint distribution determined by both marginal distributions and a parameter measuring the dependence between marginals, of which marginal distributions can be Poisson for the frequency and Gamma for the severity. Depending on the dependence structure in the data, various copula functions might be considered. For instance, a product copula can be used to describe the independence. In the example shown below, a Gumbel copula is considered given that it is often used to describe the positive dependence on the right tail, e.g. high severity and high frequency. For details, the book “Copula Modeling” by Trivedi and Zimmer is a good reference to start with.

In the demonstration, we simulated both frequency and severity measures driven by the same set of co-variates. Both are positively correlated with the Kendall’s tau = 0.5 under the assumption of Gumbel copula.

library(CopulaRegression) # number of observations to simulate n <- 100 # seed value for the simulation set.seed(2017) # design matrices with a constant column X <- cbind(rep(1, n), runif(n), runif(n)) # define coefficients for both Poisson and Gamma regressions p_beta <- g_beta <- c(3, -2, 1) # define the Gamma dispersion delta <- 1 # define the Kendall's tau tau <- 0.5 # copula parameter based on tau theta <- 1 / (1 - tau) # define the Gumbel Copula family <- 4 # simulate outcomes out <- simulate_regression_data(n, g_beta, p_beta, X, X, delta, tau, family, zt = FALSE) G <- out[, 1] P <- out[, 2]

After the simulation, a Copula regression is estimated with Poisson and Gamma marginals for the frequency and the severity respectively. As shown in the model estimation, estimated parameters with related inferences are different between independent and dependent assumptions.

m <- copreg(G, P, X, family = 4, sd.error = TRUE, joint = TRUE, zt = FALSE) coef <- c("_CONST", "X1", "X2") cols <- c("ESTIMATE", "STD. ERR", "Z-VALUE") g_est <- cbind(m$alpha, m$sd.alpha, m$alpha / m$sd.alpha) p_est <- cbind(m$beta, m$sd.beta, m$beta / m$sd.beta) g_est0 <- cbind(m$alpha0, m$sd.alpha0, m$alpha0 / m$sd.alpha0) p_est0 <- cbind(m$beta0, m$sd.beta0, m$beta0 / m$sd.beta0) rownames(g_est) <- rownames(g_est0) <- rownames(p_est) <- rownames(p_est0) <- coef colnames(g_est) <- colnames(g_est0) <- colnames(p_est) <- colnames(p_est0) <- cols # estimated coefficients for the Gamma regression assumed dependence print(g_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.9710512 0.2303651 12.897141 # X1 -1.8047627 0.2944627 -6.129003 # X2 0.9071093 0.2995218 3.028526 # estimated coefficients for the Gamma regression assumed dependence print(p_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.954519 0.06023353 49.05107 # X1 -1.967023 0.09233056 -21.30414 # X2 1.025863 0.08254870 12.42736 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(g_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 3.020771 0.2499246 12.086727 # X1 -1.777570 0.3480328 -5.107478 # X2 0.905527 0.3619011 2.502140 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(p_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.939787 0.06507502 45.17536 # X1 -2.010535 0.10297887 -19.52376 # X2 1.088269 0.09334663 11.65837

If we compare conditional loss distributions under different dependence assumptions, it shows that the predicted loss with Copula regression tends to have a fatter right tail and therefore should be considered more conservative.

df <- data.frame(g = G, p = P, x1 = X[, 2], x2 = X[, 3]) glm_p <- glm(p ~ x1 + x2, data = df, family = poisson(log)) glm_g <- glm(g ~ x1 + x2, data = df, family = Gamma(log)) loss_dep <- predict(m, X, X, independence = FALSE)[3][[1]][[1]] loss_ind <- fitted(glm_p) * fitted(glm_g) den <- data.frame(loss = c(loss_dep, loss_ind), lines = rep(c("DEPENDENCE", "INDEPENDENCE"), each = n)) ggplot(den, aes(x = loss, fill = lines)) + geom_density(alpha = 0.5)

To **leave a comment** for the author, please follow the link and comment on their blog: ** S+/R – Yet Another Blog in Statistical Computing**.

R-bloggers.com offers

(This article was first published on ** Mad (Data) Scientist**, and kindly contributed to R-bloggers)

I recently posted an update regarding our R package **revisit**, aimed at partially remedying the reproducibility crisis, both in the sense of (a) providing transparency to data analyses and (b) flagging possible statistical errors, including misuse of significance testing.

One person commented to me that it may not be important for the package to include warnings about significance testing. I replied that on the contrary, such problems are by far the most common in all of statistics. Today I found an especially egregious case in point, not only because of the errors themselves but even more so because of the shockingly high mathematical sophistication of the culprits.

This fiasco occurs in the article, “Gravitational Waves and Their Mathematics” in the August 2017 issue of the *Notices of the AMS, *by mathematics and physics professors Lydia Bieri, David Garfinkle and Nicolás Yunes. In describing the results of a dramatic experiment claimed to show the existence of gravitational wages, the authors state,

…the aLIGO detectors recorded the interference pattern associated with a gravitational wave produced in the merger of two black holes 1.3 billion light years away. The signal was so loud (relative to the level of the noise) that the probability that the recorded event was a gravitational wave was much larger than 5𝜎, meaning that the probability of a false alarm was much smaller than 10

^{-7}.

Of course, in that second sentence, the second half is (or at least reads as) the all-too-common error of interpreting a p-value as the probability that the null hypothesis is correct. But that first half (probability of a gravitational wage was much larger than 5𝜎) is quite an “innovation” in the World of Statistical Errors. Actually, it may be a challenge to incorporate a warning for this kind of error in **revisit**.

I suppose the confirmation of the existence of gravitational waves is otherwise sound, but one does have to wonder if other parts of the experiment and analysis were similarly sloppy. This is reminiscent of some controversy over the confirmation of the existence of the Higgs Boson; I actually may disagree there, but it again shows that, at the least, physicists should stop treating statistics as not worth the effort needed for useful insight.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Mad (Data) Scientist**.

R-bloggers.com offers

(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

On the heels of the very recent bi-monthly RcppArmadillo release comes a quick bug-fix release 0.7.960.1.1 which just got onto CRAN (and I will ship a build to Debian in a moment).

There were three distinct issues I addressed in three quick pull requests:

- The excellent Google Summer of Code work by Binxiang Ni had only encountered direct use of sparse matrices as produced by the Matrix. However, while we waited for 0.7.960.1.0 to make it onto CRAN, the quanteda package switched to derived classes—which we now account for via the
`is()`

method of our`S4`

class. Thanks to Kevin Ushey for reminding me we had`is()`

. - We somehow missed to account for the R 3.4.* and Rcpp 0.12.{11,12} changes for package registration (with
`.registration=TRUE`

), so ensured we only have one`fastLm`

symbol. - The build did not take not too well to systems without OpenMP, so we now explicitly unset supported via an Armadillo configuration variable. In general,
*client*packages probably want to enable C++11 support when using OpenMP (explicitly) but we prefer to not upset too many (old) users. However, our`configure`

check now also wants`g++ 4.7.2`

or later just like Armadillo.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 382 other packages on CRAN—an increase of 52 since the CRAN release in June!

Changes in this release relative to the previous CRAN release are as follows:

## Changes in RcppArmadillo version 0.7.960.1.1 (2017-08-20)

Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Thinking inside the box **.

R-bloggers.com offers

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

Probability is at the heart of data science. Simulation is also commonly used in algorithms such as the bootstrap. After completing this exercise, you will have a slightly stronger intuition for probability and for writing your own simulation algorithms.

Most of the problems in this set have an exact analytical solution, which is not the case for all probability problems, but they are great for practice since we can check against the exact correct answer.

To get the most out of the exercises, it pays off to read the instructions carefully and think about what the solution should be before starting to write `R`

code. Often this helps you weed out irrelevant information that can otherwise make your algorithm unnecessarily complicated.

Answers are available here.

**Exercise 1**

In 100 coin tosses, what is the probability of having the same side come up 10 times in a row?

You might want to use some of the following functions to answer this question:`sample(), rbinom(), rle()`

.

**Exercise 2**

Six kids are standing in line. What is the probability that they are in alphabetical order by name? Assume no two children have the same exact name.

**Exercise 3**

Remember the kids from the last question? There are three boys and three girls. How likely is it that all the girls come first?

**Exercise 4**

In six coin tosses, what is the probability of having a different side come up with each throw, that is, that you never get two tails or two heads in a row?

**Exercise 5**

A random five-card poker hand is dealt from a standard deck. What is the chance of a flush (all cards are the same suit)?

**Exercise 6**

In a random thirteen-card hand from a standard deck, what is the probability that none of the cards is an ace and none is a heart (♥)?

**Exercise 7**

At four parties each attended by 13, 23, 33, and 53 people respectively, how likely is it that at least two individuals share a birthday at each party? Assume there are no leap days, that all years are 365 days, and that births are uniformly distributed over the year.

**Exercise 8**

A famous coin tossing game has the following rules: The player tosses a coin repeatedly until a tail appears or tosses it a maximum of 1000 times if no tail appears. The initial stake starts at 2 dollars and is doubled every time heads appears. The first time tails appears, the game ends and the player wins whatever is in the pot. Thus the player wins 2 dollars if tails appears on the first toss, 4 dollars if heads appears on the first toss and tails on the second, 8 dollars if heads appears on the first two tosses and tails on the third, and so on. Mathematically, the player wins 2^{k} dollars, where k equals the number of tosses until the first tail. What is the probability of profit if it costs 15 dollars to participate?

**Exercise 9**

Back to coin tossing. What is the probability the pattern heads-heads-tails appears before tails-heads-heads?

**Exercise 10**

Suppose you’re on a game show, and you’re given the choice of three doors. Behind one door is a car; behind the others, goats. You pick a door, say #1, and the host, who knows what’s behind the doors, opens another door, say #3, which has a goat. He then says to you, “Do you want to pick door #2?” What is the probability of winning the car if you use the strategy of first picking a random door and then switching doors every time? Note that the host will always open a door you did not pick, and it always reveals a goat.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.

R-bloggers.com offers

(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

Welcome to the tenth post in the *rarely ranting R recommendations* series, or *R*^{4} for short. A few days ago we showed how to tell the linker to strip shared libraries. As discussed in the post, there are two options. One can either set up `~/.R/Makevars`

by passing the `strip-debug`

option to the linker. Alternatively, one can adjust `src/Makevars`

in the package itself with a bit a Makefile magic.

Of course, there is a third way: just run `strip --strip-debug`

over all the shared libraries *after the build*. As the path is standardized, and the shell does proper globbing, we can just do

`$ strip --strip-debug /usr/local/lib/R/site-library/*/libs/*.so`

using a double-wildcard to get all packages (in that R package directory) and all their shared libraries. Users on macOS probably want `.dylib`

on the end, users on Windows want another computer as usual (just kidding: use `.dll`

). Either may have to adjust the path which is left as an exercise to the reader.

The impact can be Yuge as illustrated in the following dotplot:

This illustration is in response to a mailing list post. Last week, someone claimed on r-help that tidyverse would not install on Ubuntu 17.04. And this is of course patently false as many of us build and test on Ubuntu and related Linux systems, Travis runs on it, CRAN tests them etc pp. That poor user had somehow messed up their default `gcc`

version. Anyway: I fired up a Docker container, installed `r-base-core`

plus three required `-dev`

packages (for xml2, openssl, and curl) and ran a single `install.packages("tidyverse")`

. In a nutshell, following the launch of Docker for an Ubuntu 17.04 container, it was just

```
$ apt-get update
$ apt-get install r-base libcurl4-openssl-dev libssl-dev libxml2-dev
$ apt-get install mg # a tiny editor
$ mg /etc/R/Rprofile.site # to add a default CRAN repo
$ R -e 'install.packages("tidyverse")'
```

which not only worked (as expected) but also installed a whopping fifty-one packages (!!) of which twenty-six contain a shared library. A useful little trick is to run `du`

with proper options to total, summarize, and use human units which reveals that these libraries occupy seventy-eight megabytes:

```
root@de443801b3fc:/# du -csh /usr/local/lib/R/site-library/*/libs/*so
4.3M /usr/local/lib/R/site-library/Rcpp/libs/Rcpp.so
2.3M /usr/local/lib/R/site-library/bindrcpp/libs/bindrcpp.so
144K /usr/local/lib/R/site-library/colorspace/libs/colorspace.so
204K /usr/local/lib/R/site-library/curl/libs/curl.so
328K /usr/local/lib/R/site-library/digest/libs/digest.so
33M /usr/local/lib/R/site-library/dplyr/libs/dplyr.so
36K /usr/local/lib/R/site-library/glue/libs/glue.so
3.2M /usr/local/lib/R/site-library/haven/libs/haven.so
272K /usr/local/lib/R/site-library/jsonlite/libs/jsonlite.so
52K /usr/local/lib/R/site-library/lazyeval/libs/lazyeval.so
64K /usr/local/lib/R/site-library/lubridate/libs/lubridate.so
16K /usr/local/lib/R/site-library/mime/libs/mime.so
124K /usr/local/lib/R/site-library/mnormt/libs/mnormt.so
372K /usr/local/lib/R/site-library/openssl/libs/openssl.so
772K /usr/local/lib/R/site-library/plyr/libs/plyr.so
92K /usr/local/lib/R/site-library/purrr/libs/purrr.so
13M /usr/local/lib/R/site-library/readr/libs/readr.so
4.7M /usr/local/lib/R/site-library/readxl/libs/readxl.so
1.2M /usr/local/lib/R/site-library/reshape2/libs/reshape2.so
160K /usr/local/lib/R/site-library/rlang/libs/rlang.so
928K /usr/local/lib/R/site-library/scales/libs/scales.so
4.9M /usr/local/lib/R/site-library/stringi/libs/stringi.so
1.3M /usr/local/lib/R/site-library/tibble/libs/tibble.so
2.0M /usr/local/lib/R/site-library/tidyr/libs/tidyr.so
1.2M /usr/local/lib/R/site-library/tidyselect/libs/tidyselect.so
4.7M /usr/local/lib/R/site-library/xml2/libs/xml2.so
78M total
root@de443801b3fc:/#
```

Looks like dplyr wins this one at thirty-three megabytes *just for its shared library*.

But with a single stroke of `strip`

we can reduce all this down *a lot*:

```
root@de443801b3fc:/# strip --strip-debug /usr/local/lib/R/site-library/*/libs/*so
root@de443801b3fc:/# du -csh /usr/local/lib/R/site-library/*/libs/*so
440K /usr/local/lib/R/site-library/Rcpp/libs/Rcpp.so
220K /usr/local/lib/R/site-library/bindrcpp/libs/bindrcpp.so
52K /usr/local/lib/R/site-library/colorspace/libs/colorspace.so
56K /usr/local/lib/R/site-library/curl/libs/curl.so
120K /usr/local/lib/R/site-library/digest/libs/digest.so
2.5M /usr/local/lib/R/site-library/dplyr/libs/dplyr.so
16K /usr/local/lib/R/site-library/glue/libs/glue.so
404K /usr/local/lib/R/site-library/haven/libs/haven.so
76K /usr/local/lib/R/site-library/jsonlite/libs/jsonlite.so
20K /usr/local/lib/R/site-library/lazyeval/libs/lazyeval.so
24K /usr/local/lib/R/site-library/lubridate/libs/lubridate.so
8.0K /usr/local/lib/R/site-library/mime/libs/mime.so
52K /usr/local/lib/R/site-library/mnormt/libs/mnormt.so
84K /usr/local/lib/R/site-library/openssl/libs/openssl.so
76K /usr/local/lib/R/site-library/plyr/libs/plyr.so
32K /usr/local/lib/R/site-library/purrr/libs/purrr.so
648K /usr/local/lib/R/site-library/readr/libs/readr.so
400K /usr/local/lib/R/site-library/readxl/libs/readxl.so
128K /usr/local/lib/R/site-library/reshape2/libs/reshape2.so
56K /usr/local/lib/R/site-library/rlang/libs/rlang.so
100K /usr/local/lib/R/site-library/scales/libs/scales.so
496K /usr/local/lib/R/site-library/stringi/libs/stringi.so
124K /usr/local/lib/R/site-library/tibble/libs/tibble.so
164K /usr/local/lib/R/site-library/tidyr/libs/tidyr.so
104K /usr/local/lib/R/site-library/tidyselect/libs/tidyselect.so
344K /usr/local/lib/R/site-library/xml2/libs/xml2.so
6.6M total
root@de443801b3fc:/#
```

Down to six point six megabytes. Not bad for one command. The chart visualizes the respective reductions. Clearly, C++ packages (and their template use) lead to more debugging symbols than plain old C code. But once stripped, the size differences are not that large.

And just to be plain, what we showed previously in post #9 does the same, only already at installation stage. The effects are not cumulative.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Thinking inside the box **.

R-bloggers.com offers