(This article was first published on ** R – rud.is**, and kindly contributed to R-bloggers)

spiderbar, spiderbar Reads robots rules from afar. Crawls the web, any size; Fetches with respect, never lies. Look Out! Here comes the spiderbar. Is it fast? Listen bud, It's got C++ under the hood. Can you scrape, from a site? Test with can_fetch(), TRUE == alright Hey, there There goes the spiderbar.

(Check the end of the post if you don’t recognize the lyrical riff.)

I’ve used and blogged about Peter Meissner’s most excellent `robotstxt`

package before. It’s an essential tool for any ethical web scraper.

But (there’s always a “*but*“, right?), it was a definite bottleneck for an unintended package use case earlier this year (yes, I still have not rounded out the corners on my “crawl delay” forthcoming post).

I needed something faster for my bulk `Crawl-Delay`

analysis which led me to this small, spiffy C++ library for parsing `robots.txt`

files. After a tiny bit of wrangling, that C++ library has turned into a small R package `spiderbar`

which is now hitting a CRAN mirror near you, soon. (CRAN — rightly so — did not like the unoriginal name `rep`

).

I’m glad you asked!

Let’s take a look at one benchmark: parsing `robots.txt`

and extracting `Crawl-delay`

entries. Just how much faster is `spiderbar`

?

```
library(spiderbar)
library(robotstxt)
library(microbenchmark)
library(tidyverse)
library(hrbrthemes)
rob <- get_robotstxt("imdb.com")
microbenchmark(
robotstxt = {
x <- parse_robotstxt(rob)
x$crawl_delay
},
spiderbar = {
y <- robxp(rob)
crawl_delays(y)
}
) -> mb1
update_geom_defaults("violin", list(colour = "#4575b4", fill="#abd9e9"))
autoplot(mb1) +
scale_y_comma(name="nanoseconds", trans="log10") +
labs(title="Microbenchmark results for parsing 'robots.txt' and extracting 'Crawl-delay' entries",
subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
theme_ipsum_rc(grid="Xx")
```

As you can see, it’s just *a tad bit faster* .

Now, you won’t notice that temporal gain in an interactive context but you absolutely will if you are cranking through a few million of them across a few thousand WARC files from the Common Crawl.

`Crawl-Delay`

!OK, fine. Do you care about fetchability? We can speed that up, too!

```
rob_txt <- parse_robotstxt(rob)
rob_spi <- robxp(rob)
microbenchmark(
robotstxt = {
robotstxt:::path_allowed(rob_txt$permissions, "/Vote")
},
spiderbar = {
can_fetch(rob_spi, "/Vote")
}
) -> mb2
autoplot(mb2) +
scale_y_comma(name="nanoseconds", trans="log10") +
labs(title="Microbenchmark results for testing resource 'fetchability'",
subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
theme_ipsum_rc(grid="Xx")
```

(*Gosh, even Spider-Man got more respect!*)

OK, this is a tough crowd, but we’ve got vectorization covered as well:

```
microbenchmark(
robotstxt = {
paths_allowed(c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"), "imdb.com")
},
spiderbar = {
can_fetch(rob_spi, c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"))
}
) -> mb3
autoplot(mb3) +
scale_y_comma(name="nanoseconds", trans="log10") +
labs(title="Microbenchmark results for testing multiple resource 'fetchability'",
subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
theme_ipsum_rc(grid="Xx")
```

Peter’s package does more than this one since it helps find the `robots.txt`

files and provides helpful data frames for more robots exclusion protocol content. And, we’ve got some plans for package interoperability. So, stay tuned, true believer, for more spider-y goodness.

You can check out the code and leave package questions or comments on GitHub.

*(Hrm… Peter Parker was Spider-Man and Peter Meissner wrote *

`robotstxt`

which is all about spiders. Coincidence?! I think not!)

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – rud.is**.

R-bloggers.com offers

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

eXtreme Gradient Boosting is a machine learning model which became really popular few years ago after winning several Kaggle competitions. It is very powerful algorithm that use an ensemble of weak learners to obtain a strong learner. Its R implementation is available in `xgboost`

package and it is really worth including into anyone’s machine learning portfolio.

This is the first part of eXtremely Boost your machine learning series. For other parts follow the tag xgboost.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

Load `xgboost`

library and download German Credit dataset. Your goal in this tutorial will be to predict `Creditability`

(the first column in the dataset).

**Exercise 2**

Convert columns `c(2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20)`

to factors and then encode them as dummy variables. HINT: use `model.matrix()`

**Exercise 3**

Split data into training and test set 700:300. Create `xgb.DMatrix`

for both sets with `Creditability`

as label.

**Exercise 4**

Train `xgboost`

with logistic objective and 30 rounds of training and maximal depth 2.

**Exercise 5**

To check model performance calculate test set classification error.

**Exercise 6**

Plot predictors importance.

**Exercise 7**

Use `xgb.train()`

instead of `xgboost()`

to add both train and test sets as a watchlist. Train model with same parameters, but 100 rounds to see how it performs during training.

**Exercise 8**

Train model again adding AUC and Log Loss as evaluation metrices.

**Exercise 9**

Plot how AUC and Log Loss for train and test sets was changing during training process. Use plotting function/library of your choice.

**Exercise 10**

Check how setting parameter `eta`

to 0.01 influences the AUC and Log Loss curves.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.

R-bloggers.com offers

(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

A maintenance update RcppGSL 0.3.3 is now on CRAN. It switched the vignette to the our new pinp package and its two-column pdf default.

The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.

No user-facing new code or features were added. The NEWS file entries follow below:

## Changes in version 0.3.3 (2017-09-24)

We also check for

`gsl-config`

at package load.The vignette now uses the pinp package in two-column mode.

Minor other fixes to package and testing infrastructure.

Courtesy of CRANberries, a summary of changes to the most recent release is available.

More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Thinking inside the box **.

R-bloggers.com offers

(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

A new version of the RcppCNPy package arrived on CRAN yesterday.

RcppCNPy provides R with read and write access to NumPy files thanks to the cnpy library by Carl Rogers.

This version updates internals for function registration, but otherwise mostly switches the vignette over to the shiny new pinp two-page template and package.

## Changes in version 0.2.7 (2017-09-22)

Vignette updated to Rmd and use of

`pinp`

packageFile

`src/init.c`

added for dynamic registration

CRANberries also provides a diffstat report for the latest release. As always, feedback is welcome and the best place to start a discussion may be the GitHub issue tickets page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Thinking inside the box **.

R-bloggers.com offers

(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

A bug-fix release RcppClassic 0.9.8 for the very recent 0.9.7 release which fixes a build issue on macOS introduced in 0.9.7. No other changes.

Courtesy of CRANberries, there are changes relative to the previous release.

Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Thinking inside the box **.

R-bloggers.com offers

(This article was first published on ** R – Win-Vector Blog**, and kindly contributed to R-bloggers)

I am pleased to announce that `vtreat`

version 0.6.0 is now available to `R`

users on CRAN.

`vtreat`

is an *excellent* way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an `R`

user we *strongly* suggest you incorporate `vtreat`

into your projects.

`vtreat`

handles, in a statistically sound fashion:

- Missing values.
- Encoding of categorical values for regularized inference and machine learning techniques.
- Categorical variables with very many values.
- Novel categorical values (that is values not seen during training).
- Variable pruning.
- y-aware scaling.
- Structured cross-validation.
- Mitigating nested model bias.

In our (biased) opinion opinion `vtreat`

has *the best* methodology and documentation for these important data cleaning and preparation steps. `vtreat`

‘s current public open-source implementation is for in-memory `R`

analysis (we are considering ports and certifying ports of the package some time in the future, possibly for: `data.table`

, `Spark`

, `Python`

/`Pandas`

, and `SQL`

).

`vtreat`

brings *a lot* of power, sophistication, and convenience to your analyses, without a lot of trouble.

A new feature of `vtreat`

version 0.6.0 is called “custom coders.” Win-Vector LLC‘s Dr. Nina Zumel is going to start a short article series to show how this new interface can be used to extend `vtreat`

methodology to include the very powerful method of partial pooled inference (a term she will spend some time clearly defining and explaining). Time permitting, we may continue with articles on other applications of custom coding including: ordinal/faithful coders, monotone coders, unimodal coders, and set-valued coders.

*Please* help us share and promote this article series, which should start in a couple of days. This should be a fun chance to share very powerful methods with your colleagues.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Win-Vector Blog**.

R-bloggers.com offers

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.

In this series of article I have tried to help you create an intuition on how probabilities work. To do so, we have been using simulations to see how concrete random situation can unfold and learn simple statistics and probabilistic concepts. In today’s set, I would like to show you some deceptively difficult situations that will challenge the way you understand probability and statistics. By doing so, you will practice the simulation technique we have seen in past set, refined your intuition and, hopefully help you avoid some pitfall when you do your own statistical analysis.

Answers to the exercises are available here.

For other parts of this exercise set follow the tag Hacking stats

**Exercise 1**

Suppose that there are exactly 365 days in a year and that the distribution of birthday in the population is uniform, meaning that the proportion of birth on any given day is the same throughout the year. In a group of 25 people, what is the probability that at least two individuals share the same birthday? Use a simulation to answer that question, then repeat this process for group of 0,10,20,…,90 and 100 people and plot the results.

Of course, when the group size is of 366 we know that the probability that two people share the same birthday is equal to 1, since there are more people than day in the year and for a group of zero person this probability is equal to 0. What is counterintuitive here is the rate at which the probability of observing this grow. From the graph we can see that with just about 23 people we have a probability of about 50% of observing two people having the same birthday and that a group of about 70 people will have almost 100% chance to see this happening.

**Exercise 2**

Here’s a problem that can someday save your life. Imagine you are a war prisoner in an Asian Communist country and your jailer is getting bored. So to past the time, they set up a Russian roulette game where you and another inmate play against one another. A jailer takes a six-shooter revolver, put two bullets in two consecutive chamber, spin the chamber and give the gun to your opponent, who place the gun to his temple and pull the trigger. Luckily for him, the chamber was empty and the gun is passed to you. Now you have a choice to make: you can let the chamber as it is and play or you can spin the chamber before playing. Use 10000 simulations of both choices to find which choice give you the highest probability to survive.

The key details in this problem is that the bullet are in consecutive chamber. This mean that if your opponent pulls the trigger on an empty chamber, and that you don’t spin the chamber, it’s impossible that you pull the trigger on the second bullet. You can only have an empty chamber of pull the trigger on the first bullet, which means that you have 25% chance of dying vs 2/6=33% chance of dying if you spin the chamber.

**Exercise 3**

What is the probability that a mother, whose is pregnant with nonidentical twin, give birth to two boys, if we know that one of the unborn child is a boy, but we cannot identifie which one is the boy?

**Exercise 4**

Two friends play head or tail to pass the time. To make this game more fun they decide to gamble pennies, so for each coin flip one friend call head or tail and if he calls right, he gets a penny and lose one otherwise. Let’s say that they have 40 and 30 pennies respectively and that they will play until someone has all the pennies.

- Create a function that simulate a complete game and return how many coin flip has been done and who win.
- In average, how many coin flip is needed before someone has all the pennies.
- Plot the histogram of the number of coin flipped during a simulation.
- What is the probability that someone wins a coin flip?
- What is the probability that each friend wins all the pennies? Why is it different than the probability of winning a single coin flip?

When the number of coin flip get high enough, the probability of someone winning often enough to get all the pennies rise to 100%. Maybe they will have to play 24h a day for weeks, but someday, someone will lose often enough to be penniless. In this context, the player who started with the most money have a huge advantage since they can survive a much longer losing streak than their opponent.

In fact, in this context where the probability of winning a single game is equal for each opponent the probability of winning all the money is equal to the proportion of the money they start with. That’s in part why the casino always win since they got more money than each gambler that plays against them, as long they get them to play long enough they will win. The fact that they propose game where they have greater chance to win help them quite a bit too.

**Exercise 5**

A classic counter intuitive is the Monty Hall problem. Here’s the scenario, if you never heard of it: you are on a game show where you can choose one of three doors and if a prize is hidden behind this door, you win this prize. Here’s the twist: after you choose a door, the game show host open one of the two other doors to show that there’s no prize behind it. At this point, you can choose to look behind the door you choose in the first place to see if there’s a prize or you can choose to switch door and look behind the door you left out.

- Simulate 10 000 games where you choose to look behind the first door you have chosen to estimate the probability of winning if you choose to look behind this door.
- Repeat this process, but this time choose to switch door.
- Why the probabilities are different?

When you pick the first door, you have 1/3 chance to have the right door. When the show host open one of the door you didn’t pick he gives you a huge amount of information on where the price is because he opened a door with no prize behind it. So the second door has more chance to hide the prize than the door you took in the first place. Our simulation tell us that this probability is about 1/2. So, you should always switch door since this gives you a higher probability of winning the prize.

To better understand this, imagine that the Grand Canyon is filled with small capsule with a volume of a cube centimeter. Of all those capsules only one has a piece of paper and if you pick this capsule, you win a 50% discount on a tie. You choose a capsule at random and then all the other trillion capsules are discarded except one, such than the winning capsule is still in play. Assuming you really want this discount, which capsule would you choose?

**Exercise 6**

This problem is a real life example of a statistical pitfall that can easily be encountered in real life and has been published by Steven A. Julious and Mark A. Mullee. In this dataset, we can see if a a medical treatment for kidney stone has been effective. There are two treatments that can be used: treatment A which include all open surgical procedure and treatment B which include small puncture surgery and the kidney stone are classified in two categories depending on his size, small or large stones.

- Compute the success rate (number of success/total number of cases) of both treatments.
- Which treatment seems the more successful?
- Create a contingency table of the success.
- Compute the success rate of both treatments when treating small kidney stones.
- Compute the success rate of both treatments when treating large kidney stones.
- Which treatment is more successful for small kidney stone? For large kidney stone?

This is an example of the Simpson paradox, which is a situation where an effect appears to be present for the set of all observations, but disappears when the observations are categorized and the analysis is done on each group. It is important to test for this phenomenon since in practice most observations can be classified in sub classes and, as the last example showed, this can change drastically the result of your analysis.

**Exercise 7**

- Download this dataset and do a linear regression with the variable X and Y. Then, compute the slope of the trend line of the regression.
- Do a scatter plot of the variable X and Y and add the trend line to the graph.
- Repeat this process of each of the three categories.

We can see that the general trend of the data is different from the trends of each of the categories. In other words, the Simpson paradox can also be observed in a regression context. The moral of the story is: make sure that all the variables are included in your analysis or you gonna have a bad time!

**Exercise 8**

For this problem you must know what’s a true positive, false positive, true negative and false negative in a classification problem. You can look at this page for a quick review of those concepts.

A big data algorithm has been developed to detect potential terrorist by looking at their behavior on the internet, their consummation habit and their traveling. To develop this classification algorithm, the computer scientist used data from a population where there’s a lot of known terrorist since they needed data about the habits of real terrorist to validate their work. In this dataset, you will find observations from this high risk population and observations taken from a low risk population.

- Compute the true positive rate, the false positive rate, the true negative rate and the false negative rate of this algorithm for the population that has a high risk of terrorism.
- Repeat this process for the remaining observations. Is there a difference between those rate?

It is a known fact that false positive rate are a lot higher in low-incidence population and this is known as . Basically, when the incidence of a certain condition in the population is lower than the average false positive rate of a test, using that test on this population will result in a much higher false positive cases than usual. This is in part due to the fact that the diminution of true positive case make the proportion of false positive so much higher. As a consequence: don’t trust to much your classification algorithm!

**Exercise 9**

- Generate a population of 10000 values from a normal distribution of mean 42 and standard deviation of 10.
- Create a sample of 10 observations and estimate the mean of the population. Repeat this 200 times.
- Compute the variation of the estimation.
- Create a sample of 50 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
- Create a sample of 100 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
- Create a sample of 500 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
- Plot the variance of the estimation of the means done with different sample size.

As you can see, the variance of the estimation of the mean is inversely proportional to the sample size, but this is not a linear relationship. A small sample can create an estimation that is a lot farther to the real value than a sample with more observations. Let’s see why this information is relevant to this set.

**Exercise 10**

A private school advertise that their small size help their student achieve better grade. In their advertisement, they claim that last year’s students have had an average 5 points higher than the average at the standardize state’s test and since no large school has such a high average, that’s proof that small school help student achieve better results.

Suppose that there is 200000 students in the state, their results at the state test was distributed normally with a mean of 76% and a standard deviation of 15, the school had 100 students and that an average school count 750 student. Does the school claim can be explained statistically?

A school can be seen as a sample of the population of student. A large school, like a large sample, has a lot more chance to be representative of the student’s population and their average score will often be near the population average, while small school can show average a lot more extreme just because they have a smaller body of student. I’m not saying that no school are better than other, but we must look at a lot of results to be sure we are not only in presence of a statistical abnormality.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.

R-bloggers.com offers

(This article was first published on ** R – insightR**, and kindly contributed to R-bloggers)

**By Gabriel Vasconcelos**

In this post I am going to discuss some features of Regression Trees an Random Forests. Regression Trees are know to be very unstable, in other words, a small change in your data may drastically change your model. The Random Forest uses this instability as an advantage through bagging (you can see details about bagging here) resulting on a very stable model.

The first question is how a Regression Tree works. Suppose, fore example, that we have the number of points scored by a set of basketball players and we want to relate it to the player’s weight an height. The Regression Tree will simply split the height-weight space and assign a number of points to each partition. The figure below shows two different representations for a small tree. In the left we have the tree itself and in the right how the space is partitioned (the blue line shows the first partition and the red lines the following partitions). The numbers in the end of the tree (and in the partitions) represent the value of the response variable. Therefore, if a basketball player is higher than 1.85 meters and weights more than 100kg it is expected to score 27 points (I invented this data =] ).

You might be asking how I chose the partitions. In general, in each node the partition is chosen through a simple optimization problem to find the best pair variable-observation based on how much the new partition reduces the model error.

What I want to illustrate here is how unstable a Regression Tree can be. The package `tree`

has some examples that I will follow here with some small modifications. The example uses computer CPUs data and the objective is to build a model for the CPU performance based on some characteristics. The data has 209 CPU observations that will be used to estimate two Regression Trees. Each tree will be estimate from a random re-sample with replacement. Since the data comes from the same place, it would be desirable to have similar results on both models.

library(ggplot2) library(reshape2) library(tree) library(gridExtra) data(cpus, package = "MASS") # = Load Data # = First Tree set.seed(1) # = Seed for Replication tree1 = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus[sample(1:209, 209, replace = TRUE), ]) plot(tree1); text(tree1)

# = Second Tree set.seed(10) tree2 = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus[sample(1:209,209, replace=TRUE), ]) plot(tree2); text(tree2)

As you can see, the two trees are different from the start. We can use some figures to verify. First let us calculate the predictions of each model in the real data (not the re-sample). The first figure is a scatterplot of both predictions and the second figure shows their boxplots. Although the scatterplot shows some relation between the two predictions, it is far from good.

# = Calculate predictions pred = data.frame(p1 = predict(tree1, cpus) ,p2 = predict(tree2, cpus)) # = Plots g1 = ggplot(data = pred) + geom_point(aes(p1, p2)) g2 = ggplot(data = melt(pred)) + geom_boxplot(aes(variable, value)) grid.arrange(g1, g2, ncol = 2)

As mentioned before, the Random Forest solves the instability problem using bagging. We simply estimate the desired Regression Tree on many bootstrap samples (re-sample the data many times with replacement and re-estimate the model) and make the final prediction as the average of the predictions across the trees. There is one small (but important) detail to add. The Random Forest adds a new source of instability to the individual trees. Every time we calculate a new optimal variable-observation point to split the tree, we do not use all variables. Instead, we randomly select 2/3 of the variables. This will make the individual trees even more unstable, but as I mentioned here, bagging benefits from instability.

The question now is: how much improvement do we get from the Random Forest. The following example is a good illustration. I broke the CPUs data into a training sample (first 150 observations) and a test sample (remaining observations) and estimated a Regression Tree and a Random Forest. The performance is compared using the mean squared error.

library(randomForest) # = Regression Tree tree_fs = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus[1:150, ]) # = Random Forest set.seed(1) # = Seed for replication rf = randomForest(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data=cpus[1:150, ], nodesize = 10, importance = TRUE) # = Calculate MSE mse_tree = mean((predict(tree_fs, cpus[-c(1:150), ]) - log(cpus$perf)[-c(1:150)])^2) mse_rf = mean((predict(rf, cpus[-c(1:150), ]) - log(cpus$perf[-c(1:150)]))^2) c(rf = mse_rf, tree = mse_tree)

## rf tree ## 0.2884766 0.5660053

As you can see, the Regression Tree has an error twice as big as the Random Forest. The only problem is that by using a combination of trees any kind of interpretation becomes really hard. Fortunately, there are importance measures that allow us to at least know which variables are more relevant in the Random Forest. In our case, both importance measures pointed to the cache size as the most important variable.

importance(rf)

## %IncMSE IncNodePurity ## syct 22.60512 22.373601 ## mmin 19.46153 21.965340 ## mmax 24.84038 27.239772 ## cach 27.92483 33.536185 ## chmin 13.77196 13.352793 ## chmax 17.61297 8.379306

Finally, we can see how the model error decreases as we increase the number of trees in the Random Forest with the following code:

plot(rf)

If you liked this post, you can find more details on Regression Trees and Random forest in the book Elements of Statistical learning, which can be downloaded direct from the authors page here.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – insightR**.

R-bloggers.com offers