## Sorting Numeric Vectors in C++ and R

January 31, 2013
By

Consider the problem to sort all elements of the given vector in ascending order. We can simply use the function std::sort from the C++ STL. #include <Rcpp.h> using namespace Rcpp; // ] NumericVector stl_sort(NumericVector x) { NumericVector y = clone(x); std::sort(y.begin(), y.end()); return y; } library(rbenchmark) set.seed(123) z <- rnorm(100000) x <- rnorm(100) # check that stl_sort is the same as sort stopifnot(all.equal(stl_sort(x), sort(x))) #...

## Using Boost via the new BH package

January 31, 2013
By

Earlier today the new BH package arrived on CRAN. Over the years, Jay Emerson, Michael Kane and I had numerous discussions about a basic Boost infrastructure package providing Boost headers for other CRAN packages. JJ and Romain chipped in as well, and Jay finally took the lead by first creating a repo on...

## repmis: misc. tools for reproducible research in R

January 30, 2013
By

I've started to put together an R package called repmis. It has miscellaneous tools for reproducible research with R. The idea behind the package is to collate commands that simplify some of the common R code used within knitr-type reproducible research papers. It's still very much in the early stages of development and has two commands: LoadandCite:...

## R installation + screenshots

January 30, 2013
By

Feeling faint of heart without photos depicting what to do? No worries, here they are. Go to the R website and click “Download R” under “Getting Started” Choose a place to download R. Even though we’re on the limitless and borderless interweb, choosing a location close to you helps speeds things up. Choose which R package to download based

## R users: Be counted in Rexer’s 2013 Data Miner Survey

January 30, 2013
By

Since 2007, Rexer Analytics has been conducting periodic surveys to measure the analytic behaviors, views and preferences of data miners and analytic professionals. In the last survey, conducted in 2011, more than 1300 analysts shared information about the data analysis software packages they use. (The results of all Rexer surveys are available free to anyone who requests them.) In...

January 30, 2013
By

A new Armadillo version 3.6.2 came out yesterday, and the corresponding RcppArmadillo version is now on CRAN. Changes are mostky incremental: Changes in RcppArmadillo version 0.3.6.2 (2013-01-29) Upgraded to Armadillo release Version 3.6.2 ...

January 30, 2013
By

A Problem A major problem in secondary data analysis is that you didn't get to decide what data was collected. Lets say you were interested in how many times a student has read the Twilight books). Specifically, you want to know how effective the ads for...

January 30, 2013
By

A Problem A major problem in secondary data analysis is that you didn't get to decide what data was collected. Lets say you were interested in how many times a student has read the Twilight books). Specifically, you want to know how effective the ads for...

## F1Stats – Visually Comparing Qualifying and Grid Positions with Race Classification

January 30, 2013
By

Following the roundabout tour of F1Stats – A Prequel to Getting Started With Rank Correlations, here’s a walk through of my attempt to replicate the first part of A Tale of Two

## R finals

January 30, 2013
By

On the morning I returned from Varanasi and the ISBA meeting there, I had to give my R final exam (along with three of my colleagues in Paris-Dauphine). This year, the R course was completely in English, exam included, which means I can post it here as it may attract more interest than the French

## Modeling Residential Electricity Usage with R

January 30, 2013
By

Wow, I can’t believe it has been 11 months since my last blog posting!  The next series of postings will be related to the retail energy field.  Residential power usage is satisfying to model as it can be forecast fairly accurately with the right inputs.  Partly as a consequence of deregulation there is now more data more available than...

## Regression on categorical variables

January 30, 2013
By
$N_{x,t}\sim\mathcal{P}(E_{x,t}\cdot \exp[\alpha_x+\beta_x \kappa_t + \gamma_x \delta_{t-x}])$

This morning, Stéphane asked me tricky question about extracting coefficients from a regression with categorical explanatory variates. More precisely, he asked me if it was possible to store the coefficients in a nice table, with information on the variable and the modality (those two information being in two different columns). Here is some code I did to produce the...

## Approaching the Zero Bound – Bonds

January 30, 2013
By

As bonds approach the artificial zero bound, where do we go next especially after the record setting +30% in 2011?  The rolling 250-day total return has rarely gone negative since the inception of the Vanguard Funds VBMFX and VUSTX.  I am int...

## The magic empty bracket

January 30, 2013
By
$The magic empty bracket$

I have been working with R for some time now, but once in a while, basic functions catch my eye that I was not aware of… For some project I wanted to transform a correlation matrix into a covariance matrix. Now, since cor2cov does not exist, I thought about “reversing” the cov2cor function (stats:::cov2cor). Inside

## Speed up for loops in R

January 30, 2013
By

Are your for loops too slow in R ? Are loops that should take seconds actually taking hours ? As I found out recently, how you structure your code can make a huge difference in execution times. Fortunately making a few small changes to your code can speed up these loops by several orders of

## R’s range and loop behaviour: Zero, One, NULL

January 30, 2013
By

One of the most common pattern in programming languages is to ability to iterate over a given set (a vector usually) by using 'for' loops. In most modern scripting languages range operations is a build in data structure and trivial to use with 'for' lo...

## Building a package in RStudio is actually very easy

January 30, 2013
By

So, you’ve written some code and you use it routinely. Now you’ve written some code and you’d like to use version control to ensure that development continues in a robust fashion. You do that and you use Github or something so that not only are changes tracked, but the general public receives the benefit of

## The three-dots construct in R

January 30, 2013
By

There is a mechanism that allows variability in the arguments given to R functions.  Technically it is ellipsis, but more commonly called “…”, dots, dot-dot-dot or three-dots. Basics The three-dots allows: an arbitrary number and variety of arguments passing arguments on to other functions Arbitrary arguments The two prime cases are the c and list The post The...

## A shiny app to display the human body map dataset

January 30, 2013
By

There was quite a lot of buzz around when the guys from Rstudio launched Shiny, a new web framework for R that promises to “make it super simple for R users like you to turn analyses into interactive web applications … Continue reading →

## Using Boost’s foreach macro

January 30, 2013
By

Boost provides a macro, BOOST_FOREACH, that allows us to easily iterate over elements in a container, similar to what we might do in R with sapply. In particular, it frees us from having to deal with iterators as we do with std::for_each and std::transform. The macro is also compatible with the objects exposed by Rcpp. Side note: C++11 has introduced...

## Using Boost’s foreach macro

January 30, 2013
By

Boost provides a macro, BOOST_FOREACH, that allows us to easily iterate over elements in a container, similar to what we might do in R with sapply. In particular, it frees us from having to deal with iterators as we do with std::for_each and std::transform. The macro is also compatible with the objects exposed by Rcpp. Side note: C++11 has introduced...

## Converting a list to a data frame

January 30, 2013
By

There are many situations in R where you have a list of vectors that you need to convert to a data.frame. This question has been addressed over at StackOverflow and it turns out there are many different approaches to completing this task. Since I encounter this situation relatively frequently, I wanted my own S3 method for as.data.frame that...

## Converting a list to a data frame

January 30, 2013
By

There are many situations in R where you have a list of vectors that you need to convert to a data.frame. This question has been addressed over at StackOverflow and it turns out there are many different approaches to completing this task. Since I encou...

## Tracking down errors in R

January 29, 2013
By

It's that moment we all know and love, somewhere in our code something has gone wrong. We think we have done everything right, but instead of expected glory we find only terse red text lain below our lintel. This can be very frustrating, and trouble shooting these issues can often be very time consuming. All is not lost. There are a...

## Another Benchmark for Joining Two Data Frames

January 29, 2013
By

In my post yesterday comparing efficiency in joining two data frames, I overlooked the computing cost used to convert data.frames to data.tables / ff data objects. Today, I did the test again with the consideration of library loading and data conversion. After the replication of 10 times in rbenchmark package, the joining method with data.table

## Hilary: the most poisoned baby name in US history

January 29, 2013
By

I’ve always had a special fondness for my name, which — according to Ryan Gosling in “Lars and the Real Girl” — is a scientific fact for most people (Ryan Gosling constitutes scientific proof in my book). Plus, the root … Continue reading →

## Strata’s Data Driven Business Day

January 29, 2013
By

The tagline for O'Reilly Strata conference series — Making Data Work — has meant that it's always been popular with practitioners, primarily data scientists working with Big Data in real-world environments. Recent Strata events have also attracted more business-oriented attendees, with events focused more on processes and outcomes than on the implementation details. On Tuesday February 26, Strata Santa...

## Disruptive Data Science – Transforming Your Company into a Data Science-Driven Enterprise

January 29, 2013
By

Big Data is the latest technology wave impacting C-Level executives across all areas of business, but amid the hype, there remains confusion about what it all means. The name emphasizes the exponential growth of data volumes worldwide (collectively, 5 Exabytes/ day in the latest estimate I saw from IDC), but more nuanced definitions of Big Data incorporate the following...