R still the preferred tool of predictive modelers competing at Kaggle

November 29, 2011
By
R still the preferred tool of predictive modelers competing at Kaggle

As reported on the Kaggle blog No Free Hunch, R remains the preferred tool for data scientists seeking to win the prizes in the predictive modeling competitions: More than 30% of Kaggle competitors report using R for their analysis, up from 22% a year ago. R's flexibility and the breadth of packages for machine learning and predictive modeling make...

Read more »

Relation Between Fires and Distanse to the Nearest Road (Recalculated)

November 29, 2011
By
Relation Between Fires and Distanse to the Nearest Road (Recalculated)

As you may already know, I'm a proud owner of AMD FX-8150 8-core CPU. And I've purchased it not for gaming reasons, but for science. My previous CPU was painfully slow with such calculations as determination of the relation between fires and distance t...

Read more »

Permanently Setting the CRAN repository

November 29, 2011
By

Setting the CRAN repository so that it does not ask every time you try to install a package  is something that I think few people bother to do, but it is so simple and can save a fair bit of frustration when working.  This is accomplished through a setting in one of the Rprofile files.  There

Read more »

Review of "The Art of R Programming" by Norman Matloff

November 29, 2011
By

By Joseph Rickert Anyone seeking to learn R faces two major challenges: (1) learning how to swim in the sea of information: R packages, books, websites, blog posts, message boards etc. that threatens to drown a newbie and (2) and coming to grips with the structure, syntax and features of the language itself. Having some idea of what one...

Read more »

Contributions to the R source

November 29, 2011
By

One of the nice things about tracking the R subversion repository using git instead of subversion is you can do git shortlog -s -n which gives you 19855 ripley 6302 maechler 5299 hornik 2263 pd 1153 murdoch 813 iacus 716 luke 6...

Read more »

Example 9.16: Small multiples

November 29, 2011
By
Example 9.16: Small multiples

Small multiples are one of the great ideas of graphics visionary Edward Tufte (e.g., in Envisioning Information). Briefly, the idea is that if many variations on a theme are presented, differences quickly become apparent. Today we offer general guida...

Read more »

Accessing and Visualising Sentencing Data for Local Courts

November 29, 2011
By
Accessing and Visualising Sentencing Data for Local Courts

A recent provisional data release from the Ministry of Justice contains sentencing data from English(?) courts, at the offence level, for the period July 2010-June 2011: “Published for the first time every sentence handed down at each court in the country between July 2010 and June 2011, along with the age and ethnicity of each

Read more »

outersect(): The opposite of R’s intersect() function

November 29, 2011
By
outersect(): The opposite of R’s intersect() function

The Objective To find the non-duplicated elements between two or more vectors (i.e. the ‘yellow sections of the diagram above) The Problem I needed the opposite of R’s intersect() function, an “outersect()“. The closest I found was setdiff() but the order of the input vectors produces different results, e.g. setdiff() produces all elements of the first

Read more »

A/B Testing in R – Part 1

November 29, 2011
By

A/B testing is a method for comparing the effectiveness of several different variations of a web page. For example, an online clothing retailer that specializes in mens’ streetwear may want to examine whether a black or pink background results in more purchases from visitors to the site. Lets say that our online store is just

Read more »

Trading Strategy Sensitivity Analysis

November 28, 2011
By
Trading Strategy Sensitivity Analysis

When designing a trading strategy, I want to make sure that small changes in the strategy parameters will not transform the profitable strategy into the loosing one. I will study the strategy robustness and profitability under different parameter scenarios using a sample strategy presented by David Varadi in the Improving Trend-Following Strategies With Counter-Trend Entries

Read more »

Dealing with R and HANA

November 28, 2011
By
Dealing with R and HANA

First things first...what's "R"? Simply put...is a programming language and software environment for statistical computing and graphics. More infomation can be found here R on WikipediaI have code in many programming languages, some of them very commer...

Read more »

How to speed up loops in R

November 28, 2011
By
How to speed up loops in R

As with any language, there are often several ways to code up the solution to a programming problem in R. If performance of the code is important (i.e. it's something you plan to run many times, or with a lot of data), how you code the solution can often have a big impact on how fast it runs. For...

Read more »

R’s Distrotheque

November 28, 2011
By
R’s Distrotheque

(Update: The csound package is now available on CRAN.) Do your random variables need to groove more? Of course they do. That's why I've been working on the upcoming csound package for R, which connects to Csound computer synthesis software to make any sound imaginable. Your computer'll be the hippest sample space on the randomized

Read more »

Retrieve GBIF Species Occurrence Data with Function from dismo Package

November 28, 2011
By
Retrieve GBIF Species Occurrence Data with Function from dismo Package

..The dismo package is awesome: with some short lines of code you can read & map species distribution data from GBIF (the global biodiversity information facility) easily:Read more »

Read more »

Course: Financial Data Modeling and Analysis in R

November 28, 2011
By

The University of Washington is holding a web-based course which will be of interest to anyone who wants to learn about financial modeling with R: Financial Data Modeling and Analysis in R (AMATH 542) is a comprehensive introduction to the R statistical programming language for computational finance offered by the University of Washington Computational Finance program and taught by...

Read more »

Where the Worlds of Dentistry and Cartography Collide

November 28, 2011
By
Where the Worlds of Dentistry and Cartography Collide

As I was getting a root canal last week, my dental X-Rays reminded me anew of an optical illusion that stumped us for a short time recently when we were developing our heatmapping engine.My X-Rays, before during and after a recent root canal.  The...

Read more »

Predicting Gender

November 28, 2011
By
Predicting Gender

If there are two (can be generalized to n) classes and both follow the same distribution (but with different parameters) it is possible to predict which class an observations comes from. Here I’ll try to predict a sample’s gender based on their height. The distribution of a person’s height is more or less normal. There

Read more »

Another aspect of speeding up loops in R

November 28, 2011
By
Another aspect of speeding up loops in R

Any frequent reader of R-bloggers will have come across several posts concerning the optimization of code - in particular, the avoidance of loops.Here's another aspect of the same issue. If you have experience programming in other languages besides R, this is probably a no-brainer, but for laymen, like myself, the following example was...

Read more »

A nice short article on memory in R

November 28, 2011
By
A nice short article on memory in R

There is a nice short article on memory issue in R at http://www.matthewckeller.com/html/memory.html. If you use R to process large data, you might find it helpful. It introduces: - checking how much memory an object is taking; - the memory … Continue reading →

Read more »

Prime Number in R Language (CloudStat)

November 28, 2011
By
Prime Number in R Language (CloudStat)

A prime number (or a prime) is a natural number greater than 1 that has no positive divisors other than 1 and itself. R Language Code The Prime Function prime = function(n){   n = as.integer(n)   if(n > 1e8) stop(“n too large”)   primes = re...

Read more »

A Story of Life and Death. On CRAN. With Packages.

November 27, 2011
By

The Comprehensive R Archive Network, or CRAN for short, has been a major driver in the success and rapid proliferation of the R statistical language and environment. CRAN currently hosts around 3400 packages, and is growing at a rapid rate. Not too ...

Read more »

Regression via Gradient Descent in R

November 27, 2011
By
Regression via Gradient Descent in R

In a previous post I derived the least squares estimators using basic calculus, algebra, and arithmetic, and also showed how the same results can be achieved using the canned functions in SAS and R or via the matrix programming capabilities offered by ...

Read more »

Basic Econometrics in R and SAS

November 27, 2011
By
Basic Econometrics in R and SAS

Regression Basicsy= b0 + b1 *X  ‘regression line we want to fit’The method of least squares minimizes the squared distance between the line ‘y’ andindividual data observations yi. That is minimize: ∑ ei2 = ∑ (yi - b0 -  b1 Xi...

Read more »

Gradient Descent in R

November 27, 2011
By
Gradient Descent in R

In a previous post I discussed the concept of gradient descent.  Given some recent work in the online machine learning course offered at Stanford,  I'm going to extend that discussion with an actual example using R-code  (the actual code...

Read more »

Dealing with Non-Positive Definite Matrices in R

November 27, 2011
By

Last time we looked at the Matrix package and dug a little into the chol(), Cholesky Decomposition, function.  I noted that often in finance we do not have a positive definite (PD) matrix.  The chol() function in both the Base and Matrix...

Read more »

Cleaning time-series and other data streams

Cleaning time-series and other data streams

The need to analyze time-series or other forms of streaming data arises frequently in many different application areas.  Examples include economic time-series like stock prices, exchange rates, or unemployment figures, biomedical data sequences like electrocardiograms or electroencephalograms, or industrial process operating data sequences like temperatures, pressures or concentrations.  As a specific example, the figure below shows four data sequences:...

Read more »

GTA R Users Group – Using R for Data Mining Competitions

November 27, 2011
By
GTA R Users Group – Using R for Data Mining Competitions

Here are the presentation slides I used for my talk on “Using R for Data Mining Competitions” at Ryerson University as part of the Greater Toronto Area (GTA) R User’s Meetup Group. Presentation (Prezi) Presentation (PDF) Meetup Event page Special thanks to Anthony Goldbloom from Kaggle and various competition winners for sharing their code and

Read more »

Analytics using R: Most active in my Twitter list

November 27, 2011
By
Analytics using R: Most active in my Twitter list

I follow some 80 odd people/ news sources on my twitter account. For a while I wondered which of these sources are most active on twitter. I picked a simple metric '# of status messages posted to twitter' as the measure of activity. Using R I quickly wrote a program to generate my top 10 most active...

Read more »

Putting it all together: concise code to make dotplots with weighted bootstrapped standard errors

November 27, 2011
By
Putting it all together: concise code to make dotplots with weighted bootstrapped standard errors

I analyze a lot of experiments and there are many times when I want to quickly look at means and standard errors for each cell (experimental condition), or the same for each cell and individual-level attribute level (e.g., Democrat, Independent, … Continue reading →

Read more »