Pairwise distances in R

May 26, 2013
By
Pairwise distances in R

For a recent project I needed to calculate the pairwise distances of a set of observations to a set of cluster centers. In MATLAB you can use the pdist function for this. As far as I know, there is no equivalent in the R standard packages. So I looked into writing a fast implementation for

Read more »

Exploratory Data Analysis: Variations of Box Plots in R for Ozone Concentrations in New York City and Ozonopolis

Exploratory Data Analysis: Variations of Box Plots in R for Ozone Concentrations in New York City and Ozonopolis

Introduction Last week, I wrote the first post in a series on exploratory data analysis (EDA).  I began by calculating summary statistics on a univariate data set of ozone concentration in New York City in the built-in data set “airquality” in R.  In particular, I talked about how to calculate those statistics when the data

Read more »

Using R to visualize geo optimization algorithms

May 26, 2013
By
Using R to visualize geo optimization algorithms

Site optimization is the process of finding an optimal location for a plant or a warehouse to minimize transportation costs and duration. A simple model only consists of one good and no restrictions regarding transportation capacities or delivery time. The optimizing algorithms are often hard to understand. Fortunately, R is a great tool to make them more comprehensible.The basic...

Read more »

Creating a typical textbook illustration of statistical power using either ggplot or base graphics

May 26, 2013
By
Creating a typical textbook illustration of statistical power using either ggplot or base graphics

A common way of illustrating the idea behind statistical power in null hypothesis significance testing, is by plotting the sampling distributions of the null hypothesis and the alternative hypothesis. Typically, these illustrations highlight the regions that correspond to making a type II error, type I error and correctly rejecting the null hypothesis (i.e. the test's power). In this post...

Read more »

Creating a typical textbook illustration of statistical power using either ggplot or base graphics

May 26, 2013
By
Creating a typical textbook illustration of statistical power using either ggplot or base graphics

A common way of illustrating the idea behind statistical power in null hypothesis significance testing, is by plotting the sampling distributions of the null hypothesis ($ H_0 $) and the alternative hypothesis ($ H_A $). Typically, these illustrations highlight the regions that correspond to making a type II error ($ beta $), type I...

Read more »

More bubble sort tuning

May 26, 2013
By

After last week's post bubble sort tuning I got an email from Berend Hasselman noting that my 'best' function did not protect against cases n<=2 and a speed improvement was possible. That made me realize that I should have been profiling t...

Read more »

Test Drive of Parallel Computing with R

May 25, 2013
By
Test Drive of Parallel Computing with R

Today, I did a test run of parallel computing with snow and multicore packages in R and compared the parallelism with the single-thread lapply() function. In the test code below, a data.frame with 20M rows is simulated in a Ubuntu VM with 8-core CPU and 10-G memory. As the baseline, lapply() function is employed to

Read more »

Revisiting text processing with R and Python

May 25, 2013
By

  Back in 2011, I covered the relative performance difference of the most popular libraries for text processing in R and Python.   In case you can’t guess the answer, Python and NLTK  won by a significant margin over R and… Read more ›

Read more »

Speed trick: Assigning large object NULL is much faster than using rm()!

May 25, 2013
By

When processing large data sets in R you often also end up creating large temporary objects. In order to keep the memory footprint small, it is always good to remove those temporary objects as soon as possible. When done, removed objects will be deallocated from memory (RAM) the next time the garbage collection runs. Better: Use rm(list="x")...

Read more »

HOWTO: X11 Forwarding for Oracle R Enterprise

May 25, 2013
By
HOWTO: X11 Forwarding for Oracle R Enterprise

v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 false false false EN-US X-NONE X-NONE ...

Read more »

Sentiment analysis finds trouble in the Enron emails

May 24, 2013
By
Sentiment analysis finds trouble in the Enron emails

The Enron email dataset, collected during the FERC investigation of the Enron financial scandal, represents the largest publicly available set of emails. This makes theman ideal testbed for sentiment analysis algorithms. Ikanow's Andrew Strite used the open-source Infinit.e framework and a Hadoop cluster to generate sentiment scores for all of the Enron emails, and then used R to manipulate...

Read more »

Down and Dirty Forecasting: Part 2

May 24, 2013
By
Down and Dirty Forecasting: Part 2

This is the second part of the forecasting exercise, where I am looking at a multiple regression. To keep it simple I chose the states that boarder WI and the US unemployment information for the regression. Again this is a down and dirty analysis, I wo...

Read more »

What is probabilistic truth? Part 2 – Everything is conditional

May 24, 2013
By
What is probabilistic truth? Part 2 – Everything is conditional

Read Part 1 When making a statement of the form “1/2 is the correct probability that this coin will land tails”, there are a few things which are left unsaid, but which are typically implied. The statement is one about the probability of an unknown event occurring, and it would seem reasonable to write this

Read more »

Down and Dirty Forecasting: Part 1

May 24, 2013
By
Down and Dirty Forecasting: Part 1

I wanted to see what I could do in a hurry using the commands found at Forecasting: Principles and Practice . I chose a simple enough data set of Wisconsin Unemployment from 1976 to the present (April 2013). I kept the last 12 months worth of...

Read more »

Compiling R from Source with OpenMP, Accelerate and MKL in OS X

May 24, 2013
By

Compiling R from Source in OS X I set out to find out whether I could speed up R by compiling it from source and: using Apple´s Accelerate Framework enabling OpenMP (which is disabled under OS X and Windows by default, but enabled under Linux) using Intel´s Intel´s Math Kernel Library I also wanted to know how an implicit parallel library,...

Read more »

Shiny + Concerto = YES !!!

May 23, 2013
By
Shiny + Concerto = YES !!!

So I have finally gotten beta access to the two most powerful R controlled web application makers in existence and produced very exciting experimental productsA few posts ago I posted a Visual Reasoning Test that I had made by hand and powered wit...

Read more »

Robert Hijmans on Spatial Data Analysis

May 23, 2013
By

Last week at the Davis R Users’ Group Robert Hijmans gave a talk about spatial data analysis in R. Robert is a professor of biogeography at UC Davis and the author of the raster (analysis of gridded data), dismo (species distribution modeling), and geosphere (spherical trigonometry), packages. Robert’s presentation spanned topics including basic...

Read more »

Working with shapefiles, projections and world maps in ggplot

May 23, 2013
By
Working with shapefiles, projections and world maps in ggplot

In this post I will show some different examples of how to work with map projections and how to plot the maps using ggplot. Many maps that are shown using their default projection are in the longlat-format, which is far from optimal. For plotting world maps I prefer to use either Robinson or Winkel Tripel projection—but many more are available—and I will show how...

Read more »

Working with shapefiles, projections and world maps in ggplot

May 23, 2013
By
Working with shapefiles, projections and world maps in ggplot

In this post I show some different examples of how to work with map projections and how to plot the maps using ggplot. Many maps that are using the default projection are shown in the longlat-format, which is far from optimal. Here I show how to use either the Robinson or Winkel Tripel projection. Read more

Read more »

7th R/Rmetrics workshop in Switzerland, June 30-July 4

May 23, 2013
By

The 7th annual R/Rmetrics Workshop om Computational Finance and Financial Engineering will take place June 30-July 4 in the beatiful alpine setting of Lake Thune, Switzerland. This is an intimate workshop limited to around 50 participants, and features tutorials from leading practitioners in finance with R, with a special focus on the Rmetrics suite of R packages. This year's...

Read more »

Highlights of the Milwaukee Workshop on R and Bioinformatics

May 23, 2013
By
Highlights of the Milwaukee Workshop on R and Bioinformatics

by Joseph Rickert On May 10th and 11th, in honor of this being the International Year of Statistics, the Milwaukee Chapter of the American Statistical Association (MILWASA) held a workshop on cutting edge uses of R in Bioinformatics. One objective of the workshop was to show the "nuts and bolts" details of how R with C++ integration and the...

Read more »

Package MatchIt: Balancing experimental data

May 23, 2013
By
Package MatchIt: Balancing experimental data

A balanced experimental design is one in which the distribution of the covariates is the same in both the control and treatment groups. However, although achievable in an experimental scenario, for observational data this ideal is seldom attained. The MatchIt package provides a means of pre-processing data so that the treated and control groups are as similar

Read more »

Veterinary Epidemiologic Research: Modelling Survival Data – Non-Parametric Analyses

May 23, 2013
By
Veterinary Epidemiologic Research: Modelling Survival Data – Non-Parametric Analyses

Next topic from Veterinary Epidemiologic Research: chapter 19, modelling survival data. We start with non-parametric analyses where we make no assumptions about either the distribution of survival times or the functional form of the relationship between a predictor and survival. There are 3 non-parametric methods to describe time-to-event data: actuarial life tables, Kaplan-Meier method, and

Read more »

Generating a Markov chain vs. computing the transition matrix

May 23, 2013
By
Generating a Markov chain vs. computing the transition matrix

A couple of days ago, we had a quick chat on Karl Broman‘s blog, about snakes and ladders (see http://kbroman.wordpress.com/…) with Karl and Corey (see http://bayesianbiologist.com/….), and the use of Markov Chain. I do believe that this application is truly awesome: the example is understandable by anyone, and computations (almost any kind, from what we’ve tried) are easy to perform....

Read more »

The R-Podcast Episode 13: Interview with Yihui Xie

May 23, 2013
By

It’s an episode of firsts on the R-Podcast! In this episode recorded on location I had the honor and privilege of interviewing Yihui Xie, author of many innovative packages such as knitr and animation. Some of the topics we discussed include: Yihui’s motivation for creating knitr and some key new features How markdown plays a

Read more »

xkcd Style Bubble Plot

May 23, 2013
By
xkcd Style Bubble Plot

A package was recently released to generate plots in the style of xkcd using R. Being a big fan of the cartoon, I could not resist trying it out. So I set out to produce something like one of Hans Rosling’s bubble plots. First I needed some data. Spoilt for choice. I scraped some population data broken

Read more »

The R-Podcast Episode 13: Interview with Yihui Xie

May 23, 2013
By

It's an episode of firsts on the R-Podcast! In this episode recorded on location I had the honor and privilege of interviewing Yihui Xie, author of many innovative packages such as knitr and animation. Some of the topics we discussed include: Yihui's motivation for creating knitr and some key new features How markdown plays a key role in making reproducible research more ...

Read more »

Investment Portfolio Analysis with R Language

May 22, 2013
By

R has a wide application in finance analysis areas such as time series analysis, portfolio management, and risk management, with its basic functions and many professional packages in Finance. In this article, we will demonstrate how to

Read more »

Vote in the KDnuggets poll on Analytics Software

May 22, 2013
By

The 14th annual KDnuggets poll measuring use of analytics software is open for voting. The poll asks, "What Predictive Analytics, Big Data, Data mining, Data Science software you used in the past 12 months for a real project?" and allows up to 20 choices from commercial software, open source software, and "big data" software. R was the leading choice...

Read more »

Sponsors