# Monthly Archives: May 2013

## Import All Text Files in A Folder with Parallel Execution

May 26, 2013
By

Sometimes, we might need to import all files, e.g. *.txt, with the same data layout in a folder without knowing each file name and then combine all pieces together. With the old method, we can use lapply() and do.call() functions to accomplish the task. However, when there are a large number of such files and

## Logging Data in R Loops: Applied to Twitter.

May 26, 2013
By

A problem that many users face in R is storing the output from loop operations. In the case of Twitter, we may be requesting the last specified number of Tweets from a number of Twitter users. Several methods exist for … Continue reading →

## Pairwise distances in R

May 26, 2013
By

For a recent project I needed to calculate the pairwise distances of a set of observations to a set of cluster centers. In MATLAB you can use the pdist function for this. As far as I know, there is no equivalent in the R standard packages. So I looked into writing a fast implementation for

## Exploratory Data Analysis: Variations of Box Plots in R for Ozone Concentrations in New York City and Ozonopolis

Introduction Last week, I wrote the first post in a series on exploratory data analysis (EDA).  I began by calculating summary statistics on a univariate data set of ozone concentration in New York City in the built-in data set “airquality” in R.  In particular, I talked about how to calculate those statistics when the data

## Using R to visualize geo optimization algorithms

May 26, 2013
By

Site optimization is the process of finding an optimal location for a plant or a warehouse to minimize transportation costs and duration. A simple model only consists of one good and no restrictions regarding transportation capacities or delivery time. The optimizing algorithms are often hard to understand. Fortunately, R is a great tool to make them more comprehensible.The basic...

## Creating a typical textbook illustration of statistical power using either ggplot or base graphics

May 26, 2013
By
$Creating a typical textbook illustration of statistical power using either ggplot or base graphics$

A common way of illustrating the idea behind statistical power in null hypothesis significance testing, is by plotting the sampling distributions of the null hypothesis and the alternative hypothesis. Typically, these illustrations highlight the regions that correspond to making a type II error, type I error and correctly rejecting the null hypothesis (i.e. the test's power). In this post...

## More bubble sort tuning

May 26, 2013
By

After last week's post bubble sort tuning I got an email from Berend Hasselman noting that my 'best' function did not protect against cases n<=2 and a speed improvement was possible. That made me realize that I should have been profiling t...

## Test Drive of Parallel Computing with R

May 25, 2013
By

Today, I did a test run of parallel computing with snow and multicore packages in R and compared the parallelism with the single-thread lapply() function. In the test code below, a data.frame with 20M rows is simulated in a Ubuntu VM with 8-core CPU and 10-G memory. As the baseline, lapply() function is employed to

## Revisiting text processing with R and Python

May 25, 2013
By

Back in 2011, I covered the relative performance difference of the most popular libraries for text processing in R and Python.   In case you can’t guess the answer, Python and NLTK  won by a significant margin over R and… Read more ›