215 = 32768 and the sum of its digits is 3 + 2 + 7 + 6 + 8 = 26.What is the sum of the digits of the number 21000?Handling large numbers or rather, very large numbers, can be a pain at times. But have no fear, for GMP is here.GMP makes the s...

Data Mining Applications with R A book to be published by Elsevier http://www.RDataMining.com/books/book2 Proposal Submission Deadline: April 30, 2012 Introduction R is one of the most widely used data mining tools in scientific and business applications, among dozens of commercial … Continue reading →

I was writing comments on the blog post A proposal for a really fast statistics journal, and I realized the comment box was too small to write down my ideas. I like the proposal a lot, and I feel really bad about the current model of submitting and rev...

At the recent Big Data Workshop held by the Boston Predictive Analytics group, airline analyst and R user Jeffrey Breen gave a step-by-step guide to setting up an R and Hadoop infrastructure. Firstly, as a local virtual instance of Hadoop with R, using VMWare and Cloudera's Hadoop Demo VM. (This is a great way to get familiar with Hadoop.)...

I have finally got around to posting the R code for my p curve simulation. Those familiar with R will realize how crude it is (I've been caught up with other urgent stuff and had no time to explore further).You are welcome to play with (and improve!) t...

I have previously described a few examples of portfolio construction: Introduction to Asset Allocation Maximum Loss and Mean-Absolute Deviation risk measures 130/30 Portfolio Construction Minimum Investment and Number of Assets Portfolio Cardinality Constraints Multiple Factor Model – Building 130/30 Index (Update) I created a number of helper functions to simplify process of making the constraints(

I get asked frequently how to convert from one gene identifier to another. This can be tricky, especially when relying on gene symbols, as Will pointed out in a previous post a few years ago. There are several tools that can do this, including DAVID an...

Why don’t X-Y plots of latitude and longitude data look “right” compared to traditional map views? For example, here’s an X-Y scatterplot of some of Jenson Button’s McLaren telemetry data from the 2010 Australian Formula One Grand Prix: The image was generated, from a data file hosted on Google Spreadsheets, using the following R script,

I wanted to write contingency tables in HTML with hwrite(). I realized that the method hwrite() does not exist for the table objects. I could use as.data.frame(), but the table produced is non-intuitive. I did a search on R-bloggers and I quickly found the solution to my problem: the as.data.frame.matrix() function. The contingency table A

As promised, this post is a bit more graphical, but I feel the need to stress the importance of the first few points in chapter 2 of the book (i.e. the difference between mean and average and why variance is meaningful). These are fundamental concepts for future work. The “pumpkin” example (2.1) gives us an

I’ve recently been working with methylation data; specifically, from the Illumina Infinium HumanMethylation450 bead chip. It’s a rather complex array which uses two types of probes to determine the methylation state of DNA at ~ 485 000 sites in the genome. The Bioconductor project has risen to the challenge with a (somewhat bewildering) variety of

Lately I've been using Rob J Hyndman's excellent forecast package. The package comes with some built in plotting functions but I found I wanted to customize and make my own plots in ggplot. In order to do that, I need a generalizable function that will...

I am currently in Redondo Beach, CA at the Sunbelt XXXII social networks conference. The program is thick from numerous interesting talks so the event promises to be very interesting. Today in the morning I gave the workshop “Introduction to Social Network Analysis with R”. Over 50 people registered. I am grateful to all the

Stata has a large number of graphics capabilities (and I highly recommend Stata over other statistical packages for a variety of reasons), but in a few instances R is more useful. In particular, I find R useful for creating beautiful scatter plot ...

The slides and replay for Dr Sanjiv Das's webinar, Using R for Analyzing Loans, Portfolios and Risk: From Academic Theory to Financial Practice are now available. I've embedded the slides below: they tell a great story of how Das, after being mistaken for the then-CEO of Citibank (with whom he shares a name) was then led to research (using...

Oracle provides the Oracle R Distribution, an Oracle-supported distribution of open source R. Support for Oracle R Distribution is provided to customers of the Oracle Advanced Analytics option and the Oracle Big Data Appliance. The Oracle R Distribu...