NSCB Sexy Stats Version 2

October 25, 2012
By
NSCB Sexy Stats Version 2

This was a revised version of my previous post about the NSCB article. With the suggestion from Tal Galili, below were the new pie charts and the R codes to produce these plots by directly scrapping the data from the webpage using XML and RColorBrewer ...

Read more »

Using FAFSA Data to study Competitors – Part 2

October 25, 2012
By
Using FAFSA Data to study Competitors – Part 2

I wanted to build upon my previous post and dive a little deeper into the sorts of questions we can answer using the FAFSA data supplied to us by applicants. As a quick overview, students completing the FAFSA for student aid can list up to ten institutions on the form. I consider this the student’s

Read more »

Modeling Couch Potato strategy

October 25, 2012
By
Modeling Couch Potato strategy

I first read about the Couch Potato strategy in the MoneySense magazine. I liked this simple strategy because it was easy to understand and easy to manage. The Couch Potato strategy is similar to the Permanent Portfolio strategy that I have analyzed previously. The Couch Potato strategy invests money in the given proportions among different

Read more »

Accelerating R code: Computing Implied Volatilities Orders of Magnitude Faster

October 25, 2012
By

This blog, together with Romain's, is one of the main homes of stories about how Rcpp can help with getting code to run faster in the context of the R system for statistical programming and analysis. By making it easier to get already existing C or C++ code to R, or equally to extend R with new C++...

Read more »

My Goodness. What a Fat Dataset!

October 25, 2012
By
My Goodness.  What a Fat Dataset!

Recently at work we got sent a data file containing information on donations to a specific charitable organization, ranging all the way back to the 80′s.  Usually, when we receive a dataset with a donation history in it, each row … Continue reading →

Read more »

Allstate compares SAS, Hadoop and R for Big-Data Insurance Models

October 25, 2012
By
Allstate compares SAS, Hadoop and R for Big-Data Insurance Models

At the Strata conference in New York today, Steve Yun (Principal Predictive Modeler at Allstate's Research and Planning Center) described the various ways he tackled the problem of fitting a generalized linear model to 150M records of insurance data. He evaluated several approaches: Proc GENMOD in SAS Installing a Hadoop cluster Using open-source R (both on the full data...

Read more »

Notes on a Scandal – When Jimmy beat Katy

October 25, 2012
By
Notes on a Scandal  – When Jimmy beat Katy

No the title doesn’t refer to how Katy Perry suffered at another of Jimmy Savile’s sexual predelictions, although these are two of  the participants. I’ll get to the details later Just over a year ago, I reflected on the relative wiki searches of leading female singing celebrities, including Ms Perry. In the light of the

Read more »

Palettes in R

October 25, 2012
By
Palettes in R

In its simplest form, a palette in R is simply a vector of colors. This vector can be include the hex triplet or R color names.The default palette can be seen through palette(): > palette("default") # you'll only need this line if you've previ...

Read more »

NSCB Sexy Statistics (Unemployment)

October 25, 2012
By
NSCB Sexy Statistics (Unemployment)

Recently, my friend posted on her Facebook account about the article published by the National Statistical Coordination Board (NSCB) about poverty and unemployment in the country.  Looking at the report I saw a lot of tables, so I thought why not ...

Read more »

How fat are your tails?

October 25, 2012
By
How fat are your tails?

Lately I’ve been thinking about how to measure the fatness of the tails of a distribution. After some searching, I came across the Pareto Tail Index method. This seems to be used mostly in economics. It works by finding the decay rate of the tail. It’s complicated, both in formula and in it’s R implementation

Read more »

Congressional ideology by state

October 25, 2012
By
Congressional ideology by state

In a recent post, I illustrated how to add a background geom to your ggplot. While that code worked, and the plot looked fine, it was pointed out to me that I was missing an important aspect of plot layering with ggplot2. Namely, it is not, as I previ...

Read more »

R function: generate a panel data.table or data.frame to fill with data

October 25, 2012
By

I have started to work with R and STATA together. I like running regressions in STATA, but I do graphs and setting up the dataset in R. R clearly has a strong comparative advantage here compared to STATA. I was writing a function that will give me a (balanced) panel-structure in R. It then simply

Read more »

Rcpp modules more flexible

October 25, 2012
By

Rcpp modules just got more flexible (as of revision 3838 of Rcpp, to become 0.9.16 in the future). modules have allowed exposing C++ classes for some time now, but developpers had to declare custom wrap and as specializations if they wanted their classes to be used as return type or argument type of a C++ function or method....

Read more »

Nonnegative Matrix Factorization and Recommendor Systems

October 24, 2012
By
Nonnegative Matrix Factorization and Recommendor Systems

Albert Au Yeung provides a very nice tutorial on non-negative matrix factorization and an implementation in python. This is based very loosely on his approach. Suppose we have the following matrix of users and ratings on movies:If we use the information above to form a matrix R it can be decomposed into two matrices...

Read more »

Quick notes from Strata NYC 2012

October 24, 2012
By

The O'Reilly Strata conferences are always great fun to attend, and this latest installment in New York City is no exception. This one is super-busy though; the conference has been sold out for weeks -- and not just marketing-sold-out, it's fire-department-sold out. It's non-stop conversations and presentations, and it's tough to move through the hallways in between. Nonetheless, I...

Read more »

R for Ecologists: Permutation Analysis – t-tests

October 24, 2012
By
R for Ecologists: Permutation Analysis – t-tests

You’ve carefully designed your experiment, you’ve meticulously collected your data, and you have a hypothesis to test. Unfortunately, your data is typical of ecology data: small sample sizes, messy, and non-normal. Your ideal test, the t-test, won’t work because of the … Continue reading →

Read more »

Plotting the debate "Winner"

October 24, 2012
By
Plotting the debate "Winner"

As a Political Scientist, it could not be more gauche to talk about the Presidential debate in terms of a winner and a loser, but the occasion provides the opportunity to show how to do (at least) three really useful things: Directly load price and v...

Read more »

Displaying Your Data in Google Earth Using R2G2

October 24, 2012
By
Displaying Your Data in Google Earth Using R2G2

Have you ever wanted to easily visualize your ecology data in Google Earth? R2G2 is a new package for R, available via R CRAN and formally described in this Molecular Ecology Resources article, which provides a user-friendly bridge between R and the Google Earth interface. Here, we will provide a brief introduction to...

Read more »

Stan for Bayesian Analysis

October 23, 2012
By
Stan for Bayesian Analysis

Bayesian analysis has been growing in popularity among ecologists recently, largely due to accessible books such as Models for Ecological Data: An Introduction, Introduction to WinBUGS for Ecologists, and Bayesian Methods for Ecology. Most ecologists with limited programming background have … Continue reading →

Read more »

RStudio training

October 23, 2012
By
RStudio training

At RStudio, we want you to be effective R users. As well as creating great software, we want to make it easier for you to master R. To this end, we’re very happy to announce our new training offerings. We’re kicking off with two public courses: Effective data visualisation and reports and reproducible research in

Read more »

Machine learning for hackers

October 23, 2012
By
Machine learning for hackers

Which way do you prefer to learn a new material – deep theoretical background first and practice later or do you like to break things in order to fix them? If latter is your way of learning things, then most likely you will enjoy Machine Learning for Hackers. The book has chapters on machine learning

Read more »

Two Talks on Data Science, Big Data and R

October 23, 2012
By

On Thursday next week (November 1), I'll be giving a new webinar on the topic of Big Data, Data Science and R. Titled "The Rise of Data Science in the Age of Big Data Analytics: Why Data Distillation and Machine Learning Aren’t Enough", this is a provocative look at why data scientists cannot be replaced by technology, and why...

Read more »

Multiple levelplots with title and subtitle in R

October 23, 2012
By
Multiple levelplots with title and subtitle in R

I had quite a fight with R to put multiple levelplots with a shared title and subtite on the same chart, so I thought I put a PoC code here: library(lattice)m col.l plot.new()par(mfrow=c(2,2), oma=c(2,0,2,0))print(levelplot(m, col.regions=col.l, main="L1"), split=c(1, 1, 2, 2)) print(levelplot(m, col.regions=col.l, main="L2"), split=c(1, 2, 2, 2), newpage=FALSE)print(levelplot(m, col.regions=col.l, main="L3"), split=c(2, 1, 2, 2), newpage=FALSE)print(levelplot(m, col.regions=col.l, main="L4"), split=c(2,...

Read more »

Bayes for President!

October 23, 2012
By
Bayes for President!

I couldn't resist getting sucked into the hype associated with the US election and debates, and so I thought I had a little fun of my own and played around a bit with the numbers. [OK: you may disagree with the definition of "fun" $-$ but then again, i...

Read more »

analyze the general social survey (gss) with r

October 23, 2012
By

the general social survey (gss) has served as america's mood ring since 1972.  data-driven social scientists can compare political beliefs by demography, look at attitude trends, make emile durkheim and max weber (pronounced durk-veber) proud.&nbs...

Read more »

Benchmarking matrix creation

October 23, 2012
By
Benchmarking matrix creation

Sometimes it is useful to take a vector, or one column/row of a matrix, and build a new matrix of identical copies of that vector. There are lots of different ways to do this, but I just discovered a new, and very straightforward way to do this with m...

Read more »

The basics of Value at Risk and Expected Shortfall

October 23, 2012
By
The basics of Value at Risk and Expected Shortfall

Value at Risk and Expected Shortfall are common risk measures.  Here is a quick explanation. Ingredients The first two ingredients are each a number: The time horizon — how many days do we look ahead? The probability level — how far in the tail are we looking? Ingredient number 3 is a prediction distribution of … Continue reading...

Read more »

Presidential Debates 2012

October 23, 2012
By
Presidential Debates 2012

I have been playing with the beta version of qdap utilizing the presidential debates as a data set. qdap is in a beta phase lacking documentation though I’m getting there. In previous blog posts (presidential debate 1 LINK and VP … Continue reading →

Read more »

It Takes 2 Lines of R Code to Discover Interesting Biology

October 23, 2012
By
It Takes 2 Lines of R Code to Discover Interesting Biology

The following biological phenomenon demonstrates just how elegant R code can be. In vertebrate genomes, a methyl group (-CH3) can be added to nucleotides. Such process of methylation is commonly associated with gene suppression. Most of the cytosines in the … Continue reading →

Read more »

Sponsors