Blog Archives

Some Comments on Donaho’s “50 Years of Data Science”

January 23, 2016
By
Some Comments on Donaho’s “50 Years of Data Science”

An old friend recently called my attention to a thoughtful essay by Stanford statistics professor David Donaho, titled “50 Years of Data Science.” Given the keen interest these days in data science, the essay is quite timely. The work clearly shows that Donaho is not only a grandmaster theoretician, but also a statistical philosopher. The … Continue reading...

Read more »

The Generalized Method of Moments and the gmm package

December 20, 2015
By
The Generalized Method of Moments and the gmm package

An almost-as-famous alternative to the famous Maximum Likelihood Estimation is the Method of Moments. MM has always been a favorite of mine because it often requires fewer distributional assumptions than MLE, and also because MM is much easier to explain than MLE to students and consulting clients. CRAN has a package gmm that does MM, … Continue reading...

Read more »

The Method of Boosting

December 8, 2015
By
The Method of Boosting

One of the techniques that has caused the most excitement in the machine learning community is boosting, which in essence is a process of iteratively refining, e.g. by reweighting, of estimated regression and classification functions (though it has primarily been applied to the latter), in order to improve predictive ability. Much has been made of … Continue reading...

Read more »

OVA vs. AVA in Classification Problems, via regtools

December 2, 2015
By
OVA vs. AVA in Classification Problems, via regtools

OVA and AVA? Huh? These stand for One vs. All and All vs. All, in classification problems with more than 2 classes. To illustrate the idea, I’ll use the UCI Vertebral Column data and Letter Recognition Data, and analyze them using my regtools package. As some of you know, I’m developing the latter in conjunction with … Continue reading...

Read more »

Back to the BLAS Issue

November 21, 2015
By
Back to the BLAS Issue

A few days ago, I wrote here about how some researchers, such Art Owen and Katelyn Gao at Stanford and Patrick Perry at NYU, have been using an old, old statistical technique — random effects models — for a new, new application — recommender systems. In addition to describing their approach to that problem, I … Continue reading...

Read more »

Partools, Recommender Systems and More

November 15, 2015
By
Partools, Recommender Systems and More

Recently I attended a talk by Stanford’s Art Owen, presenting work done with his student, Katelyn Gao. This talk touched on a number of my interests, both mathematical and computational. What particularly struck me was that Art and Katelyn are applying a very old — many would say very boring — method to a very … Continue reading...

Read more »

A New Method for Statistical Disclosure Limitation, I

October 15, 2015
By
A New Method for Statistical Disclosure Limitation, I

The Statistical Disclosure Limitation (SDL) problem involves modifying a data set in such a manner that statistical analysis on the modified data is reasonably close to that performed on the original data, while preserving the privacy of individuals in the data set. For instance, we might have a medical data set on which we want … Continue reading...

Read more »

Unbalanced Data Is a Problem? No, BALANCED Data Is Worse

September 29, 2015
By
Unbalanced Data Is a Problem? No, BALANCED Data Is Worse

Say we are doing classification analysis with classes labeled 0 through m-1. Let Ni be the number of observations in class i. There is much handwringing in the machine learning literature over situations in which there is a wide variation among the Ni. I will argue here, though, that the problem is much worse in … Continue reading...

Read more »

More on the Heteroscedasticity Issue

September 22, 2015
By
More on the Heteroscedasticity Issue

In my last post, I dsciussed R software, including mine, that handles heteroscedastic settings for linear and nonlinear regression models. Several readers had interesting comments and questions, which I will address here. To review: Though most books and software assume homoscedasticity, i.e. constancy of the variance of the response variable at all levels of the … Continue reading...

Read more »

Can You Say “Heteroscedasticity” 3 Times Fast?

September 18, 2015
By
Can You Say “Heteroscedasticity” 3 Times Fast?

Most books on regression analysis assume homoscedasticity, the situation in which Var(Y | X = t), for a response variable Y and vector of predictor variables X, is the same for all t. Yet, needless to say, almost all data in real life is heteroscedastic. For Y = human weight and X = height, say, … Continue reading...

Read more »

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)