Blog Archives

A pathological glm() problem that doesn’t issue a warning

May 1, 2013
By
A pathological glm() problem that doesn’t issue a warning

I know I have already written a lot about technicalities in logistic regression (see for example: How robust is logistic regression? and Newton-Raphson can compute an average). But I just ran into a simple case where R‘s glm() implementation of logistic regression seems to fail without issuing a warning message. Yes the data is a Related posts:

Read more »

Prefer = for assignment in R

April 23, 2013
By
Prefer = for assignment in R

We share our opinion that = should be preferred to the more standard <- for assignment in R. This is from a draft of the appendix of our upcoming book. This has the risk of becoming an R version of Javascript’s semicolon controversy, but here you have it. R has five common assignment operators: “=“, Related posts:

Read more »

Worry about correctness and repeatability, not p-values

April 5, 2013
By
Worry about correctness and repeatability, not p-values

In data science work you often run into cryptic sentences like the following: Age adjusted death rates per 10,000 person years across incremental thirds of muscular strength were 38.9, 25.9, and 26.6 for all causes; 12.1, 7.6, and 6.6 for cardiovascular disease; and 6.1, 4.9, and 4.2 for cancer (all P < 0.01 for linear Related posts:

Read more »

A bit more on sample size

March 8, 2013
By
A bit more on sample size

In our article What is a large enough random sample? we pointed out that if you wanted to measure a proportion to an accuracy “a” with chance of being wrong of “d” then a idea was to guarantee you had a sample size of at least: This is the central question in designing opinion polls Related posts:

Read more »

Don’t use correlation to track prediction performance

February 22, 2013
By
Don’t use correlation to track prediction performance

Using correlation to track model performance is “a mistake that nobody would ever make” combined with a vague “what would be wrong if I did do that” feeling. I hope after reading this feel a least a small urge to double check your work and presentations to make sure you have not reported correlation where Related posts:

Read more »

Please stop using Excel-like formats to exchange data

December 7, 2012
By
Please stop using Excel-like formats to exchange data

I know “officially” data scientists all always work in “big data” environments with data in a remote database, streaming store or key-value system. But in day to day work Excel files and Excel export files get used a lot and cause a disproportionate amount of pain. I would like to make a plea to my Related posts:

Read more »

Level fit summaries can be tricky in R

October 1, 2012
By
Level fit summaries can be tricky in R

Model level fit summaries can be tricky in R. A quick read of model fit summary data for factor levels can be misleading. We describe the issue and demonstrate techniques for dealing with them.When modeling you often encounter what are commonly called categorical variables, which are called factors in R. Possible values of categorical variables Related posts:

Read more »

Newton-Raphson can compute an average

August 28, 2012
By
Newton-Raphson can compute an average

In our article How robust is logistic regression? we pointed out some basic yet deep limitations of the traditional full-step Newton-Raphson or Iteratively Reweighted Least Squares methods of solving logistic regression problems (such as in R‘s standard glm() implementation). In fact in the comments we exhibit a well posed data fitting problem that can not Related posts:

Read more »

How robust is logistic regression?

August 23, 2012
By
How robust is logistic regression?

Logistic Regression is a popular and effective technique for modeling categorical outcomes as a function of both continuous and categorical variables. The question is: how robust is it? Or: how robust are the common implementations? (note: we are using robust in a more standard English sense of performs well for all inputs, not in the Related posts:

Read more »

What does a generalized linear model do?

August 15, 2012
By
What does a generalized linear model do?

What does a generalized linear model do? R supplies a modeling function called glm() that fits generalized linear models (abbreviated as GLMs). A natural question is what does it do and what problem is it solving for you? We work some examples and place generalized linear models in context with other techniques.For predicting a categorical Related posts:

Read more »