Site icon R-bloggers

Statistics: Losing Ground to CS, Losing Image Among Students

[This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The American Statistical Association (ASA)  leadership, and many in Statistics academia. have been undergoing a period of angst the last few years,  They worry that the field of Statistics is headed for a future of reduced national influence and importance, with the feeling that:

I had been aware of these issues for quite a while, and thus was pleasantly surprised last year to see then-ASA president Marie Davidson write a plaintive editorial titled, “Aren’t We Data Science?”

Good, the ASA is taking action, I thought.  But even then I was startled to learn during JSM 2014 (a conference tellingly titled “Statistics:  Global Impact, Past, Present and Future”) that the ASA leadership is so concerned about these problems that it has now retained a PR firm.

This is probably a wise move–most large institutions engage in extensive PR in one way or another–but it is a sad statement about how complacent the profession has become.  Indeed, it can be argued that the action is long overdue; as a friend of mine put it, “They [the statistical profession] lost the PR war because they never fought it.”

In this post, I’ll tell you the rest of the story, as I see it, viewing events as a statistician, computer scientist and R enthusiasist.

CS vs. Statistics

Let’s consider the CS issue first.  Recently a number of new terms have arisen, such as data science, Big Data, and analytics, and the popularity of the term machine learning has grown rapidly.  To many of us, though, this is just  “old wine in new bottles,” with the “wine” being Statistics.  But the new “bottles” are disciplines outside of Statistics–especially CS.

I have a foot in both the Statistics and CS camps.  I’ve spent most of my career in the Computer Science Dept. at the University of California, Davis, but I began my career in Statistics at that institution.  My mathematics doctoral thesis at UCLA was in probability theory, and my first years on the faculty at Davis focused on statistical methodology.  I was one of the seven charter members of the Department of Statistics.   Though my departmental affiliation later changed to CS, I never left Statistics as a field, and most of my research in Computer Science has been statistical in nature.  With such “dual loyalties,” I’ll refer to people in both professions via third-person pronouns, not first, and I will be critical of both groups.  (A friend who read a draft of this post joked it should be titled “J’accuse”  but of course this is not my intention.)   However, in keeping with the theme of the ASA’s recent actions, my essay will be Stat-centric:  What is poor Statistics to do?

Well then, how did CS come to annex the Stat field?  The primary cause, I believe, came from the CS subfield of Artificial Intelligence (AI).  Though there always had been some probabilistic analysis in AI, in recent years the interest has been almost exclusively in predictive analysis–a core area of Statistics.

That switch in AI was due largely to the emergence of Big Data.  No one really knows what the term means, but people “know it when they see it,” and they see it quite often these days.  Typical data sets range from large to huge to astronomical (sometimes literally the latter, as cosmology is one of the application fields), necessitating that one pay key attention to the computational aspects.  Hence the term data science, combining quantitative methods with speedy computation, and hence another reason for CS to become involved.

Involvement is one thing, but usurpation is another.  Though not a deliberate action by any means, CS is eclipsing Stat in many of Stat’s central areas.  This is dramatically demonstrated by statements that are made like,  “With machine learning methods, you don’t need statistics”–a punch in the gut for statisticians who realize that machine learning really IS statistics.  ML goes into great detail in certain aspects, e.g. text mining, but in essence it consists of parametric and nonparametric curve estimation methods from Statistics, such as logistic regression, LASSO, nearest-neighbor classification, random forests, the EM algorithm and so on.

Though the Stat leaders seem to regard all this as something of an existential threat to the well-being of their profession, I view it as much worse than that.  The problem is not that CS people are doing Statistics, but rather that they are doing it poorly:  Generally the quality of CS work in Stat is weak.  It is not a problem of quality of the researchers themselves; indeed, many of them are very highly talented.  Instead, there are a number of systemic reasons for this, structural problems with the CS research “business model”:

All this matters–a LOT.  In my opinion, the above factors result in highly lamentable opportunity costs.   Clearly, I’m not saying that people in CS should stay out of Stat research.  But the sad truth is that the usurpation process is causing precious resources–research funding, faculty slots, the best potential grad students, attention from government policymakers, even attention from the press–to go quite disproportionately to CS, even though Statistics is arguably better equipped to make use of them.   This is not a CS vs. Stat issue; Statistics is important to the nation and to the world, and if scarce resources aren’t being used well, it’s everyone’s loss.

Making Statistics Attractive to Students

This of course is an age-old problem in Stat.  Let’s face it–the very word statistics sounds hopelessly dull.  But I would argue that a more modern development is making the problem a lot worse–the Advanced Placement (AP) Statistics courses in high schools.

Professor Xiao-Li Meng has written extensively about the destructive nature of AP Stat.  He observed, “Among Harvard undergraduates I asked, the most frequent reason for not considering a statistical major was a ‘turn-off’ experience in an AP statistics course.”  That says it all, doesn’t it?  And though Meng’s views predictably sparked defensive replies in some quarters, I’ve had exactly the same experiences as Meng in my own interactions with students.  No wonder students would rather major in a field like CS and study machine learning–without realizing it is Statistics.  It is especially troubling that Statistics may be losing the “best and brightest” students.

One of the major problems is that AP Stat is usually taught by people who lack depth in the subject matter.  A typical example is that a student complained to me that even though he had attended a top-quality high school in the heart of Silicon Valley, his AP Stat teacher could not answer his question as to why it is customary to use n-1 rather than n in the denominator of s2 .  But even that lapse is really minor, compared to the lack among the AP teachers of the broad overview typically possessed by Stat professors teaching university courses, in terms of what can be done with Stat, what the philosophy is, what the concepts really mean and so on.  AP courses are ostensibly college level, but the students are not getting college-level instruction.  The “teach to the test” syndrome that pervades AP courses in general exacerbates this problem.

The most exasperating part of all this is that AP Stat officially relies on TI-83 pocket calculators as its computational vehicle.  The machines are expensive, and after all we are living in an age in which R is free!  Moreover, the calculators don’t have the capabilities of dazzling graphics and analyzing of nontrivial data sets that R provides–exactly the kinds of things that motivate young people.

So, unlike the “CS usurpation problem,” whose solution is unclear, here is something that actually  can be fixed reasonably simply.  If I had my druthers, I would simply ban AP Stat, and actually, I am one of those people who would do away with the entire AP program.   Obviously, there are too many deeply entrenched interests for this to happen, but one thing that can be done for AP Stat is to switch its computational vehicle to R.

As noted, R is free and is multi platform, with outstanding graphical capabilities.  There is no end to the number of data sets teenagers would find attractive for R use, say the Million Song Data Set.

As to a textbook, there are many introductions to Statistics that use R, such as Michael Crawley’s Statistics: an Introduction Using R, and Peter Dalgaard’s Introductory Statistics Using R.  But to really do it right, I would suggest that a group of Stat professors collaboratively write an open-source text, as has been done for instance for Chemistry.  Examples of interest to high schoolers should be used, say this engaging analysis on OK Cupid.

This is not a complete solution by any means.  There still is the issue of AP Stat being taught by people who lack depth in the field, and so on.  And even switching to R would meet with resistance from various interests, such as the College Board and especially the AP Stat teachers themselves.

But given all these weighty problems, it certainly would be nice to do something, right?  Switching to R would be doable–and should be done.


To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.