**Statistical Modeling, Causal Inference, and Social Science**, and kindly contributed to R-bloggers)

I was pleasantly surprised to have my recreational reading about baseball in the *New Yorker* interrupted by a digression on statistics. Sam Fuld of the Tampa Bay Rays, was the subjet of a Ben McGrath profile in the 4 July 2011 issue of the *New Yorker*, in an article titled Super Sam. After quoting a minor-league trainer who described Fuld as “a bit of a geek” (who isn’t these days?), McGrath gets into that lovely *New Yorker* detail:

One could have pointed out the more persuasive and telling examples, such as the fact that in 2005, after his first pro season, with the Class-A Peoria Chiefs, Fuld applied for a fall internship with Stats, Inc., the research firm that supplies broadcasters with much of the data anad analysis that you hear in sports telecasts.

After a description of what they had him doing, reviewing footage of games and cataloguing, he said

“I thought, They have a stat for everything, but they don’t have any stats regarding foul balls.”

**Fuld’s Conjecture**

Fuld went on to tell McGrath that “he’d explained that he’d conceived a study to test the received wisdom that good hitters are able to foul off difficult pitches deliberately. If this were true, he reaosned, there ought to be a measurable correlation between over-all batting success and the distribuiton of foul balls within counts. Skilled hitters, adept at protecting th eplate, might tend to produce a greater proportion of foul balls late in couns than weaker hitters, whose fouls would skew earlier — evidence of poort contact.

Turns out Fuld had taken Stats 50, the “math of sports” class at Stanford, taught by none other than the imminent information theorist Thomas Cover! It turns out that Fuld’s on leave from the masters program in statistics at Stanford (not that you’d know it from his enrollment in Stats 50). Alas, as McGrath relays

Fuld never completed the degree (although he intends to), because the next spring his other dream seemed suddenly to be coming true, as he was promoted up the developmental ladder from AA to AAA and then, just as the fall academic calendar was beginning, to the Show [Major League Baseball].

**You can Help**

While Stats, Inc. may not care about foul balls, the great baseball data site Retrosheet does (just follow the link).

In fact, Dan Fox of Baseball Analysts

has already presented a zero-order analysis using the Retrosheet data several years ago, though not of the exact question Fuld was asking.

**Hierarchical Models Away**

Now this is going to be a great problem for hierarchical modeling because the data in each count cell is going to be sparse. With 500 at-bats in a year, how many instances do we get of all of the possible pitch counts (0-0, 0-1, 0-2, 1-0, 1-1, 1-2, 2-0, 2-1, 2-2, 3-0, 3-1, 3-2)?

**But, Please, Start with a Scatterplot**

I’d be happy to see a set of scatter plots, one for each count, with a simple hitting stat on the *x* axis (like on-base percentage) and the observed percentage of foul balls on the *y* axis.

I’d do it myself, but Andrew has me chained to the C++ compiler working on Stan [just kidding, of course — we share priorities here]. But maybe if I need more time to procrastinate than this blog entry afforded, I might do it myself. The only tricky part would be writing the Python (or whatever) data munger to get the counts out of their character-sequence based pitch encoding.

**leave a comment**for the author, please follow the link and comment on their blog:

**Statistical Modeling, Causal Inference, and Social Science**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...