Philadelphia loses a TIGER — Bad omen or statistical inevitability?

[This article was first published on Normally Skewed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Lions, TIGERs, US DOT Funding, oh my.
For the first time in seven years, Philly has lost a TIGER. WHAT?! Not to worry, there are no escapees from the Philadelphia Zoo. No, TIGER has nothing to do with the animal, or the baseball team from Detroit, or the US Census Bureau’s geographic shapefiles. Today we’ll be talking about Transportation Investment Generating Economic Recovery (TIGER). As WHYY’s Newsworks reports, Philadelphia’s winning streak ended this year, coinciding with the recent change in mayoral administrations.


Quick Background on TIGER Grants
TIGER grants are distributed by the US Department of Transportation every year to US cities and municipalities to make — you’ve probably guessed it! — transportation and infrastructural improvements. As the name might imply, the grant was started in response to the Great Recession of ‘07-’09, as a way to generate jobs (hence, there is a stated preference for “shovel-ready” projects) as well as reduce US cities’ energy dependence. For much more detail and history on the grant, see the grant’s website and Wikipedia. The grants are quite competitive. As Newsworks.org reports, about $9.3 billion in applications are submitted, with half a billion dollars ultimately distributed.


Data!
Well, thanks to the open data revolution, the US DOT has kindly published a dataset of all winning projects since the grant’s inception. I’ve taken the dataset and run a quick analysis to see what these grants look like, and where Philadelphia stacks up against other localities. For those in the analytic weeds, I’m using the R programming language for this analysis and will post in some of the relevant code throughout the post. If you’re not interested, skip over the gray code boxes. My full code is in this github repo.

First, let’s look at where Philly stacks up in terms of its award amounts. The graph below shows the distribution of award amounts, and where Philadelphia specifically stands.

## Load in required packages
mods=c('data.table','raster','<span class="mceItemHidden" data-mce-bogus="1"><span class="hiddenSpellError" pre="" data-mce-bogus="1">rgeos</span></span>','<span class="hiddenSpellError" pre="" data-mce-bogus="1">ggplot2</span>','<span class="hiddenSpellError" pre="" data-mce-bogus="1">ggthemes</span>','<span class="hiddenSpellError" pre="" data-mce-bogus="1">ggmap</span>','sp','leaflet','<span class="hiddenSpellError" pre="" data-mce-bogus="1">maptools</span>')
invisible(<span class="hiddenSpellError" pre="" data-mce-bogus="1">lapply</span>(mods, function(x) require(x,character.only=TRUE,quietly = TRUE)))

## Load in TIGER grant dataset
tg=fread('https://www.transportation.gov/sites/dot.gov/files/docs/tiger_allUpdate.csv')

## note to self, tweet <span class="hiddenSpellError" pre="tweet " data-mce-bogus="1">USDOT</span> & tell them not to put spaces in column headings...
<span class="hiddenSpellError" pre="" data-mce-bogus="1">setnames</span>(tg, tg[,<span class="hiddenSpellError" pre="" data-mce-bogus="1">gsub</span>('[^[:<span class="hiddenSpellError" pre="" data-mce-bogus="1">alnum</span>:]]','',names(tg))])

## Create a column for year of award
tg[,Year := as.integer(gsub("[^0-9]","",Round))]

#<span style="line-height: 1.7;" data-mce-style="line-height: 1.7;"># change award amount to numeric
</span>tg[,Amount:=as.numeric(gsub("[\\$,]","",Amount))]

# And a column to show whether we're us or not
tg[,Phila:=<span class="hiddenSpellError" pre="" data-mce-bogus="1">ifelse</span>(<span class="hiddenSpellError" pre="" data-mce-bogus="1">grepl</span>("*[Pp]hila*|Southeastern Pennsylvania Transportation Authority|*SEPTA*|Center City District",Applicant)==TRUE,1,0)]

# Create indicator for Philadelphia
tg_p=tg[Phila==1,]

## Plot the data
<span class="hiddenSpellError" pre="" data-mce-bogus="1">ggplot</span>(data=tg, aes(x=as.character(Year),y=Amount/<span class="hiddenSpellError" pre="" data-mce-bogus="1">1e6</span>,fill=as.character(Year))) +
  geom_boxplot() + geom_point(data=tg_p, aes(x=as.character(Year), y=Amount/1e6), size=3,color="darkred")+
geom_text(data=tg_p,aes(x=as.character(Year),y=Amount/1e6,label=paste0('$',round(Amount/1e6,1),'M')),vjust=-1,<span class="hiddenSpellError" pre="" data-mce-bogus="1">fontface</span>=('bold.italic'))+
scale_y_continuous(labels = scales::dollar_format(prefix="$",suffix="M"))+
labs(x="Funding Year", y="Distribution of Amounts in Millions", title="Philly's award amounts vs. all other applicants")+
guides(fill=FALSE)

These boxplots show the range of amounts that applicants won each year. Philadelphia’s wins are labeled with a red dot and its award amount.

Since 2010, Philly has earned on average a cool $10.1M per funding round.

By the way, here are Philly’s project names and the project types they fell under:

amt=tg[Phila==1,list(Year,<span class="mceItemHidden" data-mce-bogus="1"><span class="hiddenSpellError" pre="Year " data-mce-bogus="1">ProjectName</span></span>,<span class="hiddenSpellError" pre="ProjectName " data-mce-bogus="1">ProjectType</span>,Amount=paste0('$',Amount/<span class="hiddenSpellError" pre="" data-mce-bogus="1">1e6</span>,'M'))]
amt[order(Year),]

Year | ProjectName | ProjectType | Amount
2009 | Philadelphia Area Pedestrian and Bicycle Network (PA and NJ)          
                   | Bicycle and Pedestrian | $23M
2010 | Dilworth Plaza and Concourse Improvements | Transit | $15M
2011 | IMPaCT Philadelphia | Transit | $10M
2012 | Wayne Junction Substation Replacement | Transit | $12.9M
2013 | SEPTA-CSX Separation Project | Rail | $10M
2014 | Roosevelt Boulevard Multimodal Study | Regional Planning | $2.5M
2015 | Closing the Gaps | Road | $10.3M

And the total number of awards distributed each year:

# count number of awards by year
yr = tg[,.N,Year]

# plot
<span class="mceItemHidden" data-mce-bogus="1"><span class="hiddenSpellError" pre="" data-mce-bogus="1">ggplot</span></span>(data=yr,aes(x=as.character(Year),y=N)) +
 geom_bar(stat=’identity’,fill=’gray40') +
 geom_text(aes(y=N,label=N),vjust=1.5,color=’white’,fontface=’bold’)+
 labs(x=”Funding Year”
 ,y=”Number of Applications Awarded”
 ,title=”Number of successful applications by funding year”)

Okay, Philadelphia has been consistent, but what does it mean that the new Office of Transport and Infrastructure (OTIS) didn’t win an award? It’s a let-down, but does it really reflect poorly on the new mayoral administration? As the article states, although the high-level leadership changed, most of the grant-writing staff probably stuck around. To answer this question, it’s worth looking at how (a)typical it is to win 7 times.

This was a little bit tricky from a data perspective. Applicants can be either municipalities (e.g. the City of Philadelphia), agencies (e.g. SEPTA, PENNDOT), or states (e.g. — yeah…). So identifying the “unique” winners, is kind of confusing. And as the Newsworks article mentions, Philadelphia won in a way, through SEPTA’s relationship with the Delaware Transit Corporation (DTC, which won an award, reimburses SEPTA for certain operating costs). To get a clean and consistent dataset, I overlayed the latitude/longitude points of each award site on a county-level map of the US, which brings the analysis to the county level. The next graph assumes that if a project site landed in a particular county, the award is attributed to or otherwise purposed for that county. The approximation should be close enough for, you know, government work…

## Download the Census county file
<span class="mceItemHidden" data-mce-bogus="1"><span></span>download.file('ftp://ftp2.census.gov/geo/tiger/<span class="mceItemHidden" data-mce-bogus="1"><span class="hiddenSpellError" pre="" data-mce-bogus="1">TIGER2016</span></span>/COUNTY/tl_2016_us_county.zip','cty_shp.zip')</span>

# Create a folder

shp <- readOGR('C:/Users/danieladdyson/Desktop/cty_shp',layer='tl_2016_us_county') # Census county shapefile

# turn TIGER lat/long data to spatial points
crs = shp@proj4string@projargs
pts=SpatialPointsDataFrame(coords=tg[,list(Longitude,Latitude)], data=tg, proj4string=CRS(crs) )

# overlay grant sites on map of US counties
sj = data.table(over(pts,shp))
sj=cbind(tg,sj)
sj[,NAME:=as.character(NAME)]
cty_wins = as.data.table(sj[,table(GEOID)])
wins_summary = cty_wins[,list(freq=.N),by=list(n_events=N)]
wins_summary[,rate:=n_events*freq] # marginal probabilities

# rate of wins per county
cty_wins = as.data.table(sj[,table(GEOID)]) # Number of wins by county -- the counties may be slightly off due to map overlay errors
# cty_wins[!duplicated(GEOID),.N] 269 unique counties
wins_summary = cty_wins[,list(freq=.N),by=list(n_events=N)] # Number of counties (269) X number of wins (381)
wins_summary[,rate:=n_events*freq] # marginal probabilities

# mean number of wins
mu_wins = wins_summary[,sum(rate)/sum(freq)]

# standard deviation of mean number of wins
sd_wins = sqrt(mu_wins)
sd_rng = c(mu_wins+sd_wins, # 1 standard deviation upper
mu_wins+sd_wins*2, # 2 standard deviation upper
mu_wins+sd_wins*3 # 3 standard deviation upper)
txt=data.frame(lab=c('Average\n# wins','1 Std Dev','2 Std Dev','3 Std Dev'),
  x=c(round(mu_wins,1),round(sd_rng,1)), y=rep(100) )

## Exact confidence intervals
ex_ci=poisson.test(sum(wins_summary$rate),sum(wins_summary$freq))$conf.int[c(1,2)]

## plot out the counts + SDs & exact CIs
ci_pal = brewer.pal(3,'Set1')
sd_pal = brewer.pal(6,'Reds')
hi_cty = data.frame(cty = c('King, WA','Philadelphia, PA','Los Angeles, CA','Cook, IL'),
                    X=c(6,7,7,8),Y=c(10,22,10,10))
ggplot(data=wins_summary,aes(x=n_events,y=freq))+
geom_bar(stat='identity',color='grey58')+
scale_x_continuous(breaks = c(1:8), labels = c(1:8))+
geom_vline(xintercept = mu_wins,color=ci_pal[2],size=1)+
geom_vline(xintercept = ex_ci,color=ci_pal[3],linetype='longdash',size=.7)+
geom_vline(xintercept = sd_rng,color=sd_pal[c(4:6)],size=1)+
geom_text(data=txt,aes(label=paste0(lab,':\n',x),x=x,y=y,hjust=-.001),size=3.5)+
geom_text(data=hi_cty,aes(x=X,y=Y,label=cty),size=3,angle=35,hjust=-.01)+
labs(x='Number of grants won',y='Number of counties',title= "Statistical deviation of grant wins")

This graph shows that Philadelphia’s wins are highly statistically significant (the green dashed lines are 95% confidence intervals; red lines are standard deviations from the mean). Data-related stuff: Chicago Transit Authority and Illinois DOT both had wins in 2011 and 2012, which is why Cook/Chicago has eight wins in seven years. Statistics-related stuff: I very subjectively determined that the data assumed a Poisson distribution here. But I performed no statistical tests for sensitivity, so feel free to debate that. I also calculated exact confidence intervals for the mean, rather than for the normal distribution.

So, Philadelphia has been really, really good at winning, and it looks like that’s a pretty hard feat, given that the over half the counties only won a single award. And this next graph demonstrates the probability of winning 1,2,…7 times:

## calculate probabilities
p=cbind(lapply(c(1:8), ppois, lambda=mu_wins, lower.tail=FALSE))
probs=data.frame(p=unlist(cbind(lapply(c(1:8), ppois, lambda=mu_wins, lower.tail=FALSE))),wins=c(1:8))

## Plot probability data
ggplot(data=probs,aes(x=wins,y=p))+geom_line()+
scale_y_continuous(labels = scales::percent,limits = c(0,.45))+
geom_text(aes(label=sprintf(“%1.2f%%”, 100*p)),hjust=-.1,vjust=-.5)+
labs(x=’Number of wins’,y=’Probability of achieving # wins’)+
scale_x_continuous(breaks=c(1:8))

There’s a .01% chance of winning seven or more times. That’s small.

To conclude,

My analysis shows that among the winners, Philadelphia’s winning streak was highly aberrant (in a good way, of course). From a federal perspective, this is probably a good thing, as it would look strange if just a few localities were winning over and over (I have more to show on that below!). And although the loss could have been from some major change in this year’s grant writing procedures or application quality that were consequences of Jim Kenney’s new administration, it would be hard to discern that, given that the city’s winning streak is so highly statistically significant.

I should strongly caution, that this analysis is based on only the grant winners. We don’t have any data on the losers. And because of that, we don’t know how rare any one success is. But considering that $500M in disbursements is only about 5.4% of $9.3B in applications, I’m going to say it ain’t easy. And that could potentially make Philly’s wins even more statistically significant (because you’d be adding a really tall bar at 0 for all the counties who never won). So don’t feel bad, Mr. Kenney! There’s always next year.


Geographic locations of projects

After mapping to the counties to look at probabilities of wins, I aggregated to a higher level, this time US Census Divisions and Regions, to see where projects were getting awarded by year.

## Download the US Census Division shapefile
download.file('<a class="markup--anchor markup--pre-anchor" href="http://www2.census.gov/geo/tiger/GENZ2015/shp/cb_2015_us_division_500k.zip%27,%27USCB_div.zip%27" target="_blank" rel="nofollow" data-href="http://www2.census.gov/geo/tiger/GENZ2015/shp/cb_2015_us_division_500k.zip%27,%27USCB_div.zip%27">http://www2.census.gov/geo/tiger/GENZ2015/shp/cb_2015_us_division_500k.zip','USCB_div.zip'</a>)
unzip('USCB_div.zip',exdir='div_shp') # unzip & create a folder
div <- readOGR('C:/Users/A770213/Desktop/div_shp',layer='cb_2015_us_division_500k') # Census county shapefile
crs=div@proj4string@projargs # set coordinate reference system
pts=SpatialPointsDataFrame(coords=tg[,list(Longitude,Latitude)], data=tg, proj4string=CRS(crs)) # match points & shapefile’s CRS
# div@proj4string@projargs == pts@proj4string@projargs TRUE
site_div = data.table(over(pts,div))
site_div=data.table(cbind(pts@data,site_div))
xtabs(~site_div$NAME+site_div$Year)

                    Year
Division Name      2009 2010 2011 2012 2013 2014 2015
East North Central    8    9    6    6    4    6    4
East South Central    4    1    2    3    4    4    3
Middle Atlantic       2    9    6    5    4    6    5
Mountain              7    7    3    5    7    9    5
New England           6    7    4    5    6    6    5
Pacific               9   11    7    6    7   10    8
South Atlantic        5   16    7    7    8   14    3
West North Central    5   10    5    5    5   10    4
West South Central    5    4    4    5    6    7    2

I aggregated divisions into Census regions and came out with the following:

tab=site_div[,list(cnt=.N,amt=sum(Amount)),by=list(div_name=NAME,year=Year)]
tab[,':='(yr_cnt=sum(cnt),yr_amt=sum(amt)),by=year]
tab[,':='(cnt_pct=round(cnt/yr_cnt,3),amt_pct=round(amt/yr_amt,3))]
tab_m=melt(tab,id.vars = c('year','div_name'),measure.vars=c('cnt_pct','amt_pct'))
tab_m[,variable := factor(variable,levels=c('amt_pct','cnt_pct'),labels=c('Percent of Dollars Awarded','Percent of Awards Given'))]
tab_m[,reg_name:=ifelse(div_name %in% c('New England','Middle Atlantic'),'Northwest', ifelse(div_name %in% c('East North Central','West North Central'),Midwest',ifelse(div_name %in% c('South Atlantic','East South Central','West South Central'),'South’,'West')))]

tab_m2 = tab_m[,list(value=sum(value)),by=list(year,reg_name,variable)]
ggplot(data=tab_m2,aes(x=year,color=reg_name,y=value))+
geom_line(stat=’identity’,size=1)+
scale_x_continuous(breaks=c(2009:2015),labels=c(2009:2015))+
scale_y_continuous(labels=scales::percent)+
labs(x='Funding Year',y='Percent of Total Awarded',title='Regional share of total awards & award amounts', color='US Region')+
theme(axis.text=element_text(size=10,color='black'),axis.text.x=element_text(angle = 90,vjust=.5),strip.text=element_text(size=10,color='black',face='bold'),legend.text=element_text(size=10))+
facet_grid(variable~.)

The South has really been killing it on wins over the past few years, although it looks like there was some kind of serious shift in 2015. Again, without knowing the total number of submissions (i.e. wins vs losses), it’s hard to judge what these numbers really represent. The South has the highest number of states of the Census Regions and thus more potential applicants; it could all just be a matter of scale. Southern states also may have generally lower economies and rely more heavily on federal funding, which could also drive up the number of applicants. There’s a blog post from wallethub that shows a state-level breakdown of federal funding, although we’d need to join states and regions together for an accurate view. For my next data adventure, I’ll have to pull that data and join it to the TIGER funding data.


To leave a comment for the author, please follow the link and comment on their blog: Normally Skewed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)