Exploring Police Misconduct Data

[This article was first published on R on Sam Portnow's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

On July 26, 2020, Propublica released a dataset on police discipline records.

I wanted to get a look at the data and do some exploratory analysis.

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## ── Conflicts ─────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(gt)
## Warning: package 'gt' was built under R version 3.6.2
library(here)
## here() starts at /Users/samportnow/sam-portnow-website
library(fs)
## Warning: package 'fs' was built under R version 3.6.2
library(skimr)
## Warning: package 'skimr' was built under R version 3.6.2
library(janitor)
## Warning: package 'janitor' was built under R version 3.6.2
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Read in Data

data = read_csv(here::here('content', 'post_data', 'allegations_20200726939.csv'))
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   unique_mos_id = col_double(),
##   complaint_id = col_double(),
##   month_received = col_double(),
##   year_received = col_double(),
##   month_closed = col_double(),
##   year_closed = col_double(),
##   mos_age_incident = col_double(),
##   complainant_age_incident = col_double(),
##   precinct = col_double()
## )
## See spec(...) for full column specifications.

Explore Data

skim(data)
Table 1: Data summary
Name data
Number of rows 33358
Number of columns 26
_______________________
Column type frequency:
character 17
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
first_name 0 1.00 2 10 0 1217 0
last_name 0 1.00 2 18 0 2835 0
command_now 0 1.00 2 7 0 415 0
command_at_incident 1544 0.95 2 7 0 361 0
rank_abbrev_incident 0 1.00 2 3 0 18 0
rank_abbrev_now 0 1.00 2 3 0 20 0
rank_now 0 1.00 7 22 0 8 0
rank_incident 0 1.00 7 22 0 8 0
mos_ethnicity 0 1.00 5 15 0 5 0
mos_gender 0 1.00 1 1 0 2 0
complainant_ethnicity 4464 0.87 5 15 0 8 0
complainant_gender 4195 0.87 4 21 0 6 0
fado_type 0 1.00 5 18 0 4 0
allegation 1 1.00 4 40 0 115 0
contact_reason 199 0.99 5 58 0 53 0
outcome_description 56 1.00 12 36 0 23 0
board_disposition 0 1.00 10 40 0 11 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
unique_mos_id 0 1.00 18169.91 9566.32 2 9671.00 19215 25412 36374 ▆▆▇▇▃
complaint_id 0 1.00 23905.06 11954.43 517 13684.75 25132 34252 43703 ▅▆▆▇▇
month_received 0 1.00 6.32 3.36 1 3.00 6 9 12 ▇▆▅▆▇
year_received 0 1.00 2010.73 6.03 1985 2007.00 2012 2016 2020 ▁▁▃▇▇
month_closed 0 1.00 6.47 3.34 1 4.00 6 9 12 ▇▆▆▆▇
year_closed 0 1.00 2011.53 6.09 1985 2008.00 2013 2016 2020 ▁▁▂▆▇
mos_age_incident 0 1.00 32.35 6.04 20 28.00 31 36 60 ▅▇▃▁▁
complainant_age_incident 4812 0.86 32.48 28.41 -4301 23.00 30 41 101 ▁▁▁▁▇
precinct 24 1.00 64.37 31.45 0 43.00 67 81 1000 ▇▁▁▁▁
# Officer Ethnicity
data %>% tabyl(mos_ethnicity) %>% adorn_pct_formatting() %>% gt()
mos_ethnicity n percent
American Indian 32 0.1%
Asian 1178 3.5%
Black 4924 14.8%
Hispanic 9150 27.4%
White 18074 54.2%

It looks like the officer ethnicity matches the overall NYPD demogrphaic breakdown

Complainant Ethnicity

data %>% tabyl(complainant_ethnicity) %>% adorn_pct_formatting() %>% gt()
complainant_ethnicity n percent valid_percent
American Indian 64 0.2% 0.2%
Asian 532 1.6% 1.8%
Black 17114 51.3% 59.2%
Hispanic 6424 19.3% 22.2%
Other Race 677 2.0% 2.3%
Refused 259 0.8% 0.9%
Unknown 1041 3.1% 3.6%
White 2783 8.3% 9.6%
NA 4464 13.4%

As for as the complaintant data is concerned, there isn’t an obvious population to compare to, but I think the closest is the arrest data because arrests can give some approximation of contact with the police.

Here, we see that Black people are overrepresented in their complaints compared to the percentage of arrestees that are Black. Also, Hispanic people are markedly underrepresented. I wonder if this is because of a large undocumented population in New York who may try to avoid contact with the government, or perhaps an issue with the missing ethnicity category.

Cross Tab Complaintant Ethnicity and Officer Ethnicity

data %>% tabyl(complainant_ethnicity, mos_ethnicity) %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting(digits = 2) %>%
  adorn_ns() %>% gt()
complainant_ethnicity American Indian Asian Black Hispanic White
American Indian 0.00% (0) 3.12% (2) 6.25% (4) 12.50% (8) 78.12% (50)
Asian 0.00% (0) 11.47% (61) 10.90% (58) 18.05% (96) 59.59% (317)
Black 0.11% (18) 3.26% (558) 16.63% (2846) 27.59% (4722) 52.41% (8970)
Hispanic 0.03% (2) 3.67% (236) 11.04% (709) 34.48% (2215) 50.78% (3262)
Other Race 0.89% (6) 2.81% (19) 14.48% (98) 27.62% (187) 54.21% (367)
Refused 0.00% (0) 6.95% (18) 14.67% (38) 22.39% (58) 55.98% (145)
Unknown 0.19% (2) 2.59% (27) 17.39% (181) 29.01% (302) 50.82% (529)
White 0.07% (2) 4.64% (129) 13.65% (380) 22.46% (625) 59.18% (1647)
NA 0.04% (2) 2.87% (128) 13.66% (610) 20.99% (937) 62.43% (2787)

The first thing I know is that there doesn’t seem to be much over or underrepresentation in the officer ethnicity relative to the demographic breakdown of the NYPD, with the exception of White, Native American, and Missing Complaitants and White officers. For these groups, which White officers are overrepresnted relative to the demographic breakdown of the NYPD. Actually, it appears that all ethnicity groups overrepresent their officers in complaints when the offending officer is of their same ethnicity, with the exception of Black complaitants. There is a large psychological literature that suggests people on on average display intergroup bias in favor of their own race. Perhaps when someone is mistreated by someone of their own race, it feels like a larger transgression.

Top Level Category of Complaint

data %>% tabyl(fado_type) %>% adorn_pct_formatting() %>% gt()

fado_type n percent
Abuse of Authority 20292 60.8%
Discourtesy 4677 14.0%
Force 7636 22.9%
Offensive Language 753 2.3%
# Specific Allegation

data %>% tabyl(allegation) %>% adorn_pct_formatting() %>% arrange(desc(n)) %>% gt()
allegation n percent valid_percent
Physical force 4849 14.5% 14.5%
Word 3942 11.8% 11.8%
Stop 2300 6.9% 6.9%
Search (of person) 2047 6.1% 6.1%
Frisk 1926 5.8% 5.8%
Premises entered and/or searched 1555 4.7% 4.7%
Refusal to provide name/shield number 1483 4.4% 4.4%
Vehicle search 1405 4.2% 4.2%
Threat of arrest 1370 4.1% 4.1%
Vehicle stop 1094 3.3% 3.3%
Threat of force (verbal or physical) 912 2.7% 2.7%
Question and/or stop 696 2.1% 2.1%
Other 641 1.9% 1.9%
Gun Pointed 628 1.9% 1.9%
Strip-searched 556 1.7% 1.7%
Question 527 1.6% 1.6%
Property damaged 359 1.1% 1.1%
Pepper spray 324 1.0% 1.0%
Action 316 0.9% 0.9%
Race 307 0.9% 0.9%
Retaliatory summons 291 0.9% 0.9%
Entry of Premises 289 0.9% 0.9%
Refusal to obtain medical treatment 276 0.8% 0.8%
Nightstick as club (incl asp & baton) 260 0.8% 0.8%
Chokehold 244 0.7% 0.7%
Frisk and/or search 241 0.7% 0.7%
Curse 202 0.6% 0.6%
Refusal to process civilian complaint 185 0.6% 0.6%
Gun Drawn 182 0.5% 0.5%
Push/Shove 178 0.5% 0.5%
Seizure of property 178 0.5% 0.5%
Failure to provide RTKA card 176 0.5% 0.5%
Hit against inanimate object 176 0.5% 0.5%
Threat to damage/seize property 171 0.5% 0.5%
Search of Premises 159 0.5% 0.5%
Retaliatory arrest 147 0.4% 0.4%
Other – Force 134 0.4% 0.4%
Refusal to show search warrant 125 0.4% 0.4%
Forcible Removal to Hospital 120 0.4% 0.4%
Gender 115 0.3% 0.3%
Punch/Kick 114 0.3% 0.3%
Interference with recording 111 0.3% 0.3%
Threat of summons 111 0.3% 0.3%
Refusal to provide shield number 97 0.3% 0.3%
Threat to notify ACS 97 0.3% 0.3%
Refusal to provide name 96 0.3% 0.3%
Person Searched 94 0.3% 0.3%
Nasty Words 88 0.3% 0.3%
Nonlethal restraining device 87 0.3% 0.3%
Ethnicity 84 0.3% 0.3%
Sexual orientation 78 0.2% 0.2%
Handcuffs too tight 76 0.2% 0.2%
Threat of force 74 0.2% 0.2%
Other – Abuse 70 0.2% 0.2%
Dragged/Pulled 63 0.2% 0.2%
Beat 57 0.2% 0.2%
Black 56 0.2% 0.2%
Gesture 51 0.2% 0.2%
Other blunt instrument as a club 49 0.1% 0.1%
Threat of Arrest 39 0.1% 0.1%
Demeanor/tone 38 0.1% 0.1%
Gun fired 38 0.1% 0.1%
Vehicle 35 0.1% 0.1%
Premise Searched 31 0.1% 0.1%
Gun as club 28 0.1% 0.1%
Property Damaged 27 0.1% 0.1%
Nightstick/Billy/Club 25 0.1% 0.1%
Search of recording device 23 0.1% 0.1%
Radio as club 22 0.1% 0.1%
Vehicle Searched 22 0.1% 0.1%
Restricted Breathing 21 0.1% 0.1%
Threat re: removal to hospital 21 0.1% 0.1%
Mace 20 0.1% 0.1%
Detention 19 0.1% 0.1%
Flashlight as club 18 0.1% 0.1%
Other- Discourtesy 18 0.1% 0.1%
Photography/Videography 15 0.0% 0.0%
Physical disability 15 0.0% 0.0%
Religion 14 0.0% 0.0%
Slap 14 0.0% 0.0%
Radio As Club 13 0.0% 0.0%
Sexual Misconduct (Sexual Humiliation) 13 0.0% 0.0%
Threat to Property 13 0.0% 0.0%
Gun As Club 12 0.0% 0.0%
Electronic device information deletion 11 0.0% 0.0%
Other – Ethnic Slur 11 0.0% 0.0%
Police shield 11 0.0% 0.0%
Sex Miscon (Sexual Harassment, Verbal) 11 0.0% 0.0%
Flashlight As Club 10 0.0% 0.0%
Refusal to show arrest warrant 10 0.0% 0.0%
Gun pointed/gun drawn 9 0.0% 0.0%
Hispanic 9 0.0% 0.0%
Property Seized 8 0.0% 0.0%
White 8 0.0% 0.0%
Rude Gesture 7 0.0% 0.0%
Arrest/D. A. T. 6 0.0% 0.0%
Sh Refuse Cmp 5 0.0% 0.0%
Threat of Summons 5 0.0% 0.0%
Arrest/Onlooker 4 0.0% 0.0%
Gender Identity 4 0.0% 0.0%
Improper dissemination of medical info 4 0.0% 0.0%
Sex Miscon (Sexual/Romantic Proposition) 4 0.0% 0.0%
Animal 3 0.0% 0.0%
Body Cavity Searches 3 0.0% 0.0%
Gay/Lesbian Slur 3 0.0% 0.0%
Gun pointed 3 0.0% 0.0%
Profane Gesture 3 0.0% 0.0%
Failed to Obtain Language Interpretation 2 0.0% 0.0%
Gun Fired 2 0.0% 0.0%
Jewish 2 0.0% 0.0%
Sex Miscon (Sexual Harassment, Gesture) 2 0.0% 0.0%
Oriental 1 0.0% 0.0%
Other Asian 1 0.0% 0.0%
Questioned immigration status 1 0.0% 0.0%
Sexist Remark 1 0.0% 0.0%
NA 1 0.0%

Contact Reason

data %>% tabyl(contact_reason) %>% arrange(desc(n)) %>% adorn_pct_formatting() %>% gt()

contact_reason n percent valid_percent
PD suspected C/V of violation/crime – street 10078 30.2% 30.4%
Other 4104 12.3% 12.4%
PD suspected C/V of violation/crime – auto 2981 8.9% 9.0%
PD suspected C/V of violation/crime – bldg 2542 7.6% 7.7%
Moving violation 1983 5.9% 6.0%
Other violation of VTL 1140 3.4% 3.4%
Report-dispute 1085 3.3% 3.3%
Execution of search warrant 913 2.7% 2.8%
Report of other crime 906 2.7% 2.7%
Execution of arrest/bench warrant 683 2.0% 2.1%
Parking violation 635 1.9% 1.9%
Report-domestic dispute 563 1.7% 1.7%
C/V intervened on behalf of/observed encounter w/3rd party 540 1.6% 1.6%
Report-gun possession/shots fired 492 1.5% 1.5%
PD suspected C/V of violation/crime – subway 380 1.1% 1.1%
Patrol Encounter 362 1.1% 1.1%
Report-noise/disturbance 356 1.1% 1.1%
Report-possession/sale of narcotics 318 1.0% 1.0%
C/V requested investigation of crime 316 0.9% 1.0%
EDP aided case 313 0.9% 0.9%
Traffic Incidents/Accident/Prk Violation 219 0.7% 0.7%
Stop/Question/Frisk 206 0.6% 0.6%
Traffic accident 200 0.6% 0.6%
NA 199 0.6%
Others 189 0.6% 0.6%
C/V telephoned PCT 175 0.5% 0.5%
C/V at PCT to obtain information 167 0.5% 0.5%
Report of Crime Past/Present 165 0.5% 0.5%
Aided case 154 0.5% 0.5%
C/V requested info from officer 112 0.3% 0.3%
Dispute 104 0.3% 0.3%
C/V at PCT to file complaint of crime 102 0.3% 0.3%
CV already in custody 100 0.3% 0.3%
Arrest/Complainant 73 0.2% 0.2%
Vehicle Stop and Check 51 0.2% 0.2%
Arrest/Not Complainant 46 0.1% 0.1%
Parade/special event 46 0.1% 0.1%
C/V at PCT to retrieve property 40 0.1% 0.1%
Assist ACS or other agency 39 0.1% 0.1%
PD auto checkpoint 39 0.1% 0.1%
Summons/Complainant 39 0.1% 0.1%
Regulatory inspection 36 0.1% 0.1%
Report of Disturbance/Noise Complaint 36 0.1% 0.1%
Demonstration/protest 35 0.1% 0.1%
PD telephones CV 18 0.1% 0.1%
Complainant at Pct. to make a Cmpl/info 16 0.0% 0.0%
EDP Aided Cases 12 0.0% 0.0%
Telephone Call to Precinct/Command 11 0.0% 0.0%
Transit checkpoint 11 0.0% 0.0%
Aided Cases 10 0.0% 0.0%
Demonstrations 8 0.0% 0.0%
Complainant Witnessing Incident 5 0.0% 0.0%
No contact 3 0.0% 0.0%
Victim Subject of Sex Crime 2 0.0% 0.0%
# Outcome Description

data %>% tabyl(outcome_description) %>% arrange(desc(n)) %>% adorn_pct_formatting() %>% gt()
outcome_description n percent valid_percent
No arrest made or summons issued 12822 38.4% 38.5%
Arrest – other violation/crime 10196 30.6% 30.6%
Summons – disorderly conduct 2118 6.3% 6.4%
Summons – other violation/crime 1940 5.8% 5.8%
Arrest – resisting arrest 1593 4.8% 4.8%
Arrest – disorderly conduct 1013 3.0% 3.0%
Arrest – assault (against a PO) 852 2.6% 2.6%
Moving violation summons issued 839 2.5% 2.5%
Arrest – OGA 649 1.9% 1.9%
Other VTL violation summons issued 531 1.6% 1.6%
Parking summons issued 279 0.8% 0.8%
Disorderly-Conduct/Arr/Summons 137 0.4% 0.4%
Arrest on Other Charge 81 0.2% 0.2%
Traffic Summons Claimed or Issued 59 0.2% 0.2%
Juvenile Report 57 0.2% 0.2%
NA 56 0.2%
Other Summons Claimed or Issued 38 0.1% 0.1%
Assault/Arrested 34 0.1% 0.1%
Resisting Arrest/Arrested 25 0.1% 0.1%
Arrest – harrassment (against a PO) 15 0.0% 0.0%
Obstruct-Govt-Admin/Arrested 10 0.0% 0.0%
Harrassment/Arrested/Summons 8 0.0% 0.0%
Summons – harrassment (against a PO) 5 0.0% 0.0%
Summons – OGA 1 0.0% 0.0%

Disposition

data %>% tabyl(board_disposition) %>% adorn_pct_formatting() %>% gt()
board_disposition n percent
Exonerated 9609 28.8%
Substantiated (Charges) 3796 11.4%
Substantiated (Command Discipline A) 964 2.9%
Substantiated (Command Discipline B) 789 2.4%
Substantiated (Command Discipline) 851 2.6%
Substantiated (Command Lvl Instructions) 454 1.4%
Substantiated (Formalized Training) 1033 3.1%
Substantiated (Instructions) 248 0.7%
Substantiated (MOS Unidentified) 1 0.0%
Substantiated (No Recommendations) 165 0.5%
Unsubstantiated 15448 46.3%

So, now I want to know what is the best predictor of having a substantiated claim. I am going to use tidymodels to evaluate a lasso regression to determine the most important predictors. The largely majority of this code is borrowed from Julia Silge

First, I’ll code my outcome

library(stringr)
data = data %>%
  mutate(
    outcome = if_else(str_detect(board_disposition, 'Substantiated'), 1, 0)
  )

Next, I’ll select the columns I want.

library(tidymodels)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────── tidymodels 0.1.1 ──
## ✓ broom     0.7.0      ✓ recipes   0.1.13
## ✓ dials     0.0.8      ✓ rsample   0.0.7 
## ✓ infer     0.5.3      ✓ tune      0.1.1 
## ✓ modeldata 0.0.2      ✓ workflows 0.1.2 
## ✓ parsnip   0.1.2      ✓ yardstick 0.0.7
## Warning: package 'broom' was built under R version 3.6.2
## Warning: package 'dials' was built under R version 3.6.2
## Warning: package 'scales' was built under R version 3.6.2
## Warning: package 'infer' was built under R version 3.6.2
## Warning: package 'modeldata' was built under R version 3.6.2
## Warning: package 'recipes' was built under R version 3.6.2
## Warning: package 'rsample' was built under R version 3.6.2
## Warning: package 'tune' was built under R version 3.6.2
## Warning: package 'yardstick' was built under R version 3.6.2
## ── Conflicts ────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
data = data %>% select(command_now, command_at_incident, year_received, year_closed, rank_abbrev_incident, rank_abbrev_now,
                       mos_ethnicity:contact_reason, outcome)

I want to see what predicts a founded complaint, and the allegation of the complaint seems too proximally related to the complaints, so I am going to drop those variables.

data = data %>%
  select(- c(fado_type, allegation))

Create a length of investigation variable

data$length_of_investigation = data$year_closed - data$year_received

Change years to factors

data$year_closed = factor(data$year_closed)
data$year_received = factor(data$year_received)

Recode character variables to factor

data = data %>% mutate_if(is.character, as.factor)

Recoding missing in factor variables as missing

data = data %>% mutate_if(is.factor, ~ fct_explicit_na(., 'Missing'))

Change outcome to a factor

data$outcome = as.factor(data$outcome)

Look at data again

skim(data)
Table 2: Data summary
Name data
Number of rows 33358
Number of columns 16
_______________________
Column type frequency:
factor 12
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
command_now 0 1 FALSE 415 INT: 1412, WAR: 1165, NAR: 692, DB : 683
command_at_incident 0 1 FALSE 362 Mis: 1544, 075: 1364, 046: 825, 044: 763
year_received 0 1 FALSE 36 201: 2345, 201: 2312, 201: 2281, 201: 2220
year_closed 0 1 FALSE 36 201: 3251, 201: 2408, 201: 2326, 201: 2322
rank_abbrev_incident 0 1 FALSE 18 POM: 19807, SGT: 5701, DT3: 2712, POF: 1398
rank_abbrev_now 0 1 FALSE 20 POM: 9454, DT3: 6838, SGT: 6067, LT: 2815
mos_ethnicity 0 1 FALSE 5 Whi: 18074, His: 9150, Bla: 4924, Asi: 1178
mos_gender 0 1 FALSE 2 M: 31598, F: 1760
complainant_ethnicity 0 1 FALSE 9 Bla: 17114, His: 6424, Mis: 4464, Whi: 2783
complainant_gender 0 1 FALSE 7 Mal: 24058, Fem: 5021, Mis: 4195, Not: 57
contact_reason 0 1 FALSE 54 PD : 10078, Oth: 4104, PD : 2981, PD : 2542
outcome 0 1 FALSE 2 0: 25057, 1: 8301

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mos_age_incident 0 1.00 32.35 6.04 20 28 31 36 60 ▅▇▃▁▁
complainant_age_incident 4812 0.86 32.48 28.41 -4301 23 30 41 101 ▁▁▁▁▇
precinct 24 1.00 64.37 31.45 0 43 67 81 1000 ▇▁▁▁▁
length_of_investigation 0 1.00 0.80 0.59 0 0 1 1 9 ▇▁▁▁▁
Dropping age at incident be cause it’s a lmost always bla nk. Also recodin g precin ct to factor
data = data %>% select(-complainant_age_incident)
data$precinct = factor(data$precinct)
data$precinct = fct_explicit_na(data$precinct, 'Missing')

skim(data)
Table 3: Data summary
Name data
Number of rows 33358
Number of columns 15
_______________________
Column type frequency:
factor 13
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
command_now 0 1 FALSE 415 INT: 1412, WAR: 1165, NAR: 692, DB : 683
command_at_incident 0 1 FALSE 362 Mis: 1544, 075: 1364, 046: 825, 044: 763
year_received 0 1 FALSE 36 201: 2345, 201: 2312, 201: 2281, 201: 2220
year_closed 0 1 FALSE 36 201: 3251, 201: 2408, 201: 2326, 201: 2322
rank_abbrev_incident 0 1 FALSE 18 POM: 19807, SGT: 5701, DT3: 2712, POF: 1398
rank_abbrev_now 0 1 FALSE 20 POM: 9454, DT3: 6838, SGT: 6067, LT: 2815
mos_ethnicity 0 1 FALSE 5 Whi: 18074, His: 9150, Bla: 4924, Asi: 1178
mos_gender 0 1 FALSE 2 M: 31598, F: 1760
complainant_ethnicity 0 1 FALSE 9 Bla: 17114, His: 6424, Mis: 4464, Whi: 2783
complainant_gender 0 1 FALSE 7 Mal: 24058, Fem: 5021, Mis: 4195, Not: 57
precinct 0 1 FALSE 80 75: 2172, 73: 1163, 44: 1139, 46: 1120
contact_reason 0 1 FALSE 54 PD : 10078, Oth: 4104, PD : 2981, PD : 2542
outcome 0 1 FALSE 2 0: 25057, 1: 8301

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mos_age_incident 0 1 32.35 6.04 20 28 31 36 60 ▅▇▃▁▁
length_of_investigation 0 1 0.80 0.59 0 0 1 1 9 ▇▁▁▁▁

Finally, if any factor has an n less than 10, I went to recode that to ther

recode_to_other = function(.x){
  
  keeps = which(table(.x) > 10) %>% names()
  fct_other(.x, keep = keeps, other_level = 'other')
}

data = data %>% mutate_if(is.factor, recode_to_other)
## Warning: Unknown levels in `f`: other

## Warning: Unknown levels in `f`: other

## Warning: Unknown levels in `f`: other

## Warning: Unknown levels in `f`: other

## Warning: Unknown levels in `f`: other

Before fitting the model, I want to know the distribution of outcomes.

prop.table(table(data$outcome))
## 
##         0         1 
## 0.7511541 0.2488459

75% of complains are unfounded. So, if we fit a model without any predictors, we’ll already be at 75% accuracy. I confirm this below with a logistic regression with no predictors. The predict probability of an sustantiated claim is ~25%.

glm.mod = glm(outcome ~ 1, data = data, family = binomial(link = 'logit'))
predictions = predict(glm.mod, type = 'response')
table(predictions)
## predictions
## 0.248845854068537 
##             33358
data_split <- initial_split(data)
data_train <- training(data_split)
data_test <- testing(data_split)
data_rec <- recipe(outcome~ ., data = data_train) %>%
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_predictors(), -all_numeric()) %>%
  step_normalize(all_numeric(), -all_outcomes())


data_prep <- data_rec %>%
  prep(strings_as_factors = FALSE)

summary(data_prep)
## # A tibble: 805 x 4
##    variable                type    role      source  
##    <chr>                   <chr>   <chr>     <chr>   
##  1 mos_age_incident        numeric predictor original
##  2 length_of_investigation numeric predictor original
##  3 outcome                 nominal outcome   original
##  4 command_now_X001.PCT    numeric predictor derived 
##  5 command_now_X005.DET    numeric predictor derived 
##  6 command_now_X005.PCT    numeric predictor derived 
##  7 command_now_X006.PCT    numeric predictor derived 
##  8 command_now_X007.PCT    numeric predictor derived 
##  9 command_now_X009.DET    numeric predictor derived 
## 10 command_now_X009.PCT    numeric predictor derived 
## # … with 795 more rows
lasso_spec <- logistic_reg(mode = 'classification', penalty = 0.1, mixture = 1) %>%
  set_engine("glmnet")

wf <- workflow() %>%
  add_recipe(data_rec)

lasso_fit <- wf %>%
  add_model(lasso_spec) %>%
  fit(data = data_train)

lasso_fit %>%
  pull_workflow_fit() %>%
  tidy()
## # A tibble: 30,881 x 5
##    term         step estimate lambda dev.ratio
##    <chr>       <dbl>    <dbl>  <dbl>     <dbl>
##  1 (Intercept)     1    -1.11 0.0295  9.99e-14
##  2 (Intercept)     2    -1.11 0.0269  1.68e- 3
##  3 (Intercept)     3    -1.11 0.0245  4.34e- 3
##  4 (Intercept)     4    -1.11 0.0224  6.95e- 3
##  5 (Intercept)     5    -1.11 0.0204  9.13e- 3
##  6 (Intercept)     6    -1.11 0.0186  1.15e- 2
##  7 (Intercept)     7    -1.11 0.0169  1.38e- 2
##  8 (Intercept)     8    -1.12 0.0154  1.60e- 2
##  9 (Intercept)     9    -1.12 0.0140  1.81e- 2
## 10 (Intercept)    10    -1.12 0.0128  2.01e- 2
## # … with 30,871 more rows
set.seed(1234)
data_boot <- bootstraps(data_train)

tune_spec <- logistic_reg(mode = 'classification', penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

lambda_grid <- grid_regular(penalty(), levels = 50)
doParallel::registerDoParallel()

set.seed(2020)
lasso_grid <- tune_grid(
  wf %>% add_model(tune_spec),
  resamples = data_boot,
  grid = lambda_grid
)

What results did we get?

lasso_grid %>%
  collect_metrics()
## # A tibble: 100 x 7
##     penalty .metric  .estimator  mean     n std_err .config
##       <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <fct>  
##  1 1.00e-10 accuracy binary     0.739    14 0.00104 Model01
##  2 1.00e-10 roc_auc  binary     0.622    14 0.00102 Model01
##  3 1.60e-10 accuracy binary     0.739    14 0.00104 Model02
##  4 1.60e-10 roc_auc  binary     0.622    14 0.00102 Model02
##  5 2.56e-10 accuracy binary     0.739    14 0.00104 Model03
##  6 2.56e-10 roc_auc  binary     0.622    14 0.00102 Model03
##  7 4.09e-10 accuracy binary     0.739    14 0.00104 Model04
##  8 4.09e-10 roc_auc  binary     0.622    14 0.00102 Model04
##  9 6.55e-10 accuracy binary     0.739    14 0.00104 Model05
## 10 6.55e-10 roc_auc  binary     0.622    14 0.00102 Model05
## # … with 90 more rows
lasso_grid %>%
  collect_metrics() %>%
  ggplot(aes(penalty, mean, color = .metric)) +
  geom_errorbar(aes(
    ymin = mean - std_err,
    ymax = mean + std_err
  ),
  alpha = 0.5
  ) +
  geom_line(size = 1.5) +
  facet_wrap(~.metric, scales = "free", nrow = 2) +
  scale_x_log10() +
  theme(legend.position = "none")

highest_auc <- lasso_grid %>%
  select_best("roc_auc", maximize = FALSE)
## Warning: The `maximize` argument is no longer needed. This value was ignored.
final_lasso <- finalize_workflow(
  wf %>% add_model(tune_spec),
  highest_auc
)
library(vip)
## Warning: package 'vip' was built under R version 3.6.2
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
lasso_table = final_lasso %>%
  fit(data_train) %>%
  pull_workflow_fit() %>%
  vi(lambda = highest_auc$penalty) %>%
  mutate(
    Importance = abs(Importance),
    Variable = fct_reorder(Variable, Importance)
  )
lasso_table %>% mutate(Importance = round(Importance, 2)) %>% filter(Importance > .1) %>% arrange(desc(Importance)) %>% gt()
Variable Importance Sign
year_closed_X2007 0.26 NEG
year_closed_X2008 0.23 NEG
year_closed_X2011 0.22 NEG
complainant_gender_Missing 0.20 POS
contact_reason_Execution.of.search.warrant 0.19 NEG
year_closed_X2009 0.18 NEG
year_closed_X2015 0.16 POS
complainant_gender_Male 0.14 POS
command_at_incident_ND.SEQI 0.13 POS
year_closed_X2004 0.13 POS
year_received_X2006 0.12 POS
year_received_X2003 0.12 NEG
year_closed_X2012 0.12 NEG
year_closed_X1998 0.11 POS
year_closed_X2013 0.11 NEG

Probably the most interesting finding here is that the excecution of a search warrant predicts a lower likelihood fo having a substantiated claim, which makes sense.

Next, let’s look at how the model actually performed.

last_fit(
  final_lasso,
  data_split
) %>%
  collect_metrics()
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.748
## 2 roc_auc  binary         0.643

Unfortunately, the model did not perform well at all. We would just as accurate if we had went without any predictors at all.

Conclusion

This looks like a situation where a predictive model doesn’t tell us much of anything, which makes sense. I don’t really know much of anything with regards to substantiated claims of police misconduct, and I am probably missing lots of important variables that can describe the process. A predictive model that is applied to data...

To leave a comment for the author, please follow the link and comment on their blog: R on Sam Portnow's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)