Introduction

On July 26, 2020, Propublica released a dataset on police discipline records.

I wanted to get a look at the data and do some exploratory analysis.

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## ── Conflicts ─────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(gt)
## Warning: package 'gt' was built under R version 3.6.2
library(here)
## here() starts at /Users/samportnow/sam-portnow-website
library(fs)
## Warning: package 'fs' was built under R version 3.6.2
library(skimr)
## Warning: package 'skimr' was built under R version 3.6.2
library(janitor)
## Warning: package 'janitor' was built under R version 3.6.2
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Read in Data

data = read_csv(here::here('content', 'post_data', 'allegations_20200726939.csv'))
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   unique_mos_id = col_double(),
##   complaint_id = col_double(),
##   month_received = col_double(),
##   year_received = col_double(),
##   month_closed = col_double(),
##   year_closed = col_double(),
##   mos_age_incident = col_double(),
##   complainant_age_incident = col_double(),
##   precinct = col_double()
## )
## See spec(...) for full column specifications.

Explore Data

skim(data)

Table 1: Data summary
Name	data
Number of rows	33358
Number of columns	26
_______________________
Column type frequency:
character	17
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
first_name	0	1.00	2	10	1217
last_name	0	1.00	2	18	2835
command_now	0	1.00	2	7	415
command_at_incident	1544	0.95	2	7	361
rank_abbrev_incident	0	1.00	2	3	18
rank_abbrev_now	0	1.00	2	3	20
rank_now	0	1.00	7	22	8
rank_incident	0	1.00	7	22	8
mos_ethnicity	0	1.00	5	15	5
mos_gender	0	1.00	1	1	2
complainant_ethnicity	4464	0.87	5	15	8
complainant_gender	4195	0.87	4	21	6
fado_type	0	1.00	5	18	4
allegation	1	1.00	4	40	115
contact_reason	199	0.99	5	58	53
outcome_description	56	1.00	12	36	23
board_disposition	0	1.00	10	40	11

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
unique_mos_id	0	1.00	18169.91	9566.32	2	9671.00	19215	25412	36374	▆▆▇▇▃
complaint_id	0	1.00	23905.06	11954.43	517	13684.75	25132	34252	43703	▅▆▆▇▇
month_received	0	1.00	6.32	3.36	1	3.00	6	9	12	▇▆▅▆▇
year_received	0	1.00	2010.73	6.03	1985	2007.00	2012	2016	2020	▁▁▃▇▇
month_closed	0	1.00	6.47	3.34	1	4.00	6	9	12	▇▆▆▆▇
year_closed	0	1.00	2011.53	6.09	1985	2008.00	2013	2016	2020	▁▁▂▆▇
mos_age_incident	0	1.00	32.35	6.04	20	28.00	31	36	60	▅▇▃▁▁
complainant_age_incident	4812	0.86	32.48	28.41	-4301	23.00	30	41	101	▁▁▁▁▇
precinct	24	1.00	64.37	31.45	0	43.00	67	81	1000	▇▁▁▁▁
# Officer Ethnicity

data %>% tabyl(mos_ethnicity) %>% adorn_pct_formatting() %>% gt()

mos_ethnicity	n	percent
American Indian	32	0.1%
Asian	1178	3.5%
Black	4924	14.8%
Hispanic	9150	27.4%
White	18074	54.2%

It looks like the officer ethnicity matches the overall NYPD demogrphaic breakdown

Complainant Ethnicity

data %>% tabyl(complainant_ethnicity) %>% adorn_pct_formatting() %>% gt()

complainant_ethnicity	n	percent	valid_percent
American Indian	64	0.2%	0.2%
Asian	532	1.6%	1.8%
Black	17114	51.3%	59.2%
Hispanic	6424	19.3%	22.2%
Other Race	677	2.0%	2.3%
Refused	259	0.8%	0.9%
Unknown	1041	3.1%	3.6%
White	2783	8.3%	9.6%
NA	4464	13.4%	–

As for as the complaintant data is concerned, there isn’t an obvious population to compare to, but I think the closest is the arrest data because arrests can give some approximation of contact with the police.

Here, we see that Black people are overrepresented in their complaints compared to the percentage of arrestees that are Black. Also, Hispanic people are markedly underrepresented. I wonder if this is because of a large undocumented population in New York who may try to avoid contact with the government, or perhaps an issue with the missing ethnicity category.

Cross Tab Complaintant Ethnicity and Officer Ethnicity

data %>% tabyl(complainant_ethnicity, mos_ethnicity) %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting(digits = 2) %>%
  adorn_ns() %>% gt()

complainant_ethnicity	American Indian	Asian	Black	Hispanic	White
American Indian	0.00% (0)	3.12% (2)	6.25% (4)	12.50% (8)	78.12% (50)
Asian	0.00% (0)	11.47% (61)	10.90% (58)	18.05% (96)	59.59% (317)
Black	0.11% (18)	3.26% (558)	16.63% (2846)	27.59% (4722)	52.41% (8970)
Hispanic	0.03% (2)	3.67% (236)	11.04% (709)	34.48% (2215)	50.78% (3262)
Other Race	0.89% (6)	2.81% (19)	14.48% (98)	27.62% (187)	54.21% (367)
Refused	0.00% (0)	6.95% (18)	14.67% (38)	22.39% (58)	55.98% (145)
Unknown	0.19% (2)	2.59% (27)	17.39% (181)	29.01% (302)	50.82% (529)
White	0.07% (2)	4.64% (129)	13.65% (380)	22.46% (625)	59.18% (1647)
NA	0.04% (2)	2.87% (128)	13.66% (610)	20.99% (937)	62.43% (2787)

The first thing I know is that there doesn’t seem to be much over or underrepresentation in the officer ethnicity relative to the demographic breakdown of the NYPD, with the exception of White, Native American, and Missing Complaitants and White officers. For these groups, which White officers are overrepresnted relative to the demographic breakdown of the NYPD. Actually, it appears that all ethnicity groups overrepresent their officers in complaints when the offending officer is of their same ethnicity, with the exception of Black complaitants. There is a large psychological literature that suggests people on on average display intergroup bias in favor of their own race. Perhaps when someone is mistreated by someone of their own race, it feels like a larger transgression.

Top Level Category of Complaint

data %>% tabyl(fado_type) %>% adorn_pct_formatting() %>% gt()

fado_type	n	percent
Abuse of Authority	20292	60.8%
Discourtesy	4677	14.0%
Force	7636	22.9%
Offensive Language	753	2.3%

# Specific Allegation

data %>% tabyl(allegation) %>% adorn_pct_formatting() %>% arrange(desc(n)) %>% gt()

allegation	n	percent	valid_percent
Physical force	4849	14.5%	14.5%
Word	3942	11.8%	11.8%
Stop	2300	6.9%	6.9%
Search (of person)	2047	6.1%	6.1%
Frisk	1926	5.8%	5.8%
Premises entered and/or searched	1555	4.7%	4.7%
Refusal to provide name/shield number	1483	4.4%	4.4%
Vehicle search	1405	4.2%	4.2%
Threat of arrest	1370	4.1%	4.1%
Vehicle stop	1094	3.3%	3.3%
Threat of force (verbal or physical)	912	2.7%	2.7%
Question and/or stop	696	2.1%	2.1%
Other	641	1.9%	1.9%
Gun Pointed	628	1.9%	1.9%
Strip-searched	556	1.7%	1.7%
Question	527	1.6%	1.6%
Property damaged	359	1.1%	1.1%
Pepper spray	324	1.0%	1.0%
Action	316	0.9%	0.9%
Race	307	0.9%	0.9%
Retaliatory summons	291	0.9%	0.9%
Entry of Premises	289	0.9%	0.9%
Refusal to obtain medical treatment	276	0.8%	0.8%
Nightstick as club (incl asp & baton)	260	0.8%	0.8%
Chokehold	244	0.7%	0.7%
Frisk and/or search	241	0.7%	0.7%
Curse	202	0.6%	0.6%
Refusal to process civilian complaint	185	0.6%	0.6%
Gun Drawn	182	0.5%	0.5%
Push/Shove	178	0.5%	0.5%
Seizure of property	178	0.5%	0.5%
Failure to provide RTKA card	176	0.5%	0.5%
Hit against inanimate object	176	0.5%	0.5%
Threat to damage/seize property	171	0.5%	0.5%
Search of Premises	159	0.5%	0.5%
Retaliatory arrest	147	0.4%	0.4%
Other – Force	134	0.4%	0.4%
Refusal to show search warrant	125	0.4%	0.4%
Forcible Removal to Hospital	120	0.4%	0.4%
Gender	115	0.3%	0.3%
Punch/Kick	114	0.3%	0.3%
Interference with recording	111	0.3%	0.3%
Threat of summons	111	0.3%	0.3%
Refusal to provide shield number	97	0.3%	0.3%
Threat to notify ACS	97	0.3%	0.3%
Refusal to provide name	96	0.3%	0.3%
Person Searched	94	0.3%	0.3%
Nasty Words	88	0.3%	0.3%
Nonlethal restraining device	87	0.3%	0.3%
Ethnicity	84	0.3%	0.3%
Sexual orientation	78	0.2%	0.2%
Handcuffs too tight	76	0.2%	0.2%
Threat of force	74	0.2%	0.2%
Other – Abuse	70	0.2%	0.2%
Dragged/Pulled	63	0.2%	0.2%
Beat	57	0.2%	0.2%
Black	56	0.2%	0.2%
Gesture	51	0.2%	0.2%
Other blunt instrument as a club	49	0.1%	0.1%
Threat of Arrest	39	0.1%	0.1%
Demeanor/tone	38	0.1%	0.1%
Gun fired	38	0.1%	0.1%
Vehicle	35	0.1%	0.1%
Premise Searched	31	0.1%	0.1%
Gun as club	28	0.1%	0.1%
Property Damaged	27	0.1%	0.1%
Nightstick/Billy/Club	25	0.1%	0.1%
Search of recording device	23	0.1%	0.1%
Radio as club	22	0.1%	0.1%
Vehicle Searched	22	0.1%	0.1%
Restricted Breathing	21	0.1%	0.1%
Threat re: removal to hospital	21	0.1%	0.1%
Mace	20	0.1%	0.1%
Detention	19	0.1%	0.1%
Flashlight as club	18	0.1%	0.1%
Other- Discourtesy	18	0.1%	0.1%
Photography/Videography	15	0.0%	0.0%
Physical disability	15	0.0%	0.0%
Religion	14	0.0%	0.0%
Slap	14	0.0%	0.0%
Radio As Club	13	0.0%	0.0%
Sexual Misconduct (Sexual Humiliation)	13	0.0%	0.0%
Threat to Property	13	0.0%	0.0%
Gun As Club	12	0.0%	0.0%
Electronic device information deletion	11	0.0%	0.0%
Other – Ethnic Slur	11	0.0%	0.0%
Police shield	11	0.0%	0.0%
Sex Miscon (Sexual Harassment, Verbal)	11	0.0%	0.0%
Flashlight As Club	10	0.0%	0.0%
Refusal to show arrest warrant	10	0.0%	0.0%
Gun pointed/gun drawn	9	0.0%	0.0%
Hispanic	9	0.0%	0.0%
Property Seized	8	0.0%	0.0%
White	8	0.0%	0.0%
Rude Gesture	7	0.0%	0.0%
Arrest/D. A. T.	6	0.0%	0.0%
Sh Refuse Cmp	5	0.0%	0.0%
Threat of Summons	5	0.0%	0.0%
Arrest/Onlooker	4	0.0%	0.0%
Gender Identity	4	0.0%	0.0%
Improper dissemination of medical info	4	0.0%	0.0%
Sex Miscon (Sexual/Romantic Proposition)	4	0.0%	0.0%
Animal	3	0.0%	0.0%
Body Cavity Searches	3	0.0%	0.0%
Gay/Lesbian Slur	3	0.0%	0.0%
Gun pointed	3	0.0%	0.0%
Profane Gesture	3	0.0%	0.0%
Failed to Obtain Language Interpretation	2	0.0%	0.0%
Gun Fired	2	0.0%	0.0%
Jewish	2	0.0%	0.0%
Sex Miscon (Sexual Harassment, Gesture)	2	0.0%	0.0%
Oriental	1	0.0%	0.0%
Other Asian	1	0.0%	0.0%
Questioned immigration status	1	0.0%	0.0%
Sexist Remark	1	0.0%	0.0%
NA	1	0.0%	–

Contact Reason

data %>% tabyl(contact_reason) %>% arrange(desc(n)) %>% adorn_pct_formatting() %>% gt()

contact_reason	n	percent	valid_percent
PD suspected C/V of violation/crime – street	10078	30.2%	30.4%
Other	4104	12.3%	12.4%
PD suspected C/V of violation/crime – auto	2981	8.9%	9.0%
PD suspected C/V of violation/crime – bldg	2542	7.6%	7.7%
Moving violation	1983	5.9%	6.0%
Other violation of VTL	1140	3.4%	3.4%
Report-dispute	1085	3.3%	3.3%
Execution of search warrant	913	2.7%	2.8%
Report of other crime	906	2.7%	2.7%
Execution of arrest/bench warrant	683	2.0%	2.1%
Parking violation	635	1.9%	1.9%
Report-domestic dispute	563	1.7%	1.7%
C/V intervened on behalf of/observed encounter w/3rd party	540	1.6%	1.6%
Report-gun possession/shots fired	492	1.5%	1.5%
PD suspected C/V of violation/crime – subway	380	1.1%	1.1%
Patrol Encounter	362	1.1%	1.1%
Report-noise/disturbance	356	1.1%	1.1%
Report-possession/sale of narcotics	318	1.0%	1.0%
C/V requested investigation of crime	316	0.9%	1.0%
EDP aided case	313	0.9%	0.9%
Traffic Incidents/Accident/Prk Violation	219	0.7%	0.7%
Stop/Question/Frisk	206	0.6%	0.6%
Traffic accident	200	0.6%	0.6%
NA	199	0.6%	–
Others	189	0.6%	0.6%
C/V telephoned PCT	175	0.5%	0.5%
C/V at PCT to obtain information	167	0.5%	0.5%
Report of Crime Past/Present	165	0.5%	0.5%
Aided case	154	0.5%	0.5%
C/V requested info from officer	112	0.3%	0.3%
Dispute	104	0.3%	0.3%
C/V at PCT to file complaint of crime	102	0.3%	0.3%
CV already in custody	100	0.3%	0.3%
Arrest/Complainant	73	0.2%	0.2%
Vehicle Stop and Check	51	0.2%	0.2%
Arrest/Not Complainant	46	0.1%	0.1%
Parade/special event	46	0.1%	0.1%
C/V at PCT to retrieve property	40	0.1%	0.1%
Assist ACS or other agency	39	0.1%	0.1%
PD auto checkpoint	39	0.1%	0.1%
Summons/Complainant	39	0.1%	0.1%
Regulatory inspection	36	0.1%	0.1%
Report of Disturbance/Noise Complaint	36	0.1%	0.1%
Demonstration/protest	35	0.1%	0.1%
PD telephones CV	18	0.1%	0.1%
Complainant at Pct. to make a Cmpl/info	16	0.0%	0.0%
EDP Aided Cases	12	0.0%	0.0%
Telephone Call to Precinct/Command	11	0.0%	0.0%
Transit checkpoint	11	0.0%	0.0%
Aided Cases	10	0.0%	0.0%
Demonstrations	8	0.0%	0.0%
Complainant Witnessing Incident	5	0.0%	0.0%
No contact	3	0.0%	0.0%
Victim Subject of Sex Crime	2	0.0%	0.0%

# Outcome Description

data %>% tabyl(outcome_description) %>% arrange(desc(n)) %>% adorn_pct_formatting() %>% gt()

outcome_description	n	percent	valid_percent
No arrest made or summons issued	12822	38.4%	38.5%
Arrest – other violation/crime	10196	30.6%	30.6%
Summons – disorderly conduct	2118	6.3%	6.4%
Summons – other violation/crime	1940	5.8%	5.8%
Arrest – resisting arrest	1593	4.8%	4.8%
Arrest – disorderly conduct	1013	3.0%	3.0%
Arrest – assault (against a PO)	852	2.6%	2.6%
Moving violation summons issued	839	2.5%	2.5%
Arrest – OGA	649	1.9%	1.9%
Other VTL violation summons issued	531	1.6%	1.6%
Parking summons issued	279	0.8%	0.8%
Disorderly-Conduct/Arr/Summons	137	0.4%	0.4%
Arrest on Other Charge	81	0.2%	0.2%
Traffic Summons Claimed or Issued	59	0.2%	0.2%
Juvenile Report	57	0.2%	0.2%
NA	56	0.2%	–
Other Summons Claimed or Issued	38	0.1%	0.1%
Assault/Arrested	34	0.1%	0.1%
Resisting Arrest/Arrested	25	0.1%	0.1%
Arrest – harrassment (against a PO)	15	0.0%	0.0%
Obstruct-Govt-Admin/Arrested	10	0.0%	0.0%
Harrassment/Arrested/Summons	8	0.0%	0.0%
Summons – harrassment (against a PO)	5	0.0%	0.0%
Summons – OGA	1	0.0%	0.0%

Disposition

data %>% tabyl(board_disposition) %>% adorn_pct_formatting() %>% gt()

board_disposition	n	percent
Exonerated	9609	28.8%
Substantiated (Charges)	3796	11.4%
Substantiated (Command Discipline A)	964	2.9%
Substantiated (Command Discipline B)	789	2.4%
Substantiated (Command Discipline)	851	2.6%
Substantiated (Command Lvl Instructions)	454	1.4%
Substantiated (Formalized Training)	1033	3.1%
Substantiated (Instructions)	248	0.7%
Substantiated (MOS Unidentified)	1	0.0%
Substantiated (No Recommendations)	165	0.5%
Unsubstantiated	15448	46.3%

So, now I want to know what is the best predictor of having a substantiated claim. I am going to use tidymodels to evaluate a lasso regression to determine the most important predictors. The largely majority of this code is borrowed from Julia Silge

First, I’ll code my outcome

library(stringr)
data = data %>%
  mutate(
    outcome = if_else(str_detect(board_disposition, 'Substantiated'), 1, 0)
  )

Next, I’ll select the columns I want.

library(tidymodels)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────── tidymodels 0.1.1 ──
## ✓ broom     0.7.0      ✓ recipes   0.1.13
## ✓ dials     0.0.8      ✓ rsample   0.0.7 
## ✓ infer     0.5.3      ✓ tune      0.1.1 
## ✓ modeldata 0.0.2      ✓ workflows 0.1.2 
## ✓ parsnip   0.1.2      ✓ yardstick 0.0.7
## Warning: package 'broom' was built under R version 3.6.2
## Warning: package 'dials' was built under R version 3.6.2
## Warning: package 'scales' was built under R version 3.6.2
## Warning: package 'infer' was built under R version 3.6.2
## Warning: package 'modeldata' was built under R version 3.6.2
## Warning: package 'recipes' was built under R version 3.6.2
## Warning: package 'rsample' was built under R version 3.6.2
## Warning: package 'tune' was built under R version 3.6.2
## Warning: package 'yardstick' was built under R version 3.6.2
## ── Conflicts ────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
data = data %>% select(command_now, command_at_incident, year_received, year_closed, rank_abbrev_incident, rank_abbrev_now,
                       mos_ethnicity:contact_reason, outcome)

I want to see what predicts a founded complaint, and the allegation of the complaint seems too proximally related to the complaints, so I am going to drop those variables.

data = data %>%
  select(- c(fado_type, allegation))

Create a length of investigation variable

data$length_of_investigation = data$year_closed - data$year_received

Change years to factors

data$year_closed = factor(data$year_closed)
data$year_received = factor(data$year_received)

Recode character variables to factor

data = data %>% mutate_if(is.character, as.factor)

Recoding missing in factor variables as missing

data = data %>% mutate_if(is.factor, ~ fct_explicit_na(., 'Missing'))

Change outcome to a factor

data$outcome = as.factor(data$outcome)

Look at data again

skim(data)

Table 2: Data summary
Name	data
Number of rows	33358
Number of columns	16
_______________________
Column type frequency:
factor	12
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
command_now	1	FALSE	415	INT: 1412, WAR: 1165, NAR: 692, DB : 683
command_at_incident	1	FALSE	362	Mis: 1544, 075: 1364, 046: 825, 044: 763
year_received	1	FALSE	36	201: 2345, 201: 2312, 201: 2281, 201: 2220
year_closed	1	FALSE	36	201: 3251, 201: 2408, 201: 2326, 201: 2322
rank_abbrev_incident	1	FALSE	18	POM: 19807, SGT: 5701, DT3: 2712, POF: 1398
rank_abbrev_now	1	FALSE	20	POM: 9454, DT3: 6838, SGT: 6067, LT: 2815
mos_ethnicity	1	FALSE	5	Whi: 18074, His: 9150, Bla: 4924, Asi: 1178
mos_gender	1	FALSE	2	M: 31598, F: 1760
complainant_ethnicity	1	FALSE	9	Bla: 17114, His: 6424, Mis: 4464, Whi: 2783
complainant_gender	1	FALSE	7	Mal: 24058, Fem: 5021, Mis: 4195, Not: 57
contact_reason	1	FALSE	54	PD : 10078, Oth: 4104, PD : 2981, PD : 2542
outcome	1	FALSE	2	0: 25057, 1: 8301

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
mos_age_incident	0	1.00	32.35	6.04	20	28	31	36	60	▅▇▃▁▁
complainant_age_incident	4812	0.86	32.48	28.41	-4301	23	30	41	101	▁▁▁▁▇
precinct	24	1.00	64.37	31.45	0	43	67	81	1000	▇▁▁▁▁
length_of_investigation	0	1.00	0.80	0.59	0	0	1	1	9	▇▁▁▁▁
Dropping age at incident be	cause it’s a	lmost always bla	nk. Also	recodin	g precin	ct to	factor

data = data %>% select(-complainant_age_incident)
data$precinct = factor(data$precinct)
data$precinct = fct_explicit_na(data$precinct, 'Missing')

skim(data)

Table 3: Data summary
Name	data
Number of rows	33358
Number of columns	15
_______________________
Column type frequency:
factor	13
numeric	2
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
command_now	1	FALSE	415	INT: 1412, WAR: 1165, NAR: 692, DB : 683
command_at_incident	1	FALSE	362	Mis: 1544, 075: 1364, 046: 825, 044: 763
year_received	1	FALSE	36	201: 2345, 201: 2312, 201: 2281, 201: 2220
year_closed	1	FALSE	36	201: 3251, 201: 2408, 201: 2326, 201: 2322
rank_abbrev_incident	1	FALSE	18	POM: 19807, SGT: 5701, DT3: 2712, POF: 1398
rank_abbrev_now	1	FALSE	20	POM: 9454, DT3: 6838, SGT: 6067, LT: 2815
mos_ethnicity	1	FALSE	5	Whi: 18074, His: 9150, Bla: 4924, Asi: 1178
mos_gender	1	FALSE	2	M: 31598, F: 1760
complainant_ethnicity	1	FALSE	9	Bla: 17114, His: 6424, Mis: 4464, Whi: 2783
complainant_gender	1	FALSE	7	Mal: 24058, Fem: 5021, Mis: 4195, Not: 57
precinct	1	FALSE	80	75: 2172, 73: 1163, 44: 1139, 46: 1120
contact_reason	1	FALSE	54	PD : 10078, Oth: 4104, PD : 2981, PD : 2542
outcome	1	FALSE	2	0: 25057, 1: 8301

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
mos_age_incident	0	1	32.35	6.04	20	28	31	36	60	▅▇▃▁▁
length_of_investigation	0	1	0.80	0.59	0	0	1	1	9	▇▁▁▁▁

Finally, if any factor has an n less than 10, I went to recode that to ther

recode_to_other = function(.x){
  
  keeps = which(table(.x) > 10) %>% names()
  fct_other(.x, keep = keeps, other_level = 'other')
}

data = data %>% mutate_if(is.factor, recode_to_other)
## Warning: Unknown levels in `f`: other

## Warning: Unknown levels in `f`: other

## Warning: Unknown levels in `f`: other

## Warning: Unknown levels in `f`: other

## Warning: Unknown levels in `f`: other

Before fitting the model, I want to know the distribution of outcomes.

prop.table(table(data$outcome))
## 
##         0         1 
## 0.7511541 0.2488459

75% of complains are unfounded. So, if we fit a model without any predictors, we’ll already be at 75% accuracy. I confirm this below with a logistic regression with no predictors. The predict probability of an sustantiated claim is ~25%.

glm.mod = glm(outcome ~ 1, data = data, family = binomial(link = 'logit'))
predictions = predict(glm.mod, type = 'response')
table(predictions)
## predictions
## 0.248845854068537 
##             33358
data_split <- initial_split(data)
data_train <- training(data_split)
data_test <- testing(data_split)
data_rec <- recipe(outcome~ ., data = data_train) %>%
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_predictors(), -all_numeric()) %>%
  step_normalize(all_numeric(), -all_outcomes())


data_prep <- data_rec %>%
  prep(strings_as_factors = FALSE)

summary(data_prep)
## # A tibble: 805 x 4
##    variable                type    role      source  
##    <chr>                   <chr>   <chr>     <chr>   
##  1 mos_age_incident        numeric predictor original
##  2 length_of_investigation numeric predictor original
##  3 outcome                 nominal outcome   original
##  4 command_now_X001.PCT    numeric predictor derived 
##  5 command_now_X005.DET    numeric predictor derived 
##  6 command_now_X005.PCT    numeric predictor derived 
##  7 command_now_X006.PCT    numeric predictor derived 
##  8 command_now_X007.PCT    numeric predictor derived 
##  9 command_now_X009.DET    numeric predictor derived 
## 10 command_now_X009.PCT    numeric predictor derived 
## # … with 795 more rows
lasso_spec <- logistic_reg(mode = 'classification', penalty = 0.1, mixture = 1) %>%
  set_engine("glmnet")

wf <- workflow() %>%
  add_recipe(data_rec)

lasso_fit <- wf %>%
  add_model(lasso_spec) %>%
  fit(data = data_train)

lasso_fit %>%
  pull_workflow_fit() %>%
  tidy()
## # A tibble: 30,881 x 5
##    term         step estimate lambda dev.ratio
##    <chr>       <dbl>    <dbl>  <dbl>     <dbl>
##  1 (Intercept)     1    -1.11 0.0295  9.99e-14
##  2 (Intercept)     2    -1.11 0.0269  1.68e- 3
##  3 (Intercept)     3    -1.11 0.0245  4.34e- 3
##  4 (Intercept)     4    -1.11 0.0224  6.95e- 3
##  5 (Intercept)     5    -1.11 0.0204  9.13e- 3
##  6 (Intercept)     6    -1.11 0.0186  1.15e- 2
##  7 (Intercept)     7    -1.11 0.0169  1.38e- 2
##  8 (Intercept)     8    -1.12 0.0154  1.60e- 2
##  9 (Intercept)     9    -1.12 0.0140  1.81e- 2
## 10 (Intercept)    10    -1.12 0.0128  2.01e- 2
## # … with 30,871 more rows
set.seed(1234)
data_boot <- bootstraps(data_train)

tune_spec <- logistic_reg(mode = 'classification', penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

lambda_grid <- grid_regular(penalty(), levels = 50)
doParallel::registerDoParallel()

set.seed(2020)
lasso_grid <- tune_grid(
  wf %>% add_model(tune_spec),
  resamples = data_boot,
  grid = lambda_grid
)

What results did we get?

lasso_grid %>%
  collect_metrics()
## # A tibble: 100 x 7
##     penalty .metric  .estimator  mean     n std_err .config
##       <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <fct>  
##  1 1.00e-10 accuracy binary     0.739    14 0.00104 Model01
##  2 1.00e-10 roc_auc  binary     0.622    14 0.00102 Model01
##  3 1.60e-10 accuracy binary     0.739    14 0.00104 Model02
##  4 1.60e-10 roc_auc  binary     0.622    14 0.00102 Model02
##  5 2.56e-10 accuracy binary     0.739    14 0.00104 Model03
##  6 2.56e-10 roc_auc  binary     0.622    14 0.00102 Model03
##  7 4.09e-10 accuracy binary     0.739    14 0.00104 Model04
##  8 4.09e-10 roc_auc  binary     0.622    14 0.00102 Model04
##  9 6.55e-10 accuracy binary     0.739    14 0.00104 Model05
## 10 6.55e-10 roc_auc  binary     0.622    14 0.00102 Model05
## # … with 90 more rows
lasso_grid %>%
  collect_metrics() %>%
  ggplot(aes(penalty, mean, color = .metric)) +
  geom_errorbar(aes(
    ymin = mean - std_err,
    ymax = mean + std_err
  ),
  alpha = 0.5
  ) +
  geom_line(size = 1.5) +
  facet_wrap(~.metric, scales = "free", nrow = 2) +
  scale_x_log10() +
  theme(legend.position = "none")

highest_auc <- lasso_grid %>%
  select_best("roc_auc", maximize = FALSE)
## Warning: The `maximize` argument is no longer needed. This value was ignored.
final_lasso <- finalize_workflow(
  wf %>% add_model(tune_spec),
  highest_auc
)
library(vip)
## Warning: package 'vip' was built under R version 3.6.2
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
lasso_table = final_lasso %>%
  fit(data_train) %>%
  pull_workflow_fit() %>%
  vi(lambda = highest_auc$penalty) %>%
  mutate(
    Importance = abs(Importance),
    Variable = fct_reorder(Variable, Importance)
  )
lasso_table %>% mutate(Importance = round(Importance, 2)) %>% filter(Importance > .1) %>% arrange(desc(Importance)) %>% gt()

Variable	Importance	Sign
year_closed_X2007	0.26	NEG
year_closed_X2008	0.23	NEG
year_closed_X2011	0.22	NEG
complainant_gender_Missing	0.20	POS
contact_reason_Execution.of.search.warrant	0.19	NEG
year_closed_X2009	0.18	NEG
year_closed_X2015	0.16	POS
complainant_gender_Male	0.14	POS
command_at_incident_ND.SEQI	0.13	POS
year_closed_X2004	0.13	POS
year_received_X2006	0.12	POS
year_received_X2003	0.12	NEG
year_closed_X2012	0.12	NEG
year_closed_X1998	0.11	POS
year_closed_X2013	0.11	NEG

Probably the most interesting finding here is that the excecution of a search warrant predicts a lower likelihood fo having a substantiated claim, which makes sense.

Next, let’s look at how the model actually performed.

last_fit(
  final_lasso,
  data_split
) %>%
  collect_metrics()
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.748
## 2 roc_auc  binary         0.643

Unfortunately, the model did not perform well at all. We would just as accurate if we had went without any predictors at all.

Conclusion

This looks like a situation where a predictive model doesn’t tell us much of anything, which makes sense. I don’t really know much of anything with regards to substantiated claims of police misconduct, and I am probably missing lots of important variables that can describe the process. A predictive model that is applied to data...

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Exploring Police Misconduct Data

Introduction

Read in Data

Explore Data

Complainant Ethnicity

Cross Tab Complaintant Ethnicity and Officer Ethnicity

Top Level Category of Complaint

Contact Reason

Disposition

Conclusion

Introduction

Read in Data

Explore Data

Complainant Ethnicity

Cross Tab Complaintant Ethnicity and Officer Ethnicity

Top Level Category of Complaint

Contact Reason

Disposition

Conclusion

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)