**My R Nightmares**, and kindly contributed to R-bloggers)

While writing a paper on sparse principal component analysis I came across an old dataset containing 1990s socio-economic data and rate of violent crime for 1994 communities in the US. I am not a sociologist, so my analysis may be superficial, but I found the results interesting with respect to Mr Trump’s political views. Looking at the results, it turns out that the traditional approach of considering only the largest loadings of the PCs seems to support the view that immigrants are a major cause of violent crime Instead, applying SPCA gives an entirely different view of the problem identifying mainly socio-economical characteristics, rather than being immigrants or speaking poor English, as drivers for crime. Naturally, the two things are correlated but the causal inference may be different.

The dataset is called Communities and Crime and can be downloaded from the UCI Machiine Learning Repository. I deleted 26 variables with missing values, ending up with 99 explanatory variaables and ran Principal Components Analysis on their correlation matrix. The the first and second PC respectively explain 25.3% and 17% of the variance of the data, Their ordered contributions (loadings scaled to unit sum of the absolute values) are plotted below.

*****requiring that each sparse component explained at least 95% of the variance explained by the corresponding PC, Both sets of loadings are shown in Table 1 below. Rj-sq is the R squared resulting from regressing a variable on all other in the set (the one used to compute the variance inflation factor in regression), the closer the value to one the more the variable is (multiply) correlated with the others.

**TABLE 1**

Contribution | Variable | Rj-sq |
---|---|---|

PCA | ||

-2.2% | median family income (differs from household income) | 0.98 |

-2.2% | median household income | 0.98 |

-2.1% | percentage of kids in family housing with two parents | 0.98 |

-2.1% | percentage of households with investment / rent income in 1989 | 0.83 |

2.1% | percentage of people under the poverty level | 0.84 |

-2.1% | percentage of families (with kids) that are headed by two parents | 0.98 |

-2.0% | percent of kids 4 and under in two parent households | 0.90 |

-2.0% | per capita income | 0.92 |

2.0% | percentage of households with public assistance income in 1989 | 0.75 |

2.0% | percent of occupied housing units without phone (in 1990, this was rare!) | 0.76 |

SPCA | ||

-51% | median family income (differs from household income) | 0.51 |

-37% | percentage of kids in family housing with two parents | 0.52 |

12% | percent of family households that are large (6 or more) | 0.09 |

**TABLE 2**

Contribution | Variable | Rj-sq |
---|---|---|

PCA | ||

2.7% | percent of population who have immigrated within the last 10 years | 1.00 |

2.7% | percent of population who have immigrated within the last 8 years | 1.00 |

2.6% | percent of population who have immigrated within the last 5 years | 1.00 |

2.6% | percent of population who have immigrated within the last 3 years | 0.98 |

2.6% | percent of people foreign born | 0.96 |

-2.3% | percent of people who speak only English | 0.95 |

2.3% | percent of people who do not speak English well | 0.94 |

2.1% | percent of persons in dense housing (more than 1 person per room) | 0.86 |

2.0% | percentage of population that is of asian heritage | 0.63 |

2.0% | percentage of population that is of hispanic heritage | 0.90 |

SPCA | ||

42% | percent of population who have immigrated within the last 10 years | 0.11 |

-15% | percentage of population that is 65 and over in age | 0.06 |

15% | owner occupied housing – upper quartile value | 0.55 |

14% | percent of family households that are large (6 or more) | 0.47 |

13% | number of people living in areas classified as urban | 0.32 |

**TABLE 3**

Comp1 | Comp2 | Comp3 | Comp4 | Comp5 | |
---|---|---|---|---|---|

cumulative vexp | 24.4% | 40.8% | 49.8% | 57.2% | 62.7% |

relative cumulative vexp (sPC/PC) | 96.5% | 96.5% | 96.5% | 96.6% | 96.6% |

Cardinality | 3 | 5 | 7 | 9 | 8 |

Correlation with PC | 0.97 | 0.98 | 0.942 | 0.833 | 0.92 |

R-sq log(crime rate) on PCs | 44.0% | 49.0% | 50.9% | 51.3% | 53.3% |

R-sq log(crime rate) on sparse comp. | 45.7% | 51.9% | 56.4% | 57.2% | 57.2% |

**a similar algorithm can be found in Merola, G., 2015. Least squares sparse principal component analysis: a backward elimination approach to attain large loadings. Australia & New*

*Zealand Journal of Statistics 57, 391–429. My R package for SPCA is available on GitHub*

**leave a comment**for the author, please follow the link and comment on their blog:

**My R Nightmares**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...