Comparing Quality of Life and Demographics of Major Cities

[This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction:

As an avid traveler, I have always been interested in discovering what makes a city unique. Tools informing travelers of unique landmarks and activities in the places to which they venture have been ubiquitous for ages. While I appreciate the different elements that make a city unique, I also have grown to understand that there are certain types of activities I would like to engage in when I visit a new place.

For my project, I set out to visualize demographic and quality of life data between two cities to compare them at the neighborhood level. If you have ever found yourself traveling to a new city, looking for an area similar to one you knew well, you may have some of the questions which drove me to work on this project. Say you love curry and you have decided that in your hometown of New York, Flushing has all of the curry houses you love. What neighborhoods in Los Angeles would have similar types of restaurants which may also be highly rated by their patrons? Say I was moving to Toronto from Sao Paulo, but was still enrolled in English courses as I was adapting to the new city. Maybe I can try to find a neighborhood with a high native Portuguese-speaking population to help me get settled into my new environment. Does this neighborhood compare to my favorites in Sao Paulo in terms of available green space and public transportation quality? My goal was to build a tool that could take these questions and return an answer quickly for a user. This tool would find congruence between any two data points at the neighborhood level. I chose to test this functionality using data from New York and Los Angeles–a city that I know very well and one where I can’t find the airport without a map, respectively.

Los Angeles, California, USA

 

New York, New York, USA

 

Methodology:

My application sources US Census data for demographic and quality of life information. I augmented this data set with data from Walkscore. Walkscore is a website which rates a neighborhood’s quality of transportation options. The rating is derived from a weighted analysis of the requested area’s features and is not a relative index. Thus one neighborhood’s Walkscore is not dependent on the score of another simply by value–although it may be affected if services in an area are reachable in the compared area.

I acquired my map polygons from public GeoJSON files which posed a bit of a challenge in data manipulation. I later found a more effective strategy for manipulating JSON data in R as I reached the conclusion of my project and will later talk about how I would incorporate that into the project if it were to continue.

The specific categories in my question would be tough to answer without web scraping skills, which I am acquiring in a later module of the bootcamp. To accommodate my knowledge gap, I chose a smaller set of data to run the project as a proof of concept. My data fields for this project are:

  • Neighborhood Name
  • City
  • Neighborhood Population
  • Median Household Income
  • Average Household Size
  • Violent Crime Rate (per 1,000)
  • Property Crime rate (per 1,000)
  • Median Educational Attainment
  • Median Age
  • US Census Racial Categories
    • White
    • Black
    • Asian
    • Hispanic (non-race, ethnic)
  • Percent Foreign Born
  • Data from WalkScore.com
    • WalkScore
    • TransitScore
    • BikeScore

The data for Los Angeles was very simple to acquire as the Los Angeles Times began mapping Los Angeles and gaining these insights with neighborhood granularity in 2009. Data collected by neighborhood in New York posed a greater, more time-consuming challenge as the census collects data at the census tract level, which is independent of neighborhoods and generally contains a portion of a single neighborhood or portions of multiple neighborhoods. As an alternative, I found a data set which was analyzed by the Furman Center and used census data for the combined neighborhoods which were broken down by the City of New York to analyze data to the same level as the data prepared by the LA Times. At this stage, I decided that I would be better served to test functionality on demographic information and WalkScore’s ratings which use similar data types to the questions that I had. I won’t find out about restaurants in this proof of concept, but maybe I can find out which neighborhoods have young immigrant populations in each city, then match them.

To try and find a good representation of neighborhoods, I chose 20 per city: 5 which I anecdotally knew to be affluent, 5 which I anecdotally knew to be under-served, and 10 totally at random. The neighborhoods in this demo are:

New York Los Angeles
Flushing/Whitestone
Central Harlem
Upper East Side
Upper West Side
Greenpoint/Williamsburg
Fort Greene/Brooklyn Heights
Coney Island
St. George/Stapleton
Elmhurst/Corona
Morrisania/Crotona
Hillcrest/Fresh Meadows
Astoria
Brownsville
Crown Heights/Prospect Heights
Clinton/Chelsea
Lower East Side/Chinatown
Greenwich Village/SoHo
Washington Heights/Inwood
Riverdale/Fieldston
Bushwick
Hollywood
Studio City
Beverly Hills
Brentwood
Culver City
Baldwin Hills/Crenshaw
Silver Lake
Echo Park
Van Nuys
Santa Monica
Venice
Chatsworth
Florence
Vermont-Slauson
Watts
East Los Angeles
Historic South Central
Downtown
Eagle Rock
Koreatown

The next steps after acquiring the data are to manipulate the information to provide answers to my new burning questions!

R Shiny Application:

I created a dashboard application in R Shiny to visualize neighborhood comparisons and to allow for user input. The dashboard consists of three sections:

  • Map: a map created using GeoJSON files for boundaries in the leaflet for R package. I created a choropleth map which separated neighborhood rankings by percentile in groups of 5 (0-20%, 21-40%, 41- 60%, 61-80%, 81-100%). The percentile group was noted by color as defined in the map legend. This interface allowed for a user to pick a statistic to obtain rankings for in each city by neighborhood. The neighborhood polygon then pops up the neighborhood name and statistic upon an onmouseover event.

  • Matchmaker: the key interface throughout my planning to answer the previously stated problems. I created a function to find the three nearest comparisons in percentile by taking the minimum absolute difference of a neighborhood’s rank and a corresponding rank in another city. It uses an observe event as well to ensure that neighborhoods that are searched match the city specified in the prompt.
    • The function takes a data row “z” and subtracts the value of the corresponding city’s neighborhood percentile rankings “a.” It looks for the three closest percentile points to find “matches.” For example, if a user would like to know the most similar neighborhoods to Brentwood, Los Angeles, they could do so for either New York or the rest of Los Angeles like in the example below:

matchmaker = function(a,z){
b = abs(z – a)
b = which.min(b)
mymatch = c(b)
c = abs(z – a[-b])
c = which.min(c)
if (c > b){
c = c+1
}
mymatch = c(mymatch, c)
d = abs(z – a[c(-b,-c)])
d = which.min(d)
if (d > c){
d = d+1
}
if (d > b){
d = d+1
}
mymatch = c(mymatch, d)
mymatch
}

  • Insights: I had additional questions upon seeing the data which I address below.

Insights:

As a child, I was always fond of rebuffing norms and typical structure. My mother always insisted that if I wanted to achieve my (vain) childhood goal of making a lot of money, I would need to stay in school. I preferred curating my own learning experiences. Over time, I relented and fulfilled my mother’s wishes of scholastic pursuit. I realized that across the various neighborhoods in this data set, I could make a simple visualization showing the median household income for neighborhoods with particular levels of degrees attained by the typical person. As expected, my mother’s hypothesis was backed up by the data, as income level tended to rise across neighborhoods with higher educational attainment.

I noticed that Los Angeles neighborhoods consistently had higher levels of property and violent crime. I researched the data a bit deeper and discovered that this comparison was imperfect for per capita comparisons between cities as LA Times reported misdemeanor and felony crimes while Furman Center only reported felonies in New York. However, I felt that this comparison would still be valid for percentile comparisons (relative to other neighborhoods in the same city) so I kept these statistics in the data set. When I plotted the data, I noticed that there was a positive relationship between increased property and violent crimes, but what shocked me was an exaggerated outlier. I recognized that this point represented the crime rate in Downtown Los Angeles. With no familiarity of Downtown Los Angeles, I wondered, why does this single part of the city appear to have a crime rate that essentially guarantees any resident or visitor to be a crime victim? Was Downtown LA that much more dangerous than the other neighborhoods in my data set? Maybe I was biased in my neighborhood choice and I neglected the more dangerous neighborhoods. I then remembered that Midtown Manhattan had a similar phenomenon, where the crime rate reported was much higher than to be expected, because of its daytime population. Downtown Los Angeles has a residential population of 34,811. The Los Angeles Times reported that the daytime population of the area exceeded 280,000!

We were measuring the crime frequency of a neighborhood with a mid-sized American city’s population against a small group of people as if the crimes committed in that area were only perpetrated against its own residents. This accounted for the bizarre numbers. More accurate crime numbers for the neighborhood would be 7.82 and 20.1 per 1,000 residents for violent crime and property crime respectively. This is much more indicative of a typical–even somewhat safe–Los Angeles neighborhood.

Upon this discovery that my crime data was flawed and would need to be re-investigated, I decided to investigate median age. The youngest neighborhood was Watts, Los Angeles (median age of 21) and the oldest was the Upper East Side in New York (median age of 47). Typically, the older neighborhoods experienced lower crime rates. Of the selected neighborhoods, the median age across both cities was 35, slightly lower than the US median of 38. With regards to this data set, New York and Los Angeles appear to be slightly younger than typical American cities.

Future Use Cases and Functionality:

While I did not answer my initial question and wanted to build much more, I am happy with how my first R project came together. In 10 days, I was able to read my information from a SQLite database into a dynamic user interface and write a function to answer similar questions. This shows me that my intended use case would be possible with a bit more development time and data wrangling.

I could see the core audience of an application such as this one coming from a wide array of perspectives. This can be useful to businesses looking for the best neighborhoods to open a location in an area heavily populated with their core customer base. Vacationers could use this to decide which neighborhood would be a great place to look for a hotel or bed and breakfast. A family moving from one city to another could use this to find the best school district and most green space. Because of this potential, I would endeavor to add more fields in the future such as:

  • Traffic data
    • Delays
    • Collisions
  • Cultural data
    • Languages spoken
    • Restaurants
    • Places of Worship
  • Home value
  • Rent value
  • 311 information

I found the JSON files difficult to manipulate in bulk with my additional data fields. I resorted to systematically manually importing the data, which allowed me to give a representation of the functionality across different variables. One of my colleagues showed me the merge command in rgdal toward the end of the project which allows for manipulation of the JSON file. This would make it much easier to add more features, neighborhoods, and cities to the data set. I also wanted to add Atlanta, Toronto, and Montreal to my project and would like to continue investigating that after the bootcamp.

I would also like to offer options to compare on both the percentile level, relative to the rest of the city, and also on a pure per capita or raw count basis. Optimally, I could expand the functionality to create a score between two chosen neighborhoods that quantifies the match.

Tools used:

R Packages:

  • shinyjs
  • shinydashboard
  • DT
  • data.table
  • googleVis
  • dplyr
  • leaflet
  • sp
  • ggmap
  • maptools
  • broom
  • httr
  • rgdal
  • V8
  • geojsonio
  • RColorBrewer

 

Thank you for reading and feel free to access my project from my Github repository. I have included all of the files to enable anyone to reproduce the application. Have a whirl by testing out the deployed prototype here!

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)