Where are the best places to live? How do you answer this question?
If you turn to google, there are many “top 10” lists, generated by someone else not knowing your personal needs — if you have retired, you might not care much about salary; if you do not have kids, education cost might mean nothing to you; if you have lung issues, air quality might be more important than anything else…
How about choosing among a variety of available input data, assigning your priority, and ranking the candidates based on your specific needs? Sounds good?
Here is a prototype of the described solution, using R and Shiny. I invite you to play with this interactive map generating app, and make your own judgement.
An interactive spatial data digestion framework has been implemented in R with Shiny, to help answer a type of questions: “where is the best place to …?” The web app ranks US counties based on user input, and visualize the results using Leaflet map, along with other quantitative plots facilitating the “where’s best” decision making process. Assuming a matrix-like data structure, the computational core of the framework combines a row-and-column filtering system with a weighted-average score generator. The delivered prototype product is hosted at shinyapps.io, with its source code shared on GitHub. For people interested in the thought process behind the scene, here is a short version of this blog, which is the project wrap-up presentation given at the end of the 2 week period. This blog post completes the planned documentation package, with special effort devoted to the motivation, background and discussion sections — if you like the app and want to read more, I appreciate your appreciation by offering the logic behind the logic.
Motivation & Background
When I was a little boy… Wait, wrong channel.
We are in a 12-week data science bootcamp. As the first project, we were asked to brainstorm for a suitable project based on what we had learned about R and Shiny, and go through the whole project cycle in 2 weeks. My vision for this project comes after these considerations —
1) The strengths of R and Shiny
R is an open source programming language and environment known for its statistical analysis capability, vectorized programming style and abundant plotting packages; while Shiny is a fast growing extension to the R family, providing a web app development and hosting environment, without requiring html or java script knowledge. While browsing the R Shiny gallery, I was immediately impressed by the featured map generation functionality utilizing Leaflet or googleVis.
2) Personal appreciation of the power of maps
I come from a geoscience background. Geoscientists love making maps, and deeply appreciate the power of maps. Decisions are made on maps, from everyday-life decisions (with a GPS in your hand) to billion dollar decisions (think about offshore drilling).
3) Interest in decision making process
Since we talked about decision making, I’d like to share my comment on this process. When facing a large amount of input data and a variety of choices, it is initially easy to reduce the size of the decision process by filtering down the data (shortlist generation). This filtering step is often well described with clear quantitative rules (keyword count, acceptable range of measurements, and so on), but the next step is much more labor intensive and less clear. When a committee need to pick the top 1 from the 3 short-listed candidates, it might take much longer than shortening the list from 300 to 3, with consistency and record-ability as additional challenges.
A few slides were developed to demonstrate one common decision making workflow, which could be quantified and automated. The specific version shown is implemented in my project — it starts with a general column-and-row filtering system during the “300 candidates to 3” stage, and ends after a weighted-average score generation and ranking step during the “short-listed to winner” stage. Choosing the weighted-average linear system is obviously for its simplicity, but only partially. It scales — the weighted-average routine naturally handles data expansion, as I keep adding new columns to the data set. With advantages come limitations, I plan to have a paragraph in the discussion section talking about a few high level limitations of such a choice.
4) Trying to solve a type of problem, record logical thinking, and develop reusable code
I always hope that, when possible, my work has reuse value. Starting by thinking about a general type of problem is a big first step, and I like the path so far. Now, as my first R + Shiny project, once the vision is set, the additional reusability measurements would be the design and documentation of the code itself. I know my future self would be reusing the structure of this code on other things, so please feel free to share your thoughts especially on GitHub.
5) Last but certainly not least — I would like to work on projects with business potential
This point would be illustrated further near the end.
Project Scope and Deliverables
After the brain storm design stage, the project now moves into detailed scoping and execution stage.
For any project, an on-time delivery is usually the top priority, yet passionate developers always want to do more and keep adding sophisticated functionalities. A general good practice is to have checkpoints and fast-track deliverables planned at the beginning, then at any given point past the proof-of-concept fast-track effort, make sure to always have a deliverable working version (and yes, use version control like Git).
At the halfway point, I created the proof-of-concept version. This happens to be a good place to talk about 1 key data preparation step for this project.
The Normalization and “Good” Direction Definition
The 3 data sets are carefully chosen for the minimum deliverable version, as they demonstrate a few key challenges for the weighted-average scoring system:
0) All numerical data?
To begin with, all data sets here are numerical. Another level of complexity strikes when trying to include categorical data in this scoring and ranking framework. We will save this till the discussion section.
1) The normalization step
Let’s look at the 1st and 3rd choices in the picture, which are Air Quality and Income. The air quality data set measures particle concentration in the air, with its numbers sitting between 4.44 and 15.96; while the income data has numbers ranging from 22894 to 125900 (the units are not relevant here, as we are blending unrelated quantities like air quality and income together — the result is a number that lacks physical meaning, its sole purpose is for ranking). The issue here is: the air quality variation is not reflected well in the final score — without normalization, its impact is several digits (or orders of magnitude) weaker compared to income. This is why we need to normalize each quantity to a same range — in this case I chose [0,100] — before pushing them through the weighted-average step.
2) The “good” direction
Again looking at air quality and income, after they are both normalized to [0,100], another issue persists that the “good” directions are opposite. For air quality, smaller number means good; while for income, bigger number is good (people are different, I take the general case). We can simply “flip” the numbers in the backstage for the air quality data without bothering the user, problem solved? Not completely — the political data set is chosen to demonstrate another layer of complexity. The political data itself is the 2016 election voting difference for each county, and the “good” direction is a completely subjective thing. In the final version, population density is another column which needs user input to define positive direction, since some prefer to live close to the crowd, while others prefer more space.
After sorting out these key issues, I began to enrich the data by adding more variables that people might care about when making a “where to live” decision. The data sources are listed at the end. Without time constraints, this effort can go on forever.
Most of the effort past the midpoint is spent on automatically visualizing analytical insights in a user friendly manner, building UI features like linking filtering sliders to tab 2 data table, and special calculations based on user mouse click input. In the end, the delivered product looks like this, which is the hosted web app hopefully you just played with.
Key UI Features and Design Philosophy
While all other data sets are preloaded to memory, in a way ‘static’; this “distance to” calculation reads (lat, long) from user’s mouse click location on the map, and does a spherical distance ‘dynamic’ calculation for each county toward this given point. This is to incorporate the “I want to live close to / away from” decision component into the weighted-average score generation scheme. It has to be normalized too.
2) the pie chart
As I offer more and more optional data sets to be included in the calculation, it is nice to have a visual reminder of data considered and their assigned weights.
The radar chart (or spider web plot) is designed to help understanding the strength and weakness for the top ranked locations. I decided to plot many measurements for the chosen location, even if the user decides not to include certain input data, by leaving some boxes in the control panel unchecked.
I prefer to look at the full strength/weakness near the end, when a small number of final candidates survived through the filtering and digestion funnel, and we are very close to a final conclusion (winner) — in case certain neglected aspects are so extreme that should trigger a second thought. In other words, this is a “what we’ve missed before I sign the contract” visualization.
The 25%, 50%, 75% quantile are also plotted, so there is a visual reminder of what the benchmark crowd might look like. When trying to choose top 5-10 from 3000+ counties, we are really often looking for outliers. I think it would be helpful to have some info reminding the user what an “average America” looks like.
The chartJSRadar is flexible enough, that one can click on the name tags of plotted polygons, to toggle them on and off. This is a very nice design by the author of the package, as radar charts are very hard to read when too many polygons are plotted.
4) A data table tab linked to the interactive map, with “highlight-to-drop-marker” function
The data table on tab 2 is the filtered result linked to all the sliders on tab 1. The newly calculated score after each “Update Plots” button click is added to the left side of the filtered static data table, so user can easily sort using the fresh score, and drop markers on the map by clicking on the rows of data. It is worth noting that interesting interactive features like this are being developed by R Studio in the “crosstalk” package. Eventually these features are going to be much easier to build, compared to my solution in the source code.
I might come back and add a few things demonstrating how to use this app, and show a few interesting insights when you ask the question in a certain way (every time you click the update plots button, you are asking a question. I learned a lot about interesting counties I have never heard of before this project).
Many of us might ask this question, especially after playing with the filtering sliders for a while — how are some of the input data correlated to each other? The tab 3 “Correlation analysis” is dedicated to answering this question.
A few comments on the off-diagonal strong correlations:
- Median house value of 2015-2017 correlate with each other very well, since the housing value in general changes slowly.
- Political voting stats correlate well with each other, because in the US, most people do not vote for the the 3rd party candidates.
- The income to cost ratio has a strong negative correlation with unemployment rate, while income has a much larger variation compared to cost. This seem to suggest we should go chase high salary, without worrying too much about the living cost, in a statistical sense.
- There is a strong correlation between living cost / housing value and political voting results. Housing value and living cost are naturally correlated to each other, since house value is an input used to calculate living cost. In the end it reveals that high living cost regions tend to vote for DEM, while low cost regions for GOP.
- However, there is no strong correlation between income and politics! The 2016 vote is much more cost sensitive than income sensitive.
- Population density is correlated to housing value, therefore it plugs into the same observation involving politics.
- Population density has an intuitive correlation with air quality, so statistically speaking, if you choose the large cities, forget about clean air (but everything is relative).
- Longitude has a few interesting correlation with other parameters, this is due to American geography — think about where the mountains are.
Most of these observations are intuitive. The question now is (if you accept the statement that, when looking for the best places to live, we are really looking for outliers): does statistics matter here? If you find that golden place that scores high on everything, do you care about whether it makes statistical sense?
There is a discussion section recording my random thoughts during this project. I appreciate your attention so far if you are still with me. Before I bore you completely I’d like to mention that there are acknowledgement and data source sections near the end.
The Delivered Product
This section is a bit scattered — I am merely documenting some of my random thoughts during this project. The more important goal for this section is to trigger some additional thoughts on your side, and hopefully you would share with me.
Extended thoughts from the correlation matrix plot
Let’s look at these 3 variables: income, living cost, and income/cost ratio. The ratio is derived from the first 2 base quantities, since it has a nice feature — ratio=1 means making ends meet, which is a convenient parameter for life-related decisions. However, it is worth noting that, when this ratio is offered as a 3rd variable, it does not add any additional information to this income-cost system — the degrees of freedom remain 2.
People with statistical knowledge might choose only up to 2 of the 3 parameters in this group, in any given calculation. The consideration here is: when all 3 are checked, we are double dipping into something.
But is this terrible for what we are doing here? Since many things in life are somewhat correlated, don’t we always face this problem no matter what?
What does this ranking app do?
The correlation matrix was initially desired for, since I thought I would digest the result and think about how to include/drop certain variables, and have a PCA analysis. In the end I find the following steps improper. I am not trying to come up with a “engineered feature”, which is a linear combination of the available data columns, that could predict the likelihood of achieving what the users wants to achieve in life.
Instead, I simply ask users to bring their own “Happiness = f(x)” function to the app (with the assumption that the input variables are among the ones I offer here). The “happiness” or “achievement” is subjectively defined, and very hard to measure. My current understanding is that, for what this “where to live” app is designed to do, I do not have to worry about correlation at all — I just dig for any quantity that the users might care about during the decision making process, and simply offer them all as options.
Am I right?
The limitations of a linear framework
As mentioned earlier, the weighted-average engine is chosen because of its simplicity and scalability. At this point I would like to share my thoughts on its limitations, in the context of the “happiness” discussion. By choosing the linear framework, I am essentially saying this for example: given other variables staying constant, your happiness grows linearly with how much money you make; at any given point, if you need to take derivatives with respect to a certain variable, you’ll find the first order derivative is a constant, and any higher order derivatives are 0 — because I defined a model space with only first degree polynomials.
It is always a good practice when choosing basis functions, to first think about all the possible mathematical behavior your governing equation might require. For this project I assume, for example, I would never look at how much happier people become, when they make another 10,000 $/year. The app itself is meant to use explicit linear functions to go through a forward problem, rather then an inverse problem trying to analyze the quantitative impact of money/air quality/safety on our happiness.
Another train of thought is: I really wonder what happens when I plug in some nonlinear functions into this system, like this
Thoughts on filtering
While playing with the filters of this app, very often you find there are only a handful of counties (out of 3000+) surviving the filtering system, especially when you combine multiple sliders. When the problem size is small enough for the memory to handle the whole data set, the suggested approach might be: after choosing which columns of data should be included (by checking boxes), keep all rows (by not touching the sliders at all), then generate the score. Gain an understanding of where the bull’s eyes are, then play with filters.
It would be nice to have a real-time visualization, as I play with the slider of 1 parameter, showing how the possible ranges of other parameters are changing. This would be very easily achievable in the future with the ‘crosstalk’ package, if not already.
The challenge of including categorical data
All data sets used are numerical. Categorical data sets are avoided since it’s hard to define the “distance” among different categories. When people try to decide where to eat, and tell me their favorite cuisine is Italian, it takes lots of assumptions and framework building to define the score of American, Chinese, Thai…
For this project, 1 categorical data set I found that’s potentially easy to be included is climate zone data, since it’s often derived from numerical measurements like how many days in a year people turn on heating vs. cooling. most other categorical data are hard to incorporate.
Data with gaps — interpolation is rarely just a math problem
There are lots of good real estate data from Zillow.com, and since real estate value is often a key consideration for people to compare places, I really wanted to include that info during calculation. However, the Zillow county data set only has statistics for slightly more than 1800 counties, much less than the total (3000+). I do not want to do quick but wrong interpolation of the data, therefore gave up using the set. The proper real estate value interpolation for the other 1200+ counties is a complex project by itself.
For other data sets with only a few missing data items, I often just googled for the needed data points and manually fixed them. But for the 150+ counties reporting 0 crime rate, I suspect quite a few of them are wrong — I did not have time to validate the data, so they were used as is.
The data cleaning steps are also included in the GitHub code, check it out if you are interested in the details.
How to potentially make profit?
There are 2 general business directions I can think of to potentially make money with a much improved version of this app:
- When the dollar prize of the decision itself is huge, automating parts of the decision making process and generating easy-to-digest visualizations might be desired by large corporations.
- The app is naturally feasible as a web service for individual consumers: we are learning more and more about the users as they use this app — the least we can easily do is to pick relevant ads and show beside the map.
Scale up consideration
This is a placeholder for me to revisit in a few months, when I have a more complete view of the big data solutions.
The need for efficient data digestion as a general opportunity for data scientists
When facing a fast growing ocean of data, user-friendly digestion solutions will be in high demand, from corporations to individual consumers. Eventually all intellectually active human beings need to do some ‘data science’, while a large amount (if not the majority) of data would be quickly thrown away without being stored. This leads to a prediction that real time data stream digestion will soon become dominantly mainstream. I think the future is very interesting, but should I be scared also?
- NYC Data Science: Shu Yan, Zeyu Zhang
- Inspiration from the ‘SuperZip’ example by Joe Cheng
- Leaflet mapping examples on datascienceriot.com
- Correlation Matrix app ‘shinyCorrplot’ by saurfang
- Developers of all other packages I used for this project