Best city for data scientists today according to two variables harvested with rvest
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Some cities are more appealing for a data scientist to live than others. Several websites list best cities for data scientists, but their lists don’t agree and their methods are not explained, so the quality of the analysis and ability to determine which individuals their results can be inferred to are limited.
So here I set up to develop a reproducible, if not quite complete, measure of data scientist city attractiveness. Number of jobs for data scientists and cost of living may be two important variables. Using R’s rvest package, we can scrape from the web necessary information to get an idea how cities look in terms of these two. Got to DSCA index v0.1 shown above with top 21 cities labeled.
Bottom Line Up Front: Eight large hiring metros stand out, and less expensive cities within their communing distance look best by these variables. In particular, Newark, NJ, as cheap and close to lots of data scientist jobs, has the highest value in DSCA index v0.1. These two variables alone and the sources I chose show some interesting initial results, but they don’t seem to capture a complete picture, so more work is needed before getting a reliable reproducible index.
1. How many Data Scientist jobs are there in a given city?
One approach to see how many jobs a city has is search monster.com. “Data” and “scientist” can be keywords in other jobs, but I really want to know how many jobs with the words “Data Scientist” in the title are posted in each city. In this case for Portland, OR, I want a function that will return number 24.
This approach has limitations:
- Job postings do not represent people currently employed as a data scientist
- Some cities may fill and close these postings faster than others, so would appear to have fewer postings using one slice in time than say unique postings in a month
- Biases in favor of companies that post on Monster.com
- Over-counting from duplicates
- Under-counting for posts hiring for multiple positions
In any case, let’s use this as a proxy and see what shakes out. First step is to get a function that works to return 24 on just one city. In my browser, this search’s URL has an extra “&jobid=…” section to point at the first listed job and can be removed, so that for Portland, OR its minimal moster.com Data Scientist URL is:
<span class="n">url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s1">'https://www.monster.com/jobs/search/?q=__22data-scientist__22&where=Portland__2C-OR'</span><span class="w">
</span>
We know we want this number 24. To get to it, http://selectorgadget.com/ offers a nice utility. You “install” it by adding a bookmark available there. Then when navigated to a page you want to scrape, click the bookmark to find a best tag to identify your website’s component you need. Clicking on 24 here, slectorgadget shows it’s in '.title'
and so are 31 other elements.
I only need the first one so click on one of the yellow ones to turn it red. Now selectorgadget has narrowed down to my best extractable tag, '.navigation-content .title'
As a recap, what we’ve figured out is the html tags surrounding text we need to extract from this page. Using rvest
library, we can use this information to harvest this page’s “most nutritious” content,
<span class="n">webpage</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read_html</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="w">
</span><span class="n">titles_html</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">html_nodes</span><span class="p">(</span><span class="n">webpage</span><span class="p">,</span><span class="w"> </span><span class="s1">'.navigation-content .title'</span><span class="p">)</span><span class="w">
</span><span class="n">titles_html</span><span class="w">
</span><span class="c1">#{xml_nodeset (3)}</span><span class="w">
</span><span class="c1">#[1] <h2 class="title">Filter your search</h2></span><span class="w">
</span><span class="c1">#[2] <p class="title">\r\n <b>\r\n "data scientist" J ...</span><span class="w">
</span><span class="c1">#[3] <h2 class="title">Get "data scientist" jobs in Portland as soon as they're ...</span><span class="w">
</span><span class="c1"># looks like it's the second element</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w"> </span><span class="c1"># for dplyr, ggplot2, and purrr</span><span class="w">
</span><span class="n">titles_html</span><span class="p">[</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">html_text</span><span class="p">()</span><span class="w">
</span><span class="c1">#[1] "\r\n \r\n \"data scientist\" Jobs in Portland, Oregon \r\n \r\n(24 Jobs Found) "</span><span class="w">
</span>
Can see my goal number comes after a parenthesis “(“. So using ever-handy regex cheat sheet at http://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf we can extract only the number 24
<span class="n">library</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span><span class="w">
</span><span class="n">titles_html</span><span class="p">[</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">html_text</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">str_extract</span><span class="p">(</span><span class="s1">'\\([0-9]*'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">str_replace</span><span class="p">(</span><span class="s1">'\\('</span><span class="p">,</span><span class="w"> </span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="nf">as.numeric</span><span class="p">()</span><span class="w">
</span><span class="c1">#[1] 24 </span><span class="w">
</span>
Cool, that worked. Let’s try a second city.
2. Expanding your scrape to additional cities
The obvious next city to try is Columbus, OH. Running a monster.com search on it shows the format of the url string https://www.monster.com/jobs/search/?q=__22Data-Scientist__22&where=Columbus__2C-OH
simply replaces the city and state. Also, if we run the above code changing only its city and state we expect number 44 returned.
Indeed that is what happens. Here it has been piped for easier understanding
<span class="s1">'https://www.monster.com/jobs/search/?q=__22data-scientist__22&where=Columbus__2C-OH'</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">read_html</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">html_nodes</span><span class="p">(</span><span class="s1">'.navigation-content .title'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">nth</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">html_text</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">str_extract</span><span class="p">(</span><span class="s1">'\\([0-9]*'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">str_replace</span><span class="p">(</span><span class="s1">'\\('</span><span class="p">,</span><span class="w"> </span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="nf">as.numeric</span><span class="p">()</span><span class="w">
</span><span class="c1">#[1] 44</span><span class="w">
</span>
Next step, create a function that takes a city and state and returns a job count. Now two issues arise when trying to expand to other cities. We have multi name cities, such as Los Angeles, CA and Salt Lake City, UT. We need to know how monster represents those. We also have cities with no data scientist jobs available, such as Anchorage, AK. Let’s take a look.
Los Angeles’s url is https://www.monster.com/jobs/search/?q=__22Data-Scientist__22&where=Los-Angeles__2C-CA&jobid=195858182
. Now we can remove the &jobid=...
part and note the space is a dash. Similarly with Salt Lake City, https://www.monster.com/jobs/search/?q=__22Data-Scientist__22&where=Salt-Lake-City__2C-UT&jobid=814cce32-0bd4-41b3-91cc-6789f09ffdea
it adds a dash in both spaces. These can be handled with stringr
’s str_replace()
.
As for Data Science job postings in Alaska’s largest city, what you see in parenthesis isn’t a number, so it will return as NA.
We can handle that with an if(is.na())
. And a function that works looks like this:
<span class="n">get_job_count</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">city</span><span class="p">,</span><span class="w"> </span><span class="n">state</span><span class="p">){</span><span class="w">
</span><span class="n">city_dashed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">city</span><span class="p">,</span><span class="w"> </span><span class="s1">'\\ '</span><span class="p">,</span><span class="w"> </span><span class="s1">'\\-'</span><span class="p">)</span><span class="w">
</span><span class="n">job_count</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="s1">'https://www.monster.com/jobs/search/'</span><span class="p">,</span><span class="w">
</span><span class="s1">'?q=__22data-scientist__22'</span><span class="p">,</span><span class="w">
</span><span class="s1">'&where='</span><span class="p">,</span><span class="w">
</span><span class="n">city_dashed</span><span class="p">,</span><span class="w">
</span><span class="s1">'__2C-'</span><span class="p">,</span><span class="w">
</span><span class="n">state</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">read_html</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">html_nodes</span><span class="p">(</span><span class="s1">'.navigation-content .title'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">nth</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">html_text</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">str_extract</span><span class="p">(</span><span class="s1">'\\([0-9]*'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">str_replace</span><span class="p">(</span><span class="s1">'\\('</span><span class="p">,</span><span class="w"> </span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="nf">as.numeric</span><span class="p">()</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">job_count</span><span class="p">)){</span><span class="w">
</span><span class="n">job_count</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">data_frame</span><span class="p">(</span><span class="n">job_count</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">get_job_count</span><span class="p">(</span><span class="s2">"Columbus"</span><span class="p">,</span><span class="w"> </span><span class="s2">"OH"</span><span class="p">)</span><span class="w">
</span><span class="c1"># A tibble: 1 x 1</span><span class="w">
</span><span class="c1"># job_count</span><span class="w">
</span><span class="c1"># <dbl></span><span class="w">
</span><span class="c1">#1 44.0</span><span class="w">
</span><span class="n">get_job_count</span><span class="p">(</span><span class="s2">"Anchorage"</span><span class="p">,</span><span class="w"> </span><span class="s2">"AK"</span><span class="p">)</span><span class="w">
</span><span class="c1"># A tibble: 1 x 1</span><span class="w">
</span><span class="c1"># job_count</span><span class="w">
</span><span class="c1"># <dbl></span><span class="w">
</span><span class="c1">#1 0</span><span class="w">
</span><span class="n">get_job_count</span><span class="p">(</span><span class="s2">"Salt Lake City"</span><span class="p">,</span><span class="w"> </span><span class="s2">"UT"</span><span class="p">)</span><span class="w">
</span><span class="c1"># A tibble: 1 x 1</span><span class="w">
</span><span class="c1"># job_count</span><span class="w">
</span><span class="c1"># <dbl></span><span class="w">
</span><span class="c1">#1 12.0</span><span class="w">
</span>
It returns a tibble data frame with the expected number. Can see this is close to being iterable on every city we want to look up. We’re going to be doing a right join essentially on city cost of living data, so let’s get that now, and use that list to feed this function.
3. Scrape cost of living data
Cost of living per city indexed so that New York City is 1.00 is available at https://www.numbeo.com/cost-of-living/. This website appears to “croudsource” their cost of living data, so it’s going to be biased. Again we just need a rough idea, so let’s start with this.
Now monster only takes cities in the US, so let’s filter numbeo to Northern America at: https://www.numbeo.com/cost-of-living/region_rankings.jsp?title=2017®ion=021. We need two fields, the city, and the cost of living index. Here I pulled in columns individually. rvest
does have an html_table()
function, but it doesn’t work on some types of tables.
Using selectorgadget we can get the name of the city column.
The cost of living index is a bit more complicated. Because this table is sorted by that column, clicking on it says it’s .sorting_1
but this won’t run in rvest
. Instead one needs to sort by a different column to figure out the tag is td:nth-child(3)
.
That brings in an extra row, so need to remove it.
<span class="n">cities</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s1">'https://www.numbeo.com/cost-of-living/region_rankings.jsp?title=2017®ion=021'</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">read_html</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">html_nodes</span><span class="p">(</span><span class="s1">'.cityOrCountryInIndicesTable'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">html_text</span><span class="p">()</span><span class="w">
</span><span class="n">cost_of_living_index</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s1">'https://www.numbeo.com/cost-of-living/region_rankings.jsp?title=2017®ion=021'</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">read_html</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">html_nodes</span><span class="p">(</span><span class="s1">'td:nth-child(3)'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">html_text</span><span class="p">()</span><span class="w">
</span><span class="c1"># remove the top row</span><span class="w">
</span><span class="n">cost_of_living_index</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cost_of_living_index</span><span class="p">[</span><span class="m">-1</span><span class="p">]</span><span class="w">
</span><span class="c1"># combine them</span><span class="w">
</span><span class="n">cities_to_check</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">cities</span><span class="p">,</span><span class="w"> </span><span class="n">cost_of_living_index</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">as.data.frame</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">cities_to_check</span><span class="p">)</span><span class="w">
</span><span class="c1"># cities cost_of_living_index</span><span class="w">
</span><span class="c1">#1 Hamilton, Bermuda 141.84</span><span class="w">
</span><span class="c1">#2 New York, NY, United States 100.00</span><span class="w">
</span><span class="c1">#3 Honolulu, HI, United States 99.54</span><span class="w">
</span><span class="c1">#4 San Francisco, CA, United States 99.14</span><span class="w">
</span><span class="c1">#5 Anchorage, AK, United States 98.69</span><span class="w">
</span><span class="c1">#6 Washington, DC, United States 95.34</span><span class="w">
</span>
Now again, unfortunately monster can’t find any jobs in Bermuda, and it also can’t
take Canada, so let’s filter to US only and do some cleaning to prepare to run in the get_job_count function.
<span class="n">cities_to_check_cleaned</span><span class="w">
</span><span class="o"><-</span><span class="w"> </span><span class="n">cities_to_check</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">str_detect</span><span class="p">(</span><span class="n">cities</span><span class="p">,</span><span class="w"> </span><span class="s2">"United States"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">separate</span><span class="p">(</span><span class="n">cities</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'city'</span><span class="p">,</span><span class="w"> </span><span class="s1">'state'</span><span class="p">,</span><span class="w"> </span><span class="s1">'country'</span><span class="p">),</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">","</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="c1"># need sep or else it splits on spaces</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">state</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str_trim</span><span class="p">(</span><span class="n">state</span><span class="p">),</span><span class="w">
</span><span class="n">cost_of_living_index</span><span class="w"> </span><span class="o">=</span><span class="w">
</span><span class="nf">as.numeric</span><span class="p">(</span><span class="nf">as.character</span><span class="p">(</span><span class="n">cost_of_living_index</span><span class="p">)))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="n">city</span><span class="p">,</span><span class="w"> </span><span class="n">state</span><span class="p">,</span><span class="w"> </span><span class="n">cost_of_living_index</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">cities_to_check_cleaned</span><span class="p">)</span><span class="w">
</span><span class="c1"># city state cost_of_living_index</span><span class="w">
</span><span class="c1">#1 New York NY 100.00</span><span class="w">
</span><span class="c1">#2 Honolulu HI 99.54</span><span class="w">
</span><span class="c1">#3 San Francisco CA 99.14</span><span class="w">
</span><span class="c1">#4 Anchorage AK 98.69</span><span class="w">
</span><span class="c1">#5 Washington DC 95.34</span><span class="w">
</span><span class="c1">#6 New Haven CT 93.75</span><span class="w">
</span>
4. Combine both sources
There are two inputs to the get_job_count()
function, city
and state
, so a good choice is purrr
’s map2
function. Let’s do a quick check on a subset of these cities first
<span class="n">cities_to_check_cleaned_short</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cities_to_check_cleaned</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="n">map2_df</span><span class="p">(</span><span class="n">.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cities_to_check_cleaned_short</span><span class="o">$</span><span class="n">city</span><span class="p">,</span><span class="w">
</span><span class="n">.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cities_to_check_cleaned_short</span><span class="o">$</span><span class="n">state</span><span class="p">,</span><span class="w">
</span><span class="n">.f</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">get_job_count</span><span class="p">)</span><span class="w">
</span><span class="c1"># A tibble: 4 x 1</span><span class="w">
</span><span class="c1"># job_count</span><span class="w">
</span><span class="c1"># <dbl></span><span class="w">
</span><span class="c1">#1 402 </span><span class="w">
</span><span class="c1">#2 4.00</span><span class="w">
</span><span class="c1">#3 227 </span><span class="w">
</span><span class="c1">#4 0 </span><span class="w">
</span>
It seems to work and return a data frame with numbers as expected, including a 0 for Anchorage, AK. Let’s run it on all listed cities and recombine into one data frame,
<span class="n">job_count</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">map2_df</span><span class="p">(</span><span class="n">.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cities_to_check_cleaned</span><span class="o">$</span><span class="n">city</span><span class="p">,</span><span class="w">
</span><span class="n">.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cities_to_check_cleaned</span><span class="o">$</span><span class="n">state</span><span class="p">,</span><span class="w">
</span><span class="n">.f</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">get_job_count</span><span class="p">)</span><span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">job_count</span><span class="p">)</span><span class="w">
</span><span class="c1">#[1] 112 1</span><span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">cities_to_check_cleaned</span><span class="p">)</span><span class="w">
</span><span class="c1">#[1] 112 3</span><span class="w">
</span><span class="c1"># same number of rows</span><span class="w">
</span><span class="n">count_by_city</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bind_cols</span><span class="p">(</span><span class="n">cities_to_check_cleaned</span><span class="p">,</span><span class="w"> </span><span class="n">job_count</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="s2">"City Name"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="n">city</span><span class="p">,</span><span class="w"> </span><span class="s2">", "</span><span class="p">,</span><span class="w"> </span><span class="n">state</span><span class="p">),</span><span class="w">
</span><span class="s2">"City Name"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str_replace</span><span class="p">(</span><span class="n">`City Name`</span><span class="p">,</span><span class="w"> </span><span class="s1">'\\-'</span><span class="p">,</span><span class="w"> </span><span class="s1">'\\ '</span><span class="p">))</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">count_by_city</span><span class="p">)</span><span class="w">
</span><span class="c1"># city state cost_of_living_index job_count City Name</span><span class="w">
</span><span class="c1">#1 New York NY 100.00 402 New York, NY</span><span class="w">
</span><span class="c1">#2 Honolulu HI 99.54 4 Honolulu, HI</span><span class="w">
</span><span class="c1">#3 San Francisco CA 99.14 227 San Francisco, CA</span><span class="w">
</span><span class="c1">#4 Anchorage AK 98.69 0 Anchorage, AK</span><span class="w">
</span><span class="c1">#5 Washington DC 95.34 368 Washington, DC</span><span class="w">
</span><span class="c1">#6 New Haven CT 93.75 3 New Haven, CT</span><span class="w">
</span>
These are arranged by cost of living index. How about by job count?
<span class="n">count_by_city</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">job_count</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">top_n</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">job_count</span><span class="p">)</span><span class="w">
</span><span class="c1"># city state cost_of_living_index job_count City Name</span><span class="w">
</span><span class="c1">#1 Newark NJ 72.84 419 Newark, NJ</span><span class="w">
</span><span class="c1">#2 Jersey City NJ 78.53 412 Jersey City, NJ</span><span class="w">
</span><span class="c1">#3 New York NY 100.00 402 New York, NY</span><span class="w">
</span><span class="c1">#4 Brooklyn NY 91.25 399 Brooklyn, NY</span><span class="w">
</span><span class="c1">#5 Seattle WA 90.54 388 Seattle, WA</span><span class="w">
</span><span class="c1">#6 Arlington VA 89.52 383 Arlington, VA</span><span class="w">
</span><span class="c1">#7 Fairfax VA 84.26 372 Fairfax, VA</span><span class="w">
</span><span class="c1">#8 Washington DC 95.34 368 Washington, DC</span><span class="w">
</span><span class="c1">#9 Fremont CA 71.42 298 Fremont, CA</span><span class="w">
</span><span class="c1">#10 San Jose CA 82.75 252 San Jose, CA</span><span class="w">
</span>
5. Analysis of best city for data scientists
That’s how it looks. How can we combine cost of living and number of job postings to find one best city? A good place to start is a scatterplot.
If filtering to cities with more than 20 postings,
<span class="n">library</span><span class="p">(</span><span class="n">ggthemes</span><span class="p">)</span><span class="w">
</span><span class="n">count_by_city</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">job_count</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">job_count</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cost_of_living_index</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">job_count</span><span class="p">,</span><span class="w">
</span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">`City Name`</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">state</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_text</span><span class="p">(</span><span class="n">vjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme_gdocs</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ylab</span><span class="p">(</span><span class="s1">'Count of Job Postings'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">xlab</span><span class="p">(</span><span class="s1">'Cost of living index'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ggtitle</span><span class="p">(</span><span class="s1">'Count of Data Scientist Job Postings in 20 mile radius on Monster.com
vs. Cost of living index per numbeo.com
for US Cities with 20+ Job Postings'</span><span class="p">)</span><span class="w">
</span>
Colors by state are not the most useful but do show relation between these cities – for example all the Bay Area brown ones between 200 and 300.
Finally, to try to arrive at some intuitive ranking of cities, I’ve reduced these two variables into a one-dimensional Data Scientist City Attractiveness index.
<span class="n">`DSCA_index`</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">job_count</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">cost_of_living_index</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">100</span><span class="w">
</span>
It may be simplistic, but a starting point, a null hypothesis, if you will, of what matters when choosing a city to live in as a data scientist. Final results…
<span class="n">count_by_city</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">`DSCA_index`</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">job_count</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">cost_of_living_index</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">`DSCA_index`</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">top_n</span><span class="p">(</span><span class="m">20</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="n">cost_of_living_index</span><span class="o">:</span><span class="n">DSCA_index</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">`City Name`</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">reorder</span><span class="p">(</span><span class="n">`City Name`</span><span class="p">,</span><span class="w"> </span><span class="n">DSCA_index</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">`City Name`</span><span class="p">,</span><span class="w"> </span><span class="n">DSCA_index</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">,</span><span class="w"> </span><span class="n">show.legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_flip</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"City"</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"DSCA index = exp(Job Posting Count) / Cost of Living Index"</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Hypothesis Generation of top 20 cities to live as a Data Scientist"</span><span class="p">)</span><span class="w">
</span>
Newark, NJ takes first place!
The scatterplot at top of this post shows all 112 US cities with a cost of living index on numbeo. Cities above 5 on the index are labeled, cities lower are not.
6. Conclusion
NYC, Seattle, DC, Bay Area, Boston, Chicago, St. Louis, and Dallas have the most data scientist monster.com job postings, standing out as the eight largest hiring metros as of the time of this scrape. Some questions about data quality of both data sources remain, such as duplicate postings or San Jose being cheaper to live in than Oakland. So would take this index with a grain of salt.
Also, even if data reliability questions are addressed, these are obviously not the only two numbers that matter. A quality-of-data-scientist-life index might consider more relevant variables than jobs and cost of living alone. So obvious areas of improvement include:
- crime stats
- livability indices
- numbeo’s cost of living plus rent index
- average commute time to the jobs in the 20 mile radius
- smaller radii
- total population of metro area
- college educated population
- quantitative masters educated population
- natural beauty and low pollution
- small cities with many openings
- distribution of data scientist salaries
- sentiment of reviews of job experience as a data scientist
- monster.ca for Canadian postings
- interpretability of index
DSCA v0.1 code can be found at:
https://github.com/dgarmat/data_scientist_city_attractiveness_index
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.