Best city for data scientists today according to two variables harvested with rvest

[This article was first published on Dan Garmat's Blog -- R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Jobs vs. Cost of Living by City

Some cities are more appealing for a data scientist to live than others. Several websites list best cities for data scientists, but their lists don’t agree and their methods are not explained, so the quality of the analysis and ability to determine which individuals their results can be inferred to are limited.

So here I set up to develop a reproducible, if not quite complete, measure of data scientist city attractiveness. Number of jobs for data scientists and cost of living may be two important variables. Using R’s rvest package, we can scrape from the web necessary information to get an idea how cities look in terms of these two. Got to DSCA index v0.1 shown above with top 21 cities labeled.

Bottom Line Up Front: Eight large hiring metros stand out, and less expensive cities within their communing distance look best by these variables. In particular, Newark, NJ, as cheap and close to lots of data scientist jobs, has the highest value in DSCA index v0.1. These two variables alone and the sources I chose show some interesting initial results, but they don’t seem to capture a complete picture, so more work is needed before getting a reliable reproducible index.

1. How many Data Scientist jobs are there in a given city?

One approach to see how many jobs a city has is search monster.com. “Data” and “scientist” can be keywords in other jobs, but I really want to know how many jobs with the words “Data Scientist” in the title are posted in each city. In this case for Portland, OR, I want a function that will return number 24.

scrape_03

This approach has limitations:

  1. Job postings do not represent people currently employed as a data scientist
  2. Some cities may fill and close these postings faster than others, so would appear to have fewer postings using one slice in time than say unique postings in a month
  3. Biases in favor of companies that post on Monster.com
  4. Over-counting from duplicates
  5. Under-counting for posts hiring for multiple positions

In any case, let’s use this as a proxy and see what shakes out. First step is to get a function that works to return 24 on just one city. In my browser, this search’s URL has an extra “&jobid=…” section to point at the first listed job and can be removed, so that for Portland, OR its minimal moster.com Data Scientist URL is:

<span class="n">url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s1">'https://www.monster.com/jobs/search/?q=__22data-scientist__22&where=Portland__2C-OR'</span><span class="w">
</span>

We know we want this number 24. To get to it, http://selectorgadget.com/ offers a nice utility. You “install” it by adding a bookmark available there. Then when navigated to a page you want to scrape, click the bookmark to find a best tag to identify your website’s component you need. Clicking on 24 here, slectorgadget shows it’s in '.title' and so are 31 other elements.

scrape_04

I only need the first one so click on one of the yellow ones to turn it red. Now selectorgadget has narrowed down to my best extractable tag, '.navigation-content .title'

scrape_07

As a recap, what we’ve figured out is the html tags surrounding text we need to extract from this page. Using rvest library, we can use this information to harvest this page’s “most nutritious” content,

<span class="n">webpage</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read_html</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="w">
</span><span class="n">titles_html</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">html_nodes</span><span class="p">(</span><span class="n">webpage</span><span class="p">,</span><span class="w"> </span><span class="s1">'.navigation-content .title'</span><span class="p">)</span><span class="w">
</span><span class="n">titles_html</span><span class="w">

</span><span class="c1">#{xml_nodeset (3)}</span><span class="w">
</span><span class="c1">#[1] <h2 class="title">Filter your search</h2></span><span class="w">
</span><span class="c1">#[2] <p class="title">\r\n            <b>\r\n                 "data scientist" J ...</span><span class="w">
</span><span class="c1">#[3] <h2 class="title">Get "data scientist" jobs in Portland as soon as they're  ...</span><span class="w">

</span><span class="c1"># looks like it's the second element</span><span class="w">

</span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w"> </span><span class="c1"># for dplyr, ggplot2, and purrr</span><span class="w">
</span><span class="n">titles_html</span><span class="p">[</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">html_text</span><span class="p">()</span><span class="w">
</span><span class="c1">#[1] "\r\n            \r\n                 \"data scientist\" Jobs in Portland, Oregon \r\n            \r\n(24 Jobs Found)        "</span><span class="w">
</span>

Can see my goal number comes after a parenthesis “(“. So using ever-handy regex cheat sheet at http://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf we can extract only the number 24

<span class="n">library</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span><span class="w">

</span><span class="n">titles_html</span><span class="p">[</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_text</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">str_extract</span><span class="p">(</span><span class="s1">'\\([0-9]*'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">str_replace</span><span class="p">(</span><span class="s1">'\\('</span><span class="p">,</span><span class="w"> </span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="nf">as.numeric</span><span class="p">()</span><span class="w">
</span><span class="c1">#[1] 24  </span><span class="w">
</span>

Cool, that worked. Let’s try a second city.

2. Expanding your scrape to additional cities

The obvious next city to try is Columbus, OH. Running a monster.com search on it shows the format of the url string https://www.monster.com/jobs/search/?q=__22Data-Scientist__22&where=Columbus__2C-OH simply replaces the city and state. Also, if we run the above code changing only its city and state we expect number 44 returned.

scrape_08

Indeed that is what happens. Here it has been piped for easier understanding

<span class="s1">'https://www.monster.com/jobs/search/?q=__22data-scientist__22&where=Columbus__2C-OH'</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">read_html</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_nodes</span><span class="p">(</span><span class="s1">'.navigation-content .title'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">nth</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_text</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">str_extract</span><span class="p">(</span><span class="s1">'\\([0-9]*'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">str_replace</span><span class="p">(</span><span class="s1">'\\('</span><span class="p">,</span><span class="w"> </span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="nf">as.numeric</span><span class="p">()</span><span class="w"> 
</span><span class="c1">#[1] 44</span><span class="w">
</span>

Next step, create a function that takes a city and state and returns a job count. Now two issues arise when trying to expand to other cities. We have multi name cities, such as Los Angeles, CA and Salt Lake City, UT. We need to know how monster represents those. We also have cities with no data scientist jobs available, such as Anchorage, AK. Let’s take a look.

Los Angeles’s url is https://www.monster.com/jobs/search/?q=__22Data-Scientist__22&where=Los-Angeles__2C-CA&jobid=195858182. Now we can remove the &jobid=... part and note the space is a dash. Similarly with Salt Lake City, https://www.monster.com/jobs/search/?q=__22Data-Scientist__22&where=Salt-Lake-City__2C-UT&jobid=814cce32-0bd4-41b3-91cc-6789f09ffdea it adds a dash in both spaces. These can be handled with stringr’s str_replace().

As for Data Science job postings in Alaska’s largest city, what you see in parenthesis isn’t a number, so it will return as NA.

scrape_09

We can handle that with an if(is.na()). And a function that works looks like this:

<span class="n">get_job_count</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">city</span><span class="p">,</span><span class="w"> </span><span class="n">state</span><span class="p">){</span><span class="w">
  </span><span class="n">city_dashed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">city</span><span class="p">,</span><span class="w"> </span><span class="s1">'\\ '</span><span class="p">,</span><span class="w"> </span><span class="s1">'\\-'</span><span class="p">)</span><span class="w">
  </span><span class="n">job_count</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="s1">'https://www.monster.com/jobs/search/'</span><span class="p">,</span><span class="w">
         </span><span class="s1">'?q=__22data-scientist__22'</span><span class="p">,</span><span class="w">
         </span><span class="s1">'&where='</span><span class="p">,</span><span class="w">
         </span><span class="n">city_dashed</span><span class="p">,</span><span class="w"> 
         </span><span class="s1">'__2C-'</span><span class="p">,</span><span class="w">
         </span><span class="n">state</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">read_html</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">html_nodes</span><span class="p">(</span><span class="s1">'.navigation-content .title'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">nth</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">html_text</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">str_extract</span><span class="p">(</span><span class="s1">'\\([0-9]*'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">str_replace</span><span class="p">(</span><span class="s1">'\\('</span><span class="p">,</span><span class="w"> </span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="nf">as.numeric</span><span class="p">()</span><span class="w">
  </span><span class="k">if</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">job_count</span><span class="p">)){</span><span class="w">
    </span><span class="n">job_count</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="n">data_frame</span><span class="p">(</span><span class="n">job_count</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">get_job_count</span><span class="p">(</span><span class="s2">"Columbus"</span><span class="p">,</span><span class="w"> </span><span class="s2">"OH"</span><span class="p">)</span><span class="w">
</span><span class="c1"># A tibble: 1 x 1</span><span class="w">
</span><span class="c1">#  job_count</span><span class="w">
</span><span class="c1">#      <dbl></span><span class="w">
</span><span class="c1">#1      44.0</span><span class="w">

</span><span class="n">get_job_count</span><span class="p">(</span><span class="s2">"Anchorage"</span><span class="p">,</span><span class="w"> </span><span class="s2">"AK"</span><span class="p">)</span><span class="w">
</span><span class="c1"># A tibble: 1 x 1</span><span class="w">
</span><span class="c1">#  job_count</span><span class="w">
</span><span class="c1">#      <dbl></span><span class="w">
</span><span class="c1">#1         0</span><span class="w">

</span><span class="n">get_job_count</span><span class="p">(</span><span class="s2">"Salt Lake City"</span><span class="p">,</span><span class="w"> </span><span class="s2">"UT"</span><span class="p">)</span><span class="w">
</span><span class="c1"># A tibble: 1 x 1</span><span class="w">
</span><span class="c1">#  job_count</span><span class="w">
</span><span class="c1">#      <dbl></span><span class="w">
</span><span class="c1">#1      12.0</span><span class="w">
</span>

It returns a tibble data frame with the expected number. Can see this is close to being iterable on every city we want to look up. We’re going to be doing a right join essentially on city cost of living data, so let’s get that now, and use that list to feed this function.

3. Scrape cost of living data

Cost of living per city indexed so that New York City is 1.00 is available at https://www.numbeo.com/cost-of-living/. This website appears to “croudsource” their cost of living data, so it’s going to be biased. Again we just need a rough idea, so let’s start with this.

Now monster only takes cities in the US, so let’s filter numbeo to Northern America at: https://www.numbeo.com/cost-of-living/region_rankings.jsp?title=2017&region=021. We need two fields, the city, and the cost of living index. Here I pulled in columns individually. rvest does have an html_table() function, but it doesn’t work on some types of tables.

Using selectorgadget we can get the name of the city column.

scrape_10

The cost of living index is a bit more complicated. Because this table is sorted by that column, clicking on it says it’s .sorting_1 but this won’t run in rvest. Instead one needs to sort by a different column to figure out the tag is td:nth-child(3).

scrape_11

That brings in an extra row, so need to remove it.

<span class="n">cities</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s1">'https://www.numbeo.com/cost-of-living/region_rankings.jsp?title=2017&region=021'</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">read_html</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_nodes</span><span class="p">(</span><span class="s1">'.cityOrCountryInIndicesTable'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_text</span><span class="p">()</span><span class="w">

</span><span class="n">cost_of_living_index</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s1">'https://www.numbeo.com/cost-of-living/region_rankings.jsp?title=2017&region=021'</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">read_html</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_nodes</span><span class="p">(</span><span class="s1">'td:nth-child(3)'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_text</span><span class="p">()</span><span class="w">

</span><span class="c1"># remove the top row</span><span class="w">
</span><span class="n">cost_of_living_index</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cost_of_living_index</span><span class="p">[</span><span class="m">-1</span><span class="p">]</span><span class="w">


</span><span class="c1"># combine them</span><span class="w">
</span><span class="n">cities_to_check</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">cities</span><span class="p">,</span><span class="w"> </span><span class="n">cost_of_living_index</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">as.data.frame</span><span class="w">

</span><span class="n">head</span><span class="p">(</span><span class="n">cities_to_check</span><span class="p">)</span><span class="w">
</span><span class="c1">#                            cities cost_of_living_index</span><span class="w">
</span><span class="c1">#1                Hamilton, Bermuda               141.84</span><span class="w">
</span><span class="c1">#2      New York, NY, United States               100.00</span><span class="w">
</span><span class="c1">#3      Honolulu, HI, United States                99.54</span><span class="w">
</span><span class="c1">#4 San Francisco, CA, United States                99.14</span><span class="w">
</span><span class="c1">#5     Anchorage, AK, United States                98.69</span><span class="w">
</span><span class="c1">#6    Washington, DC, United States                95.34</span><span class="w">
</span>

Now again, unfortunately monster can’t find any jobs in Bermuda, and it also can’t
take Canada, so let’s filter to US only and do some cleaning to prepare to run in the get_job_count function.

<span class="n">cities_to_check_cleaned</span><span class="w">
 </span><span class="o"><-</span><span class="w"> </span><span class="n">cities_to_check</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">str_detect</span><span class="p">(</span><span class="n">cities</span><span class="p">,</span><span class="w"> </span><span class="s2">"United States"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">separate</span><span class="p">(</span><span class="n">cities</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'city'</span><span class="p">,</span><span class="w"> </span><span class="s1">'state'</span><span class="p">,</span><span class="w"> </span><span class="s1">'country'</span><span class="p">),</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">","</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="c1"># need sep or else it splits on spaces</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">state</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str_trim</span><span class="p">(</span><span class="n">state</span><span class="p">),</span><span class="w">
         </span><span class="n">cost_of_living_index</span><span class="w"> </span><span class="o">=</span><span class="w"> 
           </span><span class="nf">as.numeric</span><span class="p">(</span><span class="nf">as.character</span><span class="p">(</span><span class="n">cost_of_living_index</span><span class="p">)))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">city</span><span class="p">,</span><span class="w"> </span><span class="n">state</span><span class="p">,</span><span class="w"> </span><span class="n">cost_of_living_index</span><span class="p">)</span><span class="w">

</span><span class="n">head</span><span class="p">(</span><span class="n">cities_to_check_cleaned</span><span class="p">)</span><span class="w">
</span><span class="c1">#           city state cost_of_living_index</span><span class="w">
</span><span class="c1">#1      New York    NY               100.00</span><span class="w">
</span><span class="c1">#2      Honolulu    HI                99.54</span><span class="w">
</span><span class="c1">#3 San Francisco    CA                99.14</span><span class="w">
</span><span class="c1">#4     Anchorage    AK                98.69</span><span class="w">
</span><span class="c1">#5    Washington    DC                95.34</span><span class="w">
</span><span class="c1">#6     New Haven    CT                93.75</span><span class="w">
</span>

4. Combine both sources

There are two inputs to the get_job_count() function, city and state, so a good choice is purrr’s map2 function. Let’s do a quick check on a subset of these cities first

<span class="n">cities_to_check_cleaned_short</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cities_to_check_cleaned</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">

</span><span class="n">map2_df</span><span class="p">(</span><span class="n">.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cities_to_check_cleaned_short</span><span class="o">$</span><span class="n">city</span><span class="p">,</span><span class="w"> 
        </span><span class="n">.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cities_to_check_cleaned_short</span><span class="o">$</span><span class="n">state</span><span class="p">,</span><span class="w">
        </span><span class="n">.f</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">get_job_count</span><span class="p">)</span><span class="w">
</span><span class="c1"># A tibble: 4 x 1</span><span class="w">
</span><span class="c1">#  job_count</span><span class="w">
</span><span class="c1">#      <dbl></span><span class="w">
</span><span class="c1">#1    402   </span><span class="w">
</span><span class="c1">#2      4.00</span><span class="w">
</span><span class="c1">#3    227   </span><span class="w">
</span><span class="c1">#4      0   </span><span class="w">
</span>

It seems to work and return a data frame with numbers as expected, including a 0 for Anchorage, AK. Let’s run it on all listed cities and recombine into one data frame,

<span class="n">job_count</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">map2_df</span><span class="p">(</span><span class="n">.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cities_to_check_cleaned</span><span class="o">$</span><span class="n">city</span><span class="p">,</span><span class="w"> 
                     </span><span class="n">.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cities_to_check_cleaned</span><span class="o">$</span><span class="n">state</span><span class="p">,</span><span class="w">
                     </span><span class="n">.f</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">get_job_count</span><span class="p">)</span><span class="w">

</span><span class="nf">dim</span><span class="p">(</span><span class="n">job_count</span><span class="p">)</span><span class="w">
</span><span class="c1">#[1] 112   1</span><span class="w">

</span><span class="nf">dim</span><span class="p">(</span><span class="n">cities_to_check_cleaned</span><span class="p">)</span><span class="w">
</span><span class="c1">#[1] 112   3</span><span class="w">

</span><span class="c1"># same number of rows</span><span class="w">

</span><span class="n">count_by_city</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bind_cols</span><span class="p">(</span><span class="n">cities_to_check_cleaned</span><span class="p">,</span><span class="w"> </span><span class="n">job_count</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">mutate</span><span class="p">(</span><span class="s2">"City Name"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="n">city</span><span class="p">,</span><span class="w"> </span><span class="s2">", "</span><span class="p">,</span><span class="w"> </span><span class="n">state</span><span class="p">),</span><span class="w">
         </span><span class="s2">"City Name"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str_replace</span><span class="p">(</span><span class="n">`City Name`</span><span class="p">,</span><span class="w"> </span><span class="s1">'\\-'</span><span class="p">,</span><span class="w"> </span><span class="s1">'\\ '</span><span class="p">))</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">count_by_city</span><span class="p">)</span><span class="w">
</span><span class="c1">#           city state cost_of_living_index job_count         City Name</span><span class="w">
</span><span class="c1">#1      New York    NY               100.00       402      New York, NY</span><span class="w">
</span><span class="c1">#2      Honolulu    HI                99.54         4      Honolulu, HI</span><span class="w">
</span><span class="c1">#3 San Francisco    CA                99.14       227 San Francisco, CA</span><span class="w">
</span><span class="c1">#4     Anchorage    AK                98.69         0     Anchorage, AK</span><span class="w">
</span><span class="c1">#5    Washington    DC                95.34       368    Washington, DC</span><span class="w">
</span><span class="c1">#6     New Haven    CT                93.75         3     New Haven, CT</span><span class="w">
</span>

These are arranged by cost of living index. How about by job count?

<span class="n">count_by_city</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
   </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">job_count</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
   </span><span class="n">top_n</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">job_count</span><span class="p">)</span><span class="w">
</span><span class="c1">#          city state cost_of_living_index job_count       City Name</span><span class="w">
</span><span class="c1">#1       Newark    NJ                72.84       419      Newark, NJ</span><span class="w">
</span><span class="c1">#2  Jersey City    NJ                78.53       412 Jersey City, NJ</span><span class="w">
</span><span class="c1">#3     New York    NY               100.00       402    New York, NY</span><span class="w">
</span><span class="c1">#4     Brooklyn    NY                91.25       399    Brooklyn, NY</span><span class="w">
</span><span class="c1">#5      Seattle    WA                90.54       388     Seattle, WA</span><span class="w">
</span><span class="c1">#6    Arlington    VA                89.52       383   Arlington, VA</span><span class="w">
</span><span class="c1">#7      Fairfax    VA                84.26       372     Fairfax, VA</span><span class="w">
</span><span class="c1">#8   Washington    DC                95.34       368  Washington, DC</span><span class="w">
</span><span class="c1">#9      Fremont    CA                71.42       298     Fremont, CA</span><span class="w">
</span><span class="c1">#10    San Jose    CA                82.75       252    San Jose, CA</span><span class="w">
</span>

5. Analysis of best city for data scientists

That’s how it looks. How can we combine cost of living and number of job postings to find one best city? A good place to start is a scatterplot.
If filtering to cities with more than 20 postings,

<span class="n">library</span><span class="p">(</span><span class="n">ggthemes</span><span class="p">)</span><span class="w">
</span><span class="n">count_by_city</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">job_count</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">job_count</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cost_of_living_index</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">job_count</span><span class="p">,</span><span class="w"> 
             </span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">`City Name`</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">state</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_text</span><span class="p">(</span><span class="n">vjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_gdocs</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">ylab</span><span class="p">(</span><span class="s1">'Count of Job Postings'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">xlab</span><span class="p">(</span><span class="s1">'Cost of living index'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">ggtitle</span><span class="p">(</span><span class="s1">'Count of Data Scientist Job Postings in 20 mile radius on Monster.com 
          vs. Cost of living index per numbeo.com
          for US Cities with 20+ Job Postings'</span><span class="p">)</span><span class="w">
</span>

Jobs vs. Cost of Living by City 2

Colors by state are not the most useful but do show relation between these cities – for example all the Bay Area brown ones between 200 and 300.

Finally, to try to arrive at some intuitive ranking of cities, I’ve reduced these two variables into a one-dimensional Data Scientist City Attractiveness index.

<span class="n">`DSCA_index`</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">job_count</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">cost_of_living_index</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">100</span><span class="w">
</span>

It may be simplistic, but a starting point, a null hypothesis, if you will, of what matters when choosing a city to live in as a data scientist. Final results…

<span class="n">count_by_city</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">`DSCA_index`</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">job_count</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">cost_of_living_index</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">`DSCA_index`</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">top_n</span><span class="p">(</span><span class="m">20</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">cost_of_living_index</span><span class="o">:</span><span class="n">DSCA_index</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">`City Name`</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">reorder</span><span class="p">(</span><span class="n">`City Name`</span><span class="p">,</span><span class="w"> </span><span class="n">DSCA_index</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">`City Name`</span><span class="p">,</span><span class="w"> </span><span class="n">DSCA_index</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">,</span><span class="w"> </span><span class="n">show.legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">coord_flip</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"City"</span><span class="p">,</span><span class="w">
       </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"DSCA index = exp(Job Posting Count) / Cost of Living Index"</span><span class="p">,</span><span class="w">
       </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Hypothesis Generation of top 20 cities to live as a Data Scientist"</span><span class="p">)</span><span class="w">
</span>

Jobs vs. Cost of Living by City 3

Newark, NJ takes first place!

newark skyline

The scatterplot at top of this post shows all 112 US cities with a cost of living index on numbeo. Cities above 5 on the index are labeled, cities lower are not.

6. Conclusion

NYC, Seattle, DC, Bay Area, Boston, Chicago, St. Louis, and Dallas have the most data scientist monster.com job postings, standing out as the eight largest hiring metros as of the time of this scrape. Some questions about data quality of both data sources remain, such as duplicate postings or San Jose being cheaper to live in than Oakland. So would take this index with a grain of salt.

Also, even if data reliability questions are addressed, these are obviously not the only two numbers that matter. A quality-of-data-scientist-life index might consider more relevant variables than jobs and cost of living alone. So obvious areas of improvement include:

  • crime stats
  • livability indices
  • numbeo’s cost of living plus rent index
  • average commute time to the jobs in the 20 mile radius
  • smaller radii
  • total population of metro area
  • college educated population
  • quantitative masters educated population
  • natural beauty and low pollution
  • small cities with many openings
  • distribution of data scientist salaries
  • sentiment of reviews of job experience as a data scientist
  • monster.ca for Canadian postings
  • interpretability of index

DSCA v0.1 code can be found at:
https://github.com/dgarmat/data_scientist_city_attractiveness_index

To leave a comment for the author, please follow the link and comment on their blog: Dan Garmat's Blog -- R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)