Salaries by alma mater – an interactive visualization with R and plotly

[This article was first published on Alexej's blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Visualization of starting salaries by college

Based on an interesting dataset from the Wall Street Journal I made the above visualization of the median starting salary for US college graduates from different undergraduate institutions (I have also looked at the mid-career salaries, and the salary increase, but more on that later). However, I thought that it would be a lot more informative, if it were interactive. To the very least I wanted to be able to see the school names when hovering over or clicking on the points with the mouse.

Luckily, this kind of interactivity can be easily achieved in R with the library plotly, especially due to its excellent integration with ggplot2, which I used to produce the above figure. In the following I describe how exactly this can be done.

Before I show you the interactive visualizations, a few words on the data preprocessing, and on how the map and the points are plotted with ggplot2:

  • I generally use functions from the tidyverse R packages.
  • I save the data in the data frame salaries, and transform the given amounts to proper floating point numbers, stripping the dollar signs and extra whitespaces.
  • The data provide school names. However, I need to find out the exact geographical coordinates of each school to put it on the map. This can be done in a very convenient way, by using the geocode function from the ggmap R package:
    <span class="n">school_longlat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">geocode</span><span class="p">(</span><span class="n">salaries</span><span class="o">$</span><span class="n">school</span><span class="p">)</span><span class="w">
    </span><span class="n">school_longlat</span><span class="o"><img src="http://latex.codecogs.com/png.latex?%3C/span%3E%3Cspan%20class="n">school</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">salaries</span><span class="o">\inline"/></span><span class="n">school</span><span class="w">
    </span><span class="n">salaries</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">left_join</span><span class="p">(</span><span class="n">salaries</span><span class="p">,</span><span class="w"> </span><span class="n">school_longlat</span><span class="p">)</span><span class="w">
    </span>

  • For the visualization I want to disregard the colleges in Alaska and Hawaii to avoid shrinking the rest of the map. The respective rows of salaries can be easily determined with a grep search:
    <span class="n">grep</span><span class="p">(</span><span class="s2">"alaska"</span><span class="p">,</span><span class="w"> </span><span class="n">salaries</span><span class="o">$</span><span class="n">school</span><span class="p">,</span><span class="w"> </span><span class="n">ignore.case</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
    </span><span class="c1"># [1] 206</span><span class="w">
    </span><span class="n">grep</span><span class="p">(</span><span class="s2">"hawaii"</span><span class="p">,</span><span class="w"> </span><span class="n">salaries</span><span class="o">$</span><span class="n">school</span><span class="p">,</span><span class="w"> </span><span class="n">ignore.case</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
    </span><span class="c1"># [1] 226</span><span class="w">
    </span>

  • A data frame containing geographical data that can be used to plot the outline of all US states can be loaded using the function map_data from the ggplot2 package:
    <span class="n">states</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">map_data</span><span class="p">(</span><span class="s2">"state"</span><span class="p">)</span><span class="w">
    </span>

  • And I load a yellow-orange-red palette with the function brewer.pal from the RColorBrewer library, to use as a scale for the salary amounts:
    <span class="n">yor_col</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">brewer.pal</span><span class="p">(</span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="s2">"YlOrRd"</span><span class="p">)</span><span class="w">
    </span>

  • Finally the (yet non-interactive) visualization is created with ggplot2:
    <span class="n">p</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">salaries</span><span class="p">[</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">206</span><span class="p">,</span><span class="w"> </span><span class="m">226</span><span class="p">),</span><span class="w"> </span><span class="p">])</span><span class="w"> </span><span class="o">+</span><span class="w">
        </span><span class="n">geom_polygon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">long</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lat</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">group</span><span class="p">),</span><span class="w">
                     </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">states</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">,</span><span class="w">
                     </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"white"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
        </span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lon</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lat</span><span class="p">,</span><span class="w">
                       </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">starting</span><span class="p">,</span><span class="w"> </span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">school</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
        </span><span class="n">coord_fixed</span><span class="p">(</span><span class="m">1.3</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
        </span><span class="n">scale_color_gradientn</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Starting\nSalary"</span><span class="p">,</span><span class="w">
                              </span><span class="n">colors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rev</span><span class="p">(</span><span class="n">yor_col</span><span class="p">),</span><span class="w">
                              </span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">comma</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
        </span><span class="n">guides</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
        </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
        </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
              </span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
              </span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
              </span><span class="n">panel.border</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
              </span><span class="n">panel.grid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
              </span><span class="n">axis.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">())</span><span class="w">
    </span>

Now, entering p into the R console will generate the figure shown at the top of this post.

However, we want to…

…make it interactive

The function ggplotly immediately generates a plotly interactive visualization from a ggplot object. It’s that simple! :smiley: (Though I must admit that, more often than I would be okay with, some elements of the ggplot visualization disappear or don’t look as expected. :fearful:)

The function argument tooltip can be used to specify which aesthetic mappings from the ggplot call should be shown in the tooltip. So, the code

<span class="n">ggplotly</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">tooltip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"text"</span><span class="p">,</span><span class="w"> </span><span class="s2">"starting"</span><span class="p">),</span><span class="w">
         </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">800</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">500</span><span class="p">)</span><span class="w">
</span>

generates the following interactive visualization.

Now, if you want to publish a plotly visualization to https://plot.ly/, you first need to communicate your account info to the plotly R package:

<span class="n">Sys.setenv</span><span class="p">(</span><span class="s2">"plotly_username"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"??????"</span><span class="p">)</span><span class="w">
</span><span class="n">Sys.setenv</span><span class="p">(</span><span class="s2">"plotly_api_key"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"????????????"</span><span class="p">)</span><span class="w">
</span>

and after that, posting the visualization to your account at https://plot.ly/ is as simple as:

<span class="n">plotly_POST</span><span class="p">(</span><span class="n">filename</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Starting"</span><span class="p">,</span><span class="w"> </span><span class="n">sharing</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"public"</span><span class="p">)</span><span class="w">
</span>

More visualizations

Finally, based on the same dataset I have generated an interactive visualization of the median mid-career salaries by undergraduate alma mater (the R script is almost identical to the one described above).
The resulting interactive visualization is embedded below.

Additionally, it is quite informative to look at a visualization of the salary increase from starting to mid-career.

To leave a comment for the author, please follow the link and comment on their blog: Alexej's blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)