Salaries by alma mater – an interactive visualization with R and plotly
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Based on an interesting dataset from the Wall Street Journal I made the above visualization of the median starting salary for US college graduates from different undergraduate institutions (I have also looked at the mid-career salaries, and the salary increase, but more on that later). However, I thought that it would be a lot more informative, if it were interactive. To the very least I wanted to be able to see the school names when hovering over or clicking on the points with the mouse.
Luckily, this kind of interactivity can be easily achieved in R with the library plotly
, especially due to its excellent integration with ggplot2
, which I used to produce the above figure. In the following I describe how exactly this can be done.
Before I show you the interactive visualizations, a few words on the data preprocessing, and on how the map and the points are plotted with ggplot2
:
- I generally use functions from the tidyverse R packages.
- I save the data in the data frame
salaries
, and transform the given amounts to proper floating point numbers, stripping the dollar signs and extra whitespaces. - The data provide school names. However, I need to find out the exact geographical coordinates of each school to put it on the map. This can be done in a very convenient way, by using the
geocode
function from theggmap
R package:<span class="n">school_longlat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">geocode</span><span class="p">(</span><span class="n">salaries</span><span class="o">$</span><span class="n">school</span><span class="p">)</span><span class="w"> </span><span class="n">school_longlat</span><span class="o">$</span><span class="n">school</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">salaries</span><span class="o">$</span><span class="n">school</span><span class="w"> </span><span class="n">salaries</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">left_join</span><span class="p">(</span><span class="n">salaries</span><span class="p">,</span><span class="w"> </span><span class="n">school_longlat</span><span class="p">)</span><span class="w"> </span>
- For the visualization I want to disregard the colleges in Alaska and Hawaii to avoid shrinking the rest of the map. The respective rows of
salaries
can be easily determined with agrep
search:<span class="n">grep</span><span class="p">(</span><span class="s2">"alaska"</span><span class="p">,</span><span class="w"> </span><span class="n">salaries</span><span class="o">$</span><span class="n">school</span><span class="p">,</span><span class="w"> </span><span class="n">ignore.case</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="c1"># [1] 206 </span><span class="n">grep</span><span class="p">(</span><span class="s2">"hawaii"</span><span class="p">,</span><span class="w"> </span><span class="n">salaries</span><span class="o">$</span><span class="n">school</span><span class="p">,</span><span class="w"> </span><span class="n">ignore.case</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="c1"># [1] 226 </span>
- A data frame containing geographical data that can be used to plot the outline of all US states can be loaded using the function
map_data
from theggplot2
package:<span class="n">states</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">map_data</span><span class="p">(</span><span class="s2">"state"</span><span class="p">)</span><span class="w"> </span>
- And I load a yellow-orange-red palette with the function
brewer.pal
from theRColorBrewer
library, to use as a scale for the salary amounts:<span class="n">yor_col</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">brewer.pal</span><span class="p">(</span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="s2">"YlOrRd"</span><span class="p">)</span><span class="w"> </span>
- Finally the (yet non-interactive) visualization is created with
ggplot2
:<span class="n">p</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">salaries</span><span class="p">[</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">206</span><span class="p">,</span><span class="w"> </span><span class="m">226</span><span class="p">),</span><span class="w"> </span><span class="p">])</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_polygon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">long</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lat</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">group</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">states</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"white"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lon</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lat</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">starting</span><span class="p">,</span><span class="w"> </span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">school</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">coord_fixed</span><span class="p">(</span><span class="m">1.3</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">scale_color_gradientn</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Starting\nSalary"</span><span class="p">,</span><span class="w"> </span><span class="n">colors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rev</span><span class="p">(</span><span class="n">yor_col</span><span class="p">),</span><span class="w"> </span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">comma</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">guides</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">panel.border</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">panel.grid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">axis.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">())</span><span class="w"> </span>
Now, entering p
into the R console will generate the figure shown at the top of this post.
However, we want to…
…make it interactive
The function ggplotly
immediately generates a plotly interactive visualization from a ggplot
object. It’s that simple! :smiley: (Though I must admit that, more often than I would be okay with, some elements of the ggplot visualization disappear or don’t look as expected. :fearful:)
The function argument tooltip
can be used to specify which aesthetic mappings from the ggplot
call should be shown in the tooltip. So, the code
<span class="n">ggplotly</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">tooltip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"text"</span><span class="p">,</span><span class="w"> </span><span class="s2">"starting"</span><span class="p">),</span><span class="w">
</span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">800</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">500</span><span class="p">)</span><span class="w">
</span>
generates the following interactive visualization.
Now, if you want to publish a plotly visualization to https://plot.ly/, you first need to communicate your account info to the plotly R package:
<span class="n">Sys.setenv</span><span class="p">(</span><span class="s2">"plotly_username"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"??????"</span><span class="p">)</span><span class="w">
</span><span class="n">Sys.setenv</span><span class="p">(</span><span class="s2">"plotly_api_key"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"????????????"</span><span class="p">)</span><span class="w">
</span>
and after that, posting the visualization to your account at https://plot.ly/ is as simple as:
<span class="n">plotly_POST</span><span class="p">(</span><span class="n">filename</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Starting"</span><span class="p">,</span><span class="w"> </span><span class="n">sharing</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"public"</span><span class="p">)</span><span class="w">
</span>
More visualizations
Finally, based on the same dataset I have generated an interactive visualization of the median mid-career salaries by undergraduate alma mater (the R script is almost identical to the one described above).
The resulting interactive visualization is embedded below.
Additionally, it is quite informative to look at a visualization of the salary increase from starting to mid-career.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.