[This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A great thing about data science – and data visualization in particular – is that you can use data science tools to get information about a subject very quickly.

This is particularly true when you know how to present the information. When you know the best technique for visualizing data, you can use those tools to present information in a way that can help you increase your understanding very quickly.

Here’s a case in point: I was thinking about energy, solar energy, and battery technology, and I got curious about sources of lithium (which is used in batteries and related tech).

Using a few tools from R, I created a map of lithium production within about 15 minutes. The maps that I created certainly don’t tell the full story, but they at least provide a baseline of knowledge.

If you’re fluent with the tools and techniques of data science, this becomes possible. Whether you’re just doing some personal research, or working for a multinational corporation, you can use the tools of data science to quickly identify and present insights about the world.

Let me walk you through this example and show you how I did it.

## Tutorial: how to map lithium production data using R

First, we’ll load the packages that we’re going to use.

We’ll load tidyverse mostly for access to ggplot2 and dplyr, but we’ll also load rvest (for data scraping), stringr (to help us manipulate our variables and put the data into shape), and viridis (to help us modify the color of the final plot).

```#--------------
#--------------

library(tidyverse)
library(rvest)
library(stringr)
library(viridis)

```

Ok. Now, we’re going to scrape this mineral production data from the wikipedia page.

Notice that we’re essentially using several rvest functions in series. We’re using several functions and combining them together using the “pipe” operator, %>%.

If you’re not using it yet, you should definitely familiarize yourself with this operator, and start using it. It’s beyond the scope of this blog post to talk extensively about pipe operator, but I will say that it’s one of the most useful tools in the R data scientist’s toolkit. If you learn to use it properly, it will make your code easier to read and easier to write. It will even train your mind to think about analysis in a more step by step way.

Concerning the code: what we’re doing here is designating the URL from which we’re going to scrape the data, then we’re specifying that we’ll be scraping one of the tables. Then, we specify that we’re going to scrape data from the 9th table, and finally we coerce the data into a tibble (instead of keeping it as a traditional data.frame).

```#---------------------------
# SCRAPE DATA FROM WIKIPEDIA
#---------------------------

df.lithium <- read_html("https://en.wikipedia.org/wiki/Lithium") %>%
html_nodes("table") %>%
.[[9]] %>%
html_table() %>%
as.tibble()

# INSPECT
df.lithium

```

The resultant dataset, df.lithium, is relatively small (which makes it easy to inspect and easy to work with), but in its raw form, it’s a little messy. We’ll need to do a few things to clean up the data, like change the variable names, parse some data into numerics, etc.

So first, let’s change the column names.

There are a few ways we could do this, but the most straightforward is to simply pass a vector of manually-defined column names into the colnames() function.

```#--------------------------------------------
# CHANGE COLUMN NAMES
# - the raw column names are capitalized and
#   have some extra information
# - we will just clean them up
#--------------------------------------------

colnames(df.lithium) <- c('country', 'production', 'reserves', 'resources')

colnames(df.lithium)

```

Now, we’ll remove an extraneous row of data. The original data table on Wikipedia contained not only the individual records of lithium production for particular countries, but it also contained a “total” row at the bottom of the table. More often than not, these sorts of “total” rows are not appropriate for a data.frame in R; we’ll typically remove them, just as we will do here.

```#-----------------------------------------------
# REMOVE "World total"
# - this is a total amount that was
#   in the original data table
# - we need to remove, because it's not a
#   proper data record for a particular country
#-----------------------------------------------

df.lithium <- df.lithium %>% filter(country != 'World total')

df.lithium

```

Next, we need to parse the numbers into actual numeric data. The reason is that when we scraped the data, it actually read in the numbers as character data, along with commas and some extra characters. We need to transform this character data into proper numeric data in R.

To do this, we’ll need to do a few things. First, we need to remove a few “notes” that were in the original data. This is why we’re using the code str_replace(production,”W\\[.*\\]”, “-“)). Essentially, we’re removing those notes from the data.

After that, we’re using parse_number(production, na = ‘-‘) to transform three variables – production, reserves, resources – into numerics.

Note once again how we’re structuring this code. We’re using a combination of functions from dplyr and stringr to achieve our objectives.

To a beginner, this might look complicated, but it’s really not that bad once you understand the individual pieces. If you don’t understand this code (our couldn’t write it yourself), I recommend that you learn the individual functions from dplyr and stringr first, and then come back to this once you’ve learned those pieces.

```#---------------------------------------------------------
# PARSE NUMBERS
# - the original numeric quantities in the table
#   were read-in as character data
# - we need to "parse" this information ....
#   & transform it from character into proper numeric data
#---------------------------------------------------------

# Strip out the 'notes' from the numeric data
#str_replace(df.lithium\$production,"W\\[.*\\]", "") #test

df.lithium <- df.lithium %>% mutate(production = str_replace(production,"W\\[.*\\]", "-"))

# inspect
df.lithium

# Parse character data into numbers
df.lithium <- df.lithium %>% mutate(production = parse_number(production, na = '-')
,reserves = parse_number(reserves, na = '-')
,resources = parse_number(resources, na = '-')
)

# Inspect
df.lithium

```

Now we’ll get data for a map of the world. To do this, we’ll just use map_data().

```#--------------
# GET WORLD MAP
#--------------

map.world <- map_data('world')

```

We’ll also get the names of the countries in this dataset.

The reason is because we’ll need to join this map data to the data from Wikipedia, and we’ll need the country names to be exactly the same. To make this work, we’ll need to examine the names in both datasets and modify any names that aren’t exactly the same.

Notice that once again, we’re using a combination of functions from dplyr, wiring them together using the pipe operator.

```#----------------------------------------------------
# Get country names
# - we can use this list and cross-reference
#   with the country names in the scraped data
# - when we find names that are not the same between
#   this map data and the scraped data, we can recode
#   the values
#----------------------------------------------------

map_data('world') %>% group_by(region) %>% summarise() %>% print(n = Inf)

```

Ok. Now we’re going to recode some country names. Again, we’re going this so that the country names in df.lithium are the same as the corresponding country names in map.data.

```#--------------------------------------------
# RECODE COUNTRY NAMES
# - some of the country names do not match
#   the names we will use later in our map
# - we will re-code so that the data matches
#   the names in the world map
#--------------------------------------------

df.lithium <- df.lithium %>% mutate(country = if_else(country == "Canada (2010)", 'Canada'
,if_else(country == "People's Republic of China", "China"
,if_else(country == "United States", "USA"
,if_else(country == "DR Congo","Democratic Republic of the Congo", country))))
)

# Inspect
df.lithium

```

Ok, now we’ll join the data using dplyr::left_join().

```#-----------------------------------------
# JOIN DATA
# - join the map data and the scraped-data
#-----------------------------------------

df <- left_join(map.world, df.lithium, by = c('region' = 'country'))

```

Now we’ll plot.

We’ll start with just a basic plot (to make sure that the map plots correctly), and then we’ll proceed to plot separate maps where the fill color corresponds to reserves, production, and resources.

```#-----------
# PLOT DATA
#-----------

# BASIC MAP
ggplot(data = df, aes(x = long, y = lat, group = group)) +
geom_polygon()

# LITHIUM RESERVES
ggplot(data = df, aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = reserves))

ggplot(data = df, aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = reserves)) +
scale_fill_viridis(option = 'plasma')

# LITHIUM PRODUCTION
ggplot(data = df, aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = production)) +
scale_fill_viridis(option = 'plasma')

# LITHIUM RESOURCES
ggplot(data = df, aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = resources)) +
scale_fill_viridis(option = 'plasma')

```

In the final three versions, notice as well that we’re modifying the color scales by using scale_fill_viridis().

There’s actually quite a bit more formatting that we could do on these, but as a first pass, these are pretty good.

I’ll leave it as an exercise for you to format these with titles, background colors, etc. If you choose to do this, leave your finalized code in the comments section below.

# To master data science, you need a plan

At several points in this tutorial, I’ve mentioned a high level plan for mastering data science: master individual pieces of a programming language, and then learn to put them together into more complicated structures.

If you can do this, you will accelerate your progress … although, the devil is in the details.

That’s actually not the only learning hack that you can use to rapidly master data science. There are lots of other tricks and learning hacks that you can use to dramatically accelerate your progress.

Want to know them?

Here at Sharp Sight, we teach data science. But we also teach you how to learn and how to study data science, so you master the tools as quickly as possible.

By signing up for our email list, you’ll get weekly tutorials about data science, delivered directly to your inbox.

You’ll also get our Data Science Crash Course, for free.