Mapping lithium production using R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A great thing about data science – and data visualization in particular – is that you can use data science tools to get information about a subject very quickly.
This is particularly true when you know how to present the information. When you know the best technique for visualizing data, you can use those tools to present information in a way that can help you increase your understanding very quickly.
Here’s a case in point: I was thinking about energy, solar energy, and battery technology, and I got curious about sources of lithium (which is used in batteries and related tech).
Using a few tools from
If you’re fluent with the tools and techniques of data science, this becomes possible. Whether you’re just doing some personal research, or working for a multinational corporation, you can use the tools of data science to quickly identify and present insights about the world.
Let me walk you through this example and show you how I did it.
Tutorial: how to map lithium production data using R
First, we’ll load the packages that we’re going to use.
We’ll load
#-------------- # LOAD PACKAGES #-------------- library(tidyverse) library(rvest) library(stringr) library(viridis)
Ok. Now, we’re going to scrape this mineral production data from the wikipedia page.
Notice that we’re essentially using several
If you’re not using it yet, you should definitely familiarize yourself with this operator, and start using it. It’s beyond the scope of this blog post to talk extensively about pipe operator, but I will say that it’s one of the most useful tools in the
Concerning the code: what we’re doing here is designating the URL from which we’re going to scrape the data, then we’re specifying that we’ll be scraping one of the tables. Then, we specify that we’re going to scrape data from the 9th table, and finally we coerce the data into a
#--------------------------- # SCRAPE DATA FROM WIKIPEDIA #--------------------------- df.lithium <- read_html("https://en.wikipedia.org/wiki/Lithium") %>% html_nodes("table") %>% .[[9]] %>% html_table() %>% as.tibble() # INSPECT df.lithium
The resultant dataset,
So first, let’s change the column names.
There are a few ways we could do this, but the most straightforward is to simply pass a vector of manually-defined column names into the
#-------------------------------------------- # CHANGE COLUMN NAMES # - the raw column names are capitalized and # have some extra information # - we will just clean them up #-------------------------------------------- colnames(df.lithium) <- c('country', 'production', 'reserves', 'resources') colnames(df.lithium)
Now, we’ll remove an extraneous row of data. The original data table on Wikipedia contained not only the individual records of lithium production for particular countries, but it also contained a “total” row at the bottom of the table. More often than not, these sorts of “total” rows are not appropriate for a
#----------------------------------------------- # REMOVE "World total" # - this is a total amount that was # in the original data table # - we need to remove, because it's not a # proper data record for a particular country #----------------------------------------------- df.lithium <- df.lithium %>% filter(country != 'World total') df.lithium
Next, we need to parse the numbers into actual numeric data. The reason is that when we scraped the data, it actually read in the numbers as character data, along with commas and some extra characters. We need to transform this character data into proper numeric data in
To do this, we’ll need to do a few things. First, we need to remove a few “notes” that were in the original data. This is why we’re using the code
After that, we’re using
Note once again how we’re structuring this code. We’re using a combination of functions from
To a beginner, this might look complicated, but it’s really not that bad once you understand the individual pieces. If you don’t understand this code (our couldn’t write it yourself), I recommend that you learn the individual functions from
#--------------------------------------------------------- # PARSE NUMBERS # - the original numeric quantities in the table # were read-in as character data # - we need to "parse" this information .... # & transform it from character into proper numeric data #--------------------------------------------------------- # Strip out the 'notes' from the numeric data #str_replace(df.lithium$production,"W\\[.*\\]", "") #test df.lithium <- df.lithium %>% mutate(production = str_replace(production,"W\\[.*\\]", "-")) # inspect df.lithium # Parse character data into numbers df.lithium <- df.lithium %>% mutate(production = parse_number(production, na = '-') ,reserves = parse_number(reserves, na = '-') ,resources = parse_number(resources, na = '-') ) # Inspect df.lithium
Now we’ll get data for a map of the world. To do this, we’ll just use
#-------------- # GET WORLD MAP #-------------- map.world <- map_data('world')
We’ll also get the names of the countries in this dataset.
The reason is because we’ll need to join this map data to the data from Wikipedia, and we’ll need the country names to be exactly the same. To make this work, we’ll need to examine the names in both datasets and modify any names that aren’t exactly the same.
Notice that once again, we’re using a combination of functions from
#---------------------------------------------------- # Get country names # - we can use this list and cross-reference # with the country names in the scraped data # - when we find names that are not the same between # this map data and the scraped data, we can recode # the values #---------------------------------------------------- map_data('world') %>% group_by(region) %>% summarise() %>% print(n = Inf)
Ok. Now we’re going to recode some country names. Again, we’re going this so that the country names in
#-------------------------------------------- # RECODE COUNTRY NAMES # - some of the country names do not match # the names we will use later in our map # - we will re-code so that the data matches # the names in the world map #-------------------------------------------- df.lithium <- df.lithium %>% mutate(country = if_else(country == "Canada (2010)", 'Canada' ,if_else(country == "People's Republic of China", "China" ,if_else(country == "United States", "USA" ,if_else(country == "DR Congo","Democratic Republic of the Congo", country)))) ) # Inspect df.lithium
Ok, now we’ll join the data using
#----------------------------------------- # JOIN DATA # - join the map data and the scraped-data #----------------------------------------- df <- left_join(map.world, df.lithium, by = c('region' = 'country'))
Now we’ll plot.
We’ll start with just a basic plot (to make sure that the map plots correctly), and then we’ll proceed to plot separate maps where the fill color corresponds to
#----------- # PLOT DATA #----------- # BASIC MAP ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon() # LITHIUM RESERVES ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = reserves)) ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = reserves)) + scale_fill_viridis(option = 'plasma') # LITHIUM PRODUCTION ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = production)) + scale_fill_viridis(option = 'plasma') # LITHIUM RESOURCES ggplot(data = df, aes(x = long, y = lat, group = group)) + geom_polygon(aes(fill = resources)) + scale_fill_viridis(option = 'plasma')
In the final three versions, notice as well that we’re modifying the color scales by using
There’s actually quite a bit more formatting that we could do on these, but as a first pass, these are pretty good.
I’ll leave it as an exercise for you to format these with titles, background colors, etc. If you choose to do this, leave your finalized code in the comments section below.
To master data science, you need a plan
At several points in this tutorial, I’ve mentioned a high level plan for mastering data science: master individual pieces of a programming language, and then learn to put them together into more complicated structures.
If you can do this, you will accelerate your progress … although, the devil is in the details.
That’s actually not the only learning hack that you can use to rapidly master data science. There are lots of other tricks and learning hacks that you can use to dramatically accelerate your progress.
Want to know them?
Sign up for our email list.
Here at Sharp Sight, we teach data science. But we also teach you how to learn and how to study data science, so you master the tools as quickly as possible.
By signing up for our email list, you’ll get weekly tutorials about data science, delivered directly to your inbox.
You’ll also get our Data Science Crash Course, for free.
SIGN UP NOW
The post Mapping lithium production using R appeared first on SHARP SIGHT LABS.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.