Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few days ago, I was inspired by a set of photographs of Earth from space, at night.

The images are amazing, so I decided to try to replicate them using R.

To be a little more specific, I found a single dataset – a data set with French population data – and I’m using it to replicate the images for a single country: France.

## A few notes and warnings before you get started

As usual, I’ve included the code for how to create the final data visualization.

Having said that, in this case I decided to use some new tools. Therefore, before trying to run the code yourself, there are a few notes and caveats to keep in mind.

### This was built with the new sf package

I decided to build this with the new sf package.

sf is a new package for geospatial analysis that’s currently under development. You can read more about it here.

### Warning: setting up sf and dependencies may be problematic

Before getting started, you should know that installing sf may require you to install some dependencies. You can find out more here.

If you decide to install sf and the dependencies, be warned: getting the dependencies set up (specifically, gdal) may be tricky and problematic, so attempt it at your own risk.

# Code: mapping French population density

Ok, here’s the code.

First, we’ll load the packages that we’re going to need. (Of course, you need to have them installed as well.)

Notice also that we’re using the development version of ggplot2. We need this because we’ll be using geom_sf().

#=======================================
# INSTALL DEVELOPMENT VERSION OF ggplot2
#=======================================

library(devtools)
dev_mode(on = T)

#==============
#==============
library(sf)
library(tidyverse)
library(stringr)
library(scales)



### Get the data

Next, we’ll get the data from a URL at Sharp Sight and load it into R:

#=========
# GET DATA
#=========
url.france_pop <- url("http://sharpsightlabs.com/wp-content/datasets/france_population_data_2016.RData")

# INSPECT
glimpse(df.france)



### Change column names to lower case

We won’t have to make too many adjustments to the data, but we will transform the variable names to lower case.

#======================================
# CHANGE COLUMN NAMES: lowercase
# - in the original dataset the column
#   names were capitalized
# - we'll transform the names to
#   lower case
#======================================
colnames(df.france) <- colnames(df.france) %>% str_to_lower()

# INSPECT
colnames(df.france)



### Create density variable

In the visualization, we’re going to plot population density.

The dataset does not have a density variable, so we need to calculate it based on population and superficie (i.e. area). Keep in mind that superficie is in hectares, so we’ll be converting from “people/hectare” to “people/square kilometer.”

#=========================
# CREATE DENSITY VARIABLE
#=========================
df.france$population %>% summary() #------------------------- # CREATE VARIABLE: density #------------------------- df.france <- df.france %>% mutate(density = population/superficie*100) # INSPECT colnames(df.france) head(df.france)  ### Plot a test chart Next, we’ll plot a very simple test chart to see if things are roughly OK. As we do this, note how simple it is to use geom_sf(). This is one of the advantages of using the new sf package. #========================================== # PLOT TEST CHART # - here, we'll just plot a test chart # to see if the data are roughly correct # and to make sure that the data are # roughly in working order #========================================== #ggplot(df.france) + geom_sf() ggplot(df.france) + geom_sf(aes(fill = density))  This appears to be OK, but it’s tough to tell because the borders are too thick. Let’s do one more test chart to try to remove the lines and get a better look: #========================================== # PLOT SECOND TEST CHART # - here, we'll plot another test with # smaller borders # - we'll set size = .1 #========================================== ggplot(df.france) + geom_sf(aes(fill = density), color = NA)  Ok, we can see the fill color now (although the borders are still there). Looking at this, we’ll have to solve at least 2 problems: • Adjust the color scale • Remove the borders Most of the remaining work will be modifying the fill color to get the colors right, and adjust the scale so that differences between densities are more visible. ### Calculate quantiles to identify color break points Ultimately, we need to adjust the fill-colors associated with the fill aesthetic. We need to find “break points” to which we’ll associate specific fill values. To do this, we’ll calculate quantiles. #======================================================= # IDENTIFY QUANTILES # - the hardest part of creating the finalized # data visualization is getting the color scale # just right # - on final analysis, the best way to adjust the color # scale was by capping the population density # - we'll use these quantiles to find a good cap # via trial and error #======================================================= #--------- # VENTILES #--------- quantile(df.france$density, seq(from = 0, to = 1, by = .05))

# 0%           5%          10%          15%          20%          25%          30%
# 0.000000     6.815208    10.032525    12.824891    15.614076    18.485419    21.719516

# 35%          40%          45%          50%          55%          60%          65%
# 25.510974    29.538791    34.568568    40.088124    46.762867    54.489134    64.356473

# 70%          75%          80%          85%          90%          95%         100%
# 77.376987    94.117647   118.935653   160.578630   242.086849   510.070301 41929.234973

#----------------------
# PERCENTILES: 90 - 100
#----------------------

quantile(df.france$density, seq(from = .9, to = 1, by = .01)) # 90% 91% 92% 93% 94% 95% 96% 97% 98% # 242.0868 268.1167 300.2817 347.6989 412.2042 510.0703 637.9211 876.0090 1300.6435 # 99% 100% # 2649.3653 41929.2350 # MEDIAN median(df.france$density) #40.08812



### Create finalized version of chart

Let’s plot the finalized version. I’ll give a little explanation afterwards.

#==================
# DETAILED VERSION
#==================

df.france %>%
mutate(fill_var = density) %>%
ggplot() +
geom_sf(aes(fill = fill_var, color = fill_var)) +
,values = rescale(c(0,500,3000,40000))
,breaks = c(0,500,1500)
,guide = FALSE) +
,values = rescale(c(0,500,3000,40000))
,guide = FALSE) +
labs(title = "Population density in France\nvisualized to imitate satellite images of France at night"
,subtitle = "(people/sq km)" ) +
theme(text = element_text(family = "Gill Sans", color = "#E1E1E1")
,plot.title = element_text(size = 18, color = "#E1E1E1")
,plot.subtitle = element_text(size = 10)
,plot.background = element_rect(fill = "#000223")
,panel.background = element_rect(fill = "#000223")
,panel.grid.major = element_line(color = "#000223")
,axis.text = element_blank()
,axis.ticks = element_blank()
,legend.background = element_blank()
)

#==================
# TURN OFF DEV MODE
#==================
dev_mode(on = F)



OK, the truth is that to create this finalized version, you actually need to do a lot of trial and error.

Essentially, you need to find the right cutoffs in the data and associate those values to specific fill values.

The point where we do that is in the following code:

    scale_fill_gradientn(colours = c("#000233","#f9cf86","#fceccf","white")
,values = rescale(c(0,500,3000,40000))


Here, we’re taking values of our fill variable, density, and we’re associating those values to specific colors. How do we know which values to use and which colors to use?

Trial and error.

You need to take the quantile values that we calculated earlier, and test them out. Try different cutoff values for values = . Try different fill-values for colours = .

## To create visualizations like this, master ggplot2

I’ll admit that this data visualization is a little challenging to create, simple because of the extensive trial and error that’s required.

It’s also a little out of the ordinary, because we’re using the some packages that are under development.

Having said that, this visualization is still essentially just a fancy application of ggplot2. If you really know what you’re doing with ggplot2 – if you’ve already mastered the basics – then this plot is fairly straightforward.

So, if you want to learn to create compelling data visualizations, master ggplot2.

To get a job as a highly-paid data scientist, you need to know how to create compelling data visualizations.

To create compelling visualizations in R, you need to master ggplot2.

By signing up, you’ll get our free Data Science Crash Course, where you’ll learn:

• the foundations of ggplot2
• 3 essential data visualizations
• a step-by-step data science learning plan
• the 1 programming language you need to learn
• how to do data manipulation in R
• how to get started with machine learning
• and more …

In addition, you’ll also get weekly data science tutorials, delivered right to your inbox.