Visualization of Vancouver Cherry Trees with ggplot2 and R

October 22, 2018
By

(This article was first published on R on YIHAN WU, and kindly contributed to R-bloggers)

Motivation: I created this image as part of a graduate course I audited in 2018. In short, I learned the very simple basics of working with a shapefile (.kml) in R. Otherwise, everything was done using ggplot2 and base R. It also was originally knit as an html file with tabs.

Data

City of Vancouver Trees dataset with locations (direct link download) City of Vancouver Neighbourhood Boundaries (direct link download) ## Code

The first step is to download the two data files, one a text file containing tree locations, and the other a shapefile file containing polygons of Vancouver’s neighbourhoods, which outline the shape and geographical coordinates of each neighbourhood.

if (file.exists("StreetTrees_CityWide.csv")==F ) {
  download.file("ftp://webftp.vancouver.ca/OpenData/csv/csv_street_trees.zip", "csv_street_trees.zip")
  unzip(zipfile = "csv_street_trees.zip", files = "StreetTrees_CityWide.csv")
}

# download shapefile
if (file.exists("cov_localareas.kml")==F){
  download.file("http://data.vancouver.ca/download/kml/cov_localareas.kml", "cov_localareas.kml")
}

As the ZIP file of tree locations contain multiple csv files for each neighbourhood as well as the entire city, we extracted only the file with all the trees.

To examine the contents of a zip file in R, use the list argument to list files:

unzip(zipfile = "path_to_zip_file", list=T)

This will list the files in the zip archive rather than extract all. unzip("path_to_zip_file") will extract the entire zip archive.

To read in the data files, I used rgdal to read in the KML file.

# install.packages("rgdal")
library(rgdal)
## Loading required package: sp
## rgdal: version: 1.3-2, (SVN revision 755)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
##  Path to GDAL shared files: C:/Users/YIHANWU/Documents/R/win-library/3.5/rgdal/gdal
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: C:/Users/YIHANWU/Documents/R/win-library/3.5/rgdal/proj
##  Linking to sp version: 1.3-1
locmap<-readOGR("cov_localareas.kml", "local_areas_region", stringsAsFactors = F)
## OGR data source with driver: KML 
## Source: "C:\Users\YIHANWU\Documents\Stuff\blogdown-ywu\content\post\cov_localareas.kml", layer: "local_areas_region"
## with 22 features
## It has 2 fields

Reading in the trees file is simple with read.csv() or you can use readr::read._csv() if you have the readr installed.

treedat<-read.csv("StreetTrees_CityWide.csv")
summary(treedat)
##     TREE_ID        CIVIC_NUMBER              STD_STREET    
##  Min.   :    12   Min.   :    0   CAMBIE ST       :  1651  
##  1st Qu.: 63434   1st Qu.: 1306   W KING EDWARD AV:  1616  
##  Median :130873   Median : 2607   W 13TH AV       :  1251  
##  Mean   :128158   Mean   : 2942   W 16TH AV       :  1155  
##  3rd Qu.:190392   3rd Qu.: 4012   W 14TH AV       :  1116  
##  Max.   :259643   Max.   :17888   W 15TH AV       :  1087  
##                                   (Other)         :137190  
##                 NEIGHBOURHOOD_NAME            ON_STREET     
##  RENFREW-COLLINGWOOD     :11313    CAMBIE ST       :  1590  
##  KENSINGTON-CEDAR COTTAGE:11264    W KING EDWARD AV:  1507  
##  HASTINGS-SUNRISE        :10254    W 16TH AV       :  1111  
##  DUNBAR-SOUTHLANDS       : 9334    W 13TH AV       :  1081  
##  SUNSET                  : 8355    W 37TH AV       :   967  
##  KITSILANO               : 8010    W 15TH AV       :   949  
##  (Other)                 :86536    (Other)         :137861  
##  ON_STREET_BLOCK STREET_SIDE_NAME ASSIGNED   HEIGHT_RANGE_ID 
##  Min.   :   0    BIKE MED:   38   N:129569   Min.   : 0.000  
##  1st Qu.:1300    EVEN    :70915   Y: 15497   1st Qu.: 1.000  
##  Median :2600    MED     : 3293              Median : 2.000  
##  Mean   :2912    ODD     :70685              Mean   : 2.621  
##  3rd Qu.:4000    PARK    :   68              3rd Qu.: 3.000  
##  Max.   :9900    TBD     :   67              Max.   :10.000  
##                                                              
##     DIAMETER        DATE_PLANTED        PLANT_AREA    ROOT_BARRIER
##  Min.   :   0.00   Min.   :19891027   10     :20438   N:135803    
##  1st Qu.:   3.00   1st Qu.:20040225   8      :15651   Y:  9263    
##  Median :   8.50   Median :20180128   N      :13944               
##  Mean   :  11.25   Mean   :20112434   6      :13813               
##  3rd Qu.:  16.00   3rd Qu.:20180128   7      :11857               
##  Max.   :1819.00   Max.   :20180128   12     : 9210               
##                                       (Other):60153               
##  CURB             CULTIVAR_NAME      GENUS_NAME         SPECIES_NAME  
##  N: 12736                :67157   ACER    :35551   SERRULATA  :13513  
##  Y:132330   KWANZAN      :10658   PRUNUS  :31134   CERASIFERA :12455  
##             ATROPURPUREUM: 9340   FRAXINUS: 7332   PLATANOIDES:12008  
##             FASTIGIATA   : 4119   TILIA   : 6884   RUBRUM     : 8292  
##             NIGRA        : 3261   QUERCUS : 5981   AMERICANA  : 5562  
##             BOWHALL      : 2455   CARPINUS: 5679   BETULUS    : 5116  
##             (Other)      :48076   (Other) :52505   (Other)    :88120  
##                       COMMON_NAME        LATITUDE       LONGITUDE     
##  KWANZAN FLOWERING CHERRY   : 10658   Min.   : 0.00   Min.   :-123.2  
##  PISSARD PLUM               :  9103   1st Qu.:49.22   1st Qu.:-123.1  
##  NORWAY MAPLE               :  5683   Median :49.24   Median :-123.1  
##  CRIMEAN LINDEN             :  4489   Mean   :42.21   Mean   :-105.5  
##  PYRAMIDAL EUROPEAN HORNBEAM:  3446   3rd Qu.:49.26   3rd Qu.:-123.0  
##  NIGHT PURPLE LEAF PLUM     :  3261   Max.   :49.29   Max.   :   0.0  
##  (Other)                    :108426

The tree dataframe contains 20 column which thankfully come with self-explanatory column names. Weirdly enough, everything is in capitals.

head(treedat)
##   TREE_ID CIVIC_NUMBER       STD_STREET NEIGHBOURHOOD_NAME
## 1     589         4694         W 8TH AV    WEST POINT GREY
## 2     592         1799 W KING EDWARD AV        SHAUGHNESSY
## 3     594         2401         W 1ST AV          KITSILANO
## 4     595         2405         W 1ST AV          KITSILANO
## 5     596         2428         W 1ST AV          KITSILANO
## 6     598         2428         W 1ST AV          KITSILANO
##          ON_STREET ON_STREET_BLOCK STREET_SIDE_NAME ASSIGNED
## 1        BLANCA ST            2400             EVEN        N
## 2 W KING EDWARD AV            1700              MED        N
## 3         W 1ST AV            2400              ODD        N
## 4         W 1ST AV            2400              ODD        N
## 5         W 1ST AV            2400             EVEN        N
## 6         W 1ST AV            2400             EVEN        N
##   HEIGHT_RANGE_ID DIAMETER DATE_PLANTED PLANT_AREA ROOT_BARRIER CURB
## 1               5       32     20180128          B            N    Y
## 2               1        3     20100104         30            N    Y
## 3               4       14     20180128          B            N    Y
## 4               2        3     19970109          B            N    Y
## 5               2       11     20180128          B            N    Y
## 6               2        3     20180128          B            N    Y
##   CULTIVAR_NAME GENUS_NAME SPECIES_NAME              COMMON_NAME LATITUDE
## 1                    ALNUS       RUGOSA           SPECKLED ALDER 49.26532
## 2                    ABIES      GRANDIS                GRAND FIR 49.24952
## 3                    PINUS        NIGRA            AUSTRIAN PINE 49.27096
## 4     CALOCARPA      MALUS         ZUMI         REDBUD CRABAPPLE  0.00000
## 5                   PRUNUS    SARGENTII SARGENT FLOWERING CHERRY 49.27083
## 6                   PRUNUS    SARGENTII SARGENT FLOWERING CHERRY 49.27084
##   LONGITUDE
## 1 -123.2150
## 2 -123.1457
## 3 -123.1599
## 4    0.0000
## 5 -123.1602
## 6 -123.1605

We can drop those columns which we don’t need.

treedat <- treedat[, c("COMMON_NAME", "LATITUDE", "LONGITUDE")]

To map cherry trees across Vancouver, we first have to filter out the cherry trees. As treedat has a COMMON_NAME column, a string match using grep on that column will tell us which trees have “cherry” in their name and we can subset the entire dataframe for only cherry trees.

cherry_trees<-treedat[grep("cherry", treedat$COMMON_NAME, ignore.case = T),]

By using ignore.case = T, we do case-insensitive pattern matching without the need to specify “CHERRY” in the pattern argument.

dim(cherry_trees)
## [1] 18245     3

There are nrow(cherry_trees) in Vancouver. We could just plot all of them and be done, but since we have cultivar information, we can examine individual cultivar distribution across the city. First, we use a table to get the different cherries by cultivars, and how many there are.

cherry_names<-as.data.frame(table(cherry_trees$COMMON_NAME))

Then we sort the table by the most frequent cultivars.

top_cherry_names<-cherry_names[order(-cherry_names$Freq),]
head(top_cherry_names, n = 10)
##                           Var1  Freq
## 303   KWANZAN FLOWERING CHERRY 10658
## 2     AKEBONO FLOWERING CHERRY  2295
## 554       UKON JAPANESE CHERRY   856
## 346             MAZZARD CHERRY   603
## 402     PINK PERFECTION CHERRY   529
## 435      RANCHO SARGENT CHERRY   513
## 280 JAPANESE FLOWERING CHERRY    484
## 482          SHIROFUGEN CHERRY   340
## 483   SHIROTAE(MT FUJI) CHERRY   229
## 472      SCHUBERT CHOKECHERRY    214

Kwanzan Flowering Cherry appears to be the most common. I will take only the top nine cultivars, and then filter my cherry tree dataset for only the trees with common names matching one of the top nine.

top_cultivars<-top_cherry_names[1:9,]
cherry_filtered<-cherry_trees[cherry_trees$COMMON_NAME %in% top_cultivars$Var1,]
summary(cherry_filtered)
##                    COMMON_NAME       LATITUDE       LONGITUDE     
##  KWANZAN FLOWERING CHERRY:10658   Min.   : 0.00   Min.   :-123.2  
##  AKEBONO FLOWERING CHERRY: 2295   1st Qu.:49.22   1st Qu.:-123.1  
##  UKON JAPANESE CHERRY    :  856   Median :49.24   Median :-123.1  
##  MAZZARD CHERRY          :  603   Mean   :43.65   Mean   :-109.1  
##  PINK PERFECTION CHERRY  :  529   3rd Qu.:49.26   3rd Qu.:-123.1  
##  RANCHO SARGENT CHERRY   :  513   Max.   :49.29   Max.   :   0.0  
##  (Other)                 : 1053

But some trees seem to have wonky latitude or longitude coordinates. If we look at the distribution, these values stand part from the group

hist(cherry_filtered$LATITUDE)
hist(cherry_filtered$LONGITUDE)

We can get rid of them by roughly excluding them through subsetting based on value. All trees must have a latitude value greater than 40 and a longitude smaller than -100.

cherry_filtered<-cherry_filtered[cherry_filtered$LATITUDE > 40 & cherry_filtered$LONGITUDE < -100,]

I don’t like the fact everything is in capitals but R (and its help pages) are always there to save the day. Making everything lowercase is through tolower(), and its opposite is the toupper() function. But for sentence case, the simpleCap function is from the examples is the easiest solution. There is also a more complicated capwords function but simpleCap works well enough.

simpleCap <- function(x) {
  s <- strsplit(x, " ")[[1]]
  paste(toupper(substring(s, 1,1)), substring(s, 2),
        sep="", collapse=" ")
}

What simpleCap does is takes in a string such as “the quick red fox jumps over the lazy brown dog”, splits each word into letters, and then capitalizes the first letter before pasting all the letters in the word back together.

For example,

simpleCap("the quick red fox jumps over the lazy brown dog")
## [1] "The Quick Red Fox Jumps Over The Lazy Brown Dog"

We can use the sapply() function to use this simpleCap function on every row of the COMMON_NAME column.

cherry_filtered$COMMON_NAME<-sapply(as.character(tolower(cherry_filtered$COMMON_NAME)), simpleCap)
head(cherry_filtered)
##                 COMMON_NAME LATITUDE LONGITUDE
## 8  Kwanzan Flowering Cherry 49.27096 -123.1601
## 12 Kwanzan Flowering Cherry 49.27097 -123.1605
## 24 Kwanzan Flowering Cherry 49.27086 -123.1615
## 25 Kwanzan Flowering Cherry 49.27086 -123.1615
## 71 Shirotae(mt Fuji) Cherry 49.27062 -123.1466
## 72 Shirotae(mt Fuji) Cherry 49.27062 -123.1467

Notice that the capitalization for Shirotae did not work inside the parantheses due to the lack of a space making it a separate word.

simpleCap("Shirotae(mt Fuji)")
## [1] "Shirotae(mt Fuji)"

But if we manually add a space, it still does not work.

simpleCap("Shirotae (mt Fuji)")
## [1] "Shirotae (mt Fuji)"

But rather, this is because the ( character is the first letter rather than m.

simpleCap("Shirotae ( mt Fuji)")
## [1] "Shirotae ( Mt Fuji)"

In this case, we can just manually correct this error by editing every “Shirotae(mt Fuji)” string.

We first make the cultivar names into factors, and then edit the level name with the corrected capitalization.

str(cherry_filtered)
## 'data.frame':    14631 obs. of  3 variables:
##  $ COMMON_NAME: chr  "Kwanzan Flowering Cherry" "Kwanzan Flowering Cherry" "Kwanzan Flowering Cherry" "Kwanzan Flowering Cherry" ...
##  $ LATITUDE   : num  49.3 49.3 49.3 49.3 49.3 ...
##  $ LONGITUDE  : num  -123 -123 -123 -123 -123 ...
cherry_filtered$COMMON_NAME<-as.factor(cherry_filtered$COMMON_NAME)
str(cherry_filtered)
## 'data.frame':    14631 obs. of  3 variables:
##  $ COMMON_NAME: Factor w/ 9 levels "Akebono Flowering Cherry",..: 3 3 3 3 8 8 8 8 3 3 ...
##  $ LATITUDE   : num  49.3 49.3 49.3 49.3 49.3 ...
##  $ LONGITUDE  : num  -123 -123 -123 -123 -123 ...
levels(cherry_filtered$COMMON_NAME)[levels(cherry_filtered$COMMON_NAME)=="Shirotae(mt Fuji) Cherry"] <- "Shirotae (Mt Fuji) Cherry"
head(cherry_filtered)
##                  COMMON_NAME LATITUDE LONGITUDE
## 8   Kwanzan Flowering Cherry 49.27096 -123.1601
## 12  Kwanzan Flowering Cherry 49.27097 -123.1605
## 24  Kwanzan Flowering Cherry 49.27086 -123.1615
## 25  Kwanzan Flowering Cherry 49.27086 -123.1615
## 71 Shirotae (Mt Fuji) Cherry 49.27062 -123.1466
## 72 Shirotae (Mt Fuji) Cherry 49.27062 -123.1467

Onto the plot!

ggplot2 does not accept spatial objects such as a SpatialPolygonsDataFrame directly so we have to modify it into a dataframe. Spatial packages in R have their own plotting methods to plot spatial objects.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.1
vancouver_map<-fortify(locmap, region="Name")
head(vancouver_map)
##        long      lat order  hole piece            id           group
## 1 -123.1641 49.25748     1 FALSE     1 Arbutus-Ridge Arbutus-Ridge.1
## 2 -123.1639 49.25746     2 FALSE     1 Arbutus-Ridge Arbutus-Ridge.1
## 3 -123.1636 49.25745     3 FALSE     1 Arbutus-Ridge Arbutus-Ridge.1
## 4 -123.1626 49.25743     4 FALSE     1 Arbutus-Ridge Arbutus-Ridge.1
## 5 -123.1603 49.25740     5 FALSE     1 Arbutus-Ridge Arbutus-Ridge.1
## 6 -123.1579 49.25736     6 FALSE     1 Arbutus-Ridge Arbutus-Ridge.1

Here, you can see the latitudes and longitude of each point, the order each point is plotted in with lines that will connect the points, and also columns designating which groups the points belong to.

The first thing to plot is our neighbourhoods. Don’t forget by convention longitude is usually plotted on the x-axis and latitude on the y.

p <- ggplot(data=vancouver_map, aes(x=long, y=lat)) + 
  geom_path(aes(group=group))
p

We can use a different theme to get rid of the grey background.

p <- p + theme_void()
p

We can plot each tree as a single point.

p + geom_point(data = cherry_filtered, aes(x=LONGITUDE, y=LATITUDE))

You can see the points outlining the streets and overplotting in dense areas. We can use a density geom instead by dividing the plot up into smaller areas and counting up the number of trees in them.

p + stat_bin2d(data = cherry_filtered, aes(x=LONGITUDE, y=LATITUDE))

Using more bins … …

q <- p + stat_bin2d(data = cherry_filtered, aes(x=LONGITUDE, y=LATITUDE), bins=40)

q

We can facet the plot for the nine different cultivars in a 3 by 3 grid. And add a caption for the data source as well as a figure title.

figure_title<-expression(paste("Vancouver's nine most common cherry cultivars (", italic("Prunus sp."), ")"))

q <- q + facet_wrap(~COMMON_NAME, nrow=3) + labs(x=NULL, y=NULL, title=figure_title, caption="Data: City of Vancouver 2015, 2016")
q

To fit with the cherries, we can modify the colors used. It’s also possible to move the legend to the bottom and make it a horizontal scale.

q + 
  scale_fill_gradient(low="pink1", high="purple4", name="Density") + 
  theme(legend.direction = "horizontal",
        legend.position = "bottom",
        plot.title = element_text(hjust=0.5)
        ) 

Lastly, rather than use square, it is possible to use hexagonal bins.

cherrymap<-ggplot(data=vancouver_map, aes(x=long, y=lat)) + 
  geom_path(aes(group=group))+ 
  geom_hex(data=cherry_filtered, aes(x=LONGITUDE, y=LATITUDE), alpha=0.9, bins=30) +
  scale_fill_gradient(low="pink1", high="purple4", name="Density") +
  theme_void() + facet_wrap(~COMMON_NAME, nrow=3) + 
  labs(x=NULL, y=NULL, title=figure_title, caption="Data: City of Vancouver 2015, 2016") +
  theme(legend.direction = "horizontal",
        legend.position = "bottom",
        plot.title = element_text(hjust=0.5)
        )

Final Product

cherrymap

And all the cherry trees plotted together.

all_cherry <- ggplot(data=vancouver_map, aes(x=long, y=lat)) + 
  geom_path(aes(group=group))+ 
  geom_hex(data=cherry_filtered, aes(x=LONGITUDE, y=LATITUDE), alpha=0.9, bins=60) +
  scale_fill_gradient(low="pink1", high="purple4", name="Density") +
  theme_void() + 
  labs(x=NULL, y=NULL, title="Vancouver's Cherry Trees", caption="Data: City of Vancouver 2015, 2016") +
  theme(legend.direction = "horizontal",
        legend.position = "bottom",
        plot.title = element_text(hjust=0.5)
        )
all_cherry

And the names of the neighbourhoods.

In conclusion, go to Mount Pleasant for the most cherry trees!

To leave a comment for the author, please follow the link and comment on their blog: R on YIHAN WU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)