[This article was first published on JourneyR Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When learning to code, I sometimes find it helpful to look at code other people have written for a similar task. How do they format their code? What tips and tricks can I take away? What was the logic they used to structure the code? I hope that this case study might help you understand a problem you’ve been working through or highlight some new tricks for data processing.

Goal

What trees tend to occur together? The goal of this case study is to find the species of trees that occur in the same area as quaking aspen.

Data

For this case study we will use the U.S. forest inventory and analysis (FIA) data. This is one of the coolest and most complete datasets for exploring forests across the U.S. Over decades, forestry teams visited a system of forest patches (plots) and recorded the trees there. Not only does this database contain the tree species in each plot but it also describes the size, condition, and location of each tree. Unfortunately, a dataset with such detail is complex to record so the FIA data is stored in a database with numerous spreadsheets that connect to one another with common keys.

The database has a spreadsheet describing each tree, another sheet to describe each plot in which those trees are found, and another to give additional species information (and many more that we will not touch in this case study). For now, we will access two spreadsheets within the database: tree.csv and REF_SPECIES.csv.

How do you access the FIA data? The simplest way (in my opinion) is through the “datamart” csv files. We can download files for each state individually, or scroll all the way to the bottom, and download files for the entire U.S.

Logic

Before writing a script I think about the logic behind what we want to do. We will:

1. find the plots that have quaking aspen
2. get the other species in those plots

Of course, it will take quite a bit more code to achieve our goal because we need to load, format, and wrangle the data. But thinking through this logic gives our script some direction.

Full script

If you like to jump to the punchline, here is the full script. Below, I’ll step through it in more detail.

# Load packages
require(tidyr)
require(dplyr)
require(bit64)
require(data.table)

# Set number type for long FIA ID numbers
options(scipen=6)

# Set working directory for FIA dta
setwd("/Users/.../")

# Check column names in tree file
tree_check <- read.csv(file = "data/raw/tree.csv", nrows = 2)
tree_check #CN, PLT_CN, SPCD

# Load species data
speciesdata <- read.csv("data/raw/REF_SPECIES.csv")
speciesdata <- speciesdata %>% dplyr::select(SPCD, COMMON_NAME, GENUS, SPECIES)

# Load tree file (only relevant columns)
tree <- fread("data/raw/tree.csv", select = c('CN',
'PLT_CN',
'SPCD'))

# Quaking aspen = Populus tremuloides = SPCD 746
# get plots with aspen in them
plt_aspen <- tree[tree$SPCD == 746, ] plt_aspen <- unique(plt_aspen$PLT_CN)
length(plt_aspen) #73232 plots have quaking aspens

# Get all trees for plots with aspens
tree_withaspen <- tree %>% filter(PLT_CN %in% plt_aspen)
tree_withaspen <- merge(tree_withaspen, speciesdata, by="SPCD", all.x=TRUE, all.y=FALSE)
tree_withaspen$GENUSSPECIES <- paste(tree_withaspen$GENUS, tree_withaspen$SPECIES) # Get list of species tree_list <- unique(tree_withaspen$GENUSSPECIES)
length(tree_list) #189 species co-occur

# Get counts of species
tree_cnts <- as.data.frame(xtabs(~GENUSSPECIES, data = tree_withaspen))
tree_cnts <- tree_cnts[rev(order(tree_cnts$Freq)), ] head(tree_cnts, n=15) # Save list to csv file write.csv(tree_cnts, "tree_cnts.csv")  Step by step Start by loading the necessary packages. When you start writing the script you may not know which packages you need. But as you use them throughout the code, come back to this section to add any packages you loaded. We will use dplyr to filter(), bit64 to deal with the long ID numbers used in the FIA, and data.table for fread() which allows us to load specific columns from the csv file rather than loading the entire file (which is huge!). # Load packages require(tidyr) require(dplyr) require(bit64) require(data.table) Next use scipen=6 to keep the long ID numbers formatted in their original system. And set your working directory so that you can access files and save outputs in a specific place for this project. # Set number type for long FIA ID numbers options(scipen=6) # Set working directory for FIA dta setwd("/Users/.../") Now we are ready to load the data. Since the tree.csv file is so big, we don’t really want to load the entire file into memory. Let’s figure out which columns we might actually use. To do this, use read.csv with nrows = 2 to load only the first 2 rows. This lets us check the column names and see a glimpse of how the data is formatted. From here, we can see that we only need a few columns: CN which gives each tree a unique ID number, PLT_CN which gives the plot ID number each tree is in, and SPCD which gives the species code for that tree. After finding the columns that we’re interested in, we can use fread to select only those three columns. REF_SPECIES.csv, on the other hand, is a small file that we load entirely into memory using read.csv() and then select the relevant columns using dplyr::select(). # Check column names in tree file tree_check <- read.csv(file = "data/raw/tree.csv", nrows = 2) tree_check #CN, PLT_CN, SPCD # Load species data speciesdata <- read.csv("data/raw/REF_SPECIES.csv") speciesdata <- speciesdata %>% dplyr::select(SPCD, COMMON_NAME, GENUS, SPECIES) # Load tree file (only relevant columns) tree <- fread("data/raw/tree.csv", select = c('CN', 'PLT_CN', 'SPCD')) After loading the data we can find the plots that have quaking aspens in them (hooray — we made it to step 1 from our logic session). Based on the REF_SPECIES file quaking aspen is SPCD 746 so we filter the data for all rows that have SPCD == 746. I did this using base R but it could also be done using dplyr where plt_aspen <- tree %>% filter(SPCD == 746). From the dataframe with only aspen trees, use unique() to find the plot numbers of those trees. In this dataset we have 73,232 plots that contain quaking aspen! # Quaking aspen = Populus tremuloides = SPCD 746 # get plots with aspen in them plt_aspen <- tree[tree$SPCD == 746, ]
plt_aspen <- unique(plt_aspen$PLT_CN) length(plt_aspen) #73232 plots have quaking aspens Based on the list of plots that contain aspen we can now look at all tree species contained in those plots. From the original tree dataframe filter for only the plots with a plot ID in the list we generated in the last step. Then we merge that dataframe with the species data so tha we can get species names (not just the SPCD species ID number that comes in the tree file). We use paste() to make a new column that combines genus and species (in the speciesdata these are separate columns). And we use unique() to get the unique list of genus + species that occur. # Get all trees for plots with aspens tree_withaspen <- tree %>% filter(PLT_CN %in% plt_aspen) tree_withaspen <- merge(tree_withaspen, speciesdata, by="SPCD", all.x=TRUE, all.y=FALSE) tree_withaspen$GENUSSPECIES <- paste(tree_withaspen$GENUS, tree_withaspen$SPECIES)

# Get list of species
tree_list <- unique(tree_withaspen$GENUSSPECIES) length(tree_list) #189 species co-occur All that remains is to organize and save this list. We will organize the list by the frequency of each species so that species occuring more often near aspen will be at the top of the list. Then save using write.csv(). # Get counts of species tree_cnts <- as.data.frame(xtabs(~GENUSSPECIES, data = tree_withaspen)) tree_cnts <- tree_cnts[rev(order(tree_cnts$Freq)), ]
head(tree_cnts, n=15)

# Save list to csv file
write.csv(tree_cnts, "tree_cnts.csv")

I hope that you found this post helpful or at least interesting. Please let me know if you have an R question that you would like explained on here. And thanks for following along with my R journey.

To leave a comment for the author, please follow the link and comment on their blog: JourneyR Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)