Introduction To R Programming : How I Learn Best

October 22, 2016
By

(This article was first published on R – Saturn Science, and kindly contributed to R-bloggers)

How Best To Learn R Programming

First six months of AQI
First six months of AQI

So what’s the best or most efficient way to learn R programming? I have read numerous books and blogs on R programming in an attempt to learn more about R. While doing this has helped me understand the basics, these resources tend to get off track.

Usually, when I find a good resource, it gives the same basic commands to form a vector or import a data frame. These authors then proceed to give examples with R’s built in data such as the Iris data set, the MPG gas mileage data or the Diamonds data.

While this is good in the beginning, and the data are interesting for a while, it gets difficult to follow when you are not really interested in the data or the question being asked.

What I have found that works

I have found that, for me, the best way to learn R programming is to read some of these resources for the basics and then to find something you are interested in and find data to work with that is related to your interest. Ask a good question and work with your data to see if it can give you some insights. What story can your data tell? What things do you need to learn to solve the problem? Keep those resources nearby and Google open in another tab.

This is the way to go. When you have a good question to answer and authentic data, the challenge is a lot of fun. If you then publish your findings, it takes on more importance. That is what I will do here as I learn more R programming and some of the way cool packages like dplyr and ggplot2.

My Story

Since I live in Chengdu, China, I am concerned with the air quality here. AQI2.5 refers to the very small particulate matter in the air that is so small that it can enter the blood via the lungs and causes damage to the human body.

Analysis Of Air Quality (AQI) from 2015 in Chengdu, China

I will be modeling the reproducible research that I am teaching my students. I am writing this in an R markdown file that contains my narrative, code chunks, output statistics and plots. I’m trying to keep the formatting to a minimum so I can focus on my thoughts and writing so I don’t have to do much editing. My time is limited on this project. Given that, I was up past midnight last night and I wanted to keep going but my eyelids kept falling down.

I am working on my analysis skills using R and R Markdown to try and understand if the air quality changes during the year. The AQI values represent the 2.5 micron level of pollution. I found AQI data online at the US department of state. I think this data is important because I have not read any published research on this data.

This data includes an air quality reading every hour for every day of the year. The most recent completed year is 2015. The data file has 8760 rows and ten column vectors imported as a data frame from an Excel file from the state department.here Is is noted that these data are not fully verified or validated according to the website. webpage

I cleaned it up a bit by deleting the top two rows that contain narrative. I kept the row variable names and all the rest. I deleted one column labeled “microgram/meter^3” because it wouldn’t import into my Mac but all readings are in micrograms per meter cubed.

Following Are The Code Chunks

library(ggplot2)
library(dplyr)

ChAir2015 <- read.csv("~/Desktop/LearnR/Chengdu Air Data/ChAir2015.csv") # Inport the data

air21015=ChAir2015 # rename the data to shorter name

str(air21015) # check to see if data is a data file

## 'data.frame':    8760 obs. of  10 variables:
##  $ Site      : Factor w/ 1 level "Chengdu": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Parameter : Factor w/ 1 level "PM2.5": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date..LST.: Factor w/ 8759 levels "1/1/15 0:00",..: 1 2 13 18 19 20 21 22 23 24 ...
##  $ Year      : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ Month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Day       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Hour      : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Value     : int  152 130 125 131 133 133 131 142 154 153 ...
##  $ Duration  : Factor w/ 1 level "1 Hr": 1 1 1 1 1 1 1 1 1 1 ...
##  $ QC.Name   : Factor w/ 2 levels "Missing","Valid": 2 2 2 2 2 2 2 2 2 2 ...

# take a look at the beginning of the data
print.data.frame(head(air21015,7,2))

##      Site Parameter  Date..LST. Year Month Day Hour Value Duration QC.Name
## 1 Chengdu     PM2.5 1/1/15 0:00 2015     1   1    0   152     1 Hr   Valid
## 2 Chengdu     PM2.5 1/1/15 1:00 2015     1   1    1   130     1 Hr   Valid
## 3 Chengdu     PM2.5 1/1/15 2:00 2015     1   1    2   125     1 Hr   Valid
## 4 Chengdu     PM2.5 1/1/15 3:00 2015     1   1    3   131     1 Hr   Valid
## 5 Chengdu     PM2.5 1/1/15 4:00 2015     1   1    4   133     1 Hr   Valid
## 6 Chengdu     PM2.5 1/1/15 5:00 2015     1   1    5   133     1 Hr   Valid
## 7 Chengdu     PM2.5 1/1/15 6:00 2015     1   1    6   131     1 Hr   Valid

The next part I will use the dplyr package to filter and select the data needed to make a histogram and then a box plot

Jan.air21015= filter(air21015,Month==1, Value>0) # filter just January data and values above 0
## If there was no reading that day, a value of -999 was assigned

dim(Jan.air21015) # number of rows. This is January 2015 with value >0

## [1] 741  10

ggplot(Jan.air21015,aes(x=Value))+
  geom_histogram(binwidth = 10, color="black", fill="yellow") +
  labs(x="AQI Value") +
  ggtitle("January Days With AQI Values")

top21

mean(Jan.air21015$Value)

## [1] 142.2497

fivenum(Jan.air21015$Value)

## [1]  15  91 149 187 349

Next, I’ll take a look at February

air21015=ChAir2015

Feb.air2015=filter(air21015,Month==2, Value>0)  
# filter just February data and values above 0
## If there was no reading that day, a value of -999 was assigned

dim(Feb.air2015)

## [1] 670  10

Feb=Feb.air2015

ggplot(Feb,aes(x=Value))+
  geom_histogram(binwidth = 10, color="black", fill="yellow") +
  labs(x="AQI Value") +
  ggtitle("February Days With AQI Values")

middle21

Now I will concentrate on using the dplyr package to subset the AQI Value data so I can make and compare all the months and see if there is some pattern. March is next.

air21015=ChAir2015

Mar.air2015=filter(air21015,Month==3, Value >0)

April is next.

air21015=ChAir2015

Apr.air2015=filter(air21015,Month==4, Value >0)

May is next

air21015=ChAir2015

May.air2015=filter(air21015,Month==5, Value >0)

June is next.

air21015=ChAir2015

Jun.air2015=filter(air21015,Month==6, Value >0)

Now I will combine each month into another data.frame so I can plot their box plot on one graph.The second boxplot shows how I rotated the months so they fit.

a = data.frame(Month = "Jan", value = (Jan.air21015$Value))

b = data.frame(Month = "Feb", value = (Feb$Value))

c = data.frame(Month = "Mar", value = (Mar.air2015$Value))

d = data.frame(Month = "Apr", value = (Apr.air2015$Value))

e = data_frame(Month = "May", value = (May.air2015$Value))

f = data_frame(Month = "Jun", value = (Jun.air2015$Value))

plot.data <- rbind(a,b,c,d,e,f)

AQI.plot=ggplot(plot.data, aes(x=Month, y=value, fill=Month)) + geom_boxplot()

AQI.plot

third-down

AQI.plot + theme(axis.text.x = element_text(angle = 60, hjust = 1)) +   
  ggtitle("2015 Chengdu AQI")

last

Conclusion

This is as far as I got last night. The trend seems to be going down looking at the box plots. I still need to finish the rest of the months. This week I’ll work on plotting all 12 months together and making some conclusion if there is a pattern. This is my first attempt at using this data and I’m sure there are shorter and better ways of doing it. I read somewhere: “Make it work, then make it fast”.

 

 

To leave a comment for the author, please follow the link and comment on their blog: R – Saturn Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)