Introduction to R for Data Science :: Session 1

Posted on April 30, 2016 by The Exactness of Mind in R bloggers | 0 Comments

[This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to Introduction to R for Data Science Session 1! The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

[in Serbian]

Lecturers

dipl. ing Branko Kovač, Data Analyst at CUBE, Data Science Mentor at Springboard, Institut savremenih nauka, Data Science Serbia
Goran S. Milovanović, Phd, DataScientist@DiploFoundation, Data Science Serbia

Summary of Session 1, 28. april 2016 :: Introduction to R

Elementary data structures, data.frames + an illustrative example of a simple linear regression model. An introduction to basic R data types and objects (vectors, lists, data.frame objects). Examples: subsetting and coercion. Getting to know RStudio. What can R do and how to make it perform the most elementary tricks needed in Data Science? What is CRAN and how to install R packages? R graphics: simple linear regression with plot(), abline(), and fancy with ggplot().

Intro to R for Data Science SlideShare :: Session 1

Introduction to R for Data Science :: Session 1 from Goran Milovanović

R script + Data Set :: Session 1

########################################################
# Introduction to R for Data Science
# SESSION 1 :: 28 April, 2016
# Data Science Community Serbia + Startit
# :: Branko Kovač and Goran S. Milovanović ::
########################################################
 
# This is an R comment: it begins with "#" and ends with nothing ?
# data source: http://www.stat.ufl.edu/~winner/datasets.html (modified, from .dat to .csv)
# from the website of Mr. Larry Winner, Department of Statistics, University of Florida
 
# Data set: RKO Films Costs and Revenues 1930-1941
# More on RKO Films: https://en.wikipedia.org/wiki/RKO_Pictures
 
# First question: where are we?
getwd(); # this will tell you the path to the R working directory
 
# Where are my files?
# NOTE: Here you need to change filesDir to match your local path
filesDir <- "/home/goran/Desktop/__IntroR_Session1/";
class(filesDir); # now filesDir is a of a character type; there are classes and types in R
typeof(filesDir);
# By the way, you do not need to use the semicolon to separate lines of code:
class(filesDir)
typeof(filesDir)
# point R to where your files are stored
setwd(filesDir); # set working directory
getwd(); # check
 
# Read some data in csv (comma separated values
# - it might turn out that you will be using these very often)
fileName <- "rko_film_1930-1941.csv";
dataSet <- read.csv(fileName,
                    header=T,
                    check.names=F,
                    stringsAsFactors=F,
                    row.names=NULL);
 
# read.csv is for reading comma separated values
# type ? in front of any R function for help
?read.csv
# to find our that read.csv is a member of a wider read* family of functions
# of which read.table is the most generic one
 
# now, dataSet is of type...
typeof(dataSet); # in type semantics, dataSet is a list. In R we use lists a lot.
class(dataSet); # in object semantics, dataSet is a data.frame!
 
# what is the first member of the dataSet list?
dataSet[[1]];
# what are the first two members?
dataSet[1:2];
# mind the difference between subsetting a list with [[]] and []
# does a single member of dataSet have a name?
names(dataSet[[1]]);
# of what type is it?
typeof(dataSet[[1]]);
class(dataSet[[1]]);
# do first two elements have names?
names(dataSet[1:2]); # wow
typeof(dataSet[1:2]);
# the first element of dataSet, understood as a character vector, does not have a name
# however, elements OF A list do have names
# can we subset a data.frame object by names?
dataSet$movie;
dataSet$movie[1:10];
dataSet$movie[[1]];
class(dataSet$movie[[1]]);
typeof(dataSet$movie[[1]]);
# thus, a character vector is the first member = the first column of the dataSet data.frame
testWord <- testWord testWord[[1]];
testWord[[1:2]]; # error
testWord[1:2];
# similar
dataSet[1:2]; # first two columns of a dataSet
# back to characters
tW <- testWord[1];
tW[1]
tW[2] # NA
# from a viewpoint of a statistical spreadsheet user, NA is used for missing data in R
# what is the second letter in tW == 'Ana'
substring(tW,2,2); # there are functions in R to deal with characters as strings!
# finding elements of vectors
w <- testWord[w];
# how many elements in testWord?
length(testWord);
# subsetting testWord, again
testWord[2:length(testWord)]; # length is another important function, like which() or substring()
tail(testWord,2); # vectors have tails, yay!
head(testWord,3); # and heads as well
# a data.frame has a head too, and that knowledge often comes handy...
head(dataSet,5); # ... especially when dealing with large data sets
# of course...
tail(dataSet,10);
# another two functions: tail() and head()
# further subsetting of a data.frame object
dataSet$reRelease # columns can have names; reRelease is the name of the 2nd column of dataSet
typeof(dataSet$reRelease);
class(dataSet$reRelease);
# automatic type conversion in R: from numeric to logical
is.numeric(dataSet$reRelease);
reRelease
is.logical(reRelease);
# vectors, sequences...
# automatic type conversion (coercing) in R: from real to integer
x <- 2:10;
# is the same as...
x <- seq(2,10,by=1);
# multiples of 3.1415927...
multipliPi <- x*pi;
multipliPi
# NOTE multiplication * in R operates element-wise
# This is one of the reasons we call it a vector programming language...
is.double(multipliPi);
# type conversion in R: from double to integer
as.integer(multipliPi)
is.integer(multipliPi)
is.integer(as.integer(multipliPi))
# rounding
round(multipliPi,1)
round(multipliPi,2)
# carefully!
as.integer(multipliPi) == round(multipliPi,0) # check documentation
?as.integer # enjoy...
# more coercion...
num <- as.numeric("123");
is.numeric(num)
ch <- as.character(num)
is.character(ch)
 
# What do we all love in Data Science and Statistics? Random numbers..!
runif(100,0,1) # one hundred uniformly distributed random numbers on a range 0 .. 1
rnorm(100, mean=0, sd=1) # one hundred random deviates from the standard Gaussian
# all probability density and mass functions in R have similar r* functions to generate random deviates
 
# Enough! Let's do something for real...
# Q: Is it possible to predict the total revenue from movie production cost?
# Are these two related at all?
# What is the size of the data set?
n # any missing data?
sum(!(is.na(dataSet$productionCost)));
sum(!(is.na(dataSet$totalRevenue)));
# plot dataSet$productionCost on x-axis and dataSet$totalRevenue on y-axis
plot(dataSet$productionCost, dataSet$totalRevenue);
# are these two correlated?
cPearson <- cor(dataSet$productionCost, dataSet$totalRevenue,method="pearson");
cPearson

# are these two correlated?
cPearson <- cor(dataSet$productionCost, dataSet$totalRevenue,method="pearson");
cPearson
# hm, maybe I should use non-parametric correlation instead
cSpearman <- cor(dataSet$productionCost, dataSet$totalRevenue,method="spearman");
cSpearman
# log-transform will not help much in this case...
hist(log(dataSet$productionCost),20); # the default base of log in R is e (natural)
hist(log(dataSet$totalRevenue),20);

# However, who in the World tests the assumptions of the linear model... Kick it!
reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost);
summary(reg);
# get residuals
reg$residuals
# get coefficients
reg$coefficients 
# some functions to inspect the simple linear model
coefficients(reg) # model coefficients
confint(reg, level=0.95) # CIs for model parameters 
fitted(reg) # predicted values
residuals(reg) # residuals
anova(reg) # anova table 
vcov(reg) # covariance matrix for model parameters 
 
# plot model
intercept <- reg$coefficients[1];
slope <- reg$coefficients[2];
plot(dataSet$productionCost, dataSet$totalRevenue);
abline(reg$coefficients); # as simple as that; abline() is a generic function, check it out ?abline

# and now for a nice plot
library(ggplot2); # first do: install.packages("ggplot2");not now - it can take a while
# library() is a call to use any R package
# of which the powerful ggplot2 is among the most popular
g <- ggplot(data=dataSet,
            aes(x = productionCost,
                y = totalRevenue)) +
  geom_point() +
  geom_smooth(method=lm,
              se=TRUE) +
  xlab("nProduction Cost") +
  ylab("Total Revenuen") +
  ggtitle("Linear Regressionn"); 
print(g);

# Q1: Is this model any good?
# Q2: Are there any truly dangerous outliers present in the data set?
 
# print is also a generic function in R: for example,
print("Doviđenja i uživajte u praznicima uz gomilu materijala za čitanje i vežbu!")
 
# P.S. Play with:
reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost + dataSet$domesticRevenue);
summary(reg) # etc.