Introduction to R for Data Science :: Session 1

April 30, 2016
By

(This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

Welcome to Introduction to R for Data Science Session 1! The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

[in Serbian]

Lecturers

Summary of Session 1, 28. april 2016 :: Introduction to R

Elementary data structures, data.frames + an illustrative example of a simple linear regression model.  An introduction to basic R data types and objects (vectors, lists, data.frame objects). Examples: subsetting and coercion. Getting to know RStudio. What can R do and how to make it perform the most elementary tricks needed in Data Science? What is CRAN and how to install R packages? R graphics: simple linear regression with plot(), abline(), and fancy with ggplot().

Intro to R for Data Science SlideShare :: Session 1

Introduction to R for Data Science :: Session 1 from Goran Milovanović

R script + Data Set :: Session 1

########################################################
# Introduction to R for Data Science
# SESSION 1 :: 28 April, 2016
# Data Science Community Serbia + Startit
# :: Branko Kovač and Goran S. Milovanović ::
########################################################
 
# This is an R comment: it begins with "#" and ends with nothing 🙂
# data source: http://www.stat.ufl.edu/~winner/datasets.html (modified, from .dat to .csv)
# from the website of Mr. Larry Winner, Department of Statistics, University of Florida
 
# Data set: RKO Films Costs and Revenues 1930-1941
# More on RKO Films: https://en.wikipedia.org/wiki/RKO_Pictures
 
# First question: where are we?
getwd(); # this will tell you the path to the R working directory
 
# Where are my files?
# NOTE: Here you need to change filesDir to match your local path
filesDir <- "/home/goran/Desktop/__IntroR_Session1/";
class(filesDir); # now filesDir is a of a character type; there are classes and types in R
typeof(filesDir);
# By the way, you do not need to use the semicolon to separate lines of code:
class(filesDir)
typeof(filesDir)
# point R to where your files are stored
setwd(filesDir); # set working directory
getwd(); # check
 
# Read some data in csv (comma separated values
# - it might turn out that you will be using these very often)
fileName <- "rko_film_1930-1941.csv";
dataSet <- read.csv(fileName,
                    header=T,
                    check.names=F,
                    stringsAsFactors=F,
                    row.names=NULL);
 
# read.csv is for reading comma separated values
# type ? in front of any R function for help
?read.csv
# to find our that read.csv is a member of a wider read* family of functions
# of which read.table is the most generic one
 
# now, dataSet is of type...
typeof(dataSet); # in type semantics, dataSet is a list. In R we use lists a lot.
class(dataSet); # in object semantics, dataSet is a data.frame!
 
# what is the first member of the dataSet list?
dataSet[[1]];
# what are the first two members?
dataSet[1:2];
# mind the difference between subsetting a list with [[]] and []
# does a single member of dataSet have a name?
names(dataSet[[1]]);
# of what type is it?
typeof(dataSet[[1]]);
class(dataSet[[1]]);
# do first two elements have names?
names(dataSet[1:2]); # wow
typeof(dataSet[1:2]);
# the first element of dataSet, understood as a character vector, does not have a name
# however, elements OF A list do have names
# can we subset a data.frame object by names?
dataSet$movie;
dataSet$movie[1:10];
dataSet$movie[[1]];
class(dataSet$movie[[1]]);
typeof(dataSet$movie[[1]]);
# thus, a character vector is the first member = the first column of the dataSet data.frame
testWord <- testWord testWord[[1]];
testWord[[1:2]]; # error
testWord[1:2];
# similar
dataSet[1:2]; # first two columns of a dataSet
# back to characters
tW <- testWord[1];
tW[1]
tW[2] # NA
# from a viewpoint of a statistical spreadsheet user, NA is used for missing data in R
# what is the second letter in tW == 'Ana'
substring(tW,2,2); # there are functions in R to deal with characters as strings!
# finding elements of vectors
w <- testWord[w];
# how many elements in testWord?
length(testWord);
# subsetting testWord, again
testWord[2:length(testWord)]; # length is another important function, like which() or substring()
tail(testWord,2); # vectors have tails, yay!
head(testWord,3); # and heads as well
# a data.frame has a head too, and that knowledge often comes handy...
head(dataSet,5); # ... especially when dealing with large data sets
# of course...
tail(dataSet,10);
# another two functions: tail() and head()
# further subsetting of a data.frame object
dataSet$reRelease # columns can have names; reRelease is the name of the 2nd column of dataSet
typeof(dataSet$reRelease);
class(dataSet$reRelease);
# automatic type conversion in R: from numeric to logical
is.numeric(dataSet$reRelease);
reRelease
is.logical(reRelease);
# vectors, sequences...
# automatic type conversion (coercing) in R: from real to integer
x <- 2:10;
# is the same as...
x <- seq(2,10,by=1);
# multiples of 3.1415927...
multipliPi <- x*pi;
multipliPi
# NOTE multiplication * in R operates element-wise
# This is one of the reasons we call it a vector programming language...
is.double(multipliPi);
# type conversion in R: from double to integer
as.integer(multipliPi)
is.integer(multipliPi)
is.integer(as.integer(multipliPi))
# rounding
round(multipliPi,1)
round(multipliPi,2)
# carefully!
as.integer(multipliPi) == round(multipliPi,0) # check documentation
?as.integer # enjoy...
# more coercion...
num <- as.numeric("123");
is.numeric(num)
ch <- as.character(num)
is.character(ch)
 
# What do we all love in Data Science and Statistics? Random numbers..!
runif(100,0,1) # one hundred uniformly distributed random numbers on a range 0 .. 1
rnorm(100, mean=0, sd=1) # one hundred random deviates from the standard Gaussian
# all probability density and mass functions in R have similar r* functions to generate random deviates
 
# Enough! Let's do something for real...
# Q: Is it possible to predict the total revenue from movie production cost?
# Are these two related at all?
# What is the size of the data set?
n # any missing data?
sum(!(is.na(dataSet$productionCost)));
sum(!(is.na(dataSet$totalRevenue)));
# plot dataSet$productionCost on x-axis and dataSet$totalRevenue on y-axis
plot(dataSet$productionCost, dataSet$totalRevenue);
# are these two correlated?
cPearson <- cor(dataSet$productionCost, dataSet$totalRevenue,method="pearson");
cPearson

Session1-Fig1Session1-Fig2

# are these two correlated?
cPearson <- cor(dataSet$productionCost, dataSet$totalRevenue,method="pearson");
cPearson
# hm, maybe I should use non-parametric correlation instead
cSpearman <- cor(dataSet$productionCost, dataSet$totalRevenue,method="spearman");
cSpearman
# log-transform will not help much in this case...
hist(log(dataSet$productionCost),20); # the default base of log in R is e (natural)
hist(log(dataSet$totalRevenue),20);

Session1-Fig3Session1-Fig4

# However, who in the World tests the assumptions of the linear model... Kick it!
reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost);
summary(reg);
# get residuals
reg$residuals
# get coefficients
reg$coefficients 
# some functions to inspect the simple linear model
coefficients(reg) # model coefficients
confint(reg, level=0.95) # CIs for model parameters 
fitted(reg) # predicted values
residuals(reg) # residuals
anova(reg) # anova table 
vcov(reg) # covariance matrix for model parameters 
 
# plot model
intercept <- reg$coefficients[1];
slope <- reg$coefficients[2];
plot(dataSet$productionCost, dataSet$totalRevenue);
abline(reg$coefficients); # as simple as that; abline() is a generic function, check it out ?abline

Session1-Fig5

# and now for a nice plot
library(ggplot2); # first do: install.packages("ggplot2");not now - it can take a while
# library() is a call to use any R package
# of which the powerful ggplot2 is among the most popular
g <- ggplot(data=dataSet,
            aes(x = productionCost,
                y = totalRevenue)) +
  geom_point() +
  geom_smooth(method=lm,
              se=TRUE) +
  xlab("nProduction Cost") +
  ylab("Total Revenuen") +
  ggtitle("Linear Regressionn"); 
print(g);

Session1-Fig6

# Q1: Is this model any good?
# Q2: Are there any truly dangerous outliers present in the data set?
 
# print is also a generic function in R: for example,
print("Doviđenja i uživajte u praznicima uz gomilu materijala za čitanje i vežbu!")
 
# P.S. Play with:
reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost + dataSet$domesticRevenue);
summary(reg) # etc.

Readings :: Session 2 [5. May, 2016, @Startit.rs, 19h CET]

Chapters 1 – 5, The Art of R Programming, Norman Matloff

  • Intro to R
  • Vectors and Matrics
  • Lists

Session 1 Photos

20160428_20481520160428_193859

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)