Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to Introduction to R for Data Science Session 1! The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

[in Serbian]

## Summary of Session 1, 28. april 2016 :: Introduction to R

Elementary data structures, data.frames + an illustrative example of a simple linear regression model.  An introduction to basic R data types and objects (vectors, lists, data.frame objects). Examples: subsetting and coercion. Getting to know RStudio. What can R do and how to make it perform the most elementary tricks needed in Data Science? What is CRAN and how to install R packages? R graphics: simple linear regression with plot(), abline(), and fancy with ggplot().

## R script + Data Set :: Session 1

########################################################
# Introduction to R for Data Science
# SESSION 1 :: 28 April, 2016
# Data Science Community Serbia + Startit
# :: Branko Kovač and Goran S. Milovanović ::
########################################################

# This is an R comment: it begins with "#" and ends with nothing ?
# data source: http://www.stat.ufl.edu/~winner/datasets.html (modified, from .dat to .csv)
# from the website of Mr. Larry Winner, Department of Statistics, University of Florida

# Data set: RKO Films Costs and Revenues 1930-1941
# More on RKO Films: https://en.wikipedia.org/wiki/RKO_Pictures

# First question: where are we?
getwd(); # this will tell you the path to the R working directory

# Where are my files?
# NOTE: Here you need to change filesDir to match your local path
filesDir <- "/home/goran/Desktop/__IntroR_Session1/";
class(filesDir); # now filesDir is a of a character type; there are classes and types in R
typeof(filesDir);
# By the way, you do not need to use the semicolon to separate lines of code:
class(filesDir)
typeof(filesDir)
# point R to where your files are stored
setwd(filesDir); # set working directory
getwd(); # check

# Read some data in csv (comma separated values
# - it might turn out that you will be using these very often)
fileName <- "rko_film_1930-1941.csv";
check.names=F,
stringsAsFactors=F,
row.names=NULL);

# type ? in front of any R function for help
# to find our that read.csv is a member of a wider read* family of functions
# of which read.table is the most generic one

# now, dataSet is of type...
typeof(dataSet); # in type semantics, dataSet is a list. In R we use lists a lot.
class(dataSet); # in object semantics, dataSet is a data.frame!

# what is the first member of the dataSet list?
dataSet[[1]];
# what are the first two members?
dataSet[1:2];
# mind the difference between subsetting a list with [[]] and []
# does a single member of dataSet have a name?
names(dataSet[[1]]);
# of what type is it?
typeof(dataSet[[1]]);
class(dataSet[[1]]);
# do first two elements have names?
names(dataSet[1:2]); # wow
typeof(dataSet[1:2]);
# the first element of dataSet, understood as a character vector, does not have a name
# however, elements OF A list do have names
# can we subset a data.frame object by names?
dataSet$movie; dataSet$movie[1:10];
dataSet$movie[[1]]; class(dataSet$movie[[1]]);
typeof(dataSet$movie[[1]]); # thus, a character vector is the first member = the first column of the dataSet data.frame testWord <- testWord testWord[[1]]; testWord[[1:2]]; # error testWord[1:2]; # similar dataSet[1:2]; # first two columns of a dataSet # back to characters tW <- testWord[1]; tW[1] tW[2] # NA # from a viewpoint of a statistical spreadsheet user, NA is used for missing data in R # what is the second letter in tW == 'Ana' substring(tW,2,2); # there are functions in R to deal with characters as strings! # finding elements of vectors w <- testWord[w]; # how many elements in testWord? length(testWord); # subsetting testWord, again testWord[2:length(testWord)]; # length is another important function, like which() or substring() tail(testWord,2); # vectors have tails, yay! head(testWord,3); # and heads as well # a data.frame has a head too, and that knowledge often comes handy... head(dataSet,5); # ... especially when dealing with large data sets # of course... tail(dataSet,10); # another two functions: tail() and head() # further subsetting of a data.frame object dataSet$reRelease # columns can have names; reRelease is the name of the 2nd column of dataSet
typeof(dataSet$reRelease); class(dataSet$reRelease);
# automatic type conversion in R: from numeric to logical
is.numeric(dataSet$reRelease); reRelease is.logical(reRelease); # vectors, sequences... # automatic type conversion (coercing) in R: from real to integer x <- 2:10; # is the same as... x <- seq(2,10,by=1); # multiples of 3.1415927... multipliPi <- x*pi; multipliPi # NOTE multiplication * in R operates element-wise # This is one of the reasons we call it a vector programming language... is.double(multipliPi); # type conversion in R: from double to integer as.integer(multipliPi) is.integer(multipliPi) is.integer(as.integer(multipliPi)) # rounding round(multipliPi,1) round(multipliPi,2) # carefully! as.integer(multipliPi) == round(multipliPi,0) # check documentation ?as.integer # enjoy... # more coercion... num <- as.numeric("123"); is.numeric(num) ch <- as.character(num) is.character(ch) # What do we all love in Data Science and Statistics? Random numbers..! runif(100,0,1) # one hundred uniformly distributed random numbers on a range 0 .. 1 rnorm(100, mean=0, sd=1) # one hundred random deviates from the standard Gaussian # all probability density and mass functions in R have similar r* functions to generate random deviates # Enough! Let's do something for real... # Q: Is it possible to predict the total revenue from movie production cost? # Are these two related at all? # What is the size of the data set? n # any missing data? sum(!(is.na(dataSet$productionCost)));
sum(!(is.na(dataSet$totalRevenue))); # plot dataSet$productionCost on x-axis and dataSet$totalRevenue on y-axis plot(dataSet$productionCost, dataSet$totalRevenue); # are these two correlated? cPearson <- cor(dataSet$productionCost, dataSet$totalRevenue,method="pearson"); cPearson # are these two correlated? cPearson <- cor(dataSet$productionCost, dataSet$totalRevenue,method="pearson"); cPearson # hm, maybe I should use non-parametric correlation instead cSpearman <- cor(dataSet$productionCost, dataSet$totalRevenue,method="spearman"); cSpearman # log-transform will not help much in this case... hist(log(dataSet$productionCost),20); # the default base of log in R is e (natural)
hist(log(dataSet$totalRevenue),20); # However, who in the World tests the assumptions of the linear model... Kick it! reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost); summary(reg); # get residuals reg$residuals
# get coefficients
reg$coefficients # some functions to inspect the simple linear model coefficients(reg) # model coefficients confint(reg, level=0.95) # CIs for model parameters fitted(reg) # predicted values residuals(reg) # residuals anova(reg) # anova table vcov(reg) # covariance matrix for model parameters # plot model intercept <- reg$coefficients[1];
slope <- reg$coefficients[2]; plot(dataSet$productionCost, dataSet$totalRevenue); abline(reg$coefficients); # as simple as that; abline() is a generic function, check it out ?abline

# and now for a nice plot
library(ggplot2); # first do: install.packages("ggplot2");not now - it can take a while
# library() is a call to use any R package
# of which the powerful ggplot2 is among the most popular
g <- ggplot(data=dataSet,
aes(x = productionCost,
y = totalRevenue)) +
geom_point() +
geom_smooth(method=lm,
se=TRUE) +
xlab("nProduction Cost") +
ylab("Total Revenuen") +
ggtitle("Linear Regressionn");
print(g);

# Q1: Is this model any good?
# Q2: Are there any truly dangerous outliers present in the data set?

# print is also a generic function in R: for example,
print("Doviđenja i uživajte u praznicima uz gomilu materijala za čitanje i vežbu!")

# P.S. Play with:
reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost + dataSet\$domesticRevenue);
summary(reg) # etc.

## Readings :: Session 2 [5. May, 2016, @Startit.rs, 19h CET]

Chapters 1 - 5, The Art of R Programming, Norman Matloff

• Intro to R
• Vectors and Matrics
• Lists