Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to Introduction to R for Data Science Session 1! The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

[in Serbian]

## Summary of Session 1, 28. april 2016 :: Introduction to R

Elementary data structures, data.frames + an illustrative example of a simple linear regression model.  An introduction to basic R data types and objects (vectors, lists, data.frame objects). Examples: subsetting and coercion. Getting to know RStudio. What can R do and how to make it perform the most elementary tricks needed in Data Science? What is CRAN and how to install R packages? R graphics: simple linear regression with plot(), abline(), and fancy with ggplot().

## R script + Data Set :: Session 1

```########################################################
# Introduction to R for Data Science
# SESSION 1 :: 28 April, 2016
# Data Science Community Serbia + Startit
# :: Branko Kovač and Goran S. Milovanović ::
########################################################

# This is an R comment: it begins with "#" and ends with nothing ?
# data source: http://www.stat.ufl.edu/~winner/datasets.html (modified, from .dat to .csv)
# from the website of Mr. Larry Winner, Department of Statistics, University of Florida

# Data set: RKO Films Costs and Revenues 1930-1941
# More on RKO Films: https://en.wikipedia.org/wiki/RKO_Pictures

# First question: where are we?
getwd(); # this will tell you the path to the R working directory

# Where are my files?
# NOTE: Here you need to change filesDir to match your local path
filesDir <- "/home/goran/Desktop/__IntroR_Session1/";
class(filesDir); # now filesDir is a of a character type; there are classes and types in R
typeof(filesDir);
# By the way, you do not need to use the semicolon to separate lines of code:
class(filesDir)
typeof(filesDir)
# point R to where your files are stored
setwd(filesDir); # set working directory
getwd(); # check

# Read some data in csv (comma separated values
# - it might turn out that you will be using these very often)
fileName <- "rko_film_1930-1941.csv";
check.names=F,
stringsAsFactors=F,
row.names=NULL);

# type ? in front of any R function for help
# to find our that read.csv is a member of a wider read* family of functions
# of which read.table is the most generic one

# now, dataSet is of type...
typeof(dataSet); # in type semantics, dataSet is a list. In R we use lists a lot.
class(dataSet); # in object semantics, dataSet is a data.frame!

# what is the first member of the dataSet list?
dataSet[];
# what are the first two members?
dataSet[1:2];
# mind the difference between subsetting a list with [[]] and []
# does a single member of dataSet have a name?
names(dataSet[]);
# of what type is it?
typeof(dataSet[]);
class(dataSet[]);
# do first two elements have names?
names(dataSet[1:2]); # wow
typeof(dataSet[1:2]);
# the first element of dataSet, understood as a character vector, does not have a name
# however, elements OF A list do have names
# can we subset a data.frame object by names?
dataSet\$movie;
dataSet\$movie[1:10];
dataSet\$movie[];
class(dataSet\$movie[]);
typeof(dataSet\$movie[]);
# thus, a character vector is the first member = the first column of the dataSet data.frame
testWord <- testWord testWord[];
testWord[[1:2]]; # error
testWord[1:2];
# similar
dataSet[1:2]; # first two columns of a dataSet
# back to characters
tW <- testWord;
tW
tW # NA
# from a viewpoint of a statistical spreadsheet user, NA is used for missing data in R
# what is the second letter in tW == 'Ana'
substring(tW,2,2); # there are functions in R to deal with characters as strings!
# finding elements of vectors
w <- testWord[w];
# how many elements in testWord?
length(testWord);
# subsetting testWord, again
testWord[2:length(testWord)]; # length is another important function, like which() or substring()
tail(testWord,2); # vectors have tails, yay!
# a data.frame has a head too, and that knowledge often comes handy...
head(dataSet,5); # ... especially when dealing with large data sets
# of course...
tail(dataSet,10);
# another two functions: tail() and head()
# further subsetting of a data.frame object
dataSet\$reRelease # columns can have names; reRelease is the name of the 2nd column of dataSet
typeof(dataSet\$reRelease);
class(dataSet\$reRelease);
# automatic type conversion in R: from numeric to logical
is.numeric(dataSet\$reRelease);
reRelease
is.logical(reRelease);
# vectors, sequences...
# automatic type conversion (coercing) in R: from real to integer
x <- 2:10;
# is the same as...
x <- seq(2,10,by=1);
# multiples of 3.1415927...
multipliPi <- x*pi;
multipliPi
# NOTE multiplication * in R operates element-wise
# This is one of the reasons we call it a vector programming language...
is.double(multipliPi);
# type conversion in R: from double to integer
as.integer(multipliPi)
is.integer(multipliPi)
is.integer(as.integer(multipliPi))
# rounding
round(multipliPi,1)
round(multipliPi,2)
# carefully!
as.integer(multipliPi) == round(multipliPi,0) # check documentation
?as.integer # enjoy...
# more coercion...
num <- as.numeric("123");
is.numeric(num)
ch <- as.character(num)
is.character(ch)

# What do we all love in Data Science and Statistics? Random numbers..!
runif(100,0,1) # one hundred uniformly distributed random numbers on a range 0 .. 1
rnorm(100, mean=0, sd=1) # one hundred random deviates from the standard Gaussian
# all probability density and mass functions in R have similar r* functions to generate random deviates

# Enough! Let's do something for real...
# Q: Is it possible to predict the total revenue from movie production cost?
# Are these two related at all?
# What is the size of the data set?
n # any missing data?
sum(!(is.na(dataSet\$productionCost)));
sum(!(is.na(dataSet\$totalRevenue)));
# plot dataSet\$productionCost on x-axis and dataSet\$totalRevenue on y-axis
plot(dataSet\$productionCost, dataSet\$totalRevenue);
# are these two correlated?
cPearson <- cor(dataSet\$productionCost, dataSet\$totalRevenue,method="pearson");
cPearson```  ```# are these two correlated?
cPearson <- cor(dataSet\$productionCost, dataSet\$totalRevenue,method="pearson");
cPearson
# hm, maybe I should use non-parametric correlation instead
cSpearman <- cor(dataSet\$productionCost, dataSet\$totalRevenue,method="spearman");
cSpearman
# log-transform will not help much in this case...
hist(log(dataSet\$productionCost),20); # the default base of log in R is e (natural)
hist(log(dataSet\$totalRevenue),20);```  ```# However, who in the World tests the assumptions of the linear model... Kick it!
reg <- lm(dataSet\$totalRevenue ~ dataSet\$productionCost);
summary(reg);
# get residuals
reg\$residuals
# get coefficients
reg\$coefficients
# some functions to inspect the simple linear model
coefficients(reg) # model coefficients
confint(reg, level=0.95) # CIs for model parameters
fitted(reg) # predicted values
residuals(reg) # residuals
anova(reg) # anova table
vcov(reg) # covariance matrix for model parameters

# plot model
intercept <- reg\$coefficients;
slope <- reg\$coefficients;
plot(dataSet\$productionCost, dataSet\$totalRevenue);
abline(reg\$coefficients); # as simple as that; abline() is a generic function, check it out ?abline``` ```# and now for a nice plot
library(ggplot2); # first do: install.packages("ggplot2");not now - it can take a while
# library() is a call to use any R package
# of which the powerful ggplot2 is among the most popular
g <- ggplot(data=dataSet,
aes(x = productionCost,
y = totalRevenue)) +
geom_point() +
geom_smooth(method=lm,
se=TRUE) +
xlab("nProduction Cost") +
ylab("Total Revenuen") +
ggtitle("Linear Regressionn");
print(g);``` ```# Q1: Is this model any good?
# Q2: Are there any truly dangerous outliers present in the data set?

# print is also a generic function in R: for example,
print("Doviđenja i uživajte u praznicima uz gomilu materijala za čitanje i vežbu!")

# P.S. Play with:
reg <- lm(dataSet\$totalRevenue ~ dataSet\$productionCost + dataSet\$domesticRevenue);
summary(reg) # etc.```

## Readings :: Session 2 [5. May, 2016, @Startit.rs, 19h CET]

Chapters 1 - 5, The Art of R Programming, Norman Matloff

• Intro to R
• Vectors and Matrics
• Lists

## Session 1 Photos  