Introduction to R for Data Science :: Session 2

May 9, 2016
By

(This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

Welcome to Introduction to R for Data Science Session 2! The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

[in Serbian]

Summary of Session 2, 05. may 2016 :: Introduction to R: vectors, matrices, and data frames

Introduction to vectors, matrices, and data frames in R.  R is a vector programming language, which means you will be using vectors, matrices, and n-dimensional arrays a lot. Vectorizing your code means enhanced performance in terms of speed. Data frame objects in R are elementary carriers of most of your data in R; unlike vectors and matrices, data frames can encompass various data types.

R script :: Session 2

########################################################
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
# Data Science Community Serbia + Startit
# :: Goran S. Milovanović and Branko Kovač ::
########################################################

# clear all
rm(list=ls());

char_list <- character(length = 0) #empty character list
num_list <- numeric(length = 10) #length can be != 0, but 0 is default value
log_list <- logical(length = 3) #default value is FALSE

# But you can always use good ol' c() for the same purpose
log_list_2 <- c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE) #some Ts and Fs
num_list_2 <- c(1, 4, 12, NA, 101, 999) #numb
char_list_2 <- c("abc", "qwerty", "test", "data", "science")

# Factor vectors are also part of R
fac_list <- gl(n = 4, k = 1, length = 8, ordered = T,
labels = c("low", "med", "high", "very high")) #only mentioning now 🙂

# Subsetting is regular-thing-to-do when using R
char_list_2 #single element can be selected
log_list_2[2:4] #or some interval
num_list_2[3:length(num_list_2)] #or even length() function

# New objects can be created when subsetting
test <- num_list_2[-c(2,4)] #or somthing like this - displays all but 2nd and 4th element
test_2 <- num_list_2 %in% test #operator %in% can be very useful
not_na <- num_list_2[!is.na(num_list_2)] #removing NAs using operator ! and is.na() function

# Vector ordering
sort(test, decreasing = T) #using sort() function
test[order(test, decreasing = T)] #or with order() function

# Vector sequences
seq(1,22,by = 2) #we already mentioned seq()
rep(1, 4) #but rep() is something new 🙂
rep(num_list_2, 2) #replicate num_list_2, 2 times

# Concatenation
new_num_vect <- c(num_list, num_list_2) #using 2 vectors to create new one
new_num_vect
new_combo_vect <- c(num_list_2, log_list) #combination of num and log vector
new_combo_vect #all numbers? false to zero? coercion in action

new_combo_vect_2 <- c(char_list_2, num_list_2) #works as well
new_combo_vect_2 #where are the numbers?
class(new_combo_vect_2) #all characters

# Matrices are available in R
matr <- matrix(data = c(1,3,5,7,NA,11), nrow = 2, ncol = 3) #2x3 matrix
class(matr) #yes, it's matrix
typeof(matr) #double as expected

matr[,2] #2nd column
matr[3,] #oops, out of bounds, there's no 3rd row
matr[2,3] #element in 2nd row and 3rd column

matr_2 <- matrix(data = c(1,3,5,"7",NA,11), nrow = 2, ncol = 3) #another 2x3 matrix
class(matr_2) #matrix again
typeof(matr_2) #but not double anymore, type conversion in action!
t(matr_2) #transponed matr_2

# What can we do if a matrix needs to encompass different types of data?
# Introducing data frame!

library(datasets) #there are some datasets in base R like mtcars
cars_data <- mtcars

# Some useful information about data frames
str(cars_data) #lets see what we have here
names(cars_data) #column names
?mtcars #dataset documentation is *very* important

# Think of data frame columns as vectors! Because they are!
mean(cars_data\$mpg) #mean of cars_data mpg (miles per galon) column
median(cars_data\$cyl) #median of cars_data cyl (cylinders) column

is.list(cars_data[1,]); #but rows are lists!

# Lets do some data frame subsetting

cars_data[-1, ] # first row out
cars_data[ ,-1] # first column out

cars_data[c(1,3)] #keeping 1st and 3rd column only
cars_data[-c(1,3)] #removing 1st and 3rd column
cars_data[ ,-c(1,3)] #same as the previous line of code

cars_data[!duplicated(cars_data\$mpg), ] #maybe we want to remove all cars with same mpg?
#remember it keeps only the first occurence!

subset(cars_data, mpg < 19) #this is one way (and it can be slow!)
cars_data[cars_data\$mpg < 19, ] #this is another one (faster)
cars_data[which(cars_data\$mpg < 19), ] #and another one (usually even more faster)

cars_data[cars_data\$mpg > 20 & cars_data\$am == 1, ] #multiple conditions

cars_data[grep("Merc", row.names(cars_data), value=T), ] #filtering by pattern match

# Data frame transformations
cars_data\$trans <- ifelse(cars_data\$am == 0, "automatic", "manual") #we can add new colums
cars_data\$trans <- NULL #or we can remove them

cars_data[c(1:3,11,4,7,5:6,8:10)] #this way we change column order

# Separation and joining of data frames
low_mpg <- cars_data[cars_data\$mpg < 15, ] #new data frame with mpg < 15
high_mpg <- cars_data[cars_data\$mpg >= 15, ] #new data frame with mpg >= 15

mpg_join <- rbind(low_mpg, high_mpg) # we can combine 2 data frames like this

car_condition <- data.frame(sample(c("old","new"), replace = T, size = 32)) #creating random data frame
#with "old" and "new" values
names(car_condition) <- "condition" #for all kinds of objects
colnames(car_condition) <- "condition" #for "matrix-like" objects, but same effect here
rownames(car_condition) <- rownames(cars_data) #use row names of one data frame as row names of other

mpg_join <- cbind(mpg_join, car_condition) #or combine data frames like this

Readings :: Session 3 [12. May, 2016, @Startit.rs, 19h CET]

Chapters 1 – 5, The Art of R Programming, Norman Matloff

• Intro to R
• Vectors and Matrics
• Lists

Session 2 Photos  To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...