R for Quants, Part I.A

Posted on February 12, 2012 by Brian Lee Yung Rowe in R bloggers | 0 Comments

[This article was first published on Cartesian Faith » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m teaching an R workshop for the Baruch MFE program. This is the first installment of the workshop and focuses on some basics, although we assume you already know how to program.

PART I: PRELIMINARIES

PART II: STATISTICS

A. Distributions
B. Optimization and Linear Programming
C. Regression Analysis

PART III: STRUCTURING CODE

A. Dispatching Systems
B. Real World Development

Getting Started

To get the most out of the workshop, you need to have basic programming knowledge. At a minimum you should understand what control structures are and how variable scopes work.

Somewhere you will need to have a working copy of R. As R is open source and popular, it’s available on all major operating systems. To write R code you can use a standard text editor, like vim, or obtain an IDE (e.g. Eclipse or RStudio) if you prefer a visual editor. For the workshop, we will stick with vim.

While R is a language that comes with “batteries included”, there are additional packages that you will need for the workshop. These include:

Note that installing tawny will get you all its dependencies including futile.paradigm and PerformanceAnalytics.

Getting Help

There are a number of ways to get help. The most direct way is to use the R shell. Most functions provide a documentation page that is retrieved by prefixing the function name with a question mark. e.g. ?lm opens a help page on the function lm. If you don’t know the specific function, try the help.search() command. At this point, you should know how to get help for this function!

Search engines are always an option, but with R it can at times be problematic due to the genericness of the letter. Many people have developed solutions to this problem, with the most popular being rseek or a filtered Google search.

The R community has a number of mailing lists for getting help in addition to special interest groups (e.g. R-SIG-Finance), while the younger generation seems to have opted for online Q&A sites as a primary resource.

The R Shell

There are a number of useful functions for interacting with the R shell. In many ways you can think of it as being a lightweight version of bash. Common operations like ls() and rm() exist to list the objects you’ve created as well as remove them. You can also view your search() path and see which packages are loaded.

To install a new package from the shell, use the install.packages() command. R will download and build the package while you wait. Don’t forget to load the package after you build it with the library() function.

install.packages('tawny')
library(tawny)

Examining Objects

Any object can be examined directly in the shell by typing its name. Note that some objects, like matrices, will flood your shell, so be careful.

Since R is open source, functions written in pure R can also be viewed directly in the shell. This is useful for learning R as well as debugging code.

Vector Primitives

In R, everything is a vector. This means that even primitives have a length():

> length(4)
 [1] 1

This seemingly strange idea makes translating mathematical notation into code very easy since vector notation is built-in. That means no loops just to add two vectors together.

> 1:5 + c(1,2,3,4,5)
 [1] 2 4 6 8 10

It also means mathematical properties are honored by default so operators behave as you expect. As we’ll discuss later, this behavior also extends to matrices.

> c(2,3,4) + 2
 [1] 4 5 6
 > c(2,3,4) * 2
 [1] 4 6 8

From the examples, you can find two ways to create vectors. Other methods include seq(), which creates a sequence of numbers based on a variety of rules.

Subsetting Notation

To access elements within a vector, R provides many handy built-in constructs. The simplest is an indexing notation. More complicated expressions can be applied as well.

> a <- 1:10
> a[4]
 [1] 4
> a[a>6]
 [1] 7 8 9 10

This works because R is evaluating the expression across the vector so any function that returns booleans properly indexed to the vector will yield deterministic results. This property is used to apply sorting over a vector.

> b <- sample(1:10)
> b[order(b)]
 [1] 1 2 3 4 5 6 7 8 9 10

What do you suppose is the output of the order() function?

Elements in a vector can also be named. Once defined, these names can be used to access elements.

> names(a) <- strsplit('abcdefghij','', fixed=TRUE)[[1]]
> a
 a b c d e f g h i j
 1 2 3 4 5 6 7 8 9 10
> a['c']
 c
 3

Operators and functions

By default vectors are defined as column vectors. This is true of the internal data structure as well. To create a row vector, use the transpose function,

> t(1:4)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4

Other common functions like inner and outer product are defined as operators,

> c(1,2,3) %*% c(3,2,1)
     [,1]
[1,]   10
> c(1,2,3) %o% c(3,2,1)
     [,1] [,2] [,3]
[1,]    3    2    1
[2,]    6    4    2
[3,]    9    6    3

When working with vectors, R tries its best to protect you from any obvious mistakes, like incompatible lengths between the operator. In general, R attempts to do the right thing while issuing errors for any glaring problems.

Arrays and Matrices

Arrays are vectors that have a dim(ension) attribute. Matrices are simply two-dimensional arrays. Each of these types have a constructor: array() and matrix(), respectively. When creating a matrix, note that it is constructed along columns. You can override this behavior but be aware that the performance may degrade since the internal representation is based on columns.

> matrix(1:6, nrow=2)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Since matrices have two dimensions, the names() function will not work to access a matrices column or row names. Instead there are colnames() and rownames(). Similarly, length() is not appropriate for matrices; use dim().

Subsetting Notation

Accessing specific elements of a matrix is accomplished using similar subsetting notation. Since there are two dimensions, an index can be applied to either dimension, or a full column or row can be accessed. Notice that the printed output of the matrix actually shows you the notation.

Exercise: Use rnorm() to generate a 20 x 6 matrix. Add column names to the matrix: C, F, T, A, D, K. Extract column A. How do you extract more than one column?

Another common technique for creating matrices is to use either rbind() or cbind() with existing vectors or arrays.

Important Types

Lists

A list is a general purpose data structure or object that stores named elements. Objects can be stored within a list at multiple levels. Once a list is created, elements can be accessed by name or index. When using an index, typically the special double bracket notation is used unless you want another list back [1].

> li <- list(a=rnorm(5), b=1:8, c='label')
> li$b
[1] 1 2 3 4 5 6 7 8
> li[[2]]
[1] 1 2 3 4 5 6 7 8
> li[2]
$b
[1] 1 2 3 4 5 6 7 8

Data.frames

While matrices require data to be of a consistent type, a data.frame allows arbitrary types for each column.

Factors

A factor is essentially an enum. Performance benefits exist when using factors for grouping or filtering since the comparison is faster than with a string. Be careful, though, as R will attempt to convert string data to factors by default, which can result in unexpected behavior.

Coercion

Sometimes the data you get needs to be converted to a different format. Most type constructors have corresponding as.* functions to coerce data into the given type. A typical usage is converting a string to a date via as.Date().

Exercise: Given the following data.frame, get the average of the values for label b.

> l <- sample(strsplit('abcdefg','',fixed=TRUE)[[1]],10,replace=TRUE)
> d <- data.frame(cbind(value=rnorm(10), label=l))

Hint: Use anytypes() to see the type for each column in the data.frame.

Reading Data

read.csv

The most common method for getting data into R is by reading a file. Typically the family of read and write functions are used for general purpose reading and writing of data.frames, while scan is sometimes used directly when reading in all numeric matrices (as an example).

> df <- read.csv(textConnection("a,b,c
+ 1,2,3
+ 4,5,6
+ 7,8,9"))
> write.csv(df, file='dummy.csv')
> dd <- read.csv('dummy.csv')

Notice anything strange when reading this back in?

Other Methods

With the help of additional packages, you can load data from a database (DBI, RODBC, RMySQL), or read financial data from publicly available sites (quantmod, RMetrics).