Classes and Objects in R
[This article was first published on bRogramming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Classes and objects in R
Welcome back! In this blog post I’m going to try to tackle the concept of objects in R. R is said to be an “object oriented” language. I touched on this in my last post when we discussed the concatenate function
c()
and I’ll go a bit beyond that this time. Speaking of the c()
function, I’ll begin this post by divulging the answer to the Challenge from last time.Solution:
The solution to last post’s challenge required you to compute the Kronecker product of and < 1, 2, 3 > and < 1, 2, 3, 4, 5 >. The solution I wanted you to come up with used a combination of the
c()
function and the multiplication operator, all packed inside another call to c()
.c(c(1, 2, 3, 4, 5) * 1, c(1, 2, 3, 4, 5) * 2, c(1, 2, 3, 4, 5) * 3) [1] 1 2 3 4 5 2 4 6 8 10 3 6 9 12 15
Hopefully you arrived at this after some trial and error. Programming in R requires a lot of trial and error – don’t get discouraged.
Alternatively, you could have used a function to do this:
Alternatively, you could have used a function to do this:
kronecker(c(1, 2, 3, 4, 5), c(1, 2, 3)) [1] 1 2 3 2 4 6 3 6 9 4 8 12 5 10 15
This is slightly more elegant for this problem and as you can imagine, if you want to compute a Kronecker product of two larger objects, this quickly becomes much more practical. This is a lesson about R. If there is something tedious that you want to compute, there is usually a function to make it drastically less tedious. Of course I wanted you to solve the problem with
The Kronecker product example is a good lead in to this post also because I gave an example with a pair of matrices. How did I tell R that my two groups of 4 number were matrices and not vectors or something else? The short answer is to reveal the code:
c()
and *
because that’s what the post was about, but Bill Gates once said “I’d hire a lazy person over a hardworking one because the lazy person would find the simplest way to do something”. My interpretation of this – Code hard, but code smart.The Kronecker product example is a good lead in to this post also because I gave an example with a pair of matrices. How did I tell R that my two groups of 4 number were matrices and not vectors or something else? The short answer is to reveal the code:
a <- matrix(c(1, 1, 1, 2), nrow = 2) b <- matrix(c(1, 3, 2, 4), ncol = 2) kronecker(a, b) [,1] [,2] [,3] [,4] [1,] 1 2 1 2 [2,] 3 4 3 4 [3,] 1 2 2 4 [4,] 3 4 6 8
This code requires a few explanations though. The first is the assignment operator. I talked about operators last time, but I saved the best for last. The assignment operator exists in 3 forms:
<-
, ->
, and =
. These create an object in R's memory that can be called back into the command window at any time. Once I've defined a
and b
as I have above, I can simply call them by name like I did in my call to the function kronecker()
. This is great for many obvious reasons. Like I said earlier, programming in R requires lots of trial and error and you can certainly save some time and keystrokes by naming something once and calling it by a few letters afterwards. Some facts of life pertaining to the assignment operator:- The arrow assignment operators always point towards the object name (try reversing the arrow in the statement above that defines
a
- you get an error because you can't assign letters to a matrix, that doesn't make sense.) - Always use either <- or ->, the equal sign can get confusing. It's not always clear what is the “name” and what is the “meat” of your object. An arrow does away with this confusion.
Now is a good time to point out a feature of R Studio you'll want to use often. Now that you've assigned 2 objects, look in the top right panel of your R Studio environment and click the “Work space” tab. (Also try clicking the “History” tab up there. See what that does?) You'll see the 2 objects you've created sitting in there with a description of what kind of object they are.
This is a good segway into the main portion of this post. Our two objects
1. This object is a matrix. This is a consequence of how we defined it - we used the
2. The size of the matrix is 2 by 2. A matrix with 3 rows and 2 columns would be 3x2, etc.
3. Our matrix is populated by numbers of the class “double”. To help explain this, I'll steal a quote from the R help page about double objects:
This is a good segway into the main portion of this post. Our two objects
a
and b
are each described as a “2x2 double matrix”. What does this mean? It means 3 things.1. This object is a matrix. This is a consequence of how we defined it - we used the
matrix()
function to create a
and b
, hence they are matrices.2. The size of the matrix is 2 by 2. A matrix with 3 rows and 2 columns would be 3x2, etc.
3. Our matrix is populated by numbers of the class “double”. To help explain this, I'll steal a quote from the R help page about double objects:
All R platforms are required to work with values conforming to the IEC 60559 (also known as IEEE 754) standard. This basically works with a precision of 53 bits, and represents to that precision a range of absolute values from about 2e-308 to 2e+308. It also has special values NaN (many of them), plus and minus infinity and plus and minus zero (although R acts as if these are the same).
In other words this is a pretty standard way of representing some number in such a way that most computers and programs can universally recognize them as what they are.
We can ask R for each of these descriptions:
class(a) [1] "matrix"
This tells us that
a
, as a whole, is a matrix. The class()
function is extremely useful.dim(a) [1] 2 2
This tells us that
a
is 2x2 (“dim” is short for “dimensions”). I also frequently use the dim()
function.class(a[1]) [1] "numeric"
Note “numeric” and “double” are synonymous in R.
This tells us that the first element in
This tells us that the first element in
a
is of the class “numeric”. You're thinking “Wait, what were the brackets in there? What do those do?”. Excellent question! Brackets are how you index into objects to pull out individual components. If I want to know what the fourth element in a
is, I would type:a[4] [1] 2
More on this shortly. For now, appreciate what R Studio does for you - it collects all this information about
a
for you and displays it neatly in the corner of your screen to help you keep track of the properties of different objects as you accumulate many objects in your work space. Click on one of the objects in the “Work space”. Neat huh? These are all little perks of R Studio that make life in R a little more organized.Back to the issue of classes
There are many classes in R and each have different rules. It is possible to build your own class in R subject to your own specific set of rules (much like in Java or C+ or some other language), but this is not necessary most of the time. It's also something I don't mess with because I don't really have a computer science background. What I will do is briefly explain the most important classes that you will encounter and use in everyday R programming.
Numeric
We've already discussed numeric objects briefly but they belong at the top of this list. The numeric class is appropriate for almost anything, well, numeric. It is almost always the default class when you create an object with exclusively numbers. Remember the
c()
function we used in the last post to play with operators? Try this:class(c(1, 2, 3, 4, 5)) [1] "numeric"
This is given all the attributes of a numeric object by default. You can also define a numeric object without explicitly inputting numbers. For example:
c <- pi d <- sqrt(2)
(R recognizes pi as a value - try just typing
If you want to try to coerce some object into a numeric value, a function exists for this:
pi
.) Above, I have not explicitly defined c
and d
, but rather defined them as the result of some mathematical operation. After all, I could not explicitly define either of these numbers, they are both irrational!If you want to try to coerce some object into a numeric value, a function exists for this:
as.numeric()
. Try:as.numeric(a) [1] 1 1 1 2
What was previously a matrix is returned as just one row of numbers, as if you had entered them with the
If you want to check if an object is numeric already, there is a function for that too:
c()
function instead of the matrix()
function.If you want to check if an object is numeric already, there is a function for that too:
is.numeric(a) [1] TRUE
R returns a logical value of either TRUE or FALSE. But wait a minute, I thought
a
was a matrix, but now it's numeric?? That's right, because every element of a
is numeric, is.numeric()
returns true. If one element of a
were a letter for example, R would return FALSE. The is.“something”() and as.“something”() functions are sort of universal for any class.Integer
The integer class is kind of a sub-class of the numeric class. Observe:
e <- as.integer(3) is.integer(e) [1] TRUE is.numeric(e) [1] TRUE
e
belongs to two classes - integer and numeric.Note that the inverse is NOT true:
f <- 3 is.integer(f) [1] FALSE is.numeric(f) [1] TRUE
Although I assigned an integer value to f, by default it is committed to R's memory as a numeric value.
f <- 3
is the same as f <- as.numeric(3)
.Logical
Logical values are TRUE/FALSE values. For an example, I'll return to an example you might recall from last post, but this time I'll save it as an object so I can examine it's properties more easily.
g <- c(1, 2, 3, 4, 5) <= 3 class(g) [1] "logical"
g
is a logical vector. There exist functions is.logical()
and as.logical()
, just like for the other classes we've discussed. The as.logical()
function classifies 0s as FALSE and anything other than 0 as TRUE. Observe:as.logical(c(0, 1, 2)) [1] FALSE TRUE TRUE
This conversion from numbers to logical values goes the other way too. I'll demonstrate this on our logical vector
g
.as.numeric(g) [1] 1 1 1 0 0
This converts the TRUEs to 1s and the FALSEs to 0s.
as.integer()
would do the same thing, but the result would be of the subclass “integer” that is a subclass of “numeric”. This conversion of logical values to numeric values can be quite useful. For example, suppose I want to know how many students in a class are of legal drinking age and I have a list of their ages:ages <- c(20, 21, 19, 22, 19, 20, 22, 21, 20, 19, 21) sum(ages >= 21) [1] 5
I've applied the
I can also use a logical vector to pick out elements of a vector that satisfy some condition (or many conditions). Recall the brackets used to identify certain elements within an object.
sum()
function to a logical vector and it returns to me a numeric answer. If I have a long vector of ages, this method is much easier than counting by eye the number of students older than 21.I can also use a logical vector to pick out elements of a vector that satisfy some condition (or many conditions). Recall the brackets used to identify certain elements within an object.
ages[ages < 21] [1] 20 19 19 20 20 19
This pulls out of the
ages
vector all values that satisfy my condition (are less than 21). Once again, very useful.Character
This is a class that we haven't touched yet and there's no way to completely cover everything you could do with/to characters in R. There exist all kinds of fancy algorithms that make sense of character data and do various things with it. Take for example a spell checker. This is a program written to deal with characters. While you could do something like this in R, there are other programming languages that are better suited for this. I'll cover some basic things involving characters that are useful for doing statistics, but keep in mind I'm only scratching the surface.
Characters are also known as strings. This is a less confusing term if you ask me. “Character” makes me think of one letter while “string” makes me think of a few letters strung together. Character objects in R can be letters, words, sentences, whatever. To create a character object in R, you must put it inside either single or double quotes. Try:
h <- "string" h [1] "string"
There is an
as.character()
function and an is.character()
function as there were for other classes, but note that many operators no longer work with strings. One might think that 'ab' + 'c'
would yield 'abc'
, but this is not the case. R returns an error. Similarly the other mathematical operators return an error when applied to strings. Some logical operators still work though. Try:"string" %in% "character string" [1] FALSE
This does not return an error, but you're probably thinking “But 'string' IS in 'character string'!” Not to R it's not. Try:
"string" %in% c("character", "string") [1] TRUE
Everything within one set of quotes is a single object. There are no individual letters to R.
As an aside, I'd like to point out that some other operators also work on strings in a somewhat nonsensical way:
As an aside, I'd like to point out that some other operators also work on strings in a somewhat nonsensical way:
"a" < "b" [1] TRUE
This is TRUE, as one might expect, but
"b" < "abc" [1] FALSE
Confused? Me too. I'm not actually sure how this works, but it's never been a problem because we're not writing spell check programs in R. Just remember that using logical operators with strings in R can give you some unexpected, unintuitive results. So be careful!
Back to comparing strings. If
Back to comparing strings. If
'string'
is not %in%
'character string'
, how do we search for certain patterns regardless of whether they constitute a whole character object or just part of one? Excellent question. This comes up sometimes in statistics when you deal with categorical data. Not everything you measure is a number. Some data is more “multiple choice”. The thing you're observing belong to categories (i.e. blue, green, purple, blue-green, or black). What if I simply want to know how many observations contained 'green'? I could of course search for 'green' and 'blue-green' and add them, but I could also do something more elegant. Meet the “g-something” family of functions:i <- c("blue", "green", "purple", "blue-green", "black") grep("green", i) [1] 2 4
This returns an integer vector with the integer of each element that matched the pattern you were searching for. In this example, the 2nd and 4th elements of
Note the arguments of this function are:
i
contained the pattern 'green'
.Note the arguments of this function are:
grep('pattern', x)
, where the pattern is what you're searching for and x is what you're searching through. In our case, x is i
and it is a character object with 5 elements. (I often forget what comes first, the pattern or the x). There is an additional optional argument - ignore.case which is by default FALSE, but can be set to TRUE. For example:grep("Green", i, ignore.case = FALSE) integer(0)
This function is very useful for subsetting. Recall our use of brackets earlier. If
This answers the question “Which elements contain my pattern?” one way, but there's another way to answer the same question.
i[1]
returns the first element of i
, i[grep('green',i)]
returns all the elements in i
that contain the pattern 'green'
. Handy!This answers the question “Which elements contain my pattern?” one way, but there's another way to answer the same question.
grepl("string", "character string") [1] TRUE
The additional l in
grepl()
stands for “logical”. This function returns a logical vector of the same length as your initial vector.Challenge:
You weren't expecting it yet were you? Stay on your toes because the Challenge pops up when the Challeng feels like it.
Use grepl()
to pull out only those elements of i
that contain the pattern 'bl'
.
Hint: Set it up like I did with grep()
, but throw in a logical operator too.
There are a few more functions in the “g-something” family, but there's only one more I use on even a semi-regular basis:
gsub("bl", "X", i) [1] "Xue" "green" "purple" "Xue-green" "Xack"
The “sub” in
gsub()
stands for… You guessed it, “substitute”. It searches for a pattern and when it finds that pattern, substitutes it with some replacement that you specify.Factor
Factors in R are a special type of character objects. Remember earlier I mentioned categorical data, well factors are designed to make categorical data easy in R. Factor objects have set categories (called levels) that all members must fall into. Imagine a psychology experiment in which you are trying to compare the effects of two different medicines. You have a third of your subjects take medicine A, another third take medicine B and the last third take a placebo. You, being a good experimentor, record which medicine each patient is given:
subject.names <- c("Jane", "Jill", "Bob", "Bill", "Grace", "Patrick") treatment <- c("A", "A", "B", "B", "Placebo", "Placebo") treatment.f <- as.factor(treatment)
(I'm hoping you've already figured out that there is an
as.factor()
and an is.factor()
and you can guess what they do).treatment
and treatment.f
are now totally different objects. This is especially useful for statistical analysis which I'll talk a lot about later on, but for now I just want you to know that factors exist and that they are similar to strings because they deal with non-numeric information, but they are also very different from strings. There are a couple functions that you can call on factors that are very useful. The first is levels()
:levels(treatment.f) [1] "A" "B" "Placebo"
R recognizes
treatment.f
as categorical data and automatically identifies all of the categories for you. These are returned by using the levels()
function.summary(treatment.f) A B Placebo 2 2 2
summary()
is extremely useful. It shows you the categories and how many members each has. Try calling summary on plain old treatment
. This still returns some information about treatment
, but it is much less informative if we are treating this as categorical data instead of just a character object. Try calling summary()
on some other objects we've created as well. This is a very useful function in general.You can also use the “g-something” functions on factors.
names[grep("placebo", treatment.f, ignore.case = TRUE)] Error: object of type 'builtin' is not subsettable
We'll go a lot more in depth on factors later when we get to basic statistical analysis, but for now, know that it is a class built for categorical data and it makes life really easy for dealing with such data.
Date
Dates in R are the bane of my existence right now. They come in a variety of flavors, only some of which are compatible with some functions. But before I start ranting about date values, lets cover the basics.
Dates are crucial. Almost every experiment takes place over time and a good experimentor accounts for this. If you ever do research, you will at some point encounter dates in your data. The passage of time is the only thing more certain than gravity and taxes. Date values in a computer program are tricky. They can't be alphabetized, but they obviously have a natural order. For the computer to recognize and take advantage of this, you must first tell the computer that it's dealing with dates and not funny division problems (10/21/2012) or subtraction problems (10-21-2012). Here's an example:
Dates are crucial. Almost every experiment takes place over time and a good experimentor accounts for this. If you ever do research, you will at some point encounter dates in your data. The passage of time is the only thing more certain than gravity and taxes. Date values in a computer program are tricky. They can't be alphabetized, but they obviously have a natural order. For the computer to recognize and take advantage of this, you must first tell the computer that it's dealing with dates and not funny division problems (10/21/2012) or subtraction problems (10-21-2012). Here's an example:
as.Date("10/21/2012", format = "%m/%d/%Y") [1] "2012-10-21"
There are 4 things that are imporant going on here.
1. I used an
2. I entered my date value as a string. If I hadn't, it would have tried to convert 10 divided by 21 divided by 2012 into a date.
3. The computer returns it in a different format (Year-Month-Day). This is the computer's preffered format and what it will always convert dates to, regardless of how you enter it.
4. The
The percent sign followed by a letter causes R to expect a specific type of entry. For example where you specify
1. I used an
as.something()
function. You saw this coming. This one capitolizes Date though - all the others were lower case (as.numeric()
, as.integer()
, etc). Curve ball. Whoah.2. I entered my date value as a string. If I hadn't, it would have tried to convert 10 divided by 21 divided by 2012 into a date.
3. The computer returns it in a different format (Year-Month-Day). This is the computer's preffered format and what it will always convert dates to, regardless of how you enter it.
4. The
format =
argument. This is crucial.The percent sign followed by a letter causes R to expect a specific type of entry. For example where you specify
%m
, R now expects a number 1-12 that it assumes corresponds to a month. (Also note the delimiters in between my %something's. In this case I have seperated my days/months/years with a slash, but I could have also used a dash or a space.) If you tried to put a 13 in the %m (month) spot, R would be confused and angry, and it would return an NA instead of a Date object. R returns an NA (a missing value essentialy) for other impossible inputs as well. Take for example Feb. 29th, 1900 - a leap day, except for the fact that every milleneium we skip a leap day:as.Date("29-2-1900", format = "%d-%m-%Y") [1] NA
Again, R returns an NA because this is a day that does not exist. Fun fact: Excel's calender system treated this day as if it existed until the most recent version came out (Excel 2010).
The capitol Y in
BTW, you've just discovered one use for this panel in R Studio. Type ? and the name of any function you have a question about. The help documentation on that function pops up in the lower right-hand panel of your R Studio window. This is infinitely useful.
The capitol Y in
%Y
indicates that this is where you are going to put a year, and the fact that it's capitol means that it is going to be 4 digits instead of 2. There a bunch of these %something formattings that you can use. For a decent overview, type ?strptime
into your command line, look at the panel to the right of your command panel in R Studio and scroll down some.BTW, you've just discovered one use for this panel in R Studio. Type ? and the name of any function you have a question about. The help documentation on that function pops up in the lower right-hand panel of your R Studio window. This is infinitely useful.
Challenge, Part II:
Yeah there's a part II. And it's way better than a part 2.
Convert 'Feb 28 1900' to a date.
Hint: Use the help page.
You can do useful things with dates in R once you've gotten them formatted as date objects. Lets take a look at some examples:
thanksgiving <- as.Date("11/22/2012", format = "%m/%d/%Y") christmas <- as.Date("12/25/2012", format = "%m/%d/%Y") christmas - thanksgiving Time difference of 33 days
Cool! If we subtract 2 dates, R tells us the time difference. The result is actually a member of a class we haven't talked about - the difftime class. This isn't super important, but worth pointing out. Try adding 2 dates and R tells you that the + operator is not defined for Date objects. Makes sense. I can't think of a situation where it would be useful to add dates. Try logical operators on dates, such as a “less than”:
christmas < thanksgiving [1] FALSE
That's right, Thanksgiving is actually before Christmas, despite what Hallmark and Hersheys would have you think.
To wrap up dates, they are different from a factor because while they are sort of categories or bins, some dates are more similar than others (today's data should be more similar to tomorrow's than data from 2 weeks ago would be, right?), while there is no inherent order to a factor. It would be excruciating to try and quantify this yourself by manually finding the time difference between all of your observations (although some people work around this by simply numbering the days of their study). R saves you the trouble of doing either. Just give R a date and a format argument so it can interpret that date, and R can do all the underlying math for you.
To wrap up this whole post, I'd like to dearly thank those of you who read the entire last post and even attempted the Challenge. It meant a lot to know that someone read it. I'd also like to aknowledge that perhaps the Challenge was a bit too challenging. I tried to tone it down along with the puns. Thanks for reading and I hope you enjoyed it/;earned something. Cheers!
To leave a comment for the author, please follow the link and comment on their blog: bRogramming.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.