Annotations for “R For Dummies”

Posted on October 15, 2012 by Pat in R bloggers | 0 Comments

[This article was first published on Portfolio Probe » R language, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here are detailed comments on the book. Elsewhere there is a review of the book.

How to read R For Dummies

In order to learn R you need to do something with it. After you have read a little of the book, find something to do. Mix reading and doing your project.

You cannot win if you do not play.

Two complementary documents

They are also complimentary.

Some hints for the R beginner

“Some hints for the R beginner” is a set of pages that give you the basics of the R language. It is a completely different approach to the one R For Dummies takes — you may want to investigate it.

The R Inferno

If you are just at the beginning of learning R, you should ignore The R Inferno (except perhaps Circle 1).

When you start using R for real and run into problems, that is the time to pick it up and see if it helps.

Missing piece

There is one thing that I think is missing in R For Dummies. Actually it isn’t missing, it comes at the very end while I think it should be at the start.

That piece is the search function. More specifically the way that R operates that is highlighted by the results of the search function.

The start of “Some hints for the R beginner” talks about search and how R finds objects.

How to use these annotations

first learning

If you are new to R and first reading the book, then you should probably mostly ignore my comments. However, when you are confused by something in the book, you can look to see if there is a comment on that page that pertains to what you are confused about.

revising

On further reading, these comments are more likely to be of use. Some are clarifications, some are extensions.

Page by page comments

These comments are based on the first printing.

Page 10

There is more history in the Inferno-ish R presentation.

Page 11

distribution

I’m not a lawyer, but I think the phrasing about redistribution is not right. I think it should say “change and redistribute” rather than “change or redistribute”.

If what you do never leaves your entity, then you can do absolutely whatever you want. That is the free as in speech part. Legalities only come into play if what you do is made available to others. It is a common misunderstanding that you are restricted in what you do within your own world.

runs anywhere

The book highlights that R runs on many operating systems. It fails to make clear that the objects that it creates on the operating systems are all the same. You can start a project on a Linux machine at work, continue it while you commute with your Mac laptop, and then finish it on your Windows machine at home. No problem.

Page 12

The book should tell you not to be afraid of new words. New words like “vector”. You don’t need to make friends with them right away, but don’t be scared off.

(technical) Unhappily the word “vector” in R has several meanings — so it is unfortunate that it is the first new word. The meaning used throughout the book is the most common meaning. See The R Inferno (Circle 5.1) for the gory details.

Page 13

statistics

Pretty much everywhere in the book where it says “statistics” I would prefer “data analysis” instead. Statistics in many people’s mind is formal and academic, not like what they do. More people can feel comfortable doing data analysis than statistics.

In addition to the fear factor, there really is a (slight) difference between data analysis and statistics. I think data analysis is more important even though I’m trained as a statistician.

fields of study

There are additional fields of study where R is used that are not considered to be data hotbeds, such as music and literature. The flexibility of R becomes very important for data in non-traditional forms.

Page 23

vectors

If you are new to R, you shouldn’t expect yourself to understand this discussion. Just let it sink in over time.

Page 24

assignment operator

Always put spaces around the assignment operator. That makes the code much more readable.

The book tells you on page 63 that you can use = as well. You will see both used. They are mostly the same (differences are explained in The R Inferno, Circle 8.2.26). I agree with the book’s approach to use <- but really you can use either.

Page 28

RStudio

A nice feature of the RStudio workspace view is that it categorizes the objects.

Page 29

Windows pathnames (technical)

The book implies that you can not write Windows pathnames with backslashes. Actually you can, you just need to put a double backslash where you want a backslash. Hence it is easier and (often) less confusing to use slashes rather than backslashes.

Page 30

loading objects (technical)

It is possible to use attach instead of load. If you load an object, then it is put into your global environment. If you attach an object, it is put separately on the search list. If you modify an object that has been attached, then the modified version goes into your global environment.

Page 32

vectorization

There are different forms of vectorization, and the book doesn’t make that explicit. Vectorization can be put into three categories:

vectorization along vectors
summary
vectorization across arguments

Functions like sum and mean are vectorized in the sense that they take a vector and summarize it. This is done in pretty much all languages, it is not special.

Vectorization as it is commonly spoken of in R is vectorization along vectors. For example the addition operator as seen on page 24. This is the form of vectorization that is so useful and powerful in R.

You should not expect the third form of vectorization in R. However, it does exist in a few functions. The sum and mean functions do summary-type vectorization:

> sum(1:3)
[1] 6
> mean(1:3)
[1] 2

The sum function also does vectorization along arguments:

> sum(1, 2, 3)
[1] 6

That is basically anomalous. The mean function is more typical by not doing this form of vectorization:

> mean(1, 2, 3) # WRONG
[1] 1

Unfortunately you don’t get an error or a warning in this case. Do not expect this form of vectorization.

Page 33

error message

Getting error messages can be frightening for a while. But it’s not the end of the world. Relax.

Page 36

names (technical)

In fact it is possible to get any name that you want, but you probably don’t want to.

return (technical)

Actually return is not a reserved word, but you should treat it as if it were.

> break <- 1
Error in break <- 1 : invalid (NULL) left side of assignment
> while <- 1
Error: unexpected assignment in "while <-"
> return <- 1 #do NOT do this
>

Page 37

F and T

I wish to emphasize the advice in the book:

never abbreviate TRUE and FALSE to T and F
avoid using T and F as object names

Page 42

library

The book suggests (with a slight revision on page 361) to load packages with the library function. Some of us prefer require instead of library for this use. The best use of library is without arguments — this gives you a list of available packages.

> library(fortunes) # load package
> require(fortunes) # same thing
> library() # get list of packages
> require() # don't do this
Loading required package: 
Failed with error:  ‘invalid package name’

contributed packages

I think the authors might be being a little too polite in their description of the quality of contributed packages.

I find base R to be phenomenally clean code — it is hard to find commercial code that is less buggy. The quality of contributed packages varies widely. A few are up to the standards of base R, some are quite good, I’m sure there are a few dreadful ones.

With contributed packages you need to be more cautious than when only using base R functionality. Or perhaps I should say that you always need to be vigilent, but if you are using contributed packages, there is a larger chance that a problem is due to a package rather than your own fault.

Without inspecting the code, I know of two clues to suggest a package is of good quality:

widely used
good documentation

A widely used package — such as those highlighted in the book — is an indication that a lot of problems with the code have been fixed or didn’t exist in the first place.

Many people use the test of the cleanliness of restaurant restrooms to infer the cleanliness of the kitchen. Likewise, carefully written documentation is likely to be a sign of clean code.

Page 46

exponentiation (technical)

It is not a good idea to use ** to mean exponentiation — it is not out of the question for that to go away. Stick to using the ^ operator.

Page 49

log and exp

The sentence a little below mid-page about creating the vector inside exp should say inside the log function.

Page 52

infinity

The last sentence on the page should say 10^309 and 10^310 rather than 10^308 and 10^309.

Page 54

table 4-3

You are unlikely to use any of these except for is.na, which you may use quite a lot.

Page 55

types of vectors

All of the types of vectors listed may have missing values (NA).

Page 56

integer versus double

One of the nice things about R is that you hardly ever need to worry about whether something is stored as an integer or a double.

largest integer (technical)

We can see how big the biggest integer is in a couple different ways:

> format(2^31 - 1, big.mark=",")
[1] "2,147,483,647"
> .Machine$integer.max
[1] 2147483647

Page 59

indexing

What is called “indexing” in the book is more commonly called “subscripting”.

Page 64

missing value testing

It is a common mistake to try testing missing values with a command like:

> x == NA

That doesn’t work — you need to use is.na.

Page 65

any and all

The last sentence on the page is a false statement. The any and all functions are smart enough to know when they can know the answer and when they can’t:

> all(c(NA, FALSE))
[1] FALSE
> all(c(NA, TRUE))
[1] NA
> any(c(NA, FALSE))
[1] NA
> any(c(NA, TRUE))
[1] TRUE

Page 72

assigning to character (technical)

It is more correct to think of the mode being character than the class being character.

Page 82

grep

Alternatively, you can use the value argument of grep:

> grep("New", state.name, value=TRUE)
[1] "New Hampshire" "New Jersey"    "New Mexico"   
[4] "New York"

Page 83

sub versus gsub

Here is an example that should make clear the difference between sub and gsub:

> gsub("e", "a", c("sheep", "cheap", "cheep"))
[1] "shaap" "chaap" "chaap"
> sub("e", "a", c("sheep", "cheap", "cheep"))
[1] "shaep" "chaap" "chaep"

Page 86

factor attributes (technical)

The book says:

[factors are] neither character vectors nor numeric vectors, although they have some attributes of both.

This sentence is using “attribute” in the non-technical sense. But attributes in the technical sense do come into play: factors have “class” and “levels” attributes.

Page 87

factor versus character

Notice how the factor is printed differently than the character vector.

Page 91

American regions (off topic)

There is a brilliant analysis of North American regions called The Nine Nations of North America.

Page 94

date sequences

You might wonder what happens if you start on the thirty-first of the month rather than the first. If you wonder something, try it out to see what happens:

> myStart <- as.Date("2012-12-31")
> seq(myStart, by="1 month", length=6)
[1] "2012-12-31" "2013-01-31" "2013-03-03" "2013-03-31"
[5] "2013-05-01" "2013-05-31"

The result is a bit Aspergery, and not to everyone’s taste. But perhaps we can do better:

> seq(myStart + 1, by="1 month", length=6) - 1
[1] "2012-12-31" "2013-01-31" "2013-02-28" "2013-03-31"
[5] "2013-04-30" "2013-05-31"

Wondering is great, experimenting is even greater.

Page 104

one-dimensional arrays (technical)

Regular vectors are not dimensional at all in the technical sense, but we think of them as being one-dimensional. But there really are one-dimensional arrays. They are almost like plain vectors but not quite.

Page 106

playing with attributes

For large objects you often won’t like the response you get when you do:

> attributes(x)

Often better is to just look at what attributes the object has:

> names(attributes(x))

Page 109

extracting values from matrices

The flexibility of subscripting matrices (and data frames) as vectors is a curse as well as a blessing.

If you want to do:

> x[-2,]

and you do:

> x[-2]

then you will get an entirely different result. This can be a hard mistake to find — a few pixels difference on your screen can have a big impact.

Page 113

first.matrix

The example on this page assumes that first.matrix is as it was first created, not as it has been modified in the intervening exercises.

Page 114

matrix operations

So adding numbers by row is easy. How to add them by column? One way is:

> fmat <- matrix(1:12, ncol=4)
> fmat + rep((1:4)*10, each=nrow(fmat))
     [,1] [,2] [,3] [,4]
[1,]   11   24   37   50
[2,]   12   25   38   51
[3,]   13   26   39   52

This uses the rep function to create a vector with as many elements as the matrix has (assuming the vector being replicated has length equal to the number of columns), and the replicated values are in the desired positions.

Page 116

inverting a matrix

The reason that the command to invert a matrix is not intuitive is because it is seldom the case that (explicitly) inverting a matrix is a good idea.

Page 117

vectors as arrays (technical)

Actually vectors, in general, are not arrays at all. The difference is of little consequence, however.

third array dimension (technical)

I call the items in the third dimension of an array “slices” rather than “tables”. I’m not aware of any standardized nomenclature. I don’t think “tables” is such a good choice because there are other meanings of “table” in R.

array filling (technical)

I’m not able to follow the sentence in the book describing how arrays are filled. How I think of it is that the first subscripts vary fastest (no matter how many dimensions are in the array).

Page 119

rows and columns (technical)

Maybe my brain went on strike, but I think that “rows” and “columns” are reversed in the first paragraph on the page.

Page 120

data frame structure

Note that all the vectors that make up the columns need to be the same length.

data frame structure (technical)

It is possible for a “column” of a data frame to be a matrix, in which case the number of rows needs to match.

data frame length

Note that the length of a data frame is different from the length of the equivalent matrix. The length of the data frame is the number of columns, while the length of the matrix is the number of columns times the number of rows.

Page 122

character versus factor

The book suggests always making sure that data frames hold character vectors instead of factors in order to reduce problems. The other main route to avoid frustration is to always assume that there are factors.

The thing you don’t want to do is assume that what is really a factor is a character vector.

naming variables

If in the middle of the page where it says “In the previous section” you don’t know what they are talking about, not to worry — you’re not alone.

as with matrices

I’m not clear on the reference to matrices at the very bottom of the page.

Page 124

data frame subscripting

You can get a column of a data frame using either the $ or [ form of subscripting. But there is a difference:

> baskets.df$Granny
[1] 12  4  5  6  9  3
> baskets.df[,Granny]
Error in `[.data.frame`(baskets.df, , Granny) : 
  object 'Granny' not found
> baskets.df[,"Granny"]
[1] 12  4  5  6  9  3

Note the quotes or lack thereof.

Page 130

pieces of a list

I prefer calling the pieces of a list "components" rather than "elements". One reason is that a component of a list can be another list, and hence not very elementary.

Page 139

The functions that you write are essentially the same as the inbuilt functions. They are first-class citizens.

Page 152

functional programming

You can very effectively use R without having a clue what "functional programming" means. The important idea behind functional programming is safety -- the data that you want to use is almost surely the data that really is being used.

Page 153

calculation example

The object names were obviously changed midstream. fifty should be half and hundred should be full.

Page 157

generic functions (technical)

A detail that only occasionally really matters is that the argument names in methods should match the argument name in the generic. You don't want to have the argument called x in the generic but object in a method.

Page 171

looping without loops

Using apply functions is really hiding loops rather than eliminating them.

Page 172

number of apply functions

Not that it matters, but I count 8 apply functions in the base package in version 2.15.0. There are also a reasonably large number of apply functions in contributed packages.

Page 188

error checking (technical)

Another way to write the check for out of bounds values is:

stopifnot(all(x >= 0 & x <= 1))

This will create an appropriate error message if there is a violation.

This will take multiple conditions separated by commas. So you can have checks like:

stopifnot(is.matrix(x), is.data.frame(y))

to make sure that x is a matrix and y is a data frame.

Page 190

technical tip (technical)

The first sentence starts:

In fact, functions are generic ...

It should read:

In fact, some functions are generic ...

Page 192

factor to numeric

The book gives the efficient method of converting a factor to numeric:

as.numeric(levels(x))[x]

The slightly less efficient but easier to remember method is:

as.numeric(as.character(x))

Don't forget the as.character -- it matters.

problems with factors (technical)

Circle 8.2 of The R Inferno starts with a number of items about factors.

Page 193

documentation quality

Unfortunately, I think the authors are painting too rosy of a picture of the quality of R documentation. There probably is some great documentation for any task or issue that you have, but you may have a significant search on your hands to find that great document.

Page 194

help files

It takes practice to learn how to use help files well. It doesn't help that sections of the help files are in the wrong order (in my opinion). The "See also" and "Examples" should be near the top, "Details" should be at the bottom.

The examples often are the most important part. The book implies that all examples are reproducible. Not all are, but many are.

You don't need to understand the whole of a help file the first time around. The goal should be to improve your understanding of the function.

Page 199

Stack Overflow

It is possible to subscribe via RSS to R tags.

Page 200

cards

With the cards I'm used to, the command to create cards should include 2:10 rather than 1:9.

Page 202

session info

The book says that it is sometimes helpful to include the results of sessionInfo() in questions. I would change that from "sometimes" to "often".

Page 210

reading in data

The start of Circle 8.3 in The R Inferno has a number of items about problems reading data in.

Page 216

changing directories

If you are using the RGui, there is a "change dir" item in the File menu.

Page 221

three subset operators

The [[ operator always gets one component. The result is often not a list.

In contrast the [ operator can get any number of items and (except for dropping) gives you back the same type of object.

Page 226

removing duplicates

The book shows the removal of duplicates using both logical subscripts and negative numeric subscripts. Be careful with the latter of these:

> vec <- 1:5
> dups <- duplicated(vec)
> vec[!dups]
[1] 1 2 3 4 5
> vec[-which(dups)]
integer(0)

If you create a vector of negative subscripts, you need to make sure it has at least one element. Otherwise you get nothing when you want everything.

Page 240

apply output

The book is in error when it says that the result of apply is always a vector. Other possible results include a matrix and a list.

Page 243

sapply example (technical)

The example at the very top of the page that uses ifelse would be more in the spirit of R if it instead used:

if(is.numeric(x)) mean(x) else NA

Page 245

aggregate (technical)

Alternatives to aggregate include the by function (if you have a data frame) and the data.table package.

Page 253

third paragraph

Something seems to have gone wrong. That the phrase "doesn't make sense at all" appears in the paragraph seems apropos.

Page 254

checking data

Often checking data with graphics is best. Do plots look as expected?

Page 260

mode

There is a mode function in R, but it is not the same meaning as in the discussion of location.

Page 270

missing values (technical)

You might think that "pairwise" should be the default choice since it uses the most data. The problem with it is that the resulting correlation matrix is not guaranteed to be positive definite.

Page 274

prop.table (technical)

I wondered if prop.table recognized a table that had added margins. The answer is no, it thinks the margins are part of the data.

Page 312

multiple plots (technical)

If you want to put the graphics device back into a single plot state without using the old.par trick, then say:

par(mfcol=c(1,1))

par(mfrow=c(1,1))

It doesn't matter which you say.

Page 314

hardcopy graphics

If you are putting your graphics into a word processor, then often pdf is a good choice.

If you are putting your graphics onto a webpage or into a presentation, then png can be a good choice.

Page 326

boxplots (technical)

To be clear whiskers are at most 1.5 times the width of the box.

Page 332

changing directory (technical)

To change the working directory and then change it back to the original, you would do something like:

> origwd <- getwd()
> setwd("blah/blah")
> # do stuff
> setwd(origwd)

Page 359

CRAN mirrors (technical)

While all mirrors are conceptually the same as the primary CRAN site, it takes time for changes to propagate. This is unlikely to be an issue unless you are trying to get a brand new release.

Page 360

CRAN packages

As of 2012 October 14 CRAN has 4087 contributed packages.

Page 362

unloading packages

I've used R pretty much every day for over a decade and never unloaded a package. I doubt this will be a big issue for you.

Page 363

R-Forge

R-Forge also provides mailing lists. The immediate significance of this for you is that some of your favorite contributed packages might have a dedicated mailing list.

Page 364

own repository (technical)

You can even set up your own repository and fill it with packages that you write.

Page 1

Do you appreciate the meaning of:

knowledge <- apply(theory, 1, sum)

as promised?

Epilogue

I saw a little teddy bear.
Well, I said to myself,
"I know what I want. I gotta get a bear some way."

from "You cannot win if you do not play" by Steve Forbert

To leave a comment for the author, please follow the link and comment on their blog: Portfolio Probe » R language.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.