From spreadsheet thinking to R thinking

January 7, 2014
By

(This article was first published on Burns Statistics » R language, and kindly contributed to R-bloggers)

Towards the basic R mindset.

Previously

The post “A first step towards R from spreadsheets” provides an introduction to switching from spreadsheets to R.  It also includes a list of additional posts (like this one) on the transition.

Add two columns

Figure 1 shows some numbers in two columns and the start of adding those two columns to each other in a third column.

Figure 1: Adding two columns in a spreadsheet. spread_ABadd

The next step is to fill the addition formula down the column.

It is not so different to do the same thing in R.  First create two objects that are equivalent to the two columns in the spreadsheet:

A <- c(32.5, -3.8, 15.9, 22.5)
B <- c(48.1, 19.4, 46.8, 14.7)

In those commands you used the c function which combines objects.  You have created two vectors.  The rules for a vector are:

  • it can be any length (up to a very large value)
  • all the elements are of the same type — all numbers, all character strings or all logicals
  • the order matters (just like it matters which row a number is in within a spreadsheet)

To summarize: they’re in little boxes and they all look just the same.

You have two R vectors holding your numbers.  Now just add them together (and assign that value into a third name):

C <- A + B

This addition is precisely what is done in the spreadsheet: the first value in C is the first value in A plus the first value in B, the second value in C is the second value in A plus the second value in B, and so on.

See the values in an object by typing its name:

> C
[1] 80.6 15.6 62.7 37.2

The “> ” is the R prompt, you type only what is after that: ‘C‘ (and the return or enter key).

Also note that R is case-sensitive – C and c are different things:

> c
function (..., recursive = FALSE)  .Primitive("c")

(Don’t try to make sense of what this means other than that c is a function.)

Multiply by a constant

One way of multiplying a column by a constant is to multiply the values in the column by the value in a single cell.  This is illustrated in Figure 2.

Figure 2: Multiply a column times the value in a single cell, shown before filling down column E. spread_CDmult

Another way of doing the same thing is to fill the value in D1 down column D and then multiply the two columns.

Do this operation in R with:

> C * 33
[1] 2659.8  514.8 2069.1 1227.6

In this command you didn’t create a new object to hold the answer.

You can think of R as doing either of the spreadsheet methods, but the fill-down image might be slightly preferable.

Recycling in R

The R recycling rule generalizes the idea of a single value expanding to the length of the vector.  It is possible to do operations with vectors of different lengths where both have more than one element:

> 1:6 + c(100, 200)
[1] 101 202 103 204 105 206

Figure 3 illustrates how R got to its answer.

Figure 3: Equivalent of the example of R’s recycling rule. spread_Rrecycle

Column F shows how column G was created: use the ROW function and fill it down the column.  That sequence of numbers was created in R with the `:` operator.

Note how the shorter vector is replicated to the length of the longer one.  Each value is used in order, and when it reaches the end it goes back to the beginning again.

You are free to think this is weird.  However, it is often useful.

Functions

Table 1 translates between spreadsheet and R functions. The spreadsheets consulted were Excel, Works and OpenOffice. Note there is some variation between spreadsheets.

Table 1: Equivalent functions between spreadsheets and R.

spreadsheetRcomment
ABSabs
ADDRESSperhaps assign but there is probably a better way
ANDallmore literally would be the & and && R operators
AVERAGEmeandanger: mean accepts only one data argument
AVGmeanthis danger of mean is discussed in Circle 3 of The R Inferno
AVERAGEIFsubscript before using mean
BESSELIbesselI
BESSELJbesselJ
BESSELKbesselK
BESSELYbesselY
BETADISTpbeta
BETAINVqbeta
BINOMDISTpbinom or dbinompbinom when cumulative, dbinom when not
CEILINGceiling
CELLstr is sort of the same idea
CHIDISTpchisqCHIDIST(x, df) is pchisq(x, df, lower.tail=FALSE)
CHIINVqchisqCHIINV(p, df) is qchisq(1-p, df)
CHISQDISTpchisq or dchisqpchisq when cumulative, dchisq when not
CHISQINVqchisq
CHITESTchisq.test
CHOOSEswitch
CLEANgsub
COLSncol(Works)
COLUMNSncol(Excel, OpenOffice)
COLUMNcolor probably more likely : or seq
COMBINchoose
CONCATENATEpaste
CONFIDENCECONFIDENCE(alpha, std, n) is -qnorm(alpha/2) * std / sqrt(n)
CORRELcor
COUNTlength
COUNTIFget length of a subscripted object
COVARcov
CRITBINOMqbinomCRITBINOM(n, p, a) is qbinom(a, n, p)
DELTAall.equal or identicalall.equal allows for slight differences, and note that it does not return a logical if there’s a pertinent difference — you can wrap it in isTRUE if you want
DGETuse subscripting in R
ERFsee the example in ?"Normal"
ERFCsee the example in ?"Normal"
EXACT==EXACT is specific to text, == is not
EXPexp
EXPONDISTpexp or dexppexp when cumulative, dexp when not
FACTfactorial
FACTDOUBLEdfactorialdfactorial is in the phangorn package
FDISTpfFDIST(x, df1, df2) is pf(x, df1, df2, lower.tail=FALSE)
FINDregexpr
FINVqfFINV(p, df1, df2) is qf(1-p, df1, df2)
FISHERatanh
FISHERINVtanh
FIXEDformat or sprintf or formatC
FLOORfloor
FORECASTpredict on an lm object
FREQUENCYyou probably want to use cut and/or table
FTESTvar.test
GAMMADISTpgamma or dgammaGAMMADIST(x, a, b, TRUE) is pgamma(x, a, scale=b) GAMMADIST(x, a, b, FALSE) is dgamma(x, a, scale=b)
GAMMAINVqgammaGAMMAINV(p, a, b) is qgamma(p, a, scale=b)
GAMMALNlgamma
GAUSSGAUSS(x) is pnorm(x) - 0.5
GCDgcdgcd is in the schoolmath package (and others). For more than two numbers you can do: Reduce(gcd, numVector)
GEOMEANexp(mean(log(x)))
GESTEP>=GESTEP(x, y) is as.numeric(x >= y) but R often coerces automatically if needed
HARMEANharmonic.meanharmonic.mean is in the psych package
HLOOKUPuse subscripting in R
HYPGEOMDISTdhyperHYPGEOMDIST(x, a, b, n) is dhyper(x, b, n-b, a)
IFif or ifelsesee Circle 3.2 of The R Inferno on if versus ifelse
IFERRORtry or tryCatch
INDEX[use subscripting in R
INDIRECTgetor possibly the eval-parse-text idiom, or (better) make changes that simplify the situation
INTfloordanger: not the same as as.integer for negative numbers
INTERCEPT(usually) the first element of coef of an lm object
ISLOGICALis.logical
ISNUMBERis.numeric
ISTEXTis.character
KURTkurtosiskurtosis is in the moments package
LARGEyou can use subscripting after sort
LCMscmscm is in the schoolmath package. For more than two numbers you can do: Reduce(scm, numVector)
LEFTsubstr
LENnchar(Excel, OpenOffice)
LENGTHnchar(Works)
LINESTuse lm
LNlogdanger: the default base in R for log is e
LOGlogdanger: the default base in spreadsheets for log is 10
LOG10log10
LOGINVqlnorm
LOGNORMDISTplnorm
LOWERtolower
MATCHmatch or whichmatch only does exact matches. Given that MATCH demands a sorted set of values for type 1 or -1, then MATCH(x, vec, 1) is sum(x <= vec) and MATCH(x, vec, -1) is sum(x >= vec) when vec is sorted as MATCH assumes.
MAXmax or pmaxmax returns one value, pmax returns a vector
MDETERMdet
MEDIANmedian
MIDsubstr
MINmin or pminmin returns one value, pmin returns a vector
MINVERSEsolve
MMULT%*%
MOD%%
MODEthe table function does the hard part. A crude approximation to MODE(x) is as.numeric(names(which.max(table(x))))
MUNITdiagdiag is much more general
Nas.numericthe correspondence is for logicals, as.numeric is more general
NEGBINOMDISTdnbinom
NORMDIST, NORMSDISTpnorm or dnormpnorm when cumulative is true, dnorm when false
NORMINV, NORMSINVqnorm
NOT!
NOWdate or Sys.time
ORanythe or operators in R are | and ||
PEARSONcor
PERCENTILEquantile
PERCENTRANKsimilar to ecdf but the argument is removed from the distribution in PERCENTRANK
PERMUTfunction(n,k) {choose(n,k) * factorial(k)}
PERMUTATIONAPERMUTATIONA(n, k) is n^k
PHIdnorm
POISSONppois or dpoisppois if cumulative, dpois if not
POWER^
PROByou can use the Ecdf function in the Hmisc package (the probabilities in the spreadsheet are the weights in Ecdf), then you can get the difference of that on the two limits
PRODUCTprod
PROPERsee example in ?toupper
QUARTILEuse quantile
QUOTIENT%/%
RANDrunifsee an introduction to random generation in R
RANDBETWEENuse sample
RANKrankRANK has the "min" tie.method and defaults to biggest first.
rank only has smallest first. To get biggest first in R you can do: length(x) + 1 - rank(x)
REPLACEsub or gsub
REPTuse rep and paste or paste0
RIGHTsubstringyou'll also need nchar to count the characters. Alternatively you can use str_sub in the stringr package with negative limits
ROUNDroundnote: round rounds exact halves to even (which avoids bias)
ROUNDDOWNtrunctrunc only goes to integers
ROWrowor probably more likely : or seq
ROWSnrow
RSQin summary of an lm object
SEARCHregexpralso see grep
SIGNsign
SKEWskewnessskewness is in the moments package
SLOPEin coef of an lm object
SMALLyou can use subscripting after sort
SQRTsqrt
STANDARDIZEscale
STDsd(Works)
STDEVsd(Excel, OpenOffice)
STEYXpredict on an lm object
STRINGformat or sprintf or formatC or prettyNum(Works)
SUBSTITUTEsub or gsubor possibly paste
SUMsumsum is one of the few R functions that allow multiple data arguments
SUMIFsubscript before using sum
SUMPRODUCTcrossprod
TDISTptTDIST(abs(x), df, tails) is pt(-abs(x), df) * tails
TEXTformat or sprintf or formatC or prettyNum
TINVTINV(x, df) is abs(qt(x/2, df))
TODAYSys.Date
TRANSPOSEt
TRENDfitted of an lm object
TRIMsub
TRIMMEANmeanTRIMMEAN(x, tr) is mean(x, trim=tr/2)
TRUNCtrunc
TTESTt.test
TYPEsimilar concepts in R are typeof, mode, class. Use str to understand the structure of objects
UPPERtoupper
VALUEas.numeric
VARvar
VLOOKUPuse subscripting in R
WEEKDAYweekdays
WEIBULLpweibull or dweibullpweibull when cumulative, dweibull when not
ZTESTuse pnorm on the calculated statistic

The trigonometric functions, like cos, acos, acosh are the same, except the R functions are all in lowercase.

Arguments

Spreadsheets show you the arguments of a function.  The args function in R provides similar information.  For example:

> args(sample)
function (x, size, replace = FALSE, prob = NULL) 
NULL

This shows that replace and prob both have default values, and so are not required.  Actually size is not required either -- x is the only mandatory argument.

You will learn to not even see the NULL on the final line of the result of args.

Help

You can get help for a function with the question mark operator:

?sample

This will show you the help file for the object -- sample in this case.  It is best not to let yourself be overwhelmed by a help file.

R vectorization

Most of the R functions are vectorized.

This is like creating a new spreadsheet column where an argument of the function is a value from the same row but a different column.  Think of putting =EXP(A1) in cell B1 and then filling it down.

Figure 4: EXP example of the vectorization idea, shown before column K is filled down. spread_exp

Giving a vector to exp returns the exponential of each of the values in the input vector:

> exp(0:5)
[1]   1.000000   2.718282   7.389056  20.085537
[5]  54.598150 148.413159

The result is a vector of length 6 -- the same length as the input. The number in square brackets at the start of each line of output is the index number of the first item on the line.

Some R resources

“Impatient R” provides a grounding in how to use R.

“Some hints for the R beginner” suggests additional ways to learn R.

Epilogue

And they're all made out of ticky tacky
And they all look just the same

from "Little Boxes" by Malvina Reynolds (1900 - 1978)

The post From spreadsheet thinking to R thinking appeared first on Burns Statistics.

To leave a comment for the author, please follow the link and comment on his blog: Burns Statistics » R language.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.