(line.)profiling mit Rprof + R.3.x – Performance von R-Skripten optimieren

This post was kindly contributed by eoda, R und Datenanalyse » R - go there to comment and to read the full post.

Mit der Version R.3.0.0 wurde R um wichtige Funktionen für den professionellen Einsatz erweitert. Dazu zählt beispielsweise die Erweiterung der Funktion Rprof(), mit der sich die  Performance von R-Skripten analysieren und optimieren lässt.

Schon zuvor war die Funktion Rprof() aus dem utils Paket ein nützliches Werkzeug um eine differenzierte Laufzeitanalyse von R-Skripten und Funktionen durchzuführen. Durch das sogenannte Profiling wird zu einer Funktion ein Profil generiert, welches Auskunft darüber gibt, wie viel Zeit für die darin enthaltenen Funktionsaufrufe verwendet werden. So lassen sich mögliche Flaschenhälse innerhalb der Funktion identifizieren und Optimierungen bezüglich der Laufzeit, an den entsprechenden Stellen durchführen.

Das erstellte Profil lässt sich mittels der Funktion summaryRprof() aufrufen. Vor allem bei sehr umfangreichen Funktionen oder beim Testen kompletter Skripte, kann der Output jedoch schnell sehr unübersichtlich werden. Ein Profiling aller Beispiele aus dem boot- Paket zeigt, dass die Ausgabe nicht sehr intuitiv ist, was eine genaue Analyse unter Umständen sehr erschweren kann.
(Das Skript boot.R welches die Beispiele enthält, finden Sie am Ende des Beitrags)

Rprof("profile_boot.out")
source("boot.R")
Rprof(NULL)
 
summaryRprof("profile_boot.out")

Output:

$by.self
                          self.time self.pct total.time total.pct
".deparseOpts"                 0.36     6.04       0.56      9.40
"FUN"                          0.28     4.70       5.86     98.32
"match"                        0.26     4.36       1.04     17.45
"deparse"                      0.26     4.36       0.70     11.74
"$"                            0.26     4.36       0.26      4.36
"as.list"                      0.22     3.69       0.28      4.70
"pmatch"                       0.22     3.69       0.22      3.69
"glm.fit"                      0.20     3.36       0.36      6.04
"model.frame.default"          0.18     3.02       1.62     27.18
"[.data.frame"                 0.16     2.68       0.38      6.38
"structure"                    0.14     2.35       0.18      3.02
"length"                       0.14     2.35       0.14      2.35
"statistic"                    0.12     2.01       5.84     97.99
"model.matrix.default"         0.12     2.01       1.26     21.14
"%in%"                         0.12     2.01       0.60     10.07
"[[.data.frame"                0.10     1.68       0.36      6.04
"unique"                       0.10     1.68       0.24      4.03
"Anonymous"                  0.10     1.68       0.16      2.68
"glm"                          0.08     1.34       3.08     51.68
"sapply"                       0.08     1.34       1.56     26.17
"terms"                        0.08     1.34       0.24      4.03
"delete.response"              0.08     1.34       0.12      2.01
"makepredictcall"              0.08     1.34       0.12      2.01
"sys.call"                     0.08     1.34       0.10      1.68
"terms.formula"                0.08     1.34       0.08      1.34
"unique.default"               0.08     1.34       0.08      1.34
"eval"                         0.06     1.01       5.96    100.00
"lapply"                       0.06     1.01       5.88     98.66
"predict.lm"                   0.06     1.01       1.72     28.86
"["                            0.06     1.01       0.44      7.38
"mode"                         0.06     1.01       0.24      4.03
"unlist"                       0.06     1.01       0.12      2.01
".checkMFClasses"              0.06     1.01       0.10      1.68
"f"                            0.06     1.01       0.06      1.01
"formals"                      0.06     1.01       0.06      1.01
"is.factor"                    0.06     1.01       0.06      1.01
"mean"                         0.06     1.01       0.06      1.01
"names"                        0.06     1.01       0.06      1.01
"sum"                          0.06     1.01       0.06      1.01
"predict.glm"                  0.04     0.67       1.86     31.21
"model.frame"                  0.04     0.67       1.66     27.85
"model.matrix"                 0.04     0.67       1.30     21.81
"match.arg"                    0.04     0.67       0.30      5.03
"na.omit"                      0.04     0.67       0.28      4.70
"na.omit.data.frame"           0.04     0.67       0.24      4.03
"factor"                       0.04     0.67       0.14      2.35
"match.call"                   0.04     0.67       0.10      1.68
"terms.default"                0.04     0.67       0.08      1.34
"getOption"                    0.04     0.67       0.06      1.01
"is.data.frame"                0.04     0.67       0.06      1.01
"$<-"                          0.04     0.67       0.04      0.67
"[[<-"                         0.04     0.67       0.04      0.67
"as.list.data.frame"           0.04     0.67       0.04      0.67
"as.matrix"                    0.04     0.67       0.04      0.67
"c"                            0.04     0.67       0.04      0.67
"makepredictcall.default"      0.04     0.67       0.04      0.67
"NextMethod"                   0.04     0.67       0.04      0.67
"runif"                        0.04     0.67       0.04      0.67
"predict"                      0.02     0.34       1.88     31.54
"[["                           0.02     0.34       0.38      6.38
"paste"                        0.02     0.34       0.36      6.04
"vapply"                       0.02     0.34       0.22      3.69
"nlm"                          0.02     0.34       0.16      2.68
"make.link"                    0.02     0.34       0.12      2.01
"var"                          0.02     0.34       0.08      1.34
"model.response"               0.02     0.34       0.06      1.01
"as.vector"                    0.02     0.34       0.04      0.67
"coef"                         0.02     0.34       0.04      0.67
"stopifnot"                    0.02     0.34       0.04      0.67
"-"                            0.02     0.34       0.02      0.34
"=="                           0.02     0.34       0.02      0.34
"as.list.default"              0.02     0.34       0.02      0.34
"as.numeric"                   0.02     0.34       0.02      0.34
"do.call"                      0.02     0.34       0.02      0.34
"is.array"                     0.02     0.34       0.02      0.34
"lazyLoadDBfetch"              0.02     0.34       0.02      0.34
"linkinv"                      0.02     0.34       0.02      0.34
"list.names"                   0.02     0.34       0.02      0.34
"options"                      0.02     0.34       0.02      0.34
"sys.parent"                   0.02     0.34       0.02      0.34

$by.total
                          total.time total.pct self.time self.pct
"eval"                          5.96    100.00      0.06     1.01
"source"                        5.96    100.00      0.00     0.00
"withVisible"                   5.96    100.00      0.00     0.00
"boot"                          5.92     99.33      0.00     0.00
"lapply"                        5.88     98.66      0.06     1.01
"FUN"                           5.86     98.32      0.28     4.70
"statistic"                     5.84     97.99      0.12     2.01
"glm"                           3.08     51.68      0.08     1.34
"predict"                       1.88     31.54      0.02     0.34
"predict.glm"                   1.86     31.21      0.04     0.67
"predict.lm"                    1.72     28.86      0.06     1.01
"model.frame"                   1.66     27.85      0.04     0.67
"model.frame.default"           1.62     27.18      0.18     3.02
"sapply"                        1.56     26.17      0.08     1.34
"model.matrix"                  1.30     21.81      0.04     0.67
"model.matrix.default"          1.26     21.14      0.12     2.01
"match"                         1.04     17.45      0.26     4.36
"deparse"                       0.70     11.74      0.26     4.36
"%in%"                          0.60     10.07      0.12     2.01
".deparseOpts"                  0.56      9.40      0.36     6.04
".getXlevels"                   0.46      7.72      0.00     0.00
"["                             0.44      7.38      0.06     1.01
"[.data.frame"                  0.38      6.38      0.16     2.68
"[["                            0.38      6.38      0.02     0.34
"glm.fit"                       0.36      6.04      0.20     3.36
"[[.data.frame"                 0.36      6.04      0.10     1.68
"paste"                         0.36      6.04      0.02     0.34
"match.arg"                     0.30      5.03      0.04     0.67
"as.list"                       0.28      4.70      0.22     3.69
"na.omit"                       0.28      4.70      0.04     0.67
"$"                             0.26      4.36      0.26     4.36
"unique"                        0.24      4.03      0.10     1.68
"terms"                         0.24      4.03      0.08     1.34
"mode"                          0.24      4.03      0.06     1.01
"na.omit.data.frame"            0.24      4.03      0.04     0.67
"table"                         0.24      4.03      0.00     0.00
"pmatch"                        0.22      3.69      0.22     3.69
"vapply"                        0.22      3.69      0.02     0.34
"simplify2array"                0.22      3.69      0.00     0.00
"structure"                     0.18      3.02      0.14     2.35
"Anonymous"                   0.16      2.68      0.10     1.68
"nlm"                           0.16      2.68      0.02     0.34
"family"                        0.16      2.68      0.00     0.00
"length"                        0.14      2.35      0.14     2.35
"factor"                        0.14      2.35      0.04     0.67
"delete.response"               0.12      2.01      0.08     1.34
"makepredictcall"               0.12      2.01      0.08     1.34
"unlist"                        0.12      2.01      0.06     1.01
"make.link"                     0.12      2.01      0.02     0.34
"sys.call"                      0.10      1.68      0.08     1.34
".checkMFClasses"               0.10      1.68      0.06     1.01
"match.call"                    0.10      1.68      0.04     0.67
"terms.formula"                 0.08      1.34      0.08     1.34
"unique.default"                0.08      1.34      0.08     1.34
"terms.default"                 0.08      1.34      0.04     0.67
"var"                           0.08      1.34      0.02     0.34
"f"                             0.06      1.01      0.06     1.01
"formals"                       0.06      1.01      0.06     1.01
"is.factor"                     0.06      1.01      0.06     1.01
"mean"                          0.06      1.01      0.06     1.01
"names"                         0.06      1.01      0.06     1.01
"sum"                           0.06      1.01      0.06     1.01
"getOption"                     0.06      1.01      0.04     0.67
"is.data.frame"                 0.06      1.01      0.04     0.67
"model.response"                0.06      1.01      0.02     0.34
"$<-"                           0.04      0.67      0.04     0.67
"[[<-"                          0.04      0.67      0.04     0.67
"as.list.data.frame"            0.04      0.67      0.04     0.67
"as.matrix"                     0.04      0.67      0.04     0.67
"c"                             0.04      0.67      0.04     0.67
"makepredictcall.default"       0.04      0.67      0.04     0.67
"NextMethod"                    0.04      0.67      0.04     0.67
"runif"                         0.04      0.67      0.04     0.67
"as.vector"                     0.04      0.67      0.02     0.34
"coef"                          0.04      0.67      0.02     0.34
"stopifnot"                     0.04      0.67      0.02     0.34
"dim"                           0.04      0.67      0.00     0.00
"dim.data.frame"                0.04      0.67      0.00     0.00
"-"                             0.02      0.34      0.02     0.34
"=="                            0.02      0.34      0.02     0.34
"as.list.default"               0.02      0.34      0.02     0.34
"as.numeric"                    0.02      0.34      0.02     0.34
"do.call"                       0.02      0.34      0.02     0.34
"is.array"                      0.02      0.34      0.02     0.34
"lazyLoadDBfetch"               0.02      0.34      0.02     0.34
"linkinv"                       0.02      0.34      0.02     0.34
"list.names"                    0.02      0.34      0.02     0.34
"options"                       0.02      0.34      0.02     0.34
"sys.parent"                    0.02      0.34      0.02     0.34
"coef.default"                  0.02      0.34      0.00     0.00
"glm.diag"                      0.02      0.34      0.00     0.00
"is.empty.model"                0.02      0.34      0.00     0.00
"model.offset"                  0.02      0.34      0.00     0.00
"ncol"                          0.02      0.34      0.00     0.00
"nrow"                          0.02      0.34      0.00     0.00
"ran.gen"                       0.02      0.34      0.00     0.00
"residuals"                     0.02      0.34      0.00     0.00
"residuals.glm"                 0.02      0.34      0.00     0.00
"rm"                            0.02      0.34      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 5.96

 

Es existieren bereits seit einiger Zeit weitere Pakete und Funktionen, die die Auswertung des erstellten Profils erleichtern sollen, wie zum Beispiel Hadely Wickam’s profr.
Eine wesentliche Verbesserung von Rprof() wurde jedoch erst kürzlich mit der neusten R Version 3.0.0 eingeführt:

  • Profiling via Rprof() now optionally records information at the statement level, not just the function level.

Durch die Argumente Rprof(… ,line.profiling = TRUE) und summaryRprof( … ,lines= „show“) ist es nun möglich die Laufzeitanalyse nach Zeilen des source- Skripts auszuwerten. So lässt sich auf den ersten Blick erkennen, an welchen Stellen im Code die meiste Zeit benötigt wird.

Rprof("profile_boot.out", line.profiling=TRUE)
source("boot.R")
Rprof(NULL)

summaryRprof("profile_boot.out", lines = "show")

Output:

$by.self
          self.time self.pct total.time total.pct
boot.R#42      3.08    51.68       3.08     51.68
boot.R#44      1.88    31.54       1.88     31.54
boot.R#11      0.28     4.70       0.28      4.70
boot.R#12      0.14     2.35       0.14      2.35
boot.R#45      0.12     2.01       0.12      2.01
boot.R#68      0.10     1.68       0.16      2.68
boot.R#3       0.06     1.01       0.08      1.34
boot.R#64      0.06     1.01       0.06      1.01
boot.R#66      0.06     1.01       0.06      1.01
boot.R#19      0.02     0.34       0.52      8.72
boot.R#10      0.02     0.34       0.02      0.34
boot.R#13      0.02     0.34       0.02      0.34
boot.R#14      0.02     0.34       0.02      0.34
boot.R#15      0.02     0.34       0.02      0.34
boot.R#2       0.02     0.34       0.02      0.34
boot.R#27      0.02     0.34       0.02      0.34
boot.R#63      0.02     0.34       0.02      0.34
boot.R#76      0.02     0.34       0.02      0.34

$by.total
          total.time total.pct self.time self.pct
boot.R#48       5.08     85.23      0.00     0.00
boot.R#42       3.08     51.68      3.08    51.68
boot.R#44       1.88     31.54      1.88    31.54
boot.R#19       0.52      8.72      0.02     0.34
boot.R#11       0.28      4.70      0.28     4.70
boot.R#62       0.26      4.36      0.00     0.00
boot.R#80       0.26      4.36      0.00     0.00
boot.R#72       0.24      4.03      0.00     0.00
boot.R#68       0.16      2.68      0.10     1.68
boot.R#12       0.14      2.35      0.14     2.35
boot.R#45       0.12      2.01      0.12     2.01
boot.R#3        0.08      1.34      0.06     1.01
boot.R#64       0.06      1.01      0.06     1.01
boot.R#66       0.06      1.01      0.06     1.01
boot.R#10       0.02      0.34      0.02     0.34
boot.R#13       0.02      0.34      0.02     0.34
boot.R#14       0.02      0.34      0.02     0.34
boot.R#15       0.02      0.34      0.02     0.34
boot.R#2        0.02      0.34      0.02     0.34
boot.R#27       0.02      0.34      0.02     0.34
boot.R#63       0.02      0.34      0.02     0.34
boot.R#76       0.02      0.34      0.02     0.34

$by.line
          self.time self.pct total.time total.pct
boot.R#2       0.02     0.34       0.02      0.34
boot.R#3       0.06     1.01       0.08      1.34
boot.R#10      0.02     0.34       0.02      0.34
boot.R#11      0.28     4.70       0.28      4.70
boot.R#12      0.14     2.35       0.14      2.35
boot.R#13      0.02     0.34       0.02      0.34
boot.R#14      0.02     0.34       0.02      0.34
boot.R#15      0.02     0.34       0.02      0.34
boot.R#19      0.02     0.34       0.52      8.72
boot.R#27      0.02     0.34       0.02      0.34
boot.R#42      3.08    51.68       3.08     51.68
boot.R#44      1.88    31.54       1.88     31.54
boot.R#45      0.12     2.01       0.12      2.01
boot.R#48      0.00     0.00       5.08     85.23
boot.R#62      0.00     0.00       0.26      4.36
boot.R#63      0.02     0.34       0.02      0.34
boot.R#64      0.06     1.01       0.06      1.01
boot.R#66      0.06     1.01       0.06      1.01
boot.R#68      0.10     1.68       0.16      2.68
boot.R#72      0.00     0.00       0.24      4.03
boot.R#76      0.02     0.34       0.02      0.34
boot.R#80      0.00     0.00       0.26      4.36

$sample.interval
[1] 0.02

$sampling.time
[1] 5.96

Eine intuitive Darstellung des Profils, welche die Hierarchie der Funktionsaufrufe mit dem neuen Feature verknüpft, liefert die Funktion proftable() von Noam Ross.

Beispiel-Skript

# Script: boot.R
#############################################################

# Usual bootstrap of the ratio of means using the city data
ratio <- function(d, w) sum(d$x * w)/sum(d$u * w)
boot(city, ratio, R = 999, stype = "w")


# Stratified resampling for the difference of means.  In this
# example we will look at the difference of means between the final
# two series in the gravity data.
diff.means <- function(d, f)
{    n <- nrow(d)
     gp1 <- 1:table(as.numeric(d$series))[1]
     m1 <- sum(d[gp1,1] * f[gp1])/sum(f[gp1])
     m2 <- sum(d[-gp1,1] * f[-gp1])/sum(f[-gp1])
     ss1 <- sum(d[gp1,1]^2 * f[gp1]) - (m1 *  m1 * sum(f[gp1]))
     ss2 <- sum(d[-gp1,1]^2 * f[-gp1]) - (m2 *  m2 * sum(f[-gp1]))
     c(m1 - m2, (ss1 + ss2)/(sum(f) - 2))
}
grav1 <- gravity[as.numeric(gravity[,2]) >= 7,]
boot(grav1, diff.means, R = 999, stype = "f", strata = grav1[,2])

# In this example we show the use of boot in a prediction from
# regression based on the nuclear data.  This example is taken
# from Example 6.8 of Davison and Hinkley (1997).  Notice also
# that two extra arguments to 'statistic' are passed through boot.
nuke <- nuclear[, c(1, 2, 5, 7, 8, 10, 11)]
nuke.lm <- glm(log(cost) ~ date+log(cap)+ne+ct+log(cum.n)+pt, data = nuke)
nuke.diag <- glm.diag(nuke.lm)
nuke.res <- nuke.diag$res * nuke.diag$sd
nuke.res <- nuke.res - mean(nuke.res)

# We set up a new data frame with the data, the standardized 
# residuals and the fitted values for use in the bootstrap.
nuke.data <- data.frame(nuke, resid = nuke.res, fit = fitted(nuke.lm))

# Now we want a prediction of plant number 32 but at date 73.00
new.data <- data.frame(cost = 1, date = 73.00, cap = 886, ne = 0,
                       ct = 0, cum.n = 11, pt = 1)
new.fit <- predict(nuke.lm, new.data)

nuke.fun <- function(dat, inds, i.pred, fit.pred, x.pred)
{
    lm.b <- glm(fit+resid[inds] ~ date+log(cap)+ne+ct+log(cum.n)+pt,
                data = dat)
    pred.b <- predict(lm.b, x.pred)
    c(coef(lm.b), pred.b - (fit.pred + dat$resid[i.pred]))
}

nuke.boot <- boot(nuke.data, nuke.fun, R = 999, m = 1, 
                  fit.pred = new.fit, x.pred = new.data)
# The bootstrap prediction squared error would then be found by
mean(nuke.boot$t[, 8]^2)
# Basic bootstrap prediction limits would be
new.fit - sort(nuke.boot$t[, 8])[c 1="25)" language="(975,"][/c]


# Finally a parametric bootstrap.  For this example we shall look 
# at the air-conditioning data.  In this example our aim is to test 
# the hypothesis that the true value of the index is 1 (i.e. that 
# the data come from an exponential distribution) against the 
# alternative that the data come from a gamma distribution with
# index not equal to 1.
air.fun <- function(data) {
    ybar <- mean(data$hours)
    para <- c(log(ybar), mean(log(data$hours)))
    ll <- function(k) {
        if (k <= 0) 1e200 else lgamma(k)-k*(log(k)-1-para[1]+para[2])
    }
    khat <- nlm(ll, ybar^2/var(data$hours))$estimate
    c(ybar, khat)
}

air.rg <- function(data, mle) {
    # Function to generate random exponential variates.
    # mle will contain the mean of the original data
    out <- data
    out$hours <- rexp(nrow(out), 1/mle)
    out
}

air.boot <- boot(aircondit, air.fun, R = 999, sim = "parametric",
                 ran.gen = air.rg, mle = mean(aircondit$hours))

# The bootstrap p-value can then be approximated by
sum(abs(air.boot$t[,2]-1) > abs(air.boot$t0[2]-1))/(1+air.boot$R)