Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post explains a combined usage of split(), lapply() and do.call() R functions, so called Split-Apply-Combine approach. These are frequently used for group based calculations such as weighted average or aggregation. It will be useful when complicated user-defined function is applied differently to each group such as currency.

### lapply, split, and do.call

lapply R function takes a list as its input and apply built-in or use-defined function to its list members.

 12 lapply(list, function) cs

Raw data is typically the data.frame not the list. When we want to perform lapply() on data.frame, It is therefore needed to convert this data.frame to the corresponding list. For this purpose, we use split() R function, which take data.frame and a key column as input and return list object separated by key column.

 12 split(data.frame, key column of data.frame) Colored by Color Scripter cs

Although lapply() is very useful, it is somewhat annoying to deal with its returning list object. We convert this list object to the corresponding data.frame using do.call() R function in the following way.

 12 do.call(rbind, list) cs

We want to aggregate weighted sensitivities (ws) within its currency (currency) from input data.frame by using split() and lapply() R functions. Finally, output data.frame is constructed by using do.call() R function. This overall process is illustrated by the following figure.

### Case 1) Single output

The following R code perform a summationof ws by currency group.

 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263 #=========================================================================## Financial Econometrics & Derivatives, ML/DL using R, Python, Tensorflow  # by Sang-Heon Lee ## https://kiandlee.blogspot.com#————————————————————————-## A versatile usage of lapply#=========================================================================# graphics.off()  # clear all graphsrm(list = ls()) # remove all files from your workspace #————————————————————————–# Input data.frame#————————————————————————–        str.data <–         “currency    maturity    ws         USD        3m            285000000         USD        3m            456000000         USD        1y            112000000         USD        2y            56000000         EUR        3m            1785000000         EUR        6m            200000000         EUR        1y            250000000         EUR        1y            1855000000         CNY        6m            84000000         CNY        6m            42000000         CNY        6m            144000000         AUD        6m            213000000         AUD        2y            106000000         AUD        2y            214000000″    df.data <– read.table(text=str.data, header = TRUE)    print(df.data)  #————————————————–# 1) Single output#————————————————–     # split input data by key columns        lt.data <– split(df.data, df.data$currency) # single output lt.out <– lapply( lt.data, function(x){ data.frame(sum_ws = sum(x$ws))})        # concatenate rows    df.out <– do.call(rbind,lt.out)     rownames(df.out) <– NULL; print(df.out)  ––––––––––––––––––––––––––––––––––––––––––––––––––– > print(df.out)    sum_ws1 5.33e+082 2.70e+083 4.09e+094 9.09e+08 Colored by Color Scripter cs

### Case 2) Multiple output

The following R code calculate a summation and maximum of ws by currency group with each corresponding currency

 1234567891011121314151617181920212223242526272829 #————————————————–# 2) Multiple output#————————————————–        # split input data by key column and do lapply    lt.out  <– lapply(                 # split by currency for outer lapply        split(df.data, df.data$currency), function(x){ data.frame(curr = max(x$currency),                       sum_ws = sum(x$ws), max_ws = max(x$ws))        })        # concatenate rows    df.out <– do.call(rbind,lt.out)     rownames(df.out) <– NULL; print(df.out)  ––––––––––––––––––––––––––––––––––––––––––––––––––– >  print(df.out)  curr   sum_ws     max_ws1  AUD 5.33e+08  2140000002  CNY 2.70e+08  1440000003  EUR 4.09e+09 18550000004  USD 9.09e+08  456000000 Colored by Color Scripter cs

### Case 3) Multiple output while original row is preserved

The following R code calculates a summation of ws by currency group with each corresponding currency and also calculate weight within its currency. This case happens when we need to calculate some variables with both ungrouped and grouped (aggregated) variables such as weights within group. For this purpose, we need to preserve the row of original input data. This can be done by returning key or other column variables without group operation such as curr = x$currency, ws = x$ws except for grouped variables such as sum_ws = sum(x$ws).  12345678910111213141516171819202122232425262728293031323334353637383940414243 #————————————————–# 3) Multiple output while original row is preserved#————————————————– # split input data by key column and do lapply lt.out <– lapply( # split by currency for lapply split(df.data, df.data$currency),         function(x){            data.frame(curr   = x$currency, ws = x$ws,                       sum_ws = sum(x$ws)) }) # concatenate rows df.out <– do.call(rbind,lt.out) rownames(df.out) <– NULL; print(df.out) # add another group based calculation df.out$group_wgt <– df.out$ws/df.out$sum_ws    print(df.out)  ––––––––––––––––––––––––––––––––––––––––––––––––––– >     print(df.out)   curr         ws   sum_ws  group_wgt1   AUD  213000000 5.33e+08 0.399624772   AUD  106000000 5.33e+08 0.198874303   AUD  214000000 5.33e+08 0.401500944   CNY   84000000 2.70e+08 0.311111115   CNY   42000000 2.70e+08 0.155555566   CNY  144000000 2.70e+08 0.533333337   EUR 1785000000 4.09e+09 0.436430328   EUR  200000000 4.09e+09 0.048899769   EUR  250000000 4.09e+09 0.0611246910  EUR 1855000000 4.09e+09 0.4535452311  USD  285000000 9.09e+08 0.3135313512  USD  456000000 9.09e+08 0.5016501713  USD  112000000 9.09e+08 0.1232123214  USD   56000000 9.09e+08 0.06160616 Colored by Color Scripter cs

### Case 4) Multiple output with multiple key columns

The following R code calculates a summation of ws by currency and maturity group with each corresponding currency. In this case, nested lapply() is used.

 1234567891011121314151617181920212223242526272829303132333435363738394041424344 #————————————————–# 4) Multiple output with multiple key columns#————————————————–        # outer lapply    lt.out  <– lapply(                 # split by currency for outer lapply        split(df.data, df.data$currency ), function(x){ # inner lapply y <– lapply( # split by maturity for inner lapply split(x, x$maturity),                 function(x) {                    data.frame(curr = max(x$currency), mat = max(x$maturity),                           sum_ws = sum(x$ws), max_ws = max(x$ws))})                        # concatenate inner rows            do.call(rbind,y)        })        # concatenate outer rows    df.out <– do.call(rbind,lt.out)    rownames(df.out) <– NULL; print(df.out) ––––––––––––––––––––––––––––––––––––––––––––––––––– > print(df.out)  curr mat     sum_ws     max_ws1  AUD  2y  320000000  2140000002  AUD  6m  213000000  2130000003  CNY  6m  270000000  1440000004  EUR  1y 2105000000 18550000005  EUR  3m 1785000000 17850000006  EUR  6m  200000000  2000000007  USD  1y  112000000  1120000008  USD  2y   56000000   560000009  USD  3m  741000000  456000000 Colored by Color Scripter cs

### Case 5) Multiple output with multiple key columns while original row is preserved

The following R code is the same version of case 3) when original row is preserved.

 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354 #————————————————–# 5) Multiple output with multiple key columns#    while original row is preserved#————————————————–        # outer lapply    lt.out  <– lapply(                 # split by currency for outer lapply        split(df.data, df.data$currency ), function(x){ # inner lapply y <– lapply( # split by maturity for inner lapply split(x, x$maturity),                 function(x) {                    data.frame(curr = x$currency, mat = x$maturity,                               ws   = x$ws, sum_ws = sum(x$ws))})            # concatenate inner rows            do.call(rbind,y)        })        # concatenate outer rows    df.out <– do.call(rbind,lt.out)     rownames(df.out) <– NULL        # add another group based calculation    df.out$group_wgt <– df.out$ws/df.out\$sum_ws    print(df.out)     ––––––––––––––––––––––––––––––––––––––––––––––––––– >     print(df.out)      curr mat         ws     sum_ws group_wgt1   AUD  2y  106000000  320000000 0.33125002   AUD  2y  214000000  320000000 0.66875003   AUD  6m  213000000  213000000 1.00000004   CNY  6m   84000000  270000000 0.31111115   CNY  6m   42000000  270000000 0.15555566   CNY  6m  144000000  270000000 0.53333337   EUR  1y  250000000 2105000000 0.11876488   EUR  1y 1855000000 2105000000 0.88123529   EUR  3m 1785000000 1785000000 1.000000010  EUR  6m  200000000  200000000 1.000000011  USD  1y  112000000  112000000 1.000000012  USD  2y   56000000   56000000 1.000000013  USD  3m  285000000  741000000 0.384615414  USD  3m  456000000  741000000 0.6153846 Colored by Color Scripter cs

From this post, we can find that combination of split, lapply and do.call R runctions deliver output which we want to generate for the group operations. $$\blacksquare$$

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.