Fast and Easy Aggregation of Multi-Type and Survey Data in R

Posted on January 8, 2021 by R, Econometrics, High Performance in R bloggers | 0 Comments

[This article was first published on R, Econometrics, High Performance, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

collapse is a C/C++ based package to facilitate and speed up advanced statistical computations in R. One of the key objectives for creating it was to introduce in R a fast, consistent, and easy to use toolset for aggregating complex datasets. This post showcases this functionality by aggregating 3 quite different survey datasets I happened to have used recently for a project:

A births dataset from the 2016 Demographic and Health Survey for Uganda (used for child mortality estimates, available here).
A dataset of poverty estimates from the Uganda National Household Survey 2016/17 (used to compute district level poverty indicators, not available for direct download, documented here).
The Uganda National Population and Housing Census 2014 (for district level population estimates and other data, available here under UBOS).

First, the STATA files are imported using the haven library. Columns with only missing values are removed from the DHS dataset, encoded columns are converted to factor variables.

library(haven)
library(magrittr)
library(collapse)

# Uganda Demographic and Health Survey 2016: Birth Recode
DHSBR <- paste0(DHS_path, "/Data/UGBR7BDT - Births Recode/UGBR7BFL.dta") %>%  
         read_dta %>% get_vars(fNobs(.) > 0L) %>% as_factor

# Uganda National Household Survey 2016/17: Poverty Estimates
UNHSPOV <- paste0(UNHS_path, "/Household/pov16_rev1.dta") %>% 
           read_dta %>% as_factor

# Uganda National Population and Housing Census 2014
CENS <- paste0(CENS_path, "/UBOS 2014 Census.dta") %>% read_dta

We start with aggregating the DHS dataset. This data has 786 variables, most of which are categorical:

fdim(DHSBR)
## [1] 57906   786

table(vclasses(DHSBR))
## 
## character    factor   numeric 
##         2       696        88

We can obtain a detailed statistical summary of the data using descr. The output prints nicely to the console, but can also be converted to a data.frame.

descr(DHSBR, table = FALSE) %>% as.data.frame %>% head(10)
##    Variable     Class                                         Label     N Ndist         Mean
## 1    caseid character                           case identification 57906 13745           NA
## 2      bidx   numeric                           birth column number 57906    18 3.486720e+00
## 3      v000 character                        country code and phase 57906     1           NA
## 4      v001   numeric                                cluster number 57906   696 3.557185e+02
## 5      v002   numeric                              household number 57906   221 2.558897e+01
## 6      v003   numeric                      respondent's line number 57906    20 1.960799e+00
## 7      v004   numeric                            ultimate area unit 57906   696 3.557185e+02
## 8      v005   numeric women's individual sample weight (6 decimals) 57906   686 9.848528e+05
## 9      v006   numeric                            month of interview 57906     7 8.630176e+00
## 10     v007   numeric                             year of interview 57906     1 2.016000e+03
##              SD   Min     Max        Skew      Kurt     1%     5%    25%    50%     75%     95%
## 1            NA    NA      NA          NA        NA     NA     NA     NA     NA      NA      NA
## 2  2.367381e+00     1      18  1.05848241  3.806124      1      1      2      3       5       8
## 3            NA    NA      NA          NA        NA     NA     NA     NA     NA      NA      NA
## 4  1.915351e+02     1     697  0.01173270  1.881827     13     53    195    356     519     664
## 5  2.926832e+01     1     545  3.89808066 31.759599      1      2     10     19      28      86
## 6  1.201193e+00     1      21  5.53129314 49.135251      1      1      1      2       2       3
## 7  1.915351e+02     1     697  0.01173270  1.881827     13     53    195    356     519     664
## 8  5.543562e+05 45069 5145429  1.78199379  9.540138 102618 227215 702216 896184 1186668 1973187
## 9  1.496144e+00     6      12 -0.01157971  2.034968      6      6      7      9      10      11
## 10 0.000000e+00  2016    2016         NaN       NaN   2016   2016   2016   2016    2016    2016
##        99%
## 1       NA
## 2       10
## 3       NA
## 4      691
## 5      140
## 6        8
## 7      691
## 8  3142092
## 9       11
## 10    2016

The DHS sample comprises 20,880 selected households and 18,506 women being interviewed. Of these women 13,745 had given birth and are recorded in this dataset. As the descriptive statistics above show, the first column gives the women-id (caseid), and the second column an integer id (bidx) for each of the born children.

The aggregation task for this dataset shall simply be to aggregate over the children for each women. A first step to decide how this aggregation is to be done is to examine which variables vary by women i.e. contain child characteristics.

# Tabulate child-variant variables
table(varying(DHSBR, ~ caseid))
## 
## FALSE  TRUE 
##   521   264

# Examine the numeric child-variant variables
DHSBR %>% fgroup_by(caseid) %>% num_vars %>% 
  get_vars(varying(.)) %>% namlab
##    Variable                                                                 Label
## 1      bidx                                                   birth column number
## 2      bord                                                    birth order number
## 3        b1                                                        month of birth
## 4        b2                                                         year of birth
## 5        b3                                                   date of birth (cmc)
## 6        b7                                        age at death (months, imputed)
## 7        b8                                                  current age of child
## 8       b11                                     preceding birth interval (months)
## 9       b12                                    succeeding birth interval (months)
## 10      b17                                                          day of birth
## 11      b18                                       century day code of birth (cdc)
## 12      b19 current age of child in months (months since birth for dead children)
## 13      b20                                                 duration of pregnancy
## 14     midx                                                index to birth history
## 15     hidx                                                index to birth history
## 16    hidxa                                                index to birth history
## 17    hwidx                                                index to birth history
## 18      hw1                                                 child's age in months
## 19    idxml                                                index to birth history
## 20    idx94                                                index to birth history

These are all variables that we would prefer to aggregate using the average, not the sum or extreme values. It is also noteworthy that the weights don’t vary by child, but only by women, so weighted aggregation is actually not necessary in this case.

# Renaming weights variable
setrename(DHSBR, v005 = weights)
# Confirm that it does not vary by child
varying(DHSBR, weights ~ caseid)
## weights 
##   FALSE

Thus aggregation in this case is very simple using the collap() function, which by default aggregates numeric columns using the mean, and categorical columns using the statistical mode (i.e. the most frequent value):

# Aggregating, same as collap(DHSBR, ~ caseid, fmean, fmode), or collapv(1)
DHSBR_agg <- collap(DHSBR, ~ caseid) %>% fdroplevels

head(DHSBR_agg)
## # A tibble: 6 x 786
##   caseid   bidx v000   v001  v002  v003  v004 weights  v006  v007  v008 v008a  v009  v010  v011  v012
##   <chr>   <dbl> <chr> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "    0~   1.5 UG7       1     3     2     1 1099225     8  2016  1400 42613     7  1991  1099    25
## 2 "    0~   1.5 UG7       1     4     1     1 1099225     8  2016  1400 42609    12  1975   912    40
## 3 "    0~   1   UG7       1     4     2     1 1099225     8  2016  1400 42609     7  1995  1147    21
## 4 "    0~   1.5 UG7       1     4     6     1 1099225     8  2016  1400 42611     1  1993  1117    23
## 5 "    0~   1.5 UG7       1     4     7     1 1099225     8  2016  1400 42609     2  1986  1034    30
## 6 "    0~   1   UG7       1     4     8     1 1099225     8  2016  1400 42609     5  1989  1073    27
## # ... with 770 more variables: v013 <fct>, v014 <fct>, v015 <fct>, v016 <dbl>, v017 <dbl>,
## #   v018 <fct>, v019 <fct>, v019a <fct>, v020 <fct>, v021 <dbl>, v022 <fct>, v023 <fct>, v024 <fct>,
## #   v025 <fct>, v027 <dbl>, v028 <dbl>, v030 <dbl>, v034 <fct>, v040 <dbl>, v042 <fct>, v044 <fct>,
## #   v045a <fct>, v045b <fct>, v045c <fct>, v046 <fct>, v101 <fct>, v102 <fct>, v104 <fct>,
## #   v105 <fct>, v106 <fct>, v107 <fct>, v113 <fct>, v115 <fct>, v116 <fct>, v119 <fct>, v120 <fct>,
## #   v121 <fct>, v122 <fct>, v123 <fct>, v124 <fct>, v125 <fct>, v127 <fct>, v128 <fct>, v129 <fct>,
## #   v130 <fct>, v131 <fct>, v133 <fct>, v135 <fct>, v136 <dbl>, v137 <dbl>, v138 <dbl>, v139 <fct>,
## #   v140 <fct>, v149 <fct>, v150 <fct>, v151 <fct>, v152 <fct>, v153 <fct>, awfactt <dbl>,
## #   awfactu <dbl>, awfactr <dbl>, awfacte <dbl>, awfactw <dbl>, v155 <fct>, v157 <fct>, v158 <fct>,
## #   v159 <fct>, v160 <fct>, v161 <fct>, v167 <fct>, v168 <fct>, v169a <fct>, v169b <fct>,
## #   v170 <fct>, v171a <fct>, v171b <fct>, v190 <fct>, v191 <dbl>, v190a <fct>, v191a <dbl>,
## #   ml101 <fct>, v201 <dbl>, v202 <dbl>, v203 <dbl>, v204 <dbl>, v205 <dbl>, v206 <dbl>, v207 <dbl>,
## #   v208 <fct>, v209 <fct>, v210 <dbl>, v211 <dbl>, v212 <dbl>, v213 <fct>, v214 <dbl>, v215 <fct>,
## #   v216 <fct>, v217 <fct>, v218 <dbl>, v219 <dbl>, ...

# Aggregating preserves column order and data types / classes + attributes
identical(namlab(DHSBR_agg, class = TRUE), 
          namlab(DHSBR, class = TRUE))
## [1] TRUE

Apart from the simplicity and speed of this solution, collap() by default preserves the original column order (argument keep.col.order = TRUE) and all attributes of columns and the data frame itself. So we can truly speak of an aggregated / collapsed version of this dataset. Calling fdroplevels on the result is a likewise highly optimized and non-destructive solution to dropping any redundant factor levels from any of the 696 aggregated factor variables.

Let us now consider the poverty estimates dataset:

fdim(UNHSPOV)
## [1] 15636    44

table(vclasses(UNHSPOV))
## 
##  factor numeric 
##      17      27

descr(UNHSPOV, table = FALSE) %>% as.data.frame %>% head(10)
##    Variable   Class                          Label     N Ndist         Mean           SD       Min
## 1      hhid numeric  Unique identifier  in 2016/17 15636 15636 89610.296943 50753.531112 201.00000
## 2  finalwgt numeric                           <NA> 15636  1731   540.811778   519.368731  10.65561
## 3  district  factor                  District Code 15636   112           NA           NA        NA
## 4        ea numeric               Enumeration area 15636    67     9.157265    10.810512   1.00000
## 5     urban  factor         Urban/Rural Identifier 15636     2           NA           NA        NA
## 6    subreg  factor         15      sub     region 15636    15           NA           NA        NA
## 7    region  factor Region of Residence in 2016/17 15636     4           NA           NA        NA
## 8    regurb  factor             RegionxRural/Urban 15636     8           NA           NA        NA
## 9     equiv numeric                    (sum) equiv 15636  9448     3.438747     1.897926   0.71000
## 10    hsize numeric                    (sum) hsize 15636    20     4.515285     2.548680   1.00000
##             Max        Skew      Kurt         1%           5%          25%          50%          75%
## 1  178010.00000 0.002337925  1.833309 2102.35000 9907.7500000 46178.250000 89401.500000 1.327083e+05
## 2    5156.81494 3.097397657 18.780390   34.65487   76.0465393   207.895950   399.305145 6.978978e+02
## 3            NA          NA        NA         NA           NA           NA           NA           NA
## 4      90.00000 3.683418249 21.263899    1.00000    1.0000000     3.000000     6.000000 1.100000e+01
## 5            NA          NA        NA         NA           NA           NA           NA           NA
## 6            NA          NA        NA         NA           NA           NA           NA           NA
## 7            NA          NA        NA         NA           NA           NA           NA           NA
## 8            NA          NA        NA         NA           NA           NA           NA           NA
## 9      17.28507 0.904448197  4.183096    0.77380    0.8743333     2.009667     3.146083 4.559833e+00
## 10     23.00000 0.734721072  3.761180    1.00000    1.0000000     3.000000     4.000000 6.000000e+00
##             95%          99%
## 1  1.695023e+05 176403.65000
## 2  1.444975e+03   2700.59717
## 3            NA           NA
## 4  2.800000e+01     60.00000
## 5            NA           NA
## 6            NA           NA
## 7            NA           NA
## 8            NA           NA
## 9  6.972708e+00      8.84461
## 10 9.000000e+00     12.00000

Using the qsu() function, we can also summarize the variation in two of the key variables between district averages and within districts, separated for rural and urban areas. This can give us an idea of the variation in poverty levels we are erasing by aggregating this data to the district level.

qsu(UNHSPOV, fexp30 + welfare ~ urban, ~ district, ~ finalwgt, 
    vlabels = TRUE)[,"SD",,] # Showing only the standard deviation (SD)
## , , fexp30: Monthly food expenses
## 
##           Overall     Between      Within
## Rural  168101.761  47831.6226  161254.386
## Urban   243424.17  56966.9794  240210.089
## 
## , , welfare: Welfare based on usual members present
## 
##           Overall     Between      Within
## Rural  99872.8917  35288.1075  95355.6836
## Urban  202069.239   64221.637  195061.104

The variance breakdown shows that apart from rural welfare, most of the variation in food expenditure and welfare levels is between district averages rather than within districts. We can again examine the numeric variables:

UNHSPOV %>% num_vars %>% namlab
##    Variable                                                                            Label
## 1      hhid                                                    Unique identifier  in 2016/17
## 2  finalwgt                                                                             <NA>
## 3        ea                                                                 Enumeration area
## 4     equiv                                                                      (sum) equiv
## 5     hsize                                                                      (sum) hsize
## 6    fexp30                                                            Monthly food expenses
## 7    rexp30                     Monthly household expenditures after adjusting for inflation
## 8   rrfxp30                                                                             <NA>
## 9   rrexp30 Monthly household expenditures in real prices after adjusting for regional price
## 10 nrrexp30 Monthly nominal household expenditures in market prices & after regional price a
## 11  cpexp30 Monthly household expenditures in constant prices after adjusting for regional p
## 12 fcpexp30 Monthly household food expenditures in constant prices after adjusting for regio
## 13     mult                                                                             <NA>
## 14    rmult                                                                             <NA>
## 15  welfare                                           Welfare based on usual members present
## 16 fwelfare                                                                             <NA>
## 17    hmult                                                                             <NA>
## 18   plinen                                                    Poverty line in nominal terms
## 19  ctpline                                                  Poverty line in constant prices
## 20   hpline                                     Food poverty line in 2009/10 constant prices
## 21   spline                                          Poverty line in 2009/10 constant prices
## 22 fpoor_16                                      food Poor in 2016 based on welfare variable
## 23   decile                                                                   Quantile group
## 24      pid                                                           Individual indentifier
## 25    hhage                                                           Age in completed years
## 26  hhedyrs                                                 Number of school years completed
## 27  hhelder                                                                             <NA>

These are also all variables that we would aggregate using a measure of central tendency. The categorical variables are mostly identifiers and also some categorical versions of welfare variables (welfare quintiles), which can all sensibly be aggregated using the statistical mode:

UNHSPOV %>% cat_vars %>% namlab
##        Variable                                                           Label
## 1      district                                                   District Code
## 2         urban                                          Urban/Rural Identifier
## 3        subreg                                          15      sub     region
## 4        region                                  Region of Residence in 2016/17
## 5        regurb                                              RegionxRural/Urban
## 6       poor_16                                                  Poverty status
## 7        quints                      Quintiles based on the national population
## 8        qurban                       Quintiles based on rural/urban population
## 9       qregion                          Quintiles based on regional population
## 10        hhrel  Relationship of household member  to the head of the household
## 11        mstat                            Marital  status of  household member
## 12        hhsex                     RECODE of R02 (Sex of the household member)
## 13      hhedlev                                                            <NA>
## 14 hhstatus_emp                          Activity status(employed, subsistence)
## 15     hhstatus Activity status(employed, subsistence, unemployed, not working)
## 16       hhindu                                                   RECODE of B4b
## 17     hhmrtsex                                             Marital by headship

Below we aggregate this dataset, applying the weighted median to numeric data and the weighted mode (default) to categorical data, this time using collapg which is a wrapper around collap operating on grouped data frames / tibbles.

# Weighted aggregation by district, after removing household id and enumeration area
UNHSPOV %>% 
  fselect(-hhid, -ea) %>% 
  fgroup_by(district) %>% 
  collapg(fmedian, w = finalwgt) %>%
  fdroplevels %>% 
  head
## # A tibble: 6 x 42
##   district finalwgt urban subreg region regurb equiv hsize fexp30 rexp30 rrfxp30 rrexp30 nrrexp30
##   <fct>       <dbl> <fct> <fct>  <fct>  <fct>  <dbl> <dbl>  <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
## 1 "KALANG~   12994. Rural Centr~ Centr~ Centr~  1.87     2 2.46e5 1.83e5 240877. 180432.  324962.
## 2 "KAMPAL~  460128. Urban Kampa~ Centr~ Centr~  2.30     3 2.89e5 4.17e5 267612. 402942.  662020.
## 3 "KIBOGA"   20524. Rural Centr~ Centr~ Centr~  3.16     4 1.81e5 2.45e5 171290. 233323.  418979.
## 4 "     L~  118868. Rural Centr~ Centr~ Centr~  2.77     4 2.11e5 2.26e5 199803. 220439.  386698.
## 5 "MASAKA"   92389. Urban Centr~ Centr~ Centr~  2.64     3 2.42e5 2.74e5 224339. 269409.  473376.
## 6 "MPIGI"    65521. Rural Centr~ Centr~ Centr~  2.81     4 2.29e5 2.49e5 222739. 240048.  428228.
## # ... with 29 more variables: cpexp30 <dbl>, fcpexp30 <dbl>, mult <dbl>, rmult <dbl>, welfare <dbl>,
## #   fwelfare <dbl>, hmult <dbl>, plinen <dbl>, ctpline <dbl>, hpline <dbl>, spline <dbl>,
## #   poor_16 <fct>, fpoor_16 <dbl>, quints <fct>, decile <dbl>, qurban <fct>, qregion <fct>,
## #   pid <dbl>, hhrel <fct>, hhage <dbl>, mstat <fct>, hhsex <fct>, hhedyrs <dbl>, hhedlev <fct>,
## #   hhstatus_emp <fct>, hhstatus <fct>, hhindu <fct>, hhelder <dbl>, hhmrtsex <fct>

Note in the result above that the weighting variable is also aggregated. The default is wFUN = fsum so the weights in each group are summed.

At last let’s consider the census dataset. On first sight it is a bit simpler than the other two, consisting of 5 character identifiers from the macro-region to the parish level, followed by 270 numeric variables.

fdim(CENS)
## [1] 7653  275

table(vclasses(CENS))
## 
## character   numeric 
##         5       270

The specialty of this data is however that some variables are recorded in population totals, and some in percentage terms.

descr(CENS, table = FALSE) %>% as.data.frame %>% head(15)
##         Variable     Class                                                     Label    N Ndist
## 1         Region character                                                      <NA> 7653     4
## 2       District character                                                      <NA> 7653   122
## 3         County character                                                      <NA> 7653   199
## 4      Subcounty character                                                      <NA> 7653  1382
## 5         Parish character                                                      <NA> 7653  6438
## 6          POP_M   numeric                                     Population Size: Male 7557  3548
## 7          POP_F   numeric                                   Population Size: Female 7557  3664
## 8         POP_SR   numeric                                Population Size: Sex Ratio 7557   609
## 9            POP   numeric                                    Population Size: Total 7557  4923
## 10       HHEAD_M   numeric        Headship of Households by Sex: Male Headed: Number 7557  1736
## 11     HHEAD_M_P   numeric       Headship of Households by Sex: Male Headed: Percent 7557   359
## 12       HHEAD_F   numeric      Headship of Households by Sex: Female Headed: Number 7557   846
## 13     HHEAD_F_P   numeric     Headship of Households by Sex: Female Headed: Percent 7557   359
## 14   HHEAD_10_17   numeric  Household Headship by specific age groups: 10-17: Number 7557    70
## 15 HHEAD_10_17_P   numeric Household Headship by specific age groups: 10-17: Percent 7556    40
##            Mean           SD  Min     Max       Skew       Kurt      1%      5%    25%    50%    75%
## 1            NA           NA   NA      NA         NA         NA      NA      NA     NA     NA     NA
## 2            NA           NA   NA      NA         NA         NA      NA      NA     NA     NA     NA
## 3            NA           NA   NA      NA         NA         NA      NA      NA     NA     NA     NA
## 4            NA           NA   NA      NA         NA         NA      NA      NA     NA     NA     NA
## 5            NA           NA   NA      NA         NA         NA      NA      NA     NA     NA     NA
## 6  2236.0525341 2060.3798193 39.0 45834.0  5.8878678  68.350438 335.000  549.00 1155.0 1782.0 2686.0
## 7  2347.0690750 2285.1063696 26.0 52061.0  6.3804915  77.223950 324.000  550.60 1193.0 1852.0 2831.0
## 8    97.1208813   10.7985572 35.0   365.2  5.2374120  86.423031  78.300   85.20   91.9   95.8  100.5
## 9  4583.1216091 4338.2687374 65.0 97895.0  6.1578818  73.263475 668.680 1101.60 2350.0 3634.0 5520.0
## 10  733.6140003  795.4130787  3.0 19855.0  7.5065928 101.724761 106.000  175.00  362.0  565.0  861.0
## 11   77.0265979    6.0370928 21.3    95.5 -0.5516445   5.158277  61.956   67.28   73.1   77.3   81.2
## 12  232.9163689  300.3926888  1.0  7018.0  7.3292989  91.895443  20.000   38.00  100.0  167.0  267.0
## 13   22.9735477    6.0371554  4.5    78.7  0.5516337   5.158115  10.600   13.70   18.8   22.7   26.9
## 14    4.7338891    7.3239515  0.0   148.0  5.0812704  49.771747   0.000    0.00    1.0    3.0    6.0
## 15    0.4547512    0.4600549  0.0     9.2  4.0165231  49.845440   0.000    0.00    0.2    0.4    0.6
##         95%       99%
## 1        NA        NA
## 2        NA        NA
## 3        NA        NA
## 4        NA        NA
## 5        NA        NA
## 6   5102.40 10264.160
## 7   5331.80 11562.360
## 8    112.40   133.732
## 9  10449.40 22273.800
## 10  1677.00  3929.520
## 11    86.30    89.400
## 12   568.00  1590.760
## 13    32.72    38.044
## 14    16.00    37.000
## 15     1.20     1.900

The population counts are easily aggregated by simply computing a sum, but variables providing percentages of the population need to be aggregated using a weighted mean, where the total population serves as the weighting variable. This shows the percentage change variables:

# gvr is a shorthand for get_vars(..., regex = TRUE)
gvr(CENS, "_P$") %>% namlab %>% head(10)
##         Variable                                                                    Label
## 1      HHEAD_M_P                      Headship of Households by Sex: Male Headed: Percent
## 2      HHEAD_F_P                    Headship of Households by Sex: Female Headed: Percent
## 3  HHEAD_10_17_P                Household Headship by specific age groups: 10-17: Percent
## 4  HHEAD_18_30_P                Household Headship by specific age groups: 18-30: Percent
## 5  HHEAD_M_A60_P                  Household Headship by specific age groups: 60+: Percent
## 6    HPOP_0_17_P                        Household Population by Broad Ages: 0-17: Percent
## 7   HPOP_18_30_P                       Household Population by Broad Ages: 18-30: Percent
## 8   HPOP_31_59_P                       Household Population by Broad Ages: 31-59: Percent
## 9     HPOP_A60_P                         Household Population by Broad Ages: 60+: Percent
## 10      POP_L1_P Population Distribution by Special Age groups: Less than 1 year: Percent

# Making sure all of these variables are indeed on a percentage scale
range(fmax(gvr(CENS, "_P$")))
## [1]   8.9 100.0

To aggregate this data with collap, we need to supply the names or indices of both percentage and non-percentage variables together with the corresponding aggregator functions in a list passed to the custom argument. Weights are passed to the w argument. A specialty here is that we are using fsum_uw instead of fsum. The postfix _uw prevents the weights from being passed to fsum, which would otherwise calculate a survey total (i.e. a weighted sum) instead of a simple summation.

perc_vars <- gvr(CENS, "_P$", return = "indices")
pop_vars <- setdiff(num_vars(CENS, "indices"), perc_vars)

collap(CENS, ~ Region + District, w = ~ POP,
       custom = list(fmean = perc_vars, fsum_uw = pop_vars), 
       keep.w = FALSE) %>% head
## # A tibble: 6 x 272
##   Region District  POP_M  POP_F POP_SR    POP HHEAD_M HHEAD_M_P HHEAD_F HHEAD_F_P HHEAD_10_17
##   <chr>  <chr>     <dbl>  <dbl>  <dbl>  <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Centr~ Buikwe   207324 215447  6807. 422771   71148      72.8   26685      27.2         691
## 2 Centr~ Bukoman~  75109  76304  2442. 151413   23426      68.3   10902      31.7         177
## 3 Centr~ Butamba~  50082  50758  2495. 100840   15128      69.8    6550      30.2         139
## 4 Centr~ Buvuma    48414  41476  4703.  89890   20289      81.3    4830      18.7         211
## 5 Centr~ Gomba     82167  77755  3923. 159922   25794      73.3    9446      26.7         207
## 6 Centr~ Kalanga~  31349  22944  2353   54293   15493      77.1    4548      22.9         123
## # ... with 261 more variables: HHEAD_10_17_P <dbl>, HHEAD_18_30 <dbl>, HHEAD_18_30_P <dbl>,
## #   HHEAD_M_A60 <dbl>, HHEAD_M_A60_P <dbl>, HHEAD <dbl>, HPOP_0_17 <dbl>, HPOP_0_17_P <dbl>,
## #   HPOP_18_30 <dbl>, HPOP_18_30_P <dbl>, HPOP_31_59 <dbl>, HPOP_31_59_P <dbl>, HPOP_A60 <dbl>,
## #   HPOP_A60_P <dbl>, HPOP <dbl>, POP_L1 <dbl>, POP_L1_P <dbl>, POP_0_4 <dbl>, POP_0_4_P <dbl>,
## #   POP_0_8 <dbl>, POP_0_8_P <dbl>, POP_2_8 <dbl>, POP_2_8_P <dbl>, POP_2_17 <dbl>,
## #   POP_2_17_P <dbl>, POP_6_12 <dbl>, POP_6_12_P <dbl>, POP_6_15 <dbl>, POP_6_15_P <dbl>,
## #   POP_10_15 <dbl>, POP_10_15_P <dbl>, POP_10_17 <dbl>, POP_10_17_P <dbl>, POP_15_24 <dbl>,
## #   POP_15_24_P <dbl>, POP_16_24 <dbl>, POP_16_24_P <dbl>, POP_15_29 <dbl>, POP_15_29_P <dbl>,
## #   POP_A2 <dbl>, POP_A2_P <dbl>, POP_A10 <dbl>, POP_A10_P <dbl>, POP_A15 <dbl>, POP_A15_P <dbl>,
## #   POP_A18 <dbl>, POP_A18_P <dbl>, POP_A20 <dbl>, POP_A20_P <dbl>, POP_A65 <dbl>, POP_A65_P <dbl>,
## #   EDU_6_12_NAS_M <dbl>, EDU_6_12_NAS_M_P <dbl>, EDU_6_12_NAS_F <dbl>, EDU_6_12_NAS_F_P <dbl>,
## #   EDU_6_12_NAS <dbl>, EDU_6_12_NAS_P <dbl>, EDU_6_12_PRI_M <dbl>, EDU_6_12_PRI_M_P <dbl>,
## #   EDU_6_12_PRI_F <dbl>, EDU_6_12_PRI_F_P <dbl>, EDU_6_12_PRI <dbl>, EDU_6_12_PRI_P <dbl>,
## #   EDU_13_18_SEC_M <dbl>, EDU_13_18_SEC_M_P <dbl>, EDU_13_18_SEC_F <dbl>, EDU_13_18_SEC_F_P <dbl>,
## #   EDU_13_18_SEC <dbl>, EDU_13_18_SEC_P <dbl>, EDU_A15_BS4_M <dbl>, EDU_A15_BS4_M_P <dbl>,
## #   EDU_A15_BS4_F <dbl>, EDU_A15_BS4_F_P <dbl>, EDU_A15_BS4 <dbl>, EDU_A15_BS4_P <dbl>,
## #   EDU_A18_HO_M <dbl>, EDU_A18_HO_M_P <dbl>, EDU_A18_HO_F <dbl>, EDU_A18_HO_F_P <dbl>,
## #   EDU_A18_HO <dbl>, EDU_A18_HO_P <dbl>, EDU_A20_HA_M <dbl>, EDU_A20_HA_M_P <dbl>,
## #   EDU_A20_HA_F <dbl>, EDU_A20_HA_F_P <dbl>, EDU_A20_HA <dbl>, EDU_A20_HA_P <dbl>, IL_A18_M <dbl>,
## #   IL_A18_M_P <dbl>, IL_A18_F <dbl>, IL_A18_F_P <dbl>, IL_A18 <dbl>, IL_A18_P <dbl>,
## #   IL_10_17 <dbl>, IL_10_17_P <dbl>, IL_18_30 <dbl>, IL_18_30_P <dbl>, IL_A60 <dbl>,
## #   IL_A60_P <dbl>, BR_L1 <dbl>, ...

Also with the custom argument, the columns are by default (keep.col.order = TRUE) rearranged into the order in which they occur. Here we additionally use keep.w = FALSE, because the variable POP is both used as the weighting variable but also contained in pop_vars, and we don’t want to have it twice in the output.

Since we are only aggregating numeric data, we may compare the computation speed with a matching data.table expression¹:

library(microbenchmark)
library(data.table)
setDT(CENS)

microbenchmark(
  data.table = cbind(CENS[, lapply(.SD, weighted.mean, POP), by = .(Region, District), .SDcols = perc_vars], 
                     CENS[, lapply(.SD, sum), by = .(Region, District), .SDcols = pop_vars][, -(1:2)]), 
  collapse = collap(CENS, ~ Region + District, w = ~ POP,
                     custom = list(fmean = perc_vars, fsum_uw = pop_vars), 
                     keep.w = FALSE)
)
## Unit: milliseconds
##        expr        min         lq       mean     median        uq       max neval cld
##  data.table 153.317076 169.257733 181.740319 175.767603 191.12234 346.31187   100   b
##    collapse   8.997704   9.260768   9.990888   9.837097  10.24251  14.12287   100  a

Which does however not maintain the original column order.↩︎

To leave a comment for the author, please follow the link and comment on their blog: R, Econometrics, High Performance.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Fast and Easy Aggregation of Multi-Type and Survey Data in R

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)