Aggregation

February 26, 2015
By

(This article was first published on Paradigm4 » R, and kindly contributed to R-bloggers)

Aggregation splits data into subsets, computes summary statistics on each subset, and reports the results in a conveniently summarized form.

The aggregate function is one of the most capable functions in the scidb package. The package overloads R’s standard aggregate function for SciDB arrays, using reasonably standard R syntax to cover most SciDB aggregation operators including aggregate, window, and variable_window. (The regrid and cumulate functions separately implement additional SciDB aggregation operators.)

The aggregate function extends the default capabilities of many SciDB aggregation operators to allow grouping by SciDB array dimensions, aggregates, other SciDB arrays, and combinations of all three.

Basic Aggregation

The SciDB aggregate operator computes summary statistics for one or more attributes of a SciDB array, grouped by zero or more dimensions of the array. SciDB computes aggregates grouped along array dimensions very quickly.

The aggregate function in the scidb package extends the basic SciDB aggregation capability to implement aggregation grouped by SciDB array dimensions, array attributes, auxiliary SciDB arrays, or combinations of all three.

The aggregate function presents its output summarized in a data frame-like unpacked SciDB 1-d array with the unpack=TRUE option set. This format is the most convenient format for display and manipulation by R. Grouped aggregates may optionally be returned in multi-dimensional SciDB arrays by setting the unpack=FALSE function argument (the default). This form is useful if the aggregate result is to be subsequently joined with other SciDB arrays. See the examples below.

The scidb package tries to use standard R syntax as much as possible. The package interprets the FUN function argument for several standard R summary statistic functions like mean, sd, min, max, var, length, prod, as SciDB aggregate expressions for convenience. The aggregate function also accepts any valid SciDB aggregate expression as a character string. See below for examples.

Dimensionality of Aggregation Results

The SciDB aggregate function can return results of variable array dimensionality depending on the number of grouping level combinations involved. The precise dimensions of the output may not be known until run time. For this reason, SciDB may place aggregation results into a huge sparse array, avoiding the need to compute precise coordinate upper bounds. Consequentially, the results of aggregations are huge sparse arrays whose data reside near the coordinate system origin at zero.

When users materialize results to R, only the portion of the sparse output array containing data is returned. The examples show a simple way to explicitly bound the coordinate system to the output size of the aggregation if you need to do that.

Examples

Many examples below use the iris data set. The following code listing uploads iris to SciDB, assigning the result to the R variable x, a data frame-like scidbdf object. The underlying SciDB array has a single coordinate axis named “row” with 150 rows, and five SciDB array attributes corresponding to the variables in the iris data. Note that the variable names change to conform to SciDB array attribute naming convention.

# Upload Fisher's iris data to SciDB:
> x  head(x)
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Example 1: “Grand” aggregates

Omit the by function argument to compute a grand aggregate that averages all values, grouped across array attributes. Note that we use the standard R mean function instead of supplying an explicit SciDB aggregation expression. The package interprets this for us as the SciDB avg aggregate function. Also note that we limit the aggregation to the first four attributes with x[,1:4], because the fifth attribute (“Species”) is a string value, and the average of strings is not defined. The result of this example is returned as a huge sparse 1-d SciDB data frame-like array (see the discussion above). We explicitly materialize the result to R with [ ].

> aggregate(x[,1:4], FUN=mean)
SciDB 1-D array: 1 obs. of 4 variables.
> .Last.value[]
  Sepal_Length_avg Sepal_Width_avg Petal_Length_avg Petal_Width_avg
0         5.843333        3.057333            3.758        1.199333

We can equivalently use an explicit SciDB aggregation expression instead of mean:

> aggregate(x,FUN="avg(Sepal_Length), avg(Sepal_Width), avg(Petal_Length), avg(Petal_Width)")[]
  Sepal_Length_avg Sepal_Width_avg Petal_Length_avg Petal_Width_avg
0         5.843333        3.057333            3.758        1.199333

Note that aggregation expressions can rename their results:

> aggregate(x,FUN="avg(Sepal_Length) as sla, avg(Petal_Width) as pwa")[]
       sla      pwa
0 5.843333 1.199333

Example 2: Grouped by dimension coordinates

We use a 3-d array for this example. Define the 5 x 4 x 3 array A with a single double-precision-valued array attribute val and dimension names i, j, and k as follows:

> A  str(A)
SciDB expression:  build([i=0:4,100,0,j=0:...
SciDB schema:   [i=0:4,100,0,j=0:3,100,0,k=0:2,100,0]

Attributes:
  attribute   type nullable
1       val string    FALSE
Dimensions: 
  dimension start end chunk
1         i     0   4   100
2         j     0   3   100
3         k     0   2   100

Compute sums grouped by dimensions j and k, returned as a data frame:

> aggregate(A, by=list("j","k"), FUN=sum)[]
     [,1] [,2] [,3]
[1,]   10   15   20
[2,]   15   20   25
[3,]   20   25   30
[4,]   25   30   35

Alternatively, we can unpack the 2-d output into a data.frame with unpack=TRUE:

> aggregate(A, by=list("j","k"), FUN=sum, unpack=TRUE)[]
   j k val_sum
0  0 0      10
1  0 1      15
2  0 2      20
3  1 0      15
4  1 1      20
5  1 2      25
6  2 0      20
7  2 1      25
8  2 2      30
9  3 0      25
10 3 1      30
11 3 2      35

Example 3: Group by array attributes

The aggregate function for SciDB arrays supports groups defined by distinct array attribute values, similarly to the standard R aggregate function syntax. Aggregation by array attributes proceeds by converting the specified array attributes to categorical variables that enumerate distinct attribute values, and then grouping along the resulting categories. Consider the iris data used in Example 1. Compute the average Petal_Length, grouped by the distinct values of the “Species” attribute:

> aggregate(x, by="Species", FUN="avg(Petal_Length)")[]
  Petal_Length_avg    Species
0           1.462     setosa
1           4.260 versicolor
2           5.552  virginica

The returned result includes the new “Species_index” array attribute that enumerates the distinct values of “Species.”

Example 4: Group by an auxiliary SciDB array

This example most closely matches standard R aggregation behavior. Summary statistics are computed across groups defined by an auxiliary, usually 1-d, SciDB array. The following example creates an auxiliary grouping array that divides flowers into two groups of different petal widths and then computes the average petal length for each group. Note that supplying the by argument as a list is optional in the SciDB case.

> aggregate(x$Petal_Length, by=list(x$Petal_Width<1), FUN=mean)[]
  Petal_Length_avg condition
0            4.906     false
1            1.462      true

This is identical to the standard R syntax (!):

aggregate(iris$Petal.Length, by=list(iris$Petal.Width<1), FUN=mean)
  Group.1     x
1   FALSE 4.906
2    TRUE 1.462

Moving window aggregation

The aggregate function supports moving-window aggregates along coordinate systems or successive (sparse) data values.

Example 5: Moving window aggregation along the coordinate system

Consider the first six rows and only the numeric attributes of the iris data. The following example computes the rolling sum of pairs of rows across the data:

> aggregate(x[1:6,1:4], FUN=sum, window=c(0,1))[]
  Sepal_Length_sum Sepal_Width_sum Petal_Length_sum Petal_Width_sum
1             10.0             6.5              2.8             0.4
2              9.6             6.2              2.7             0.4
3              9.3             6.3              2.8             0.4
4              9.6             6.7              2.9             0.4
5             10.4             7.5              3.1             0.6
6              5.4             3.9              1.7             0.4

Note that SciDB’s moving window aggregates assume a zero boundary condition for (non-existing) values past the end of the array when applying sums. (See discussion below.)

Compare with an alternate convolution-based approach on the original R data using R’s filter function. The output of this approach differs in a few ways. The filter function returns a time series object, and the boundary condition is different than SciDB–assuming NA values outside the bounds of the array, resulting in a different last row.

> filter(iris[1:6,1:4],filter=c(1,1),method="convolution")
Time Series:
Start = 1
End = 6
Frequency = 1
  [,1] [,2] [,3] [,4]
1 10.0  6.5  2.8  0.4
2  9.6  6.2  2.7  0.4
3  9.3  6.3  2.8  0.4
4  9.6  6.7  2.9  0.4
5 10.4  7.5  3.1  0.6
6   NA   NA   NA   NA

Window boundary conditions

Example 6: Windows along successive data values

Use ‘variable_window’ to perform moving window aggregates over data values in a single dimension specified by the ‘by’ argument. Moving window aggregates along data values are restricted to a single array dimension. The following example considers a sparse version of the iris data, filtered to include numeric values greater than 2.3 across all attributes.

# The filtered data. Note that the rows meeting the condition are widely separated.
> (x[,1:4]>2.3)[]
    Sepal_Length Sepal_Width Petal_Length Petal_Width
101          6.3         3.3          6.0         2.5
110          7.2         3.6          6.1         2.5
115          5.8         2.8          5.1         2.4
137          6.3         3.4          5.6         2.4
141          6.7         3.1          5.6         2.4
145          6.7         3.3          5.7         2.5

# A moving window aggregate along two rows at a time does not do anything, because
# there are no rows in the data adjacent to each other.
> aggregate(x[,1:4] > 2.3, by="row", FUN=sum, window=c(0,1))[]
    Sepal_Length_sum Sepal_Width_sum Petal_Length_sum Petal_Width_sum
101              6.3             3.3              6.0             2.5
110              7.2             3.6              6.1             2.5
115              5.8             2.8              5.1             2.4
137              6.3             3.4              5.6             2.4
141              6.7             3.1              5.6             2.4
145              6.7             3.3              5.7             2.5

# The variable_window aggregate adds successive values.
> aggregate(x[,1:4] > 2.3, by="row", FUN=sum, variable_window=c(0,1))[]
    Sepal_Length_sum Sepal_Width_sum Petal_Length_sum Petal_Width_sum
101             13.5             6.9             12.1             5.0
110             13.0             6.4             11.2             4.9
115             12.1             6.2             10.7             4.8
137             13.0             6.5             11.2             4.8
141             13.4             6.4             11.3             4.9
145              6.7             3.3              5.7             2.5

The post Aggregation appeared first on Paradigm4.

To leave a comment for the author, please follow the link and comment on their blog: Paradigm4 » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)