**Paradigm4 » R**, and kindly contributed to R-bloggers)

Aggregation splits data into subsets, computes summary statistics on each subset, and reports the results in a conveniently summarized form.

The `aggregate` function is one of the most capable functions in the scidb package. The package overloads R’s standard `aggregate` function for SciDB arrays, using reasonably standard R syntax to cover most SciDB aggregation operators including `aggregate`, `window`, and `variable_window`. (The `regrid` and `cumulate` functions separately implement additional SciDB aggregation operators.)

The `aggregate` function extends the default capabilities of many SciDB aggregation operators to allow grouping by SciDB array dimensions, aggregates, other SciDB arrays, and combinations of all three.

### Basic Aggregation

The SciDB `aggregate` operator computes summary statistics for one or more attributes of a SciDB array, grouped by zero or more dimensions of the array. SciDB computes aggregates grouped along array dimensions very quickly.

The `aggregate` function in the scidb package extends the basic SciDB aggregation capability to implement aggregation grouped by SciDB array dimensions, array attributes, auxiliary SciDB arrays, or combinations of all three.

The aggregate function presents its output summarized in a data frame-like unpacked SciDB 1-d array with the `unpack=TRUE` option set. This format is the most convenient format for display and manipulation by R. Grouped aggregates may optionally be returned in multi-dimensional SciDB arrays by setting the `unpack=FALSE` function argument (the default). This form is useful if the aggregate result is to be subsequently joined with other SciDB arrays. See the examples below.

The scidb package tries to use standard R syntax as much as possible. The package interprets the `FUN` function argument for several standard R summary statistic functions like `mean`, `sd`, `min`, `max`, `var`, `length`, `prod`, as SciDB aggregate expressions for convenience. The `aggregate` function also accepts any valid SciDB aggregate expression as a character string. See below for examples.

### Dimensionality of Aggregation Results

The SciDB `aggregate` function can return results of variable array dimensionality depending on the number of grouping level combinations involved. The precise dimensions of the output may not be known until run time. For this reason, SciDB may place aggregation results into a huge sparse array, avoiding the need to compute precise coordinate upper bounds. Consequentially, the results of aggregations are huge sparse arrays whose data reside near the coordinate system origin at zero.

When users materialize results to R, only the portion of the sparse output array containing data is returned. The examples show a simple way to explicitly bound the coordinate system to the output size of the aggregation if you need to do that.

### Examples

Many examples below use the iris data set. The following code listing uploads iris to SciDB, assigning the result to the R variable `x`, a data frame-like `scidbdf` object. The underlying SciDB array has a single coordinate axis named “row” with 150 rows, and five SciDB array attributes corresponding to the variables in the iris data. Note that the variable names change to conform to SciDB array attribute naming convention.

# Upload Fisher's iris data to SciDB: > x head(x) Sepal_Length Sepal_Width Petal_Length Petal_Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa

#### Example 1: “Grand” aggregates

Omit the `by` function argument to compute a grand aggregate that averages all values, grouped across array attributes. Note that we use the standard R `mean` function instead of supplying an explicit SciDB aggregation expression. The package interprets this for us as the SciDB `avg` aggregate function. Also note that we limit the aggregation to the first four attributes with `x[,1:4]`, because the fifth attribute (“Species”) is a string value, and the average of strings is not defined. The result of this example is returned as a huge sparse 1-d SciDB data frame-like array (see the discussion above). We explicitly materialize the result to R with `[ ]`.

> aggregate(x[,1:4], FUN=mean) SciDB 1-D array: 1 obs. of 4 variables. > .Last.value[] Sepal_Length_avg Sepal_Width_avg Petal_Length_avg Petal_Width_avg 0 5.843333 3.057333 3.758 1.199333

We can equivalently use an explicit SciDB aggregation expression instead of `mean`:

> aggregate(x,FUN="avg(Sepal_Length), avg(Sepal_Width), avg(Petal_Length), avg(Petal_Width)")[] Sepal_Length_avg Sepal_Width_avg Petal_Length_avg Petal_Width_avg 0 5.843333 3.057333 3.758 1.199333

Note that aggregation expressions can rename their results:

> aggregate(x,FUN="avg(Sepal_Length) as sla, avg(Petal_Width) as pwa")[] sla pwa 0 5.843333 1.199333

#### Example 2: Grouped by dimension coordinates

We use a 3-d array for this example. Define the 5 x 4 x 3 array `A` with a single double-precision-valued array attribute `val` and dimension names `i`, `j`, and `k` as follows:

> A str(A) SciDB expression: build([i=0:4,100,0,j=0:... SciDB schema: [i=0:4,100,0,j=0:3,100,0,k=0:2,100,0] Attributes: attribute type nullable 1 val string FALSE Dimensions: dimension start end chunk 1 i 0 4 100 2 j 0 3 100 3 k 0 2 100

Compute sums grouped by dimensions `j` and `k`, returned as a data frame:

> aggregate(A, by=list("j","k"), FUN=sum)[] [,1] [,2] [,3] [1,] 10 15 20 [2,] 15 20 25 [3,] 20 25 30 [4,] 25 30 35

Alternatively, we can unpack the 2-d output into a data.frame with `unpack=TRUE`:

> aggregate(A, by=list("j","k"), FUN=sum, unpack=TRUE)[] j k val_sum 0 0 0 10 1 0 1 15 2 0 2 20 3 1 0 15 4 1 1 20 5 1 2 25 6 2 0 20 7 2 1 25 8 2 2 30 9 3 0 25 10 3 1 30 11 3 2 35

#### Example 3: Group by array attributes

The `aggregate` function for SciDB arrays supports groups defined by distinct array attribute values, similarly to the standard R `aggregate` function syntax. Aggregation by array attributes proceeds by converting the specified array attributes to categorical variables that enumerate distinct attribute values, and then grouping along the resulting categories. Consider the iris data used in Example 1. Compute the average Petal_Length, grouped by the distinct values of the “Species” attribute:

> aggregate(x, by="Species", FUN="avg(Petal_Length)")[] Petal_Length_avg Species 0 1.462 setosa 1 4.260 versicolor 2 5.552 virginica

The returned result includes the new “Species_index” array attribute that enumerates the distinct values of “Species.”

#### Example 4: Group by an auxiliary SciDB array

This example most closely matches standard R aggregation behavior. Summary statistics are computed across groups defined by an auxiliary, usually 1-d, SciDB array. The following example creates an auxiliary grouping array that divides flowers into two groups of different petal widths and then computes the average petal length for each group. Note that supplying the `by` argument as a list is optional in the SciDB case.

> aggregate(x$Petal_Length, by=list(x$Petal_Width<1), FUN=mean)[] Petal_Length_avg condition 0 4.906 false 1 1.462 true

This is identical to the standard R syntax (!):

aggregate(iris$Petal.Length, by=list(iris$Petal.Width<1), FUN=mean) Group.1 x 1 FALSE 4.906 2 TRUE 1.462

### Moving window aggregation

The `aggregate` function supports moving-window aggregates along coordinate systems or successive (sparse) data values.

#### Example 5: Moving window aggregation along the coordinate system

Consider the first six rows and only the numeric attributes of the iris data. The following example computes the rolling sum of pairs of rows across the data:

> aggregate(x[1:6,1:4], FUN=sum, window=c(0,1))[] Sepal_Length_sum Sepal_Width_sum Petal_Length_sum Petal_Width_sum 1 10.0 6.5 2.8 0.4 2 9.6 6.2 2.7 0.4 3 9.3 6.3 2.8 0.4 4 9.6 6.7 2.9 0.4 5 10.4 7.5 3.1 0.6 6 5.4 3.9 1.7 0.4

Note that SciDB’s moving window aggregates assume a zero boundary condition for (non-existing) values past the end of the array when applying sums. (See discussion below.)

Compare with an alternate convolution-based approach on the original R data using R’s `filter` function. The output of this approach differs in a few ways. The `filter` function returns a time series object, and the boundary condition is different than SciDB–assuming `NA` values outside the bounds of the array, resulting in a different last row.

> filter(iris[1:6,1:4],filter=c(1,1),method="convolution") Time Series: Start = 1 End = 6 Frequency = 1 [,1] [,2] [,3] [,4] 1 10.0 6.5 2.8 0.4 2 9.6 6.2 2.7 0.4 3 9.3 6.3 2.8 0.4 4 9.6 6.7 2.9 0.4 5 10.4 7.5 3.1 0.6 6 NA NA NA NA

### Window boundary conditions

#### Example 6: Windows along successive data values

Use ‘variable_window’ to perform moving window aggregates over data values in a single dimension specified by the ‘by’ argument. Moving window aggregates along data values are restricted to a single array dimension. The following example considers a sparse version of the iris data, filtered to include numeric values greater than 2.3 across all attributes.

# The filtered data. Note that the rows meeting the condition are widely separated. > (x[,1:4]>2.3)[] Sepal_Length Sepal_Width Petal_Length Petal_Width 101 6.3 3.3 6.0 2.5 110 7.2 3.6 6.1 2.5 115 5.8 2.8 5.1 2.4 137 6.3 3.4 5.6 2.4 141 6.7 3.1 5.6 2.4 145 6.7 3.3 5.7 2.5 # A moving window aggregate along two rows at a time does not do anything, because # there are no rows in the data adjacent to each other. > aggregate(x[,1:4] > 2.3, by="row", FUN=sum, window=c(0,1))[] Sepal_Length_sum Sepal_Width_sum Petal_Length_sum Petal_Width_sum 101 6.3 3.3 6.0 2.5 110 7.2 3.6 6.1 2.5 115 5.8 2.8 5.1 2.4 137 6.3 3.4 5.6 2.4 141 6.7 3.1 5.6 2.4 145 6.7 3.3 5.7 2.5 # The variable_window aggregate adds successive values. > aggregate(x[,1:4] > 2.3, by="row", FUN=sum, variable_window=c(0,1))[] Sepal_Length_sum Sepal_Width_sum Petal_Length_sum Petal_Width_sum 101 13.5 6.9 12.1 5.0 110 13.0 6.4 11.2 4.9 115 12.1 6.2 10.7 4.8 137 13.0 6.5 11.2 4.8 141 13.4 6.4 11.3 4.9 145 6.7 3.3 5.7 2.5

The post Aggregation appeared first on Paradigm4.

**leave a comment**for the author, please follow the link and comment on their blog:

**Paradigm4 » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...