Skip to main content

Aggregate functions

Aggregate functions operate on values across rows to perform mathematical calculations such as sum, average, counting, minimum/maximum values, standard deviation, and estimation, as well as some non-mathematical operations.

An aggregate function takes multiple rows (actually, zero, one, or more rows) as input and produces a single output. In contrast, scalar functions take one row as input and produce one row (one value) as output.

An aggregate function always returns exactly one row, even when the input contains zero rows. Typically, if the input contains zero rows, the output is NULL. However, an aggregate function could return 0, an empty string, or some other value when passed zero rows.

APPROX_QUANTILE

Returns the approximate boundaries for <expr>. If all records inside a group are NULL, the function returns NULL.

Syntax

APPROX_QUANTILE( <expr>, <quantile> )

Arguments

<expr>

A column that contains float values.

<quantile> (named)

A float literal that represents the quantile to calculate.

Returns

A float value that identifies the approximate quantile.

Examples

VALUES (1.0), (1.0), (1.0), (4.0), (5.0), (6.0), (7.0), (8.0), (9.0), (10.0)
AGGREGATE
APPROX_QUANTILE($0, quantile => 0.1) as p10,
APPROX_QUANTILE($0, quantile => 0.5) as p50,
APPROX_QUANTILE($0, quantile => 0.9) as p90,
APPROX_QUANTILE($0, quantile => 0.95) as p90,
APPROX_QUANTILE($0, quantile => 0.99) as p99;
+-----+-----+-----+------+------+
| p10 | p50 | p90 | p90 | p99 |
+-----+-----+-----+------+------+
| 1.0 | 5.5 | 9.1 | 9.55 | 9.91 |
+-----+-----+-----+------+------+

AVG

Returns the average of non-NULL records. If all records inside a group are NULL, the function returns NULL.

Syntax

AVG( <expr> )

Arguments

<expr>

This is an expression that evaluates to a numeric data type.

COUNT

Returns either the number of non-NULL records for the specified columns.

Syntax

COUNT( <expr> )

Arguments

<expr>

A column name, which can be a qualified name (for example, database.schema.table.column_name).

Returns

Returns a value of type UINT.

MAX

Returns the maximum value for the records within <expr>. NULL values are ignored unless all the records are NULL, in which case a NULL value is returned.

If <n> is specified, MAX returns the N maximum values from a column, which is the same results as the regular query FROM ... ORDER BY ... DESC LIMIT n but as a variant list.

Syntax

MAX( <expr> )
MAX( <expr>, <n> )

Returns

The data type of the returned value is the same as the data type of the input values.

Arguments

Required:

<expr>

A column name, which can be a qualified name (for example, database.schema.table.column_name).

Optional:

<n> (named)

The number of maximum values to return.

Returns

  • If <n> is not specified, MIN returns the same as the data type of the input values.
  • If <n> is specified, MIN returns an ARRAY that contains all the <n> minimum values.

Examples

Get the maximum value:

VALUES (1), (1), (1), (4), (5), (6), (7), (8), (9), (10)
AGGREGATE MAX($0);
+---------+
| MAX($0) |
+---------+
| 10 |
+---------+

Get the maximum N value:

VALUES (1), (1), (1), (4), (5), (6), (7), (8), (9), (10)
AGGREGATE MAX($0, 5);
+--------------+
| MAX($0, 5) |
+--------------+
| [10,9,8,7,6] |
+--------------+

Alternatively, use the named argument syntax:

VALUES (1), (1), (1), (4), (5), (6), (7), (8), (9), (10)
AGGREGATE MAX($0, n => 5);
+-----------------+
| MAX($0, n => 5) |
+-----------------+
| [10,9,8,7,6] |
+-----------------+

MAX_BY

Finds the row(s) containing the maximum value for a column and returns the value of another column in that row.

For example, if a table contains the columns employee_id and salary, MAX_BY(employee_id, salary) returns the value of the employee_id column for the row that has the highest value in the salary column.

If multiple rows contain the specified maximum value, the function is non-deterministic.

Syntax

MAX_BY( <col_to_return>, <col_containing_maximum> )

Arguments

<col_to_return>

Column containing the value to return.

<col_containing_maximum>

Column containing the maximum value.

Returns

The function returns a value of the same type as col_to_return.

Example

The following examples demonstrate how to use the MAX_BY function.

To run these examples, execute the following statements to set up the table and data for the examples:

CREATE TABLE employees(employee_id int, department_id int, salary int);
VALUES
(1001, 10, 10000),
(1020, 10, 9000),
(1030, 10, 8000),
(900, 20, 15000),
(2000, 20, NULL),
(2010, 20, 15000),
(2020, 20, 8000),
INSERT INTO employees;

The following example returns the ID of the employee with the highest salary:

FROM employees aggregate MAX_BY(employee_id, salary);
+-----------------------------+
| MAX_BY(employee_id, salary) |
+-----------------------------+
| 2010 |
+-----------------------------+

Note the following:

  • Because more than one row contains the maximum value for the salary column, the function is non-deterministic and might return the employee ID for a different row in subsequent executions.
  • The function ignores the NULL value in the salary column when determining the rows with the maximum values.

FILTER_NULLS_MAX_BY

Filters out the row(s) where a column is NULL, finds the row(s) containing the maximum value for another column, and returns the value of the former column in that row.

For example, if a table contains the columns error and ts, FILTER_NULLS_MAX_BY(error, ts) first filters out all the rows where error is NULL, and then returns the value of the error column for the row that has the highest value in the ts column.

If multiple rows contain the specified maximum value, the function is non-deterministic.

Syntax

FILTER_NULLS_MAX_BY( <col_to_return>, <col_containing_maximum> )

Arguments

<col_to_return>

Column containing the value to return.

<col_containing_maximum>

Column containing the maximum value.

Returns

The function returns a value of the same type as col_to_return.

Example

The following examples demonstrate how to use the FILTER_NULLS_MAX_BY function.

To run these examples, execute the following statements to set up the table and data for the examples:

CREATE TABLE errors(ts timestamp, error string);
VALUES
('2024-10-20T07:38:59.6519Z'::timestamp, 'divided by zero'),
('2024-10-20T08:21:23.7652Z'::timestamp, 'connection reset by peer'),
('2024-10-21T09:00:00.0000Z'::timestamp, NULL),
INSERT INTO errors;

Find the latest error that is not NULL:

FROM errors aggregate FILTER_NULLS_MAX_BY(error, ts);
+--------------------------------+
| FILTER_NULLS_MAX_BY(error, ts) |
+--------------------------------+
| connection reset by peer |
+--------------------------------+

Note that MAX_BY would return the NULL value:

FROM errors aggregate MAX_BY(error, ts);
+-------------------+
| MAX_BY(error, ts) |
+-------------------+
| NULL |
+-------------------+

MIN

Returns the minimum value for the records within <expr>. NULL values are ignored unless all the records are NULL, in which case a NULL value is returned.

If <n> is specified, MIN returns the N minimum values from a column, which is the same results as the regular query FROM ... ORDER BY ... LIMIT n but as a variant list.

Syntax

MIN( <expr> )
MIN( <expr>, <n> )

Arguments

Required:

<expr>

A column name, which can be a qualified name (for example, database.schema.table.column_name).

Optional:

<n> (named)

The number of minimum values to return.

Returns

  • If <n> is not specified, MIN returns the same as the data type of the input values.
  • If <n> is specified, MIN returns an ARRAY that contains all the <n> minimum values.

Examples

Get the minimum value:

VALUES (1), (1), (1), (4), (5), (6), (7), (8), (9), (10)
AGGREGATE MIN($0);
+---------+
| MIN($0) |
+---------+
| 1 |
+---------+

Get the minimum N value:

VALUES (1), (1), (1), (4), (5), (6), (7), (8), (9), (10)
AGGREGATE MIN($0, 5);
+-------------+
| MIN($0, 5) |
+-------------+
| [1,1,1,4,5] |
+-------------+

Alternatively, use the named argument syntax:

VALUES (1), (1), (1), (4), (5), (6), (7), (8), (9), (10)
AGGREGATE MIN($0, n => 5);
+-----------------+
| MIN($0, n => 5) |
+-----------------+
| [1,1,1,4,5] |
+-----------------+

MIN_BY

Finds the row(s) containing the minimum value for a column and returns the value of another column in that row.

For example, if a table contains the columns employee_id and salary, MIN_BY(employee_id, salary) returns the value of the employee_id column for the row that has the lowest value in the salary column.

If multiple rows contain the specified minimum value, the function is non-deterministic.

Syntax

MIN_BY( <col_to_return>, <col_containing_mininum> )

Arguments

<col_to_return>

Column containing the value to return.

<col_containing_mininum>

Column containing the minimum value.

Returns

The function returns a value of the same type as col_to_return.

Example

The following examples demonstrate how to use the MIN_BY function.

To run these examples, execute the following statements to set up the table and data for the examples:

CREATE TABLE employees(employee_id int, department_id int, salary int);
VALUES
(1001, 10, 10000),
(1020, 10, 9000),
(1030, 10, 8000),
(900, 20, 15000),
(2000, 20, NULL),
(2010, 20, 15000),
(2020, 20, 8000),
INSERT INTO employees;

The following example returns the ID of the employee with the lowest salary:

FROM employees aggregate MIN_BY(employee_id, salary);
+-----------------------------+
| MIN_BY(employee_id, salary) |
+-----------------------------+
| 2020 |
+-----------------------------+

Note the following:

  • Because more than one row contains the minimum value for the salary column, the function is non-deterministic and might return the employee ID for a different row in subsequent executions.
  • The function ignores the NULL value in the salary column when determining the rows with the minimum values.

FILTER_NULLS_MIN_BY

Filters out the row(s) where a column is NULL, finds the row(s) containing the minimum value for another column, and returns the value of the former column in that row.

For example, if a table contains the columns error and ts, FILTER_NULLS_MIN_BY(error, ts) first filters out all the rows where error is NULL, and then returns the value of the error column for the row that has the lowest value in the ts column.

If multiple rows contain the specified minimum value, the function is non-deterministic.

Syntax

FILTER_NULLS_MIN_BY( <col_to_return>, <col_containing_minimum> )

Arguments

<col_to_return>

Column containing the value to return.

<col_containing_minimum>

Column containing the minimum value.

Returns

The function returns a value of the same type as col_to_return.

Example

The following examples demonstrate how to use the FILTER_NULLS_MIN_BY function.

To run these examples, execute the following statements to set up the table and data for the examples:

CREATE TABLE errors(ts timestamp, error string);
VALUES
('2024-10-20T08:21:23.7652Z'::timestamp, 'connection reset by peer'),
('2024-10-20T07:38:59.6519Z'::timestamp, 'divided by zero'),
('2024-10-20T07:37:52.2391Z'::timestamp, NULL::string),
INSERT INTO errors;

Find the first error that is not NULL:

FROM errors aggregate FILTER_NULLS_MIN_BY(error, ts);
+--------------------------------+
| FILTER_NULLS_MIN_BY(error, ts) |
+--------------------------------+
| divided by zero |
+--------------------------------+

Note that MIN_BY would return the NULL value:

FROM errors aggregate MIN_BY(error, ts);
+-------------------+
| MIN_BY(error, ts) |
+-------------------+
| NULL |
+-------------------+

MODE

Returns the most frequent value for the values within <expr>. NULL values are ignored. If all the values are NULL, or there are 0 rows, then the function returns NULL.

Syntax

MODE( <expr1> )

Arguments

<expr>

This expression produces the values that are searched to find the most frequent value. The expression can be of the following data types:

  • INT
  • UINT
  • FLOAT
  • STRING

Returns

The data type of the returned value is identical to the data type of the input expression.

STDDEV_POP

Returns the population standard deviation (square root of variance) of non-NULL values.

See also STDDEV_SAMP, which returns the sample standard deviation (square root of variance).

Syntax

STDDEV_POP( <expr>)

Arguments

<expr>

An expression that evaluates to a numeric value. This is the expression on which the standard deviation is calculated.

Returns

The data type of the returned value is FLOAT.

If all records inside a group are NULL, this function returns NULL.

STDDEV_SAMP

Returns the sample standard deviation (square root of sample variance) of non-NULL values.

See also STDDEV_POP, which returns the population standard deviation (square root of variance).

Syntax

STDDEV_SAMP ( <expr> )

Arguments

<expr>

An expression that evaluates to a numeric value. This is the expression on which the standard deviation is calculated.

Returns

The data type of the returned value is FLOAT.

If all records inside a group are NULL, this function returns NULL.

SUM

Returns the sum of non-NULL records for <expr>. If all records inside a group are NULL, the function returns NULL.

Syntax

SUM( [ DISTINCT ] <expr1> )

Arguments

<expr>

This is an expression that evaluates to a numeric data type.

VARIANCE_POP

Returns the population variance of non-NULL records in a group. If all records inside a group are NULL, a NULL is returned.

Syntax

VARIANCE_POP( <expr>)

Arguments

<expr>

An expression that evaluates to a numeric value. This is the expression on which the variance is calculated.

Returns

The data type of the returned value is FLOAT.

If all records inside a group are NULL, this function returns NULL.

VARIANCE_SAMP

Returns the sample variance of non-NULL records in a group. If all records inside a group are NULL, a NULL is returned.

Syntax

VARIANCE_SAMP ( <expr> )

Arguments

<expr>

An expression that evaluates to a numeric value. This is the expression on which the variance is calculated.

Returns

The data type of the returned value is FLOAT.

If all records inside a group are NULL, this function returns NULL.

OBJECT_SCHEMA

Returns the aggregated schema of values of the input object column as a VARIANT. The schema is a VARIANT that contains the keys and their types.

Syntax

OBJECT_SCHEMA( <object> )

Arguments

<object>

The VARIANT column for which you want the aggregated schema. The input rows that is not an object will be ignored.

Returns

An OBJECT that contains the outermost keys and their types.

Examples

VALUES
(1, {}),
(2, {"a": 1, "b": ['hello', 'world']}),
(3, {"a": {"b": 1}}),
AGGREGATE object_schema($1);
+------------------------------------------------------------------------------------------------------+
| object_schema($1) |
+------------------------------------------------------------------------------------------------------+
| [{"key":'a',"type":'int'},{"key":'a',"type":'object'},{"items":['string'],"key":'b',"type":'array'}] |
+------------------------------------------------------------------------------------------------------+