description |
---|
Learn how to query Pinot using PQL |
PQL is a derivative of SQL that supports selection, projection, aggregation, and grouping aggregation.
PQL is only a derivative of SQL, and it does not support Joins nor Subqueries. In order to support them, we suggest to rely on PrestoDB https://prestodb.io/, although Subqueries are not completely supported by PrestoDB at the moment of writing.
The Pinot Query Language (PQL) is very similar to standard SQL:
SELECT COUNT(*) FROM myTable
SELECT COUNT(*), MAX(foo), SUM(bar) FROM myTable
SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo) FROM myTable
GROUP BY bar, baz LIMIT 50
SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo) FROM myTable
GROUP BY bar, baz
ORDER BY bar, MAX(foo) DESC LIMIT 50
SELECT COUNT(*) FROM myTable
WHERE foo = 'foo'
AND bar BETWEEN 1 AND 20
OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))
SELECT * FROM myTable
WHERE quux < 5
LIMIT 50
SELECT foo, bar FROM myTable
WHERE baz > 20
ORDER BY bar DESC
LIMIT 100
Note: results might not be consistent if column ordered by has same value in multiple rows.
SELECT foo, bar FROM myTable
WHERE baz > 20
ORDER BY bar DESC
LIMIT 50, 100
To count rows where the column airlineName
starts with U
SELECT count(*) FROM SomeTable
WHERE regexp_like(airlineName, '^U.*')
GROUP BY airlineName TOP 10
As of now, functions have to be implemented within Pinot. Injecting functions is not allowed yet. The example below demonstrate the use of UDFs. More examples in Transform Function in Aggregation Grouping
SELECT count(*) FROM myTable
GROUP BY dateTimeConvert(timeColumnName, '1:MILLISECONDS:EPOCH', '1:HOURS:EPOCH', '1:HOURS')
Pinot supports queries on BYTES column using HEX string. The query response also uses hex string to represent bytes value.
E.g. the query below fetches all the rows for a given UID.
SELECT * FROM myTable
WHERE UID = "c8b3bce0b378fc5ce8067fc271a34892"
The select statement is as follows
SELECT <outputColumn> (, outputColumn, outputColumn,...)
FROM <tableName>
(WHERE ... | GROUP BY ... | ORDER BY ... | TOP ... | LIMIT ...)
outputColumn
can be *
to project all columns, columns (foo
, bar
, baz
) or aggregation functions like (MIN(foo)
, MAX(bar)
, AVG(baz)
).
EQUALS
IN
NOT IN
GT
LT
BETWEEN
REGEXP_LIKE
For Multi-Valued columns, EQUALS is similar to CONTAINS.
COUNT
MIN
MAX
SUM
AVG
MINMAXRANGE
DISTINCT
DISTINCTCOUNT
DISTINCTCOUNTHLL
DISTINCTCOUNTRAWHLL
: Returns HLL response serialized as string. The serialized HLL can be converted back into an HLL (see pinot-core/**/HllUtil.java as an example) and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.FASTHLL
(WARN: will be deprecated soon.FASTHLL
stores serialized HyperLogLog in String format, which performs worse thanDISTINCTCOUNTHLL
, which supports serialized HyperLogLog in BYTES (byte array) format)PERCENTILE[0-100]
: e.g.PERCENTILE5
,PERCENTILE50
,PERCENTILE99
, etc.PERCENTILEEST[0-100]
: e.g.PERCENTILEEST5
,PERCENTILEEST50
,PERCENTILEEST99
, etc.
COUNTMV
MINMV
MAXMV
SUMMV
AVGMV
MINMAXRANGEMV
DISTINCTCOUNTMV
DISTINCTCOUNTHLLMV
DISTINCTCOUNTRAWHLLMV
: Returns HLL response serialized as string. The serialized HLL can be converted back into an HLL (see pinot-core/**/HllUtil.java as an example) and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.FASTHLLMV
(WARN: will be deprecated soon. It does not make lots of sense to configure serialized HyperLogLog column as a dimension)PERCENTILE[0-100]MV
: e.g.PERCENTILE5MV
,PERCENTILE50MV
,PERCENTILE99MV
, etc.PERCENTILEEST[0-100]MV
: e.g.PERCENTILEEST5MV
,PERCENTILEEST50MV
,PERCENTILEEST99MV
, etc.
Supported predicates are comparisons with a constant using the standard SQL operators (=
, <
, <=
, >
, >=
, <>
, ‘!=’) , range comparisons using BETWEEN
(foo BETWEEN 42 AND 69
), set membership (foo IN (1, 2, 4, 8)
) and exclusion (foo NOT IN (1, 2, 4, 8)
). For BETWEEN
, the range is inclusive.
Comparison with a regular expression is supported using the regexp_like function, as in WHERE regexp_like(columnName, 'regular expression')
The GROUP BY
clause groups aggregation results by a list of columns, or transform functions on columns (see below)
The ORDER BY
clause orders selection results or group by results by a list of columns. PQL supports ordering DESC
or ASC
.
The TOP n
clause causes the ‘n’ largest group results to be returned. If not specified, the top 10 groups are returned.
The LIMIT n
clause causes the selection results to contain at most ‘n’ results. The LIMIT a, b
clause paginate the selection results from the ‘a’ th results and return at most ‘b’ results. By default, 10 records are returned in the result.
In aggregation and grouping, each column can be transformed from one or multiple columns. For example, the following query will calculate the maximum value of column foo
divided by column bar
grouping on the column time
converted from time unit MILLISECONDS
to SECONDS
:
SELECT MAX(DIV(foo, bar) FROM myTable
GROUP BY DATETIMECONVERT(time, '1:MILLISECONDS:EPOCH', '1:SECONDS:EPOCH', '1:SECONDS')
Function | Description |
---|---|
ADD | Sum of at least two values |
SUB | Difference between two values |
MULT | Product of at least two values |
DIV | Quotient of two values |
MOD | Modulo of two values |
ABS | Absolute of a value |
CEIL | Rounded up to the nearest integer. |
FLOOR | Rounded down to the nearest integer. |
EXP | exponential of |
LN | Euler’s number raised to the power of x. |
SQRT | Square root of a value |
TIMECONVERT |
Takes 3 arguments, converts the value into another time unit.
|
DATETIMECONVERT |
Takes 4 arguments, converts the value into another date time format, and buckets time based on the given time granularity.
Format is expressed as
Granularity is expressed as Examples 1) To convert column "Date" from hoursSinceEpoch to daysSinceEpoch
and bucket it to 1 day granularity
2) To simply bucket millis "Date" to 15 minutes granularity
3) To convert column "Date" from hoursSinceEpoch to format yyyyMdd
and bucket it to 1 days granularity
4) To convert column "Date" from format yyyy/MM/dd to weeksSinceEpoch
and bucket it to 1 weeks granularity
5) To convert column "Date" from millis to format yyyyMdd in
timezone PST
|
DATETRUNC |
(Presto) SQL compatible date truncation, equivalent to the Presto function
date_trunc. Takes at least 3 and upto 5 arguments, converts the value
into a specified output granularity seconds since UTC epoch that is bucketed
on a unit in a specified timezone.
|
ARRAYLENGTH | Returns the length of a multi-value column |
VALUEIN | Takes at least 2 arguments, where the first argument is a multi-valued
column, and the following arguments are constant values. The transform
function will filter the value from the multi-valued column with the given
constant values. The VALUEIN transform function is especially
useful when the same multi-valued column is both filtering column and grouping
column.
Examples VALUEIN(mvColumn, 3, 5, 15)
|
JSONEXTRACTSCALAR |
Examples
|
JSONEXTRACTKEY |
Examples
|
GROOVY |
GROOVY('result value metadata json', ''groovy script', arg0, arg1, arg2...) This function will execute the groovy script using the arguments provided and return the result which matches the provided result value metadata. 1st argument - json string representing result value metadata. Must
contain non-null keys 2nd argument - groovy script string, which uses remaining arguments - pinot columns/other transform functions which are arguments to the groovy script Examples: (these are just illustrations, you can write practically any groovy script)
|
{% hint style="info" %} These differences only apply to the PQL endpoint. They do not hold true for the standard-SQL endpoint, which is the recommended endpoint. More information about the two types of endpoints in Querying Pinot {% endhint %}
TOP
works likeLIMIT
for truncation in group by queries- No need to select the columns to group with. The following two queries are both supported in PQL, where the non-aggregation columns are ignored.
SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo) FROM mytable
GROUP BY bar, baz
TOP 50
SELECT bar, baz, MIN(foo), MAX(foo), SUM(foo), AVG(foo) FROM mytable
GROUP BY bar, baz
TOP 50
- The results will always order by the aggregated value (descending). The results for query
SELECT MIN(foo), MAX(foo) FROM myTable
GROUP BY bar
TOP 50
will be the same as the combining results from the following queries
SELECT MIN(foo) FROM myTable
GROUP BY bar
TOP 50
SELECT MAX(foo) FROM myTable
GROUP BY bar
TOP 50
where we don’t put the results for the same group together.
- No support for ORDER BY in aggregation group by. However, ORDER BY support was added recently and is available in the standard-SQL endpoint. It can be used in the PQL endpoint by passing
queryOptions
into the payload as follows
{
"pql" : "SELECT SUM(foo), SUM(bar) from myTable GROUP BY moo ORDER BY SUM(bar) ASC, moo DESC TOP 10",
"queryOptions" : "groupByMode=sql;responseFormat=sql"
}
where,
groupByMode=sql
- standard sql way of execution group by, hence accepting order byresponseFormat=sql
- standard sql way of displaying results, in a tabular manner