Written byVladimir Ozerov
June 1, 2021
When a user submits a query to a database, the optimizer translates the query string to an intermediate representation (IR) and applies various transformations to find the optimal execution plan.
Apache Calcite uses relational operators as the intermediate representation. In this blog post, we discuss the design of common relational operators in Apache Calcite.
Query optimization starts with parsing when a query string is translated into a syntax tree, which defines the syntactic structure of the query.
Since every database has a parser, the syntax tree might look like a good candidate for the intermediate representation because it is readily available to the database.
There are two significant problems with syntax tree as query's IR:
Combined, this makes query optimization over syntax trees challenging and not flexible.
An alternative IR is a relational operator tree. We may define common relational operators, such as `Project`, `Filter`, `Join`, `Aggregate`. The query represented in such a way is much simpler to optimize because relational operators have a well-defined scope and usually have only one input (except for joins and set operators). This dramatically simplifies common relational optimizations, such as operator transposition. Also, it gives implementors flexibility to model operators independently of the database syntax rules.
The main disadvantage is the need to translate the syntax tree into a relational tree, which is often non-trivial, especially with complex syntax constructs like subqueries or common table expressions. However, the simplicity and flexibility of relational operators usually outweigh by a high margin the additional efforts on translation.
Apache Calcite parses the query into a syntax tree. Then it performs the semantic validation of the syntax tree using the SqlValidatorImpl class, resolving involved data types along the way. Finally, the validated syntax tree is converted into a tree or relational operators using the SqlToRelConverter class. The subsequent optimizations are performed on the relational tree.
In this section, we discuss the design of Apache Calcite relational operators.
We start with several simplified definitions, which are not precise but sufficient for this blog post.
An attribute is a pair of a name and a data type. An attribute value is defined by an attribute name and value from the attribute type domain. A tuple is an unordered set of attribute values. No two attribute values in the tuple may have the same attribute name. A relation is a set of tuples. Every tuple within the relation has the same set of attributes. Relational operators take zero, one, or more input relations and produce an output relation.
To construct a tree of relational operators, we need the ability to define operator inputs. Many operators need access to attributes of the input relations. Therefore we also need the ability to reference input attributes. These are two key requirements for the relational operator interface.
In Apache Calcite, the relational operator is represented by the RelNode interface. The operator may have zero, one, or more input operators. For example, `TableScan` is an 0-ary operator, `Filter` is a unary operator, and `Union` is an N-ary operator. Every operator exposes the `RelDataType`, which is an ordered list of operator attributes. This is sufficient to construct arbitrarily complex relational trees.
Operators describe various transformations to tuples. A RexNode interface defines an operation that applies to some attribute values of a tuple and produces another value. Common `RexNode` types:
For example, the expression `name = "John"` would be represented as follows.
Notice that `RexInputRef` references the input's attribute by index, which means that attribute order is important in Apache Calcite. On the bright side, it simplifies the design, as you do not need to care about attribute names and potential naming conflicts (think of a join of two tables, which have an attribute with the same name). On the other hand, it has a detrimental effect on join order planning, as we shall see below.
Now, as we understand the basics, let's discuss the most common Apache Calcite operators: `TableScan`, `Project`, `Filter`, `Calc`, `Aggregate`, and `Join`.
Other important operators are `Window` and `Union`. We omit them in this blog post because they follow the same design principles as the previously mentioned operators.
`TableScan` is a leaf 0-ary operator that defines a scan of some data source.
The operator contains the `org.apache.calcite.schema.Table` instance, which describes a data source that produces tuples. It could represent a relational table, an index, a view, a CSV file, a network connection, or anything else. As an implementor, you provide the schema of your database that contains some `Table` instances. Apache Calcite will create a `TableScan` operator with the referenced `Table` inside when you refer to that table in the query. The `Table` must expose the row type so that the parent operators know which attributes are available from the `TableScan`.
The `Project` operator defines row expressions that should be applied to input tuples to produce new tuples. The operator produces one output tuple for every input tuple. Expressions are organized in a list.
Because Apache Calcite uses local indexes to reference input attributes, the `Project` operator is also injected whenever we need to change the attribute's order. For example, if there is a table with attributes `[a, b]` in that order and we execute `SELECT b, a FROM t`, the `Project` operator will be added on top of the `TableScan` to reorder attributes as required by the query. This complicates query planning because the optimizer spends time applying transformation rules to otherwise useless operators that do a trivial reorder.
Physical implementations of the `Project` operator must adjust the input traits. E.g., if the `TableScan` produces tuples ordered by `[b]` but the `Project` operator doesn't project that column, the order will be lost.
The relational tree of the query `SELECT a, a+b FROM t` might look like this:
The `Filter` operator returns tuples that satisfy a predicate. A predicate is a row expression. The `Filter` output row type is similar to the input's row type. Physical implementations of the `Filter` operator usually don't change input traits.
The query `SELECT a, a+b FROM t WHERE a+b>5` could be represented as:
The `Calc` is a special operator that combines the functionality of `Project` and `Filter` operators and performs the common sub-expression elimination. Internally, it splits all composite row expressions into primitive expressions. Expressions are organized in a list. The special `RexLocalRef` node is used to link siblings. `Project` becomes a list of expression indexes that should be exposed from the operator. `Filter` becomes an optional expression index that filters input tuples.
Apache Calcite provides a lot of optimization rules for `Project` and `Filter` operators. These same optimizations are generally not implemented for the `Calc` operator because it would essentially require duplication of rules logic. Instead, you may do the cost-based optimization with `Project` and `Filter` operations only and then convert `Project` and `Filter` operators into `Calc` in a separate heuristic phase. Apache Calcite provides dedicated rules for that. We touched on the multi-phase optimization in our previous blog post.
The `Aggregate` operator models the application of aggregate functions to the input. The operator consists of two parts - the group keys and aggregate functions.
The group keys define which input attributes to use to construct the groups. The statement `GROUP BY a, b` yields the grouping key `[0, 1]` if `a` and `b` are located on input positions 0 and 1, respectively. If there is no `GROUP BY` clause, the group key would be empty.
There will be several group keys if there is a `ROLLUP` or `CUBE` clause. For example, `GROUP BY ROLLUP a, b` would yield the grouping keys `[0,1], , `, which means that we would like to output groups for `[a, b]`, groups for `[a]`, and global aggregates without any grouping.
If there is an expression in the `GROUP BY` statement, it would be moved to a separate `Project` operator below `Aggregate`. This is why it is sufficient to define input attribute indexes for the group keys instead of defining row expressions. Separation of projections and aggregations is essential to keep the complexity of optimization rules under control. Otherwise, we would have to repeat logic from the `Project` optimization rules in the `Aggregate` optimization rules.
The aggregate functions are the list of aggregates that should be computed for the groups. The aggregate functions do not use the `RexNode` interface because they operate on multiple tuples as opposed to row expressions that are applied to a single tuple. Similar to group keys, aggregate functions refer to input columns by indexes. For example, the function `SUM(a)` is converted to `SUM(0)` if the input attribute `a` is located at position 0. Likewise, complex expressions are moved to a `Project` operator below the `Aggregate`. Aggregate functions may also have advanced properties, such as the `DISTINCT` flag or an optional filter. We will discuss these features in detail in future blog posts.
The `Aggregate` operator outputs group keys followed by aggregate functions. For the query `SELECT SUM(a), b GROUP BY b`, the relevant `Aggregate` operator would output `[0:b, 1:SUM(a)]`.
Consider the plan for the query `SELECT SUM(a+b), c FROM t GROUP BY c` below. Notice two `Project` operators: one to calculate `a+b` and another to output `SUM` before the attribute `c`.
The `Join` operator joins two inputs. The operator defines the join type (inner, left/right/full outer, semi, etc.) and the optional predicate.
The `Join` operator outputs all columns from the left input followed by all columns from the right input. There is the convention: given the left input with `L` attributes and the right input with `R` attributes:
In our previous blog post, we discussed that cost-based optimizers rely on the equivalence property of operators to encode alternative plans efficiently in the MEMO data structure. In Apache Calcite, `Join(AxB)` and `Join(BxA)` are not semantically equivalent because Apache Calcite relies on attribute indexes in the `RexInputRef` class. Parent operators of `Join(AxB)` and `Join(BxA)` will have to use different indexes when referring to the same join attribute. Internal join predicates will also reference attributes at different indexes.
Consider the `JoinCommute` rule that changes the order of inputs. To apply this rule, we need to (a) rewrite the internal predicate and (b) add the `Project` on top of the new `Join` to restore the original order of attributes.
This additional `Project` prevents the execution of other rules. For example, the `JoinAssociate` rule tries to reorder `(A join B) join C` to `A join (B join C)`. The rule looks for a pattern "Join on top of the Join". But with the additional `Project`, we have only "Join on top of the Project". To mitigate this, we may use the `JoinProjectTransposeRule` that transposes `Join` and `Project`, but this dramatically decreases planner's performance to the extent that Apache Calcite cannot do the exhaustive cost-based join planning on more than 5-6 tables in a reasonable time.
The alternative solution would be to operate on unique column names rather than indexes. Spark Catalyst and CockroachDB follow this approach. But this would require introducing some unique identifier to every equivalence group, which is also a challenge on its own.
Apache Calcite parses the query string into a syntax tree. The syntax tree is then translated into a tree of relational operators, which have a simpler internal structure and are more suitable for the subsequent optimizations.
We discussed several common relational operators in Apache Calcite. `Project` transforms every tuple from the input into another tuple. `Filter` operator returns input tuples that pass the predicate. `Calc` combines `Project` and `Filter` functionality and eliminates the common sub-expressions. `Aggregate` operator performs the grouping and applies aggregate functions. `Join` operator combines tuples two inputs and applies the predicate.
Designing relational operators is challenging. Every decision may open opportunities for new optimizations but block others. The index-based input attribute references in Apache Calcite are a good example of such a trade-off when a simplification useful for many optimization rules leads to severe problems with one of the most critical optimizer tasks - join order planning.
In future blog posts, we will dive into concrete optimizations that Apache Calcite applies to individual operators. Stay tuned!
We are always ready to help you with your query optimizer design. Just let us know.