Avoiding unnecessary computations is essential for high-performance query engines. This blog post discusses search arguments, or SARGs - a technique to derive data restrictions from query predicates that enables index selection, data pruning, and query plan simplification optimizations.

Consider a simplified part of the TPC-DS query 28:

We may infer several restrictions from the query: `ss_quantity ∈ [0:5]`, and `ss_list_price ∈ [8:18]`. These restrictions might be helpful for the query engine, for example:

- A planner may decide to use secondary indexes on these columns to minimize the amount of data to scan.
- A storage engine may skip some data blocks by comparing column restrictions with the block-level column statistics, a technique known as data pruning.
- Advanced planners may deduce that `avg(ss_list_price) > 20` would never hold for the range `ss_list_price ∈ [8:18]`, avoiding query execution altogether.

Now, let's see how we can generalize these restrictions into a common framework that both query planner and query engine may utilize.

Assume that our system converts every query into a tree of relational operators. Each operator has one or more attributes. Each attribute has the data domain determined by its type. We may describe the domain as a set of ranges and concrete values.

By carefully analyzing the query predicates, we may impose additional restrictions on operator attributes. We call these restrictions searchable arguments or **SARGs**. We call the expression **sargable** if it allows us to derive one or more SARGs.

The most common example of sargable expressions are comparison operators:

- Operators `=`, `>`, `>=`, `<`, `<=` add ranges and restrict the `NULL` values.
- The inequality operator `<>` excludes a point and the `NULL` value.
- The `IS NULL`, `IS NOT NULL`, `IS DISTINCT FROM`, and `IS NOT DISTINCT FROM` operators provide equality and inequality comparisons with special handling of `NULL` values.
- The `LIKE` operator with a concrete prefix might be represented as a range.

SARGs may also be extracted from logical expressions:

- `AND` expression always narrows the result set, and therefore we can combine SARGs from the child predicates through the intersection.
- `OR` expression extends the possible result set. If all sides of the `OR` contain restrictions on the same attribute, we merge the restrictions. For example, `a = 10 OR a = 20` limits the attribute's domain to two points [10] and [20]. If at least one part of the `OR` doesn't restrict the given attribute, then we ignore the relevant attribute restrictions from other sides. For example, `a = 10 OR b = 20` doesn't impose any restrictions to `a` or `b` domains.
- `NOT` expression inverts the existing SARG. Note that `a <> 10` and `NOT (a = 10)` are different expressions thanks to the three-valued boolean logic in SQL.

We can derive SARGs from other predicates. For example, we can potentially derive `x ∈ [4]` from the following expressions:

- Arithmetic functions: `x + 1 = 5`.
- Math functions: `sqrt(x) = 2`.
- Casts: `CAST(x AS BIGINT) = 4`.
- Transitive predicates: `x = y AND y = 4`.

Note, however, that some production-grade optimizers might not be able to extract SARGs from anything other than range predicates and logical expressions. To increase the chances that SARG-related optimizations kick in, it is often better to write your conditions in a straightforward form when possible:

- `<column> <condition> <value>`.

Sometimes we know that the attribute's domain is restricted to one or more points, but the concrete values are not known until execution. Consider the following expression with a parameter: `a = ? AND b = 0.9 * a`. Both `a` and `b` would always have only one concrete value, but we cannot resolve these values upfront:

Surprisingly, even unknown restrictions might be valuable for the engine. If the predicate returns only one distinct value for the attribute `a`, we can better estimate some cardinalities. For example, the aggregation `GROUP BY a` would never produce more than one row. Therefore, the SARG infrastructure may also track unknown points or ranges.

Now, as we understand the core concerns, we can propose the high-level design of the SARG data structure:

We now discuss how query engines may use SARGs to improve query execution performance.

Query optimizers rely on statistics to determine the optimal operator placement, such as join order, aggregate pushdown, etc. DBMS backends usually provide statistics in the form of inherently imprecise sketches. Derivation of stats for intermediate operators increases estimation errors further. The tighter the stats, the smaller the estimation errors.

We may consider SARGs as an additional source of statistics. We may reduce the error margins by merging SARGs with existing engine stats. This dramatically improves cardinality estimations, especially for the join order selection.

One example is the already mentioned cardinality estimation for the `Aggregate` operator when its input is known to produce no more than one distinct value for the grouping attribute:

If the existing statistics and the newly derived SARG have non-intersecting ranges, we may estimate that the operator is likely to produce no rows. If the existing statistics are already derived from other SARGs, we can return the empty result set without even executing some parts of the query.

For distributed engines, SARGs may provide hints that the relevant data set is located on a single node, which may simplify the parent operators. For example, the distributed `Aggregate` operation usually requires a shuffle (aka exchange) to calculate the final aggregate. However, if its input contains only one row, we do not need the shuffle.

The textbook example of SARG usage is the access path selection. For example, the range predicate `a > 10` may instruct the optimizer to use the sorted index.

Another important application of SARGs is data pruning. Some databases may partition data using hash or range sharding. We can use SARGs to infer that a particular partition doesn't hold the required data. This technique is known as **partition pruning** and often could be applied by the query optimizer before the query execution.

In other cases, the SARG-induced pruning is applied in runtime. For example, columnar databases often organize the data set into blocks. Blocks may have pre-calculated per-column statistics, such as `min` and `max` values. Comparing the SARG range with the range from block stats, we may deduce that the block doesn't have matching rows without reading the actual data. We call this technique **block pruning**.

SARGs, or searchable arguments, are a powerful technique that allows us to extract the attribute restrictions from predicates and propagate them across operators. Query engines might utilize SARGs to improve query execution performance. Typical optimizations are operator tree simplification, better cardinality estimations, and access path selection.

As an end-user, you may increase the probability of SARG-related optimizations by writing the predicates in the form of `<attribute> <condition> <value>`, e.g., `x = 10`. Although, advanced engines may derive SARGs from more complex expressions.

In future posts, we will discuss the data access optimizations in more details. Stay tuned!

We are always ready to help you with your query engine design. Just let us know.