Blog posts by Vladimir Ozerov

Composable Data Systems: Lessons from Apache Calcite Success

Apr 1, 2024

Apache Calcite achieved tremendous success, powering query optimization in many popular systems, such as Apache Hive and Apache Flink. But even though such a great library has existed for more than ten years, query optimization development is still remarkably complicated and hardly "commoditized." Why is it so? We will discuss which exact technical decisions contributed to Apache Calcite's success, what role community plays in such projects, why it is still so difficult to integrate "composable" libraries into real products, and why I personally do not believe that composable data systems trend will fundamentally change the competition dynamics in the market.

Dynamic Filtering: a Critical Performance Optimization in Analytical Engines

Jun 6, 2023

We discuss how dynamic filtering optimization dramatically improves scan performance in analytical engines and how it is implemented in Trino.

Distinct aggregation optimization in Apache Calcite and Trino

Feb 22, 2023

Aggregation is one of the most frequently encountered operations in analytics. In SQL, aggregations are performed using aggregate functions (e.g., `SUM`, `COUNT`) with the optional `GROUP BY` clause. An aggregation function could contain the `DISTINCT` keyword, which might be non-trivial to implement in the query engine. This blog post explains how Apache Calcite and Trino optimizers rewrite distinct aggregates so that the underlying query engine can process them.

Enhancing query execution performance with searchable arguments

May 5, 2022

Avoiding unnecessary computations is essential for high-performance query engines. This blog post discusses search arguments, or SARGs - a technique to derive data restrictions from query predicates that enables index selection, data pruning, and query plan simplification optimizations.

Introduction to Data Shuffling in Distributed SQL Engines

Jan 31, 2022

Distributed SQL engines process queries on several nodes. Nodes may need to exchange tuples during query execution to ensure correctness and maintain a high degree of parallelism. This blog post discusses the concept of data shuffling in distributed query engines.

Cross-Product Suppression in Join Order Planning

Nov 15, 2021

In this blog post, we discuss cross-product suppression, an important heuristic that powers the join order planning in modern query optimizers.

Relational Operators in Apache Calcite

Jun 1, 2021

When a user submits a query to a database, the optimizer translates the query string to an intermediate representation (IR) and applies various transformations to find the optimal execution plan. Apache Calcite uses relational operators as the intermediate representation. In this blog post, we discuss the design of the relational operators in Apache Calcite.

Memoization in Cost-based Optimizers

Mar 25, 2021

Query optimization is an expensive process that needs to explore multiple alternative ways to execute the query. The query optimization problem is NP-hard, with the number of possible plans growing exponentially with the query's complexity. This blog post will discuss memoization - an important technique that allows rule-based optimizers to consider billions of alternative plans in a reasonable time.

Rule-based Query Optimization

Jan 28, 2021

In this blog post, we discuss rule-based optimization - a common pattern to explore equivalent plans used by modern optimizers. Then we analyze the rule-based optimization in Apache Calcite, Presto, and CockroachDB.

Inside Presto Optimizer

Jan 4, 2021

Presto is an open-source distributed SQL query engine for big data. In this blog post series, we explore the internals of the Presto query optimizer. In the first part, we discuss the relational tree organization, the optimizer interface, and the design of the rule-based planner.

Querify labs BLOG