We create high-performance data processing engines with Apache Arrow and Apache Calcite and advanced analytical solutions with Trino / Presto. Learn more about our team.
Building a query engine is a challenging task, requiring careful design of executor, scheduler, memory manager, and other key components. We created multiple distributed query engines for both transactional and analytical workloads using modern approaches, such as vectorization and compiled execution.
We routinely use Apache Arrow for high-performance columnar processing.
Query optimizer is one of the most important components of modern data management systems, having a critical impact on performance. We create powerful cost-based optimizers for distributed, federated and analytical engines.
We frequently use Apache Calcite as a back bone of our optimizers.
Data is at the heart of any modern business. Ability to analyze large volumes of data quickly is essential to stay ahead of your competitors.
We build custom analytical solutions using the modern open-source stack, including Apache Spark, Apache Flink, Apache Kafka, Trino, and open data formats Parquet, Orc, Avro managed by Apache Hive or Apache Iceberg.
Designing a new data management system is a challenging task. We create prototypes and do design reviews to ensure that you considered all trade-offs as early as possible.
Data processing is an active area of research. We bridge academic knowledge and practice to help you make better design decisions.
In-house expertise is essential for long-term product success. We conduct training to help your team accumulate solid knowledge of distributed systems and query processing.
Aggregation is one of the most frequently encountered operations in analytics. In SQL, aggregations are performed using aggregate functions (e.g., `SUM`, `COUNT`) with the optional `GROUP BY` clause. An aggregation function could contain the `DISTINCT` keyword, which might be non-trivial to implement in the query engine. This blog post explains how Apache Calcite and Trino optimizers rewrite distinct aggregates so that the underlying query engine can process them.
Avoiding unnecessary computations is essential for high-performance query engines. This blog post discusses search arguments, or SARGs - a technique to derive data restrictions from query predicates that enables index selection, data pruning, and query plan simplification optimizations.
Distributed SQL engines process queries on several nodes. Nodes may need to exchange tuples during query execution to ensure correctness and maintain a high degree of parallelism. This blog post discusses the concept of data shuffling in distributed query engines.