Substrait — the Lingua Franca for Databases

Introduction

Often, required data resides in different storages: for example, sales data is in Parquet/S3, while catalog data is in Postgres. A business question requires one query across all of it (e.g., joining sales with products).

To achieve this, one can use a federated engine (CedrusData / Trino / Dremio / Presto), which parses SQL written by an analyst, builds a unified execution plan, and pushes down fragments of the plan to data sources: filters and projections to Parquet readers, subqueries to Postgres, etc.

For Postgres, the federated engine uses a JDBC connector. This usually means that after optimization, the federated engine must regenerate SQL in the target database’s dialect, send it there, fetch intermediate results, and compute the rest of the plan (joins across sources, final aggregations) locally.

But this approach has a number of specific problems:

SQL dialects. Different systems “speak” different dialects, with distinct behavior of functions and NULL. The query producer (where the query is written) must know every consumer (where the query or part of it is executed).
Redundant work. Every engine must re-parse and re-optimize SQL generated by connectors of the federated engine. Hints and plan details are lost on the way..
Inconsistent semantics. Integer overflow, division by zero, NULL comparison - all can behave differently in different engines.
NoSQL and multiple DSLs. MongoDB, Cassandra, and other APIs don’t speak SQL. Analysts, meanwhile, use Python (Ibis), R (dplyr), and more. We need a common denominator for all of them.

The Idea of Substrait - Instead of sending a SQL string, Substrait proposes to transmit a serialized query plan - an intermediate representation (IR) in protobuf, with well-defined operators, types, and functions.

This enables:

Pushing down plan fragments into suitable backends
Avoiding repeated parsing/optimization
Decomposing a large plan into sub-plans executed where it’s most efficient
Decoupling frontends (SQL/Ibis/dplyr/…) from backends (Velox/DuckDB/DataFusion/…)

Substrait's place in the world of DBMS is well illustrated by the following illustration:

When Substrait Is Not Needed:

You live entirely within one DB/DI stack, without federated sources.
You have no need to share logic across engines.
Your internal IR fully satisfies you, and you don’t plan to swap the execution backend.

Substrait Architecture

Substrait is an open, language- and engine-agnostic specification of a portable IR for relational computations over structured data.

Plans are defined in Protocol Buffers (with a human-readable text form), and YAML catalogs describe extensions - functions, expressions, relational operators, types, and their variants.

The basic building blocks of Substrait are shown in this figure:

One system that generates plans is called a producer, and the one that consumes and executes them - a consumer.

Before executing anything, systems can exchange their capabilities (capabilities.proto), so both sides understand what the other supports. This includes the Substrait version and lists of supported functions and types - described via SimpleExtensions.

A typical YAML extension (e.g., functions_arithmetic.yaml) looks like this:

urn: extension:io.substrait:extension_types
types:
  - name: point
    structure:
      latitude: i32
      longitude: i32
  - name: line
    structure:
      start: point
      end: point

This capabilities exchange step is optional - one can start sending plans right away, assuming both sides understand the same extensions.

When a user requests data in any format (SQL, Python, dplyr, etc.), the producer converts it into a Protobuf message in accordance with the Substrait specification.

To make this clearer, let's look at a typical use case for pushing a subquery to another system. For example, we need to execute this query:

-- Sum of sales where price > 100
SELECT SUM(amount) AS total_amount
FROM sales
WHERE price > 100;

The sales table is in Parquet.

Our Parquet reader can perform simple filtering (the WHERE clause), but not aggregation (SUM).

The producer generates a Substrait plan and sends it to the consumer:

{
 // List of extensions (functions and types) used
 "extensions": [
    // The namespase for extensions has anchor=1, this is the link by which the function is bound to the namespace.
    { "extensionUri": { "uriAnchor": 1, "uri": "urn:substrait:functions_comparison" } },
    // The definition of the function used is "gt," short for "greater than." It has an anchor that links to it below.
    { "extensionFunction": { "functionAnchor": 1, "name": "gt", "uriReference": 1 } }
  ],

  "relations": [
    {
      // Relational expression tree
      "root": {
        "input": {
          "project": {  // The projection operator SELECT(amount) which discards extra columns
            "input": {
              "filter": {  // Filter operator WHERE price > 100
                "input": {
                  "read": {  // Table Read Operator and Table Schema       
                    "namedTable": { "names": ["sales"] },
                    "baseSchema": {
                      "names": ["id", "price", "amount"],
                      "struct": {
                        "types": [
                          { "i64": {} },
                          { "i32": {} },
                          { "i32": {} }
                        ]
                      }
                    }
                  }
                },
                "condition": {  // The argument of the filter operator - predicate price > 100
                  "scalarFunction": {
                    "functionReference": 1, // gt(x, y) - the “Greater than” operation with anchor=1 (see the definition of extensionFunction at the beginning)
                    "arguments": [
                      {
                        "value": {
                          "selection": {
                            "directReference": { "structField": { "field": 1 }} //price field #1
                            "rootReference": {}
                          }
                        }
                      },
                      { "value": { "literal": { "i32": 100 } } } // Literal “100” of integer 32
                    ]
                  }
                }
              }
            },
            "expressions": [ // The argument of the project operator: select the ‘amount’ column.
              {                                          
                "selection": {
                  "directReference": { "structField": { "field": 2 } }, // Amount - field #2
                  "rootReference": {}
                }
              }
            ]
          }
        },
        "names": ["amount"] // The final projection of the “root” node
      }
    }
  ]
}

The consumer executes the plan and returns the filtered amount column (price > 100).

The producer aggregates it locally (SUM(amount)) and returns the final result.

Thus, we executed a federated query without parsing SQL, without dialect headaches, and without regenerating SQL for subqueries.

Ways to use Substrait

The example above shows just one scenario where Substrait would be useful: pushing a subquery to another system. But there are other use cases for Substrait.

As shown above - one system sends a subquery plan to another (or several).

Database 1 must know the others’ capabilities (e.g., don’t send joins to a Parquet reader).

Here, Database 1 need not know the capabilities of Database 2.

It can send the entire plan, and Database 2 executes what it can, returning the residual part (still in Substrait) to be executed by Database 1.

Substrait can also replace an internal IR for communication between the query planner and execution engine. This makes swapping engines easier.

For example, Spark + Gluten uses Substrait for physical-plan translation into native execution (Velox and others).

Conclusion

Substrait brings us closer to a world where frontends (SQL / Ibis / dplyr / …) and backends (Velox / DuckDB / DataFusion / …) can be freely combined.

It’s not a silver bullet — semantics alignment, validation, and observability are still required — but the gains in integration speed and logic portability are well worth it.

Useful links

Substrait site https://substrait.io/‍
Substrait tutorial https://substrait.io/tutorial/sql_to_substrait/‍
Substrait repo https://github.com/substrait-io/substrait/blob/main/extensions/extension_types.yaml‍
Useful exmaples https://medium.com/@omri-levy/one-query-plan-three-different-engines-e5dc74aeb52f

Substrait — the Lingua Franca for Databases

Introduction

Substrait Architecture

Ways to use Substrait

Conclusion

Useful links

about querify Labs

Navigation

contact us