Cross-source queries
Join data across Postgres, Snowflake, MongoDB, and more in a single SQL statement, no ETL pipelines required.
What is DataFusion?
Apache DataFusion is a fast, extensible query engine written in Rust. It executes SQL against in-memory Apache Arrow columnar data, with a vectorized, streaming execution model and a pluggable table-provider interface that lets a host application feed it rows from any source. It is a top-level Apache project, used as the engine behind many analytical databases and data tools.
Arris embeds DataFusion to run cross-source queries: a single SQL statement that reads from several of your database connections at once and joins the results locally. DataFusion runs entirely in-process, so there is no cluster to deploy and no external service to call.
How it works
When you run a cross-source query, Arris hands it to the embedded DataFusion engine, which:
- Parses — the SQL is scanned for dotted table references to work out which connection each table belongs to.
- Registers — each referenced table is registered with a DataFusion session as a custom table provider backed by that connection's native driver.
- Pushes down — column projections,
WHEREfilters, and row limits are translated back into SQL and pushed to each source, so only the needed rows and columns are pulled. - Executes — DataFusion streams the source rows as Arrow batches and runs the full query, performing the joins, aggregations, and sorting locally with its vectorized operators.
- Returns — the result is shown in the same results grid as any other query.
Writing a cross-source query
In an editor tab, flip the DataFusion toggle in the run bar. The toggle needs
at least two connections configured; once it is on, the connection selector switches to
All Connections and the tab can reference tables from any of them. Reference a
table with its connection name in front: connection.schema.table.
Table references
A reference is <connection>.<schema>.<table>, or
<connection>.<table> for sources without schemas (e.g., for MongoDB, the
database name takes the schema slot). The connection name is the name you gave the connection in
Arris, and the editor tints each connection's segment in its own color. Autocomplete spans every
connection, so typing a connection name and a dot suggests that connection's schemas, then its
tables and columns.
Execution plan
A cross-source query runs through DataFusion's physical plan, which Arris surfaces as a live execution graph. Use the Show execution plan toggle () in the results toolbar to swap the grid for a node graph of the plan: each scan, join, aggregate, sort, and the final result, with each node's status updating as the query runs.
Limitations
Federation is designed for analytical and exploratory workloads. Keep these limits in mind:
- Read-only sources — a cross-source query only reads from its sources; it never writes back to them.
- Data volume — source rows are streamed into memory (DataFusion can spill to disk under pressure). Queries that scan very large tables may be slow or memory-heavy, so filter at the source with
WHEREclauses. - Transactions — a cross-source query does not run in a distributed transaction. Each source is read at a point in time and may not be perfectly consistent with the others.
- Engine-specific features — DataFusion executes the final SQL, so engine-specific functions (for example Postgres
array_agg) only work if DataFusion has an equivalent.