The Atreides Family of Worst-case Optimal Join Algorithms

The query engine reasons about data by solving a set of constraints over variables. Instead of constructing a traditional left-deep or bushy join plan, it performs a guided depth-first search that binds one variable at a time. The approach draws on the broader theory of worst-case optimal joins and lets us navigate the search space directly rather than materialising intermediate results.

Constraints as the search frontier

Every constraint implements the Constraint trait, which exposes four abilities that shape the search:

  1. estimate – predicts how many results remain for a variable under the current partial binding.
  2. propose – enumerates candidate values for a variable.
  3. confirm – filters a set of candidates without re-enumerating them.
  4. influence – reports which other variables need their estimates refreshed when this variable changes.

Traditional databases rely on a query planner to combine statistics into a join plan. Atreides instead consults the constraints directly while it searches. Each constraint can base its estimates on whatever structure it maintains—hash maps, precomputed counts, or even constant values for predicates that admit at most one match—so long as it can provide a quick upper bound. Whenever a binding changes, the engine asks the influenced constraints for fresh estimates. Those estimates are cached per variable and reused until another binding invalidates them, keeping the guidance loop responsive as the search progresses.

Because the heuristics are derived entirely from the constraints themselves, we do not need a separate query planner or multiple join implementations. Any custom constraint can participate in the same search by providing sensible estimates, proposal generation, confirmation, and influence tracking.

A spectrum of Atreides variants

The Atreides "family" refers to the spectrum of heuristics a constraint can use when implementing Constraint::estimate. Each variant exposes the same guided depth-first search, but with progressively tighter cardinality guidance. In practice they all revisit their estimates when bindings change; what differs is what quantity they approximate:

  • Row-count Join (Jessica) estimates the remaining search volume for the entire constraint. If one variable is bound but two others are not, Jessica multiplies the candidate counts for the unbound pair (|b| × |c|) and reports that larger product. The number can wildly overshoot the next variable's frontier, yet it often tracks the overall work the constraint will perform.
  • Distinct-value Join (Paul) narrows the focus to a single variable at a time. It returns the smallest proposal buffer the constraint could produce for any still-unbound variable, ignoring later confirmation filters. This is the behaviour exercised by Query::new today, which keeps the tightest candidate list on hand while the search walks forward.
  • Partial-binding Join (Ghanima) goes further by measuring the size of the actual proposal the composite constraint can deliver for the current binding and chosen variable. For an and constraint this corresponds to the intersection of its children after they have applied their own filtering, revealing how many candidates truly survive the local checks.
  • Exact-result Join (Leto) is an idealised limit where a constraint predicts how many of those proposed values extend all the way to full results once the remaining variables are also bound. Although no constraint currently achieves this omniscience, the interface supports it conceptually.

All four share the same implementation machinery; the difference lies in how aggressively estimate compresses the constraint's knowledge. Even when only partial information is available the search still functions, but better estimates steer the traversal directly toward the surviving tuples.

Every constraint can decide which rung of this ladder it occupies. Simple wrappers that only track total counts behave like Jessica, those that surface their tightest per-variable proposals behave like Paul, and structures capable of intersecting their children on the fly approach Ghanima's accuracy. The engine does not need to know which variant it is running—estimate supplies whatever fidelity the data structure can provide, and influence ensures that higher quality estimates refresh when relevant bindings change.

When a query starts, Query::new collects the initial estimates and influence sets, sorts the unbound variables so the tightest constraints are considered first, and caches per-variable proposal buffers that can be reused across backtracking steps. The engine then walks the search space as follows:

  1. Inspect the unbound variables.
  2. Refresh the cached estimates for any variables whose constraints were influenced by the latest binding.
  3. Pick the next variable to bind by sorting the unbound set on two criteria:
    • the base‑2 logarithm of the estimate (smaller estimates are tried first),
    • the number of other variables the constraints could influence (ties favour the most connected variable, which tends to prune the search faster).
  4. Ask the relevant constraints to propose candidates for that variable. Composite constraints enumerate the tightest member and call confirm on the rest so that each candidate is checked without materialising cross products.
  5. Push the candidates onto a stack and recurse until every variable is bound or the stack runs empty, in which case the engine backtracks.

Traditional databases rely on sorted indexes to make the above iteration tractable. Atreides still performs random lookups when confirming each candidate, but the cardinality hints let it enumerate the most selective constraint sequentially and probe only a handful of values in the wider ones. Because the search is depth-first, the memory footprint stays small and the engine can stream results as soon as they are found.

Consider a query that relates ?person to ?parent and ?city. The search begins with all three variables unbound. If ?city only has a handful of possibilities, its estimate will be the smallest, so the engine binds ?city first. Each city candidate is checked against the parent and person constraints before the search continues, quickly rejecting infeasible branches before the higher-cardinality relationships are explored.

Per-variable estimates in practice

Suppose we want to answer the following query:

(find [?person ?parent ?city]
  [?person :lives-in ?city]
  [?person :parent ?parent]
  [?parent :lives-in ?city])

There are three variables and three constraints. Every constraint can provide a cardinality hint for each variable it touches, and the combined query records the tightest estimate for each variable:

VariableContributing constraints (individual estimates)Stored estimate
?person?person :lives-in ?city (12), ?person :parent ?parent (40)12
?parent?person :parent ?parent (40), ?parent :lives-in ?city (6)6
?city?person :lives-in ?city (12), ?parent :lives-in ?city (6)6

The estimates are scoped to individual variables even when no single constraint covers the whole tuple. The engine chooses the variable with the tightest bound, ?parent, and asks the constraints that mention it for proposals. Each candidate parent immediately passes through the ?parent :lives-in ?city constraint, which usually narrows the possible cities to a handful. Those cities, in turn, constrain the possible ?person bindings. If a branch fails — for example because no child of the selected parent lives in the same city — the engine backtracks and tries the next parent. The smallest estimated constraints therefore guide the search towards promising combinations and keep the depth-first traversal from thrashing through unrelated values.

Implementation notes

  • During iteration the engine keeps a stack of bound variables, a Binding structure holding the active assignments, and a touched_variables set that marks which estimates need refreshing before the next decision point.
  • Highly skewed data still behaves predictably: even if one attribute dominates the dataset, the other constraints continue to bound the search space tightly and prevent runaway exploration.
  • The per-variable proposal buffers allow repeated proposals without reallocating, which is especially helpful when backtracking over large domains.

Why worst-case optimal?

Worst-case optimal join algorithms guarantee that their running time never exceeds the size of the output by more than a constant factor, even for pathological inputs. The Atreides family retains this property because the search always explores bindings in an order that honours the cardinality bounds supplied by the constraints. As a result the engine remains robust under heavy skew, sparse joins, and high-dimensional queries without requiring a bespoke join plan for each case.

This combination of simple heuristics, incremental estimates, and a disciplined search strategy keeps the implementation straightforward while delivering the performance characteristics we need for real-world workloads.