Engineering

Eliminating Redundant Computations in Query Plans with Automatic CTE Detection

Common Table Expression (CTE) eliminates redundant computations by computing a result once and reusing it across multiple downstream operations

Eliminating redundant computations with automatic CTE

Want to see e6data in action?

Learn how data teams power their workloads.

Get Demo
Get Demo

In analytical workloads, it is common to see repeated or near-repeated computations, like scans or big joins or aggregates, appearing more than once in the plan. These redundancies can inflate query execution significantly. Common Table Expressions (CTEs) provide a way to compute these once and reference the results multiple times, saving CPU cycles and I/O.

In this post, we will walk through how we implemented automatic detection of similar subgraphs within a query plan in our query optimizer. We will:

  1. Show our approach to detect equivalent sub-plans in a query graph
  2. Demonstrate pruning and cost-based decisions.
  3. Generate and insert ephemeral CTEs in the final query plan.
  4. Handle approximate matches by merging sub-plans that differ only slightly.

We take inspiration from a classic paper - Cost-based optimization of decision support queries using transient-views. Let us dig in. 

Understanding Redundant Computations

Analytical workloads often involve complex queries that include repeated or near-repeated computations. These might involve multiple scans of large tables or big joins and aggregates appearing more than once in the plan. For instance, consider a market-basket or pricing analysis query that groups results into buckets based on certain price, coupon, or cost ranges. Instead of grouping all data in a single SELECT statement with CASE or grouping keys, the query might use multiple independent sub-selects. Each sub-select focuses on a distinct quantity band and numeric filter, combining them into a single row for side-by-side comparison of average prices, total counts, and distinct counts per bucket.

Without optimization, each sub-select re-applies filters to the same fact table in slightly different ranges. This redundancy is exactly what a cost-based optimizer or an automatic CTE detection mechanism can eliminate by computing a superset of filters once, storing results in a temporary table, and deriving each sub-bucket from that shared intermediate result.

Let us take a market-basket or pricing analysis TPC-DS query as an example. This query lumps the results for buckets based on certain ranges of price/coupon/cost side by side in one row.

Instead of grouping all in a single SELECT with a CASE or grouping key, the query uses six independent sub-selects. Each sub-select focuses on a distinct quantity band and numeric filter. They are then combined in a single row (or a small set of rows) to facilitate a side-by-side comparison of average price, total counts, and distinct counts per bucket.

select 
  * 
from 
  (
    select 
      avg(ss_list_price) B1_LP, 
      count(ss_list_price) B1_CNT, 
      count(distinct ss_list_price) B1_CNTD 
    from 
      store_sales 
    where 
      ss_quantity between 0 
      and 5 
      and (
        ss_list_price between 11 
        and 11 + 10 
        or ss_coupon_amt between 460 
        and 460 + 1000 
        or ss_wholesale_cost between 14 
        and 14 + 20
      )
  ) B1, 
  (
    select 
      avg(ss_list_price) B2_LP, 
      count(ss_list_price) B2_CNT, 
      count(distinct ss_list_price) B2_CNTD 
    from 
      store_sales 
    where 
      ss_quantity between 6 
      and 10 
      and (
        ss_list_price between 91 
        and 91 + 10 
        or ss_coupon_amt between 1430 
        and 1430 + 1000 
        or ss_wholesale_cost between 32 
        and 32 + 20
      )
  ) B2, 
  (
    select 
      avg(ss_list_price) B3_LP, 
      count(ss_list_price) B3_CNT, 
      count(distinct ss_list_price) B3_CNTD 
    from 
      store_sales 
    where 
      ss_quantity between 11 
      and 15 
      and (
        ss_list_price between 66 
        and 66 + 10 
        or ss_coupon_amt between 920 
        and 920 + 1000 
        or ss_wholesale_cost between 4 
        and 4 + 20
      )
  ) B3, 
  (
    select 
      avg(ss_list_price) B4_LP, 
      count(ss_list_price) B4_CNT, 
      count(distinct ss_list_price) B4_CNTD 
    from 
      store_sales 
    where 
      ss_quantity between 16 
      and 20 
      and (
        ss_list_price between 142 
        and 142 + 10 
        or ss_coupon_amt between 3054 
        and 3054 + 1000 
        or ss_wholesale_cost between 80 
        and 80 + 20
      )
  ) B4, 
  (
    select 
      avg(ss_list_price) B5_LP, 
      count(ss_list_price) B5_CNT, 
      count(distinct ss_list_price) B5_CNTD 
    from 
      store_sales 
    where 
      ss_quantity between 21 
      and 25 
      and (
        ss_list_price between 135 
        and 135 + 10 
        or ss_coupon_amt between 14180 
        and 14180 + 1000 
        or ss_wholesale_cost between 38 
        and 38 + 20
      )
  ) B5, 
  (
    select 
      avg(ss_list_price) B6_LP, 
      count(ss_list_price) B6_CNT, 
      count(distinct ss_list_price) B6_CNTD 
    from 
      store_sales 
    where 
      ss_quantity between 26 
      and 30 
      and (
        ss_list_price between 28 
        and 28 + 10 
        or ss_coupon_amt between 2513 
        and 2513 + 1000 
        or ss_wholesale_cost between 42 
        and 42 + 20
      )
  ) B6 
limit 
  100

Query Plan without Automatic CTE detection

As is seen in the plan below, there is repeated scanning or filtering of the large store_sales table—each sub-select re-applies filters to the same fact table in a slightly different range. That’s exactly the kind of redundancy that a cost-based optimizer or an automatic CTE detection mechanism can factor out, e.g., computing a superset of the filters once, storing results in a temporary table, and deriving each sub-bucket from that shared intermediate result.

Computing without Automatic CTE Detection

LogicalSort(fetch=[100])
  LogicalProject(B1_LP=[$0], B1_CNT=[$1], B1_CNTD=[$2], B2_LP=[$15], B2_CNT=[$16], B2_CNTD=[$17], B3_LP=[$12], B3_CNT=[$13], B3_CNTD=[$14], B4_LP=[$9], B4_CNT=[$10], B4_CNTD=[$11], B5_LP=[$6], B5_CNT=[$7], B5_CNTD=[$8], B6_LP=[$3], B6_CNT=[$4], B6_CNTD=[$5])
    LogicalJoin(condition=[true], joinType=[inner])
      LogicalJoin(condition=[true], joinType=[inner])
        LogicalJoin(condition=[true], joinType=[inner])
          LogicalJoin(condition=[true], joinType=[inner])
            LogicalJoin(condition=[true], joinType=[inner])
              LogicalProject(B1_LP=[/($0, $1)], B1_CNT=[$1], B1_CNTD=[$2])
                LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B1_CNTD=[COUNT(DISTINCT $0)])
                  LogicalProject(ss_list_price=[$11])
                    LogicalFilter(condition=[AND(SEARCH($9, Sarg[[0..5]]), OR(AND(>=($11, 11), <=($11, +(11, 10))), AND(>=($18, 460), <=($18, +(460, 1000))), AND(>=($10, 14), <=($10, +(14, 20)))))])
                      LogicalTableScan(table=[[hive, tpcds_1000, store_sales]])
              LogicalProject(B6_LP=[/($0, $1)], B6_CNT=[$1], B6_CNTD=[$2])
                LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B6_CNTD=[COUNT(DISTINCT $0)])
                  LogicalProject(ss_list_price=[$11])
                    LogicalFilter(condition=[AND(SEARCH($9, Sarg[[26..30]]), OR(AND(>=($11, 28), <=($11, +(28, 10))), AND(>=($18, 2513), <=($18, +(2513, 1000))), AND(>=($10, 42), <=($10, +(42, 20)))))])
                      LogicalTableScan(table=[[hive, tpcds_1000, store_sales]])
            LogicalProject(B5_LP=[/($0, $1)], B5_CNT=[$1], B5_CNTD=[$2])
              LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B5_CNTD=[COUNT(DISTINCT $0)])
                LogicalProject(ss_list_price=[$11])
                  LogicalFilter(condition=[AND(SEARCH($9, Sarg[[21..25]]), OR(AND(>=($11, 135), <=($11, +(135, 10))), AND(>=($18, 14180), <=($18, +(14180, 1000))), AND(>=($10, 38), <=($10, +(38, 20)))))])
                    LogicalTableScan(table=[[hive, tpcds_1000, store_sales]])
          LogicalProject(B4_LP=[/($0, $1)], B4_CNT=[$1], B4_CNTD=[$2])
            LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B4_CNTD=[COUNT(DISTINCT $0)])
              LogicalProject(ss_list_price=[$11])
                LogicalFilter(condition=[AND(SEARCH($9, Sarg[[16..20]]), OR(AND(>=($11, 142), <=($11, +(142, 10))), AND(>=($18, 3054), <=($18, +(3054, 1000))), AND(>=($10, 80), <=($10, +(80, 20)))))])
                  LogicalTableScan(table=[[hive, tpcds_1000, store_sales]])
        LogicalProject(B3_LP=[/($0, $1)], B3_CNT=[$1], B3_CNTD=[$2])
          LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B3_CNTD=[COUNT(DISTINCT $0)])
            LogicalProject(ss_list_price=[$11])
              LogicalFilter(condition=[AND(SEARCH($9, Sarg[[11..15]]), OR(AND(>=($11, 66), <=($11, +(66, 10))), AND(>=($18, 920), <=($18, +(920, 1000))), AND(>=($10, 4), <=($10, +(4, 20)))))])
                LogicalTableScan(table=[[hive, tpcds_1000, store_sales]])
      LogicalProject(B2_LP=[/($0, $1)], B2_CNT=[$1], B2_CNTD=[$2])
        LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B2_CNTD=[COUNT(DISTINCT $0)])
          LogicalProject(ss_list_price=[$11])
            LogicalFilter(condition=[AND(SEARCH($9, Sarg[[6..10]]), OR(AND(>=($11, 91), <=($11, +(91, 10))), AND(>=($18, 1430), <=($18, +(1430, 1000))), AND(>=($10, 32), <=($10, +(32, 20)))))])
              LogicalTableScan(table=[[hive, tpcds_1000, store_sales]])

High-level Implementation

Step 1: Building a Directed Graph of the Query Plan

The first step in optimizing query plans involves constructing the logical plan as a directed acyclic graph (DAG). In this graph, each logical operator is represented as a vertex, and edges denote parent-child relationships between operators. This structure allows us to systematically analyze and optimize the query.

CTE COVER] GENERATED GRAPH: 
└─rel#1570:LogicalSort.NONE.[](input=LogicalProject#1568,fetch=100)
  └─rel#1568:LogicalProject.NONE.[](input=LogicalJoin#1566,inputs=0..2,exprs=[$15, $16, $17, $12, $13, $14, $9, $10, $11, $6, $7, $8, $3, $4, $5])
    └─rel#1566:LogicalJoin.NONE.[](left=LogicalJoin#1557,right=LogicalAggregate#1564,condition=true,joinType=inner)
      ├─rel#1557:LogicalJoin.NONE.[](left=LogicalJoin#1548,right=LogicalAggregate#1555,condition=true,joinType=inner)
      │ ├─rel#1548:LogicalJoin.NONE.[](left=LogicalJoin#1539,right=LogicalAggregate#1546,condition=true,joinType=inner)
      │ │ ├─rel#1539:LogicalJoin.NONE.[](left=LogicalJoin#1530,right=LogicalAggregate#1537,condition=true,joinType=inner)
      │ │ │ ├─rel#1530:LogicalJoin.NONE.[](left=LogicalAggregate#1521,right=LogicalAggregate#1528,condition=true,joinType=inner)
      │ │ │ │ ├─rel#1521:LogicalAggregate.NONE.[](input=LogicalProject#1519,group={},B1_LP=AVG($0),B1_CNT=COUNT($0),B1_CNTD=COUNT(DISTINCT $0))
      │ │ │ │ │ └─rel#1519:LogicalProject.NONE.[](input=LogicalFilter#1517,exprs=[$11])
      │ │ │ │ │   └─rel#1517:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[0..5]]), OR(AND(>=($11, 11), <=($11, +(11, 10))), AND(>=($18, 460), <=($18, +(460, 1000))), AND(>=($10, 14), <=($10, +(14, 20))))))
      │ │ │ │ │     └─rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
      │ │ │ │ └─rel#1528:LogicalAggregate.NONE.[](input=LogicalProject#1526,group={},B6_LP=AVG($0),B6_CNT=COUNT($0),B6_CNTD=COUNT(DISTINCT $0))
      │ │ │ │   └─rel#1526:LogicalProject.NONE.[](input=LogicalFilter#1524,exprs=[$11])
      │ │ │ │     └─rel#1524:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[26..30]]), OR(AND(>=($11, 28), <=($11, +(28, 10))), AND(>=($18, 2513), <=($18, +(2513, 1000))), AND(>=($10, 42), <=($10, +(42, 20))))))
      │ │ │ │       └─rel#1572:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
      │ │ │ └─rel#1537:LogicalAggregate.NONE.[](input=LogicalProject#1535,group={},B5_LP=AVG($0),B5_CNT=COUNT($0),B5_CNTD=COUNT(DISTINCT $0))
      │ │ │   └─rel#1535:LogicalProject.NONE.[](input=LogicalFilter#1533,exprs=[$11])
      │ │ │     └─rel#1533:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[21..25]]), OR(AND(>=($11, 135), <=($11, +(135, 10))), AND(>=($18, 14180), <=($18, +(14180, 1000))), AND(>=($10, 38), <=($10, +(38, 20))))))
      │ │ │       └─rel#1573:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
      │ │ └─rel#1546:LogicalAggregate.NONE.[](input=LogicalProject#1544,group={},B4_LP=AVG($0),B4_CNT=COUNT($0),B4_CNTD=COUNT(DISTINCT $0))
      │ │   └─rel#1544:LogicalProject.NONE.[](input=LogicalFilter#1542,exprs=[$11])
      │ │     └─rel#1542:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[16..20]]), OR(AND(>=($11, 142), <=($11, +(142, 10))), AND(>=($18, 3054), <=($18, +(3054, 1000))), AND(>=($10, 80), <=($10, +(80, 20))))))
      │ │       └─rel#1574:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
      │ └─rel#1555:LogicalAggregate.NONE.[](input=LogicalProject#1553,group={},B3_LP=AVG($0),B3_CNT=COUNT($0),B3_CNTD=COUNT(DISTINCT $0))
      │   └─rel#1553:LogicalProject.NONE.[](input=LogicalFilter#1551,exprs=[$11])
      │     └─rel#1551:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[11..15]]), OR(AND(>=($11, 66), <=($11, +(66, 10))), AND(>=($18, 920), <=($18, +(920, 1000))), AND(>=($10, 4), <=($10, +(4, 20))))))
      │       └─rel#1575:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
      └─rel#1564:LogicalAggregate.NONE.[](input=LogicalProject#1562,group={},B2_LP=AVG($0),B2_CNT=COUNT($0),B2_CNTD=COUNT(DISTINCT $0))
        └─rel#1562:LogicalProject.NONE.[](input=LogicalFilter#1560,exprs=[$11])
          └─rel#1560:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[6..10]]), OR(AND(>=($11, 91), <=($11, +(91, 10))), AND(>=($18, 1430), <=($18, +(1430, 1000))), AND(>=($10, 32), <=($10, +(32, 20))))))
            └─rel#1576:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])

Step 2: Level Assignment

To identify potential subgraphs that perform similar work, we assign levels to each vertex in the DAG. The level indicates how “far” a node is from the leaf nodes (typically table scans), which are at level 0. Their immediate parents are at level 1, and so on, until reaching the root of the graph. This level assignment helps group nodes that operate at the same stage of the pipeline, facilitating the identification of redundant computations.

INPUT:  DAG, a directed acyclic graph
OUTPUT: LEVEL_MAP, mapping (node -> level),
        MAX_LEVEL, the highest level found
procedure ASSIGN_LEVELS(node):
  if node has no children (leaf):
    LEVEL_MAP[node] <- 0
    return 0
  maxChildLvl <- -1
  for each child in children(node):
    childLvl <- ASSIGN_LEVELS(child)
    if childLvl > maxChildLvl:
      maxChildLvl <- childLvl
  thisLvl <- maxChildLvl + 1
  LEVEL_MAP[node] <- thisLvl
  return thisLvl
_______________________________[CTE COVER]_______________________________________ 
LEVELS REGISTERED
{
	LEVEL: <10>
		rel#1570:LogicalSort.NONE.[](input=LogicalProject#1568,fetch=100)
	LEVEL: <9>
		rel#1568:LogicalProject.NONE.[](input=LogicalJoin#1566,inputs=0..2,exprs=[$15, $16, $17, $12, $13, $14, $9, $10, $11, $6, $7, $8, $3, $4, $5])
	LEVEL: <8>
		rel#1566:LogicalJoin.NONE.[](left=LogicalJoin#1557,right=LogicalAggregate#1564,condition=true,joinType=inner)
	LEVEL: <7>
		rel#1557:LogicalJoin.NONE.[](left=LogicalJoin#1548,right=LogicalAggregate#1555,condition=true,joinType=inner)
	LEVEL: <6>
		rel#1548:LogicalJoin.NONE.[](left=LogicalJoin#1539,right=LogicalAggregate#1546,condition=true,joinType=inner)
	LEVEL: <5>
		rel#1539:LogicalJoin.NONE.[](left=LogicalJoin#1530,right=LogicalAggregate#1537,condition=true,joinType=inner)
	LEVEL: <4>
		rel#1530:LogicalJoin.NONE.[](left=LogicalAggregate#1521,right=LogicalAggregate#1528,condition=true,joinType=inner)
	LEVEL: <3>
		rel#1564:LogicalAggregate.NONE.[](input=LogicalProject#1562,group={},B2_LP=AVG($0),B2_CNT=COUNT($0),B2_CNTD=COUNT(DISTINCT $0))
		rel#1555:LogicalAggregate.NONE.[](input=LogicalProject#1553,group={},B3_LP=AVG($0),B3_CNT=COUNT($0),B3_CNTD=COUNT(DISTINCT $0))
		rel#1521:LogicalAggregate.NONE.[](input=LogicalProject#1519,group={},B1_LP=AVG($0),B1_CNT=COUNT($0),B1_CNTD=COUNT(DISTINCT $0))
		rel#1528:LogicalAggregate.NONE.[](input=LogicalProject#1526,group={},B6_LP=AVG($0),B6_CNT=COUNT($0),B6_CNTD=COUNT(DISTINCT $0))
		rel#1537:LogicalAggregate.NONE.[](input=LogicalProject#1535,group={},B5_LP=AVG($0),B5_CNT=COUNT($0),B5_CNTD=COUNT(DISTINCT $0))
		rel#1546:LogicalAggregate.NONE.[](input=LogicalProject#1544,group={},B4_LP=AVG($0),B4_CNT=COUNT($0),B4_CNTD=COUNT(DISTINCT $0))
	LEVEL: <2>
		rel#1544:LogicalProject.NONE.[](input=LogicalFilter#1542,exprs=[$11])
		rel#1519:LogicalProject.NONE.[](input=LogicalFilter#1517,exprs=[$11])
		rel#1526:LogicalProject.NONE.[](input=LogicalFilter#1524,exprs=[$11])
		rel#1553:LogicalProject.NONE.[](input=LogicalFilter#1551,exprs=[$11])
		rel#1535:LogicalProject.NONE.[](input=LogicalFilter#1533,exprs=[$11])
		rel#1562:LogicalProject.NONE.[](input=LogicalFilter#1560,exprs=[$11])
	LEVEL: <1>
		rel#1517:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[0..5]]), OR(AND(>=($11, 11), <=($11, +(11, 10))), AND(>=($18, 460), <=($18, +(460, 1000))), AND(>=($10, 14), <=($10, +(14, 20))))))
		rel#1542:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[16..20]]), OR(AND(>=($11, 142), <=($11, +(142, 10))), AND(>=($18, 3054), <=($18, +(3054, 1000))), AND(>=($10, 80), <=($10, +(80, 20))))))
		rel#1560:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[6..10]]), OR(AND(>=($11, 91), <=($11, +(91, 10))), AND(>=($18, 1430), <=($18, +(1430, 1000))), AND(>=($10, 32), <=($10, +(32, 20))))))
		rel#1533:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[21..25]]), OR(AND(>=($11, 135), <=($11, +(135, 10))), AND(>=($18, 14180), <=($18, +(14180, 1000))), AND(>=($10, 38), <=($10, +(38, 20))))))
		rel#1524:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[26..30]]), OR(AND(>=($11, 28), <=($11, +(28, 10))), AND(>=($18, 2513), <=($18, +(2513, 1000))), AND(>=($10, 42), <=($10, +(42, 20))))))
		rel#1551:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[11..15]]), OR(AND(>=($11, 66), <=($11, +(66, 10))), AND(>=($18, 920), <=($18, +(920, 1000))), AND(>=($10, 4), <=($10, +(4, 20))))))
	LEVEL: <0>
		rel#1575:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
		rel#1572:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
		rel#1574:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
		rel#1573:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
		rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
		rel#1576:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
}

Step 3: Detecting Equivalent (and Approximate) Subgraphs

This is the heart of the approach: for each level in the plan graph, we look for sets of nodes that are “equivalent” or “approximately equivalent.” Exact equivalence is straightforward (both sub-trees have the same operators, same inputs, same filters, same columns, etc.). Approximate equivalence means they match enough to be combined—maybe the filters or projections differ only slightly. 

At each level, we search for sub-plans that are:

  • Exactly equivalent: The same operator type, same filters/expressions, same schema, etc.
  • Approximately equivalent: Operators differ only in minor ways (e.g., a filter threshold). Under certain conditions, we can unify them if the cost model justifies it (e.g., merging filters with OR logic).
INPUT:  LEVEL_MAP, DAG
OUTPUT: EQUIV_SETS[level], each element is a set of "equivalent or approx" nodes
procedure GATHER_EQUIVALENT_SETS():
  for level in 0 to MAX_LEVEL:
    let NODES = all nodes where LEVEL_MAP[node] = level
    for i in 0 to NODES.size - 1:
      source <- NODES[i]
      if not isSupportedType(source):
        continue
      for j in i+1 to NODES.size - 1:
        target <- NODES[j]
        if not isSupportedType(target):
          continue
        exactlySimilar <- isExactlySimilar(source, target)
        approxSimilar <- isApproxSimilar(source, target)
        if exactlySimilar OR (approxSimilar AND approxMatchingEnabled()):
          unifyInSet(source, target, level) 
          // unifyInSet() merges them into a single set for that level
_______________________________[CTE COVER]_______________________________________
 EQUIVALENT SETS [Total: 6, Distinct: 1]
{
	LEVEL: <0>
		Set#0
			rel#1574:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
			rel#1575:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
			rel#1572:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
			rel#1573:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
			rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
			rel#1576:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
}

Step 4: Pruning

Once equivalent sub-plans are identified, the optimizer must decide whether to prune redundant computations and replace them with CTEs. This decision is typically cost-based, meaning the optimizer evaluates whether the benefits of using a CTE (e.g., reduced CPU and I/O usage) outweigh the costs (e.g., additional memory for storing intermediate results).

After we find these candidate sets, we do additional checks:

  • Cost Model: If computing a single shared sub-plan plus writing it to a temporary table is more expensive than letting each sub-plan run independently, we skip the CTE approach.
  • Row Count Threshold: If the plan is extremely selective (or not selective enough), it might not be worth materializing into a CTE.
  • Nested child sets: If a sub-tree is already captured at a higher level, we don’t want to generate redundant CTEs.

Step 5: Generating CTEs and Rewriting the Plan

If the cost model says we’re better off factoring out the sub-plans, we create CTEs and rewrite all references to the repeated sub-plan to read from these newly created tables. This final step yields a single unified plan with minimal repeated work.

INPUT:  EQUIV_SETS, DAG, cost model, root node
OUTPUT: updatedRoot (RelNode) with references to ephemeral CTE tables
procedure REPLACE_WITH_CTEs(root):
  // 1. Generate ephemeral tables for each remaining equivalence set
  setToCTE <- {}           // map: set of nodes -> representative ephemeral table 
  nodeToCTE <- {}          // map: node -> ephemeral table
  idx <- 0
  for level in 0..MAX_LEVEL:
    for each set S in EQUIV_SETS[level]:
      if costModel.saysBeneficial(S):
        cteName <- "SYS_CTE_" + idx
        idx <- idx + 1
        sampleNode <- pickOneNode(S)
        // (a) Possibly unify approximate filters or projects across S
        unifiedSubPlan <- unifySubPlan(S, sampleNode)
        // (b) Create ephemeral table referencing unifiedSubPlan
        ephemeralTable <- createE6TempTable(cteName, unifiedSubPlan)
        // (c) Register in maps
        setToCTE[S] <- ephemeralTable
        for node in S:
          nodeToCTE[node] <- ephemeralTable
  // 2. Traverse original root plan, replace each node in nodeToCTE with ephemeralTableRef
  return doCTEReplacementShuttle(root, nodeToCTE)
_______________________________[CTE COVER]_______________________________________
SYS_CTE_0: 
LogicalProject($f0=[0], $f1=[1], $f2=[2], $f3=[3], $f4=[4], $f5=[5], $f6=[6], $f7=[7], $f8=[8], ss_quantity=[$9], ss_wholesale_cost=[$10], ss_list_price=[$11], $f12=[12], $f13=[13], $f14=[14], $f15=[15], $f16=[16], $f17=[17], ss_coupon_amt=[$18], $f19=[19], $f20=[20], $f21=[21], $f22=[22])
  LogicalFilter(condition=[OR(AND(SEARCH($9, Sarg[[11..15]]), OR(AND(>=($11, 66), <=($11, +(66, 10))), AND(>=($18, 920), <=($18, +(920, 1000))), AND(>=($10, 4), <=($10, +(4, 20))))), AND(SEARCH($9, Sarg[[26..30]]), OR(AND(>=($11, 28), <=($11, +(28, 10))), AND(>=($18, 2513), <=($18, +(2513, 1000))), AND(>=($10, 42), <=($10, +(42, 20))))), AND(SEARCH($9, Sarg[[16..20]]), OR(AND(>=($11, 142), <=($11, +(142, 10))), AND(>=($18, 3054), <=($18, +(3054, 1000))), AND(>=($10, 80), <=($10, +(80, 20))))), AND(SEARCH($9, Sarg[[21..25]]), OR(AND(>=($11, 135), <=($11, +(135, 10))), AND(>=($18, 14180), <=($18, +(14180, 1000))), AND(>=($10, 38), <=($10, +(38, 20))))), AND(SEARCH($9, Sarg[[0..5]]), OR(AND(>=($11, 11), <=($11, +(11, 10))), AND(>=($18, 460), <=($18, +(460, 1000))), AND(>=($10, 14), <=($10, +(14, 20))))), AND(SEARCH($9, Sarg[[6..10]]), OR(AND(>=($11, 91), <=($11, +(91, 10))), AND(>=($18, 1430), <=($18, +(1430, 1000))), AND(>=($10, 32), <=($10, +(32, 20))))))])
    LogicalTableScan(table=[[hive, tpcds_1000, store_sales]])
_______________________________[CTE COVER]_______________________________________
STARTING CTE REPLACEMENT
Replacing: {rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])}
Replacing: {rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])}
Replacing: {rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])}
Replacing: {rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])}
Replacing: {rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])}
Replacing: {rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])}
_______________________________DONE [Total: 6, Distinct: 1]_______________________

Plan with Automatic CTE detection

Now let’s examine the query plan with automatic CTE detection

Computing with Automatic CTE Detection

LogicalSort(fetch=[100])
  LogicalProject(B1_LP=[$0], B1_CNT=[$1], B1_CNTD=[$2], B2_LP=[$15], B2_CNT=[$16], B2_CNTD=[$17], B3_LP=[$12], B3_CNT=[$13], B3_CNTD=[$14], B4_LP=[$9], B4_CNT=[$10], B4_CNTD=[$11], B5_LP=[$6], B5_CNT=[$7], B5_CNTD=[$8], B6_LP=[$3], B6_CNT=[$4], B6_CNTD=[$5])
    LogicalJoin(condition=[true], joinType=[inner])
      LogicalJoin(condition=[true], joinType=[inner])
        LogicalJoin(condition=[true], joinType=[inner])
          LogicalJoin(condition=[true], joinType=[inner])
            LogicalJoin(condition=[true], joinType=[inner])
              LogicalProject(B1_LP=[/($0, $1)], B1_CNT=[$1], B1_CNTD=[$2])
                LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B1_CNTD=[COUNT(DISTINCT $0)])
                  LogicalProject(ss_list_price=[$11])
                    LogicalFilter(condition=[AND(SEARCH($9, Sarg[[0..5]]), OR(AND(>=($11, 11), <=($11, +(11, 10))), AND(>=($18, 460), <=($18, +(460, 1000))), AND(>=($10, 14), <=($10, +(14, 20)))))])
                      E6TempTableScan(table=[[e6, hive, SESSION, SYS_CTE_0]])
              LogicalProject(B6_LP=[/($0, $1)], B6_CNT=[$1], B6_CNTD=[$2])
                LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B6_CNTD=[COUNT(DISTINCT $0)])
                  LogicalProject(ss_list_price=[$11])
                    LogicalFilter(condition=[AND(SEARCH($9, Sarg[[26..30]]), OR(AND(>=($11, 28), <=($11, +(28, 10))), AND(>=($18, 2513), <=($18, +(2513, 1000))), AND(>=($10, 42), <=($10, +(42, 20)))))])
                      E6TempTableScan(table=[[e6, hive, SESSION, SYS_CTE_0]])
            LogicalProject(B5_LP=[/($0, $1)], B5_CNT=[$1], B5_CNTD=[$2])
              LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B5_CNTD=[COUNT(DISTINCT $0)])
                LogicalProject(ss_list_price=[$11])
                  LogicalFilter(condition=[AND(SEARCH($9, Sarg[[21..25]]), OR(AND(>=($11, 135), <=($11, +(135, 10))), AND(>=($18, 14180), <=($18, +(14180, 1000))), AND(>=($10, 38), <=($10, +(38, 20)))))])
                    E6TempTableScan(table=[[e6, hive, SESSION, SYS_CTE_0]])
          LogicalProject(B4_LP=[/($0, $1)], B4_CNT=[$1], B4_CNTD=[$2])
            LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B4_CNTD=[COUNT(DISTINCT $0)])
              LogicalProject(ss_list_price=[$11])
                LogicalFilter(condition=[AND(SEARCH($9, Sarg[[16..20]]), OR(AND(>=($11, 142), <=($11, +(142, 10))), AND(>=($18, 3054), <=($18, +(3054, 1000))), AND(>=($10, 80), <=($10, +(80, 20)))))])
                  E6TempTableScan(table=[[e6, hive, SESSION, SYS_CTE_0]])
        LogicalProject(B3_LP=[/($0, $1)], B3_CNT=[$1], B3_CNTD=[$2])
          LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B3_CNTD=[COUNT(DISTINCT $0)])
            LogicalProject(ss_list_price=[$11])
              LogicalFilter(condition=[AND(SEARCH($9, Sarg[[11..15]]), OR(AND(>=($11, 66), <=($11, +(66, 10))), AND(>=($18, 920), <=($18, +(920, 1000))), AND(>=($10, 4), <=($10, +(4, 20)))))])
                E6TempTableScan(table=[[e6, hive, SESSION, SYS_CTE_0]])
      LogicalProject(B2_LP=[/($0, $1)], B2_CNT=[$1], B2_CNTD=[$2])
        LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B2_CNTD=[COUNT(DISTINCT $0)])
          LogicalProject(ss_list_price=[$11])
            LogicalFilter(condition=[AND(SEARCH($9, Sarg[[6..10]]), OR(AND(>=($11, 91), <=($11, +(91, 10))), AND(>=($18, 1430), <=($18, +(1430, 1000))), AND(>=($10, 32), <=($10, +(32, 20)))))])
              E6TempTableScan(table=[[e6, hive, SESSION, SYS_CTE_0]])

With automatic CTE detection, the query’s execution time reduced from 83 sec to 21 sec - ~75% performance improvement!

Benefits of Automatic CTE

By automatically detecting repeated (or nearly repeated) sub-plan fragments and turning them into transient tables, we can drastically reduce redundant work in complex query plans. This approach:

  • Improves Performance: Especially for expensive joins or aggregates, computing once can be much cheaper than computing multiple times.
  • Is Transparent: The user doesn’t need to write those WITH cte AS ( ... ) statements by hand; the optimizer does it all automatically!!
  • Handles Approximate Equivalence: We can unify sub-plan variants if the cost model says so, which is especially relevant in large queries with slightly different dimensional filters or projections

Conclusion

Automatic CTE detection is a powerful tool for optimizing query performance by eliminating redundant computations. By constructing a directed graph of the query plan, assigning levels to identify similar subgraphs, detecting equivalent sub-plans, making cost-based decisions, generating ephemeral CTEs, and handling approximate matches, we can significantly improve query efficiency. This approach not only speeds up query execution but also reduces resource usage, making it an essential technique for database administrators and analysts working with complex analytical workloads.

Reference

Subramanian, Subbu N., and Shivakumar Venkataraman. "Cost-based optimization of decision support queries using transient-views." Proceedings of the 1998 ACM SIGMOD international conference on Management of data, 1998.

Share on

Build future-proof data products

Try e6data for your heavy workloads!

Get Started for Free
Get Started for Free
Frequently asked questions (FAQs)
How do I integrate e6data with my existing data infrastructure?

We are universally interoperable and open-source friendly. We can integrate across any object store, table format, data catalog, governance tools, BI tools, and other data applications.

How does billing work?

We use a usage-based pricing model based on vCPU consumption. Your billing is determined by the number of vCPUs used, ensuring you only pay for the compute power you actually consume.

What kind of file formats does e6data support?

We support all types of file formats, like Parquet, ORC, JSON, CSV, AVRO, and others.

What kind of performance improvements can I expect with e6data?

e6data promises a 5 to 10 times faster querying speed across any concurrency at over 50% lower total cost of ownership across the workloads as compared to any compute engine in the market.

What kinds of deployment models are available at e6data ?

We support serverless and in-VPC deployment models. 

How does e6data handle data governance rules?

We can integrate with your existing governance tool, and also have an in-house offering for data governance, access control, and security.