Eliminating redundant computations with automatic CTE
In analytical workloads, it is common to see repeated or near-repeated computations, like scans or big joins or aggregates, appearing more than once in the plan. These redundancies can inflate query execution significantly. Common Table Expressions (CTEs) provide a way to compute these once and reference the results multiple times, saving CPU cycles and I/O.
In this post, we will walk through how we implemented automatic detection of similar subgraphs within a query plan in our query optimizer. We will:
We take inspiration from a classic paper - Cost-based optimization of decision support queries using transient-views. Let us dig in.
Analytical workloads often involve complex queries that include repeated or near-repeated computations. These might involve multiple scans of large tables or big joins and aggregates appearing more than once in the plan. For instance, consider a market-basket or pricing analysis query that groups results into buckets based on certain price, coupon, or cost ranges. Instead of grouping all data in a single SELECT statement with CASE or grouping keys, the query might use multiple independent sub-selects. Each sub-select focuses on a distinct quantity band and numeric filter, combining them into a single row for side-by-side comparison of average prices, total counts, and distinct counts per bucket.
Without optimization, each sub-select re-applies filters to the same fact table in slightly different ranges. This redundancy is exactly what a cost-based optimizer or an automatic CTE detection mechanism can eliminate by computing a superset of filters once, storing results in a temporary table, and deriving each sub-bucket from that shared intermediate result.
Let us take a market-basket or pricing analysis TPC-DS query as an example. This query lumps the results for buckets based on certain ranges of price/coupon/cost side by side in one row.
Instead of grouping all in a single SELECT with a CASE or grouping key, the query uses six independent sub-selects. Each sub-select focuses on a distinct quantity band and numeric filter. They are then combined in a single row (or a small set of rows) to facilitate a side-by-side comparison of average price, total counts, and distinct counts per bucket.
select
*
from
(
select
avg(ss_list_price) B1_LP,
count(ss_list_price) B1_CNT,
count(distinct ss_list_price) B1_CNTD
from
store_sales
where
ss_quantity between 0
and 5
and (
ss_list_price between 11
and 11 + 10
or ss_coupon_amt between 460
and 460 + 1000
or ss_wholesale_cost between 14
and 14 + 20
)
) B1,
(
select
avg(ss_list_price) B2_LP,
count(ss_list_price) B2_CNT,
count(distinct ss_list_price) B2_CNTD
from
store_sales
where
ss_quantity between 6
and 10
and (
ss_list_price between 91
and 91 + 10
or ss_coupon_amt between 1430
and 1430 + 1000
or ss_wholesale_cost between 32
and 32 + 20
)
) B2,
(
select
avg(ss_list_price) B3_LP,
count(ss_list_price) B3_CNT,
count(distinct ss_list_price) B3_CNTD
from
store_sales
where
ss_quantity between 11
and 15
and (
ss_list_price between 66
and 66 + 10
or ss_coupon_amt between 920
and 920 + 1000
or ss_wholesale_cost between 4
and 4 + 20
)
) B3,
(
select
avg(ss_list_price) B4_LP,
count(ss_list_price) B4_CNT,
count(distinct ss_list_price) B4_CNTD
from
store_sales
where
ss_quantity between 16
and 20
and (
ss_list_price between 142
and 142 + 10
or ss_coupon_amt between 3054
and 3054 + 1000
or ss_wholesale_cost between 80
and 80 + 20
)
) B4,
(
select
avg(ss_list_price) B5_LP,
count(ss_list_price) B5_CNT,
count(distinct ss_list_price) B5_CNTD
from
store_sales
where
ss_quantity between 21
and 25
and (
ss_list_price between 135
and 135 + 10
or ss_coupon_amt between 14180
and 14180 + 1000
or ss_wholesale_cost between 38
and 38 + 20
)
) B5,
(
select
avg(ss_list_price) B6_LP,
count(ss_list_price) B6_CNT,
count(distinct ss_list_price) B6_CNTD
from
store_sales
where
ss_quantity between 26
and 30
and (
ss_list_price between 28
and 28 + 10
or ss_coupon_amt between 2513
and 2513 + 1000
or ss_wholesale_cost between 42
and 42 + 20
)
) B6
limit
100
As is seen in the plan below, there is repeated scanning or filtering of the large store_sales table—each sub-select re-applies filters to the same fact table in a slightly different range. That’s exactly the kind of redundancy that a cost-based optimizer or an automatic CTE detection mechanism can factor out, e.g., computing a superset of the filters once, storing results in a temporary table, and deriving each sub-bucket from that shared intermediate result.
LogicalSort(fetch=[100])
LogicalProject(B1_LP=[$0], B1_CNT=[$1], B1_CNTD=[$2], B2_LP=[$15], B2_CNT=[$16], B2_CNTD=[$17], B3_LP=[$12], B3_CNT=[$13], B3_CNTD=[$14], B4_LP=[$9], B4_CNT=[$10], B4_CNTD=[$11], B5_LP=[$6], B5_CNT=[$7], B5_CNTD=[$8], B6_LP=[$3], B6_CNT=[$4], B6_CNTD=[$5])
LogicalJoin(condition=[true], joinType=[inner])
LogicalJoin(condition=[true], joinType=[inner])
LogicalJoin(condition=[true], joinType=[inner])
LogicalJoin(condition=[true], joinType=[inner])
LogicalJoin(condition=[true], joinType=[inner])
LogicalProject(B1_LP=[/($0, $1)], B1_CNT=[$1], B1_CNTD=[$2])
LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B1_CNTD=[COUNT(DISTINCT $0)])
LogicalProject(ss_list_price=[$11])
LogicalFilter(condition=[AND(SEARCH($9, Sarg[[0..5]]), OR(AND(>=($11, 11), <=($11, +(11, 10))), AND(>=($18, 460), <=($18, +(460, 1000))), AND(>=($10, 14), <=($10, +(14, 20)))))])
LogicalTableScan(table=[[hive, tpcds_1000, store_sales]])
LogicalProject(B6_LP=[/($0, $1)], B6_CNT=[$1], B6_CNTD=[$2])
LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B6_CNTD=[COUNT(DISTINCT $0)])
LogicalProject(ss_list_price=[$11])
LogicalFilter(condition=[AND(SEARCH($9, Sarg[[26..30]]), OR(AND(>=($11, 28), <=($11, +(28, 10))), AND(>=($18, 2513), <=($18, +(2513, 1000))), AND(>=($10, 42), <=($10, +(42, 20)))))])
LogicalTableScan(table=[[hive, tpcds_1000, store_sales]])
LogicalProject(B5_LP=[/($0, $1)], B5_CNT=[$1], B5_CNTD=[$2])
LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B5_CNTD=[COUNT(DISTINCT $0)])
LogicalProject(ss_list_price=[$11])
LogicalFilter(condition=[AND(SEARCH($9, Sarg[[21..25]]), OR(AND(>=($11, 135), <=($11, +(135, 10))), AND(>=($18, 14180), <=($18, +(14180, 1000))), AND(>=($10, 38), <=($10, +(38, 20)))))])
LogicalTableScan(table=[[hive, tpcds_1000, store_sales]])
LogicalProject(B4_LP=[/($0, $1)], B4_CNT=[$1], B4_CNTD=[$2])
LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B4_CNTD=[COUNT(DISTINCT $0)])
LogicalProject(ss_list_price=[$11])
LogicalFilter(condition=[AND(SEARCH($9, Sarg[[16..20]]), OR(AND(>=($11, 142), <=($11, +(142, 10))), AND(>=($18, 3054), <=($18, +(3054, 1000))), AND(>=($10, 80), <=($10, +(80, 20)))))])
LogicalTableScan(table=[[hive, tpcds_1000, store_sales]])
LogicalProject(B3_LP=[/($0, $1)], B3_CNT=[$1], B3_CNTD=[$2])
LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B3_CNTD=[COUNT(DISTINCT $0)])
LogicalProject(ss_list_price=[$11])
LogicalFilter(condition=[AND(SEARCH($9, Sarg[[11..15]]), OR(AND(>=($11, 66), <=($11, +(66, 10))), AND(>=($18, 920), <=($18, +(920, 1000))), AND(>=($10, 4), <=($10, +(4, 20)))))])
LogicalTableScan(table=[[hive, tpcds_1000, store_sales]])
LogicalProject(B2_LP=[/($0, $1)], B2_CNT=[$1], B2_CNTD=[$2])
LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B2_CNTD=[COUNT(DISTINCT $0)])
LogicalProject(ss_list_price=[$11])
LogicalFilter(condition=[AND(SEARCH($9, Sarg[[6..10]]), OR(AND(>=($11, 91), <=($11, +(91, 10))), AND(>=($18, 1430), <=($18, +(1430, 1000))), AND(>=($10, 32), <=($10, +(32, 20)))))])
LogicalTableScan(table=[[hive, tpcds_1000, store_sales]])
The first step in optimizing query plans involves constructing the logical plan as a directed acyclic graph (DAG). In this graph, each logical operator is represented as a vertex, and edges denote parent-child relationships between operators. This structure allows us to systematically analyze and optimize the query.
CTE COVER] GENERATED GRAPH:
└─rel#1570:LogicalSort.NONE.[](input=LogicalProject#1568,fetch=100)
└─rel#1568:LogicalProject.NONE.[](input=LogicalJoin#1566,inputs=0..2,exprs=[$15, $16, $17, $12, $13, $14, $9, $10, $11, $6, $7, $8, $3, $4, $5])
└─rel#1566:LogicalJoin.NONE.[](left=LogicalJoin#1557,right=LogicalAggregate#1564,condition=true,joinType=inner)
├─rel#1557:LogicalJoin.NONE.[](left=LogicalJoin#1548,right=LogicalAggregate#1555,condition=true,joinType=inner)
│ ├─rel#1548:LogicalJoin.NONE.[](left=LogicalJoin#1539,right=LogicalAggregate#1546,condition=true,joinType=inner)
│ │ ├─rel#1539:LogicalJoin.NONE.[](left=LogicalJoin#1530,right=LogicalAggregate#1537,condition=true,joinType=inner)
│ │ │ ├─rel#1530:LogicalJoin.NONE.[](left=LogicalAggregate#1521,right=LogicalAggregate#1528,condition=true,joinType=inner)
│ │ │ │ ├─rel#1521:LogicalAggregate.NONE.[](input=LogicalProject#1519,group={},B1_LP=AVG($0),B1_CNT=COUNT($0),B1_CNTD=COUNT(DISTINCT $0))
│ │ │ │ │ └─rel#1519:LogicalProject.NONE.[](input=LogicalFilter#1517,exprs=[$11])
│ │ │ │ │ └─rel#1517:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[0..5]]), OR(AND(>=($11, 11), <=($11, +(11, 10))), AND(>=($18, 460), <=($18, +(460, 1000))), AND(>=($10, 14), <=($10, +(14, 20))))))
│ │ │ │ │ └─rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
│ │ │ │ └─rel#1528:LogicalAggregate.NONE.[](input=LogicalProject#1526,group={},B6_LP=AVG($0),B6_CNT=COUNT($0),B6_CNTD=COUNT(DISTINCT $0))
│ │ │ │ └─rel#1526:LogicalProject.NONE.[](input=LogicalFilter#1524,exprs=[$11])
│ │ │ │ └─rel#1524:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[26..30]]), OR(AND(>=($11, 28), <=($11, +(28, 10))), AND(>=($18, 2513), <=($18, +(2513, 1000))), AND(>=($10, 42), <=($10, +(42, 20))))))
│ │ │ │ └─rel#1572:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
│ │ │ └─rel#1537:LogicalAggregate.NONE.[](input=LogicalProject#1535,group={},B5_LP=AVG($0),B5_CNT=COUNT($0),B5_CNTD=COUNT(DISTINCT $0))
│ │ │ └─rel#1535:LogicalProject.NONE.[](input=LogicalFilter#1533,exprs=[$11])
│ │ │ └─rel#1533:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[21..25]]), OR(AND(>=($11, 135), <=($11, +(135, 10))), AND(>=($18, 14180), <=($18, +(14180, 1000))), AND(>=($10, 38), <=($10, +(38, 20))))))
│ │ │ └─rel#1573:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
│ │ └─rel#1546:LogicalAggregate.NONE.[](input=LogicalProject#1544,group={},B4_LP=AVG($0),B4_CNT=COUNT($0),B4_CNTD=COUNT(DISTINCT $0))
│ │ └─rel#1544:LogicalProject.NONE.[](input=LogicalFilter#1542,exprs=[$11])
│ │ └─rel#1542:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[16..20]]), OR(AND(>=($11, 142), <=($11, +(142, 10))), AND(>=($18, 3054), <=($18, +(3054, 1000))), AND(>=($10, 80), <=($10, +(80, 20))))))
│ │ └─rel#1574:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
│ └─rel#1555:LogicalAggregate.NONE.[](input=LogicalProject#1553,group={},B3_LP=AVG($0),B3_CNT=COUNT($0),B3_CNTD=COUNT(DISTINCT $0))
│ └─rel#1553:LogicalProject.NONE.[](input=LogicalFilter#1551,exprs=[$11])
│ └─rel#1551:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[11..15]]), OR(AND(>=($11, 66), <=($11, +(66, 10))), AND(>=($18, 920), <=($18, +(920, 1000))), AND(>=($10, 4), <=($10, +(4, 20))))))
│ └─rel#1575:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
└─rel#1564:LogicalAggregate.NONE.[](input=LogicalProject#1562,group={},B2_LP=AVG($0),B2_CNT=COUNT($0),B2_CNTD=COUNT(DISTINCT $0))
└─rel#1562:LogicalProject.NONE.[](input=LogicalFilter#1560,exprs=[$11])
└─rel#1560:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[6..10]]), OR(AND(>=($11, 91), <=($11, +(91, 10))), AND(>=($18, 1430), <=($18, +(1430, 1000))), AND(>=($10, 32), <=($10, +(32, 20))))))
└─rel#1576:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
To identify potential subgraphs that perform similar work, we assign levels to each vertex in the DAG. The level indicates how “far” a node is from the leaf nodes (typically table scans), which are at level 0
. Their immediate parents are at level 1
, and so on, until reaching the root of the graph. This level assignment helps group nodes that operate at the same stage of the pipeline, facilitating the identification of redundant computations.
INPUT: DAG, a directed acyclic graph
OUTPUT: LEVEL_MAP, mapping (node -> level),
MAX_LEVEL, the highest level found
procedure ASSIGN_LEVELS(node):
if node has no children (leaf):
LEVEL_MAP[node] <- 0
return 0
maxChildLvl <- -1
for each child in children(node):
childLvl <- ASSIGN_LEVELS(child)
if childLvl > maxChildLvl:
maxChildLvl <- childLvl
thisLvl <- maxChildLvl + 1
LEVEL_MAP[node] <- thisLvl
return thisLvl
_______________________________[CTE COVER]_______________________________________
LEVELS REGISTERED
{
LEVEL: <10>
rel#1570:LogicalSort.NONE.[](input=LogicalProject#1568,fetch=100)
LEVEL: <9>
rel#1568:LogicalProject.NONE.[](input=LogicalJoin#1566,inputs=0..2,exprs=[$15, $16, $17, $12, $13, $14, $9, $10, $11, $6, $7, $8, $3, $4, $5])
LEVEL: <8>
rel#1566:LogicalJoin.NONE.[](left=LogicalJoin#1557,right=LogicalAggregate#1564,condition=true,joinType=inner)
LEVEL: <7>
rel#1557:LogicalJoin.NONE.[](left=LogicalJoin#1548,right=LogicalAggregate#1555,condition=true,joinType=inner)
LEVEL: <6>
rel#1548:LogicalJoin.NONE.[](left=LogicalJoin#1539,right=LogicalAggregate#1546,condition=true,joinType=inner)
LEVEL: <5>
rel#1539:LogicalJoin.NONE.[](left=LogicalJoin#1530,right=LogicalAggregate#1537,condition=true,joinType=inner)
LEVEL: <4>
rel#1530:LogicalJoin.NONE.[](left=LogicalAggregate#1521,right=LogicalAggregate#1528,condition=true,joinType=inner)
LEVEL: <3>
rel#1564:LogicalAggregate.NONE.[](input=LogicalProject#1562,group={},B2_LP=AVG($0),B2_CNT=COUNT($0),B2_CNTD=COUNT(DISTINCT $0))
rel#1555:LogicalAggregate.NONE.[](input=LogicalProject#1553,group={},B3_LP=AVG($0),B3_CNT=COUNT($0),B3_CNTD=COUNT(DISTINCT $0))
rel#1521:LogicalAggregate.NONE.[](input=LogicalProject#1519,group={},B1_LP=AVG($0),B1_CNT=COUNT($0),B1_CNTD=COUNT(DISTINCT $0))
rel#1528:LogicalAggregate.NONE.[](input=LogicalProject#1526,group={},B6_LP=AVG($0),B6_CNT=COUNT($0),B6_CNTD=COUNT(DISTINCT $0))
rel#1537:LogicalAggregate.NONE.[](input=LogicalProject#1535,group={},B5_LP=AVG($0),B5_CNT=COUNT($0),B5_CNTD=COUNT(DISTINCT $0))
rel#1546:LogicalAggregate.NONE.[](input=LogicalProject#1544,group={},B4_LP=AVG($0),B4_CNT=COUNT($0),B4_CNTD=COUNT(DISTINCT $0))
LEVEL: <2>
rel#1544:LogicalProject.NONE.[](input=LogicalFilter#1542,exprs=[$11])
rel#1519:LogicalProject.NONE.[](input=LogicalFilter#1517,exprs=[$11])
rel#1526:LogicalProject.NONE.[](input=LogicalFilter#1524,exprs=[$11])
rel#1553:LogicalProject.NONE.[](input=LogicalFilter#1551,exprs=[$11])
rel#1535:LogicalProject.NONE.[](input=LogicalFilter#1533,exprs=[$11])
rel#1562:LogicalProject.NONE.[](input=LogicalFilter#1560,exprs=[$11])
LEVEL: <1>
rel#1517:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[0..5]]), OR(AND(>=($11, 11), <=($11, +(11, 10))), AND(>=($18, 460), <=($18, +(460, 1000))), AND(>=($10, 14), <=($10, +(14, 20))))))
rel#1542:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[16..20]]), OR(AND(>=($11, 142), <=($11, +(142, 10))), AND(>=($18, 3054), <=($18, +(3054, 1000))), AND(>=($10, 80), <=($10, +(80, 20))))))
rel#1560:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[6..10]]), OR(AND(>=($11, 91), <=($11, +(91, 10))), AND(>=($18, 1430), <=($18, +(1430, 1000))), AND(>=($10, 32), <=($10, +(32, 20))))))
rel#1533:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[21..25]]), OR(AND(>=($11, 135), <=($11, +(135, 10))), AND(>=($18, 14180), <=($18, +(14180, 1000))), AND(>=($10, 38), <=($10, +(38, 20))))))
rel#1524:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[26..30]]), OR(AND(>=($11, 28), <=($11, +(28, 10))), AND(>=($18, 2513), <=($18, +(2513, 1000))), AND(>=($10, 42), <=($10, +(42, 20))))))
rel#1551:LogicalFilter.NONE.[](input=LogicalTableScan#1,condition=AND(SEARCH($9, Sarg[[11..15]]), OR(AND(>=($11, 66), <=($11, +(66, 10))), AND(>=($18, 920), <=($18, +(920, 1000))), AND(>=($10, 4), <=($10, +(4, 20))))))
LEVEL: <0>
rel#1575:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
rel#1572:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
rel#1574:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
rel#1573:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
rel#1576:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
}
This is the heart of the approach: for each level in the plan graph, we look for sets of nodes that are “equivalent” or “approximately equivalent.” Exact equivalence is straightforward (both sub-trees have the same operators, same inputs, same filters, same columns, etc.). Approximate equivalence means they match enough to be combined—maybe the filters or projections differ only slightly.
At each level, we search for sub-plans that are:
INPUT: LEVEL_MAP, DAG
OUTPUT: EQUIV_SETS[level], each element is a set of "equivalent or approx" nodes
procedure GATHER_EQUIVALENT_SETS():
for level in 0 to MAX_LEVEL:
let NODES = all nodes where LEVEL_MAP[node] = level
for i in 0 to NODES.size - 1:
source <- NODES[i]
if not isSupportedType(source):
continue
for j in i+1 to NODES.size - 1:
target <- NODES[j]
if not isSupportedType(target):
continue
exactlySimilar <- isExactlySimilar(source, target)
approxSimilar <- isApproxSimilar(source, target)
if exactlySimilar OR (approxSimilar AND approxMatchingEnabled()):
unifyInSet(source, target, level)
// unifyInSet() merges them into a single set for that level
_______________________________[CTE COVER]_______________________________________
EQUIVALENT SETS [Total: 6, Distinct: 1]
{
LEVEL: <0>
Set#0
rel#1574:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
rel#1575:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
rel#1572:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
rel#1573:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
rel#1576:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])
}
Once equivalent sub-plans are identified, the optimizer must decide whether to prune redundant computations and replace them with CTEs. This decision is typically cost-based, meaning the optimizer evaluates whether the benefits of using a CTE (e.g., reduced CPU and I/O usage) outweigh the costs (e.g., additional memory for storing intermediate results).
After we find these candidate sets, we do additional checks:
If the cost model says we’re better off factoring out the sub-plans, we create CTEs and rewrite all references to the repeated sub-plan to read from these newly created tables. This final step yields a single unified plan with minimal repeated work.
INPUT: EQUIV_SETS, DAG, cost model, root node
OUTPUT: updatedRoot (RelNode) with references to ephemeral CTE tables
procedure REPLACE_WITH_CTEs(root):
// 1. Generate ephemeral tables for each remaining equivalence set
setToCTE <- {} // map: set of nodes -> representative ephemeral table
nodeToCTE <- {} // map: node -> ephemeral table
idx <- 0
for level in 0..MAX_LEVEL:
for each set S in EQUIV_SETS[level]:
if costModel.saysBeneficial(S):
cteName <- "SYS_CTE_" + idx
idx <- idx + 1
sampleNode <- pickOneNode(S)
// (a) Possibly unify approximate filters or projects across S
unifiedSubPlan <- unifySubPlan(S, sampleNode)
// (b) Create ephemeral table referencing unifiedSubPlan
ephemeralTable <- createE6TempTable(cteName, unifiedSubPlan)
// (c) Register in maps
setToCTE[S] <- ephemeralTable
for node in S:
nodeToCTE[node] <- ephemeralTable
// 2. Traverse original root plan, replace each node in nodeToCTE with ephemeralTableRef
return doCTEReplacementShuttle(root, nodeToCTE)
_______________________________[CTE COVER]_______________________________________
SYS_CTE_0:
LogicalProject($f0=[0], $f1=[1], $f2=[2], $f3=[3], $f4=[4], $f5=[5], $f6=[6], $f7=[7], $f8=[8], ss_quantity=[$9], ss_wholesale_cost=[$10], ss_list_price=[$11], $f12=[12], $f13=[13], $f14=[14], $f15=[15], $f16=[16], $f17=[17], ss_coupon_amt=[$18], $f19=[19], $f20=[20], $f21=[21], $f22=[22])
LogicalFilter(condition=[OR(AND(SEARCH($9, Sarg[[11..15]]), OR(AND(>=($11, 66), <=($11, +(66, 10))), AND(>=($18, 920), <=($18, +(920, 1000))), AND(>=($10, 4), <=($10, +(4, 20))))), AND(SEARCH($9, Sarg[[26..30]]), OR(AND(>=($11, 28), <=($11, +(28, 10))), AND(>=($18, 2513), <=($18, +(2513, 1000))), AND(>=($10, 42), <=($10, +(42, 20))))), AND(SEARCH($9, Sarg[[16..20]]), OR(AND(>=($11, 142), <=($11, +(142, 10))), AND(>=($18, 3054), <=($18, +(3054, 1000))), AND(>=($10, 80), <=($10, +(80, 20))))), AND(SEARCH($9, Sarg[[21..25]]), OR(AND(>=($11, 135), <=($11, +(135, 10))), AND(>=($18, 14180), <=($18, +(14180, 1000))), AND(>=($10, 38), <=($10, +(38, 20))))), AND(SEARCH($9, Sarg[[0..5]]), OR(AND(>=($11, 11), <=($11, +(11, 10))), AND(>=($18, 460), <=($18, +(460, 1000))), AND(>=($10, 14), <=($10, +(14, 20))))), AND(SEARCH($9, Sarg[[6..10]]), OR(AND(>=($11, 91), <=($11, +(91, 10))), AND(>=($18, 1430), <=($18, +(1430, 1000))), AND(>=($10, 32), <=($10, +(32, 20))))))])
LogicalTableScan(table=[[hive, tpcds_1000, store_sales]])
_______________________________[CTE COVER]_______________________________________
STARTING CTE REPLACEMENT
Replacing: {rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])}
Replacing: {rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])}
Replacing: {rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])}
Replacing: {rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])}
Replacing: {rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])}
Replacing: {rel#1:LogicalTableScan.NONE.[](table=[hive, tpcds_1000, store_sales])}
_______________________________DONE [Total: 6, Distinct: 1]_______________________
Now let’s examine the query plan with automatic CTE detection
LogicalSort(fetch=[100])
LogicalProject(B1_LP=[$0], B1_CNT=[$1], B1_CNTD=[$2], B2_LP=[$15], B2_CNT=[$16], B2_CNTD=[$17], B3_LP=[$12], B3_CNT=[$13], B3_CNTD=[$14], B4_LP=[$9], B4_CNT=[$10], B4_CNTD=[$11], B5_LP=[$6], B5_CNT=[$7], B5_CNTD=[$8], B6_LP=[$3], B6_CNT=[$4], B6_CNTD=[$5])
LogicalJoin(condition=[true], joinType=[inner])
LogicalJoin(condition=[true], joinType=[inner])
LogicalJoin(condition=[true], joinType=[inner])
LogicalJoin(condition=[true], joinType=[inner])
LogicalJoin(condition=[true], joinType=[inner])
LogicalProject(B1_LP=[/($0, $1)], B1_CNT=[$1], B1_CNTD=[$2])
LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B1_CNTD=[COUNT(DISTINCT $0)])
LogicalProject(ss_list_price=[$11])
LogicalFilter(condition=[AND(SEARCH($9, Sarg[[0..5]]), OR(AND(>=($11, 11), <=($11, +(11, 10))), AND(>=($18, 460), <=($18, +(460, 1000))), AND(>=($10, 14), <=($10, +(14, 20)))))])
E6TempTableScan(table=[[e6, hive, SESSION, SYS_CTE_0]])
LogicalProject(B6_LP=[/($0, $1)], B6_CNT=[$1], B6_CNTD=[$2])
LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B6_CNTD=[COUNT(DISTINCT $0)])
LogicalProject(ss_list_price=[$11])
LogicalFilter(condition=[AND(SEARCH($9, Sarg[[26..30]]), OR(AND(>=($11, 28), <=($11, +(28, 10))), AND(>=($18, 2513), <=($18, +(2513, 1000))), AND(>=($10, 42), <=($10, +(42, 20)))))])
E6TempTableScan(table=[[e6, hive, SESSION, SYS_CTE_0]])
LogicalProject(B5_LP=[/($0, $1)], B5_CNT=[$1], B5_CNTD=[$2])
LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B5_CNTD=[COUNT(DISTINCT $0)])
LogicalProject(ss_list_price=[$11])
LogicalFilter(condition=[AND(SEARCH($9, Sarg[[21..25]]), OR(AND(>=($11, 135), <=($11, +(135, 10))), AND(>=($18, 14180), <=($18, +(14180, 1000))), AND(>=($10, 38), <=($10, +(38, 20)))))])
E6TempTableScan(table=[[e6, hive, SESSION, SYS_CTE_0]])
LogicalProject(B4_LP=[/($0, $1)], B4_CNT=[$1], B4_CNTD=[$2])
LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B4_CNTD=[COUNT(DISTINCT $0)])
LogicalProject(ss_list_price=[$11])
LogicalFilter(condition=[AND(SEARCH($9, Sarg[[16..20]]), OR(AND(>=($11, 142), <=($11, +(142, 10))), AND(>=($18, 3054), <=($18, +(3054, 1000))), AND(>=($10, 80), <=($10, +(80, 20)))))])
E6TempTableScan(table=[[e6, hive, SESSION, SYS_CTE_0]])
LogicalProject(B3_LP=[/($0, $1)], B3_CNT=[$1], B3_CNTD=[$2])
LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B3_CNTD=[COUNT(DISTINCT $0)])
LogicalProject(ss_list_price=[$11])
LogicalFilter(condition=[AND(SEARCH($9, Sarg[[11..15]]), OR(AND(>=($11, 66), <=($11, +(66, 10))), AND(>=($18, 920), <=($18, +(920, 1000))), AND(>=($10, 4), <=($10, +(4, 20)))))])
E6TempTableScan(table=[[e6, hive, SESSION, SYS_CTE_0]])
LogicalProject(B2_LP=[/($0, $1)], B2_CNT=[$1], B2_CNTD=[$2])
LogicalAggregate(group=[{}], agg#0=[SUM($0)], agg#1=[COUNT($0)], B2_CNTD=[COUNT(DISTINCT $0)])
LogicalProject(ss_list_price=[$11])
LogicalFilter(condition=[AND(SEARCH($9, Sarg[[6..10]]), OR(AND(>=($11, 91), <=($11, +(91, 10))), AND(>=($18, 1430), <=($18, +(1430, 1000))), AND(>=($10, 32), <=($10, +(32, 20)))))])
E6TempTableScan(table=[[e6, hive, SESSION, SYS_CTE_0]])
With automatic CTE detection, the query’s execution time reduced from 83 sec to 21 sec - ~75% performance improvement!
By automatically detecting repeated (or nearly repeated) sub-plan fragments and turning them into transient tables, we can drastically reduce redundant work in complex query plans. This approach:
Automatic CTE detection is a powerful tool for optimizing query performance by eliminating redundant computations. By constructing a directed graph of the query plan, assigning levels to identify similar subgraphs, detecting equivalent sub-plans, making cost-based decisions, generating ephemeral CTEs, and handling approximate matches, we can significantly improve query efficiency. This approach not only speeds up query execution but also reduces resource usage, making it an essential technique for database administrators and analysts working with complex analytical workloads.
Subramanian, Subbu N., and Shivakumar Venkataraman. "Cost-based optimization of decision support queries using transient-views." Proceedings of the 1998 ACM SIGMOD international conference on Management of data, 1998.
We are universally interoperable and open-source friendly. We can integrate across any object store, table format, data catalog, governance tools, BI tools, and other data applications.
We use a usage-based pricing model based on vCPU consumption. Your billing is determined by the number of vCPUs used, ensuring you only pay for the compute power you actually consume.
We support all types of file formats, like Parquet, ORC, JSON, CSV, AVRO, and others.
e6data promises a 5 to 10 times faster querying speed across any concurrency at over 50% lower total cost of ownership across the workloads as compared to any compute engine in the market.
We support serverless and in-VPC deployment models.
We can integrate with your existing governance tool, and also have an in-house offering for data governance, access control, and security.