Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new rule PinotSeminJoinDistinctProjectRule to apply a distinct to a semi join right side project #14758

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

xiangfu0
Copy link
Contributor

@xiangfu0 xiangfu0 commented Jan 6, 2025

The goal is to apply a distinct on the Semi Join Right side Project Node.
E.g. in ColocatedJoinEngineQuickStart :
the query:

SET useMultistageEngine = true;
SET maxRowsInJoin = 100000000;
SET timeoutMs = 300000;
explain plan for
SELECT
  distinctCount(a.userUUID),
  a.deviceOS
from userAttributes a
  JOIN (
    SELECT DISTINCT userUUID
    FROM userGroups
    WHERE groupUUID = 'group-1'
  ) g ON a.userUUID = g.userUUID
GROUP BY a.deviceOS

The old query plan is:

Execution Plan
LogicalProject(EXPR$0=[$1], deviceOS=[$0])
  PinotLogicalAggregate(group=[{0}], agg#0=[DISTINCTCOUNT($1)], aggType=[FINAL])
    PinotLogicalExchange(distribution=[hash[0]])
      PinotLogicalAggregate(group=[{0}], agg#0=[DISTINCTCOUNT($1)], aggType=[LEAF])
        LogicalJoin(condition=[=($1, $2)], joinType=[semi])
          LogicalProject(deviceOS=[$4], userUUID=[$6])
            LogicalTableScan(table=[[default, userAttributes]])
          PinotLogicalExchange(distribution=[hash[0]], relExchangeType=[PIPELINE_BREAKER])
            LogicalProject(userUUID=[$4])
              LogicalFilter(condition=[=($3, _UTF-8'group-1')])
                LogicalTableScan(table=[[default, userGroups]])

The new query plan is:

Execution Plan
LogicalProject(EXPR$0=[$1], deviceOS=[$0])
  PinotLogicalAggregate(group=[{0}], agg#0=[DISTINCTCOUNT($1)], aggType=[FINAL])
    PinotLogicalExchange(distribution=[hash[0]])
      PinotLogicalAggregate(group=[{0}], agg#0=[DISTINCTCOUNT($1)], aggType=[LEAF])
        LogicalJoin(condition=[=($1, $2)], joinType=[semi])
          LogicalProject(deviceOS=[$4], userUUID=[$6])
            LogicalTableScan(table=[[default, userAttributes]])
          PinotLogicalExchange(distribution=[hash[0]], relExchangeType=[PIPELINE_BREAKER])
            PinotLogicalAggregate(group=[{0}], aggType=[FINAL])
              PinotLogicalExchange(distribution=[hash[0]])
                PinotLogicalAggregate(group=[{4}], aggType=[LEAF])
                  LogicalFilter(condition=[=($3, _UTF-8'group-1')])
                    LogicalTableScan(table=[[default, userGroups]])

@xiangfu0 xiangfu0 force-pushed the semi-join-distinct-project-rule branch from 6ff9572 to df34109 Compare January 6, 2025 04:25
@xiangfu0 xiangfu0 requested a review from Jackie-Jiang January 6, 2025 04:41
@xiangfu0 xiangfu0 added query multi-stage Related to the multi-stage query engine labels Jan 6, 2025
@xiangfu0 xiangfu0 requested a review from kishoreg January 6, 2025 04:46
@codecov-commenter
Copy link

codecov-commenter commented Jan 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.83%. Comparing base (59551e4) to head (18efb4f).
Report is 1537 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14758      +/-   ##
============================================
+ Coverage     61.75%   63.83%   +2.08%     
- Complexity      207     1607    +1400     
============================================
  Files          2436     2704     +268     
  Lines        133233   150755   +17522     
  Branches      20636    23291    +2655     
============================================
+ Hits          82274    96238   +13964     
- Misses        44911    47315    +2404     
- Partials       6048     7202    +1154     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.81% <100.00%> (+2.11%) ⬆️
java-21 63.73% <100.00%> (+2.10%) ⬆️
skip-bytebuffers-false 63.83% <100.00%> (+2.08%) ⬆️
skip-bytebuffers-true 63.71% <100.00%> (+35.99%) ⬆️
temurin 63.83% <100.00%> (+2.08%) ⬆️
unittests 63.83% <100.00%> (+2.08%) ⬆️
unittests1 56.27% <100.00%> (+9.38%) ⬆️
unittests2 34.15% <17.64%> (+6.42%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@xiangfu0 xiangfu0 force-pushed the semi-join-distinct-project-rule branch from df34109 to 18efb4f Compare January 6, 2025 17:56
@Jackie-Jiang
Copy link
Contributor

Will this rule prevent us from doing:

SELECT ... FROM t1 WHERE t1.col IN (SELECT t2.col FROM t2)

We don't always want to pay overhead of distinct when we already know the value is unique, or there are very few duplicates.

Even without distinct, the result is still correct, just not necessary the most efficient way to execute. For the example query, in order to achieve the desired query plan, we can write it as:

SELECT
  distinctCount(a.userUUID),
  a.deviceOS
FROM userAttributes a WHERE a.userUUID IN
  (
    SELECT DISTINCT userUUID
    FROM userGroups
    WHERE groupUUID = 'group-1'
  )
GROUP BY a.deviceOS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multi-stage Related to the multi-stage query engine query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants