-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: null in array_agg with DISTINCT and IGNORE #19736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a bug where the IGNORE NULLS clause was being lost when optimizing ARRAY_AGG(DISTINCT x IGNORE NULLS) queries. The SingleDistinctToGroupBy optimizer was incorrectly discarding the null_treatment, filter, and order_by parameters when rewriting DISTINCT aggregates into GROUP BY operations.
Changes:
- Modified the optimizer to preserve aggregate function parameters (
null_treatment,filter,order_by) during the DISTINCT-to-GROUP-BY transformation - Added a regression test to verify
ARRAY_AGG(DISTINCT x IGNORE NULLS)correctly filters NULL values
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| datafusion/optimizer/src/single_distinct_to_groupby.rs | Extracts and preserves filter, order_by, and null_treatment parameters when rewriting DISTINCT aggregates |
| datafusion/sqllogictest/test_files/aggregate.slt | Adds regression test for ARRAY_AGG(DISTINCT ... IGNORE NULLS) functionality |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice spot. I notice in the branch below it also doesn't carry over the properties, is this something we should also fix?
datafusion/datafusion/optimizer/src/single_distinct_to_groupby.rs
Lines 212 to 234 in 458b491
| } else { | |
| index += 1; | |
| let alias_str = format!("alias{index}"); | |
| inner_aggr_exprs.push( | |
| Expr::AggregateFunction(AggregateFunction::new_udf( | |
| Arc::clone(&func), | |
| args, | |
| false, | |
| None, | |
| vec![], | |
| None, | |
| )) | |
| .alias(&alias_str), | |
| ); | |
| Ok(Expr::AggregateFunction(AggregateFunction::new_udf( | |
| func, | |
| vec![col(&alias_str)], | |
| false, | |
| None, | |
| vec![], | |
| None, | |
| ))) | |
| } |
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to find a test case for the 2-phase rewrite cases? 🤔
| func, | ||
| vec![col(&alias_str)], | ||
| false, | ||
| None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another case here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll look into whether we can add a minimal test case covering the 2-phase rewrite scenario.
If I understood correctly, I’ve added a couple of tests covering those cases. |
|
Looking into it some more, it looks like the two-phase aggregation related changes aren't really needed as they don't affect anything 🤔 The introduced tests don't fail on main, and it seems its because the only supported aggregates for the two-phase aggregation branch are min/max/sum: datafusion/datafusion/optimizer/src/single_distinct_to_groupby.rs Lines 85 to 94 in f9697c1
I guess it doesn't hurt to keep the fix but they won't actually affect anything (since datafusion/datafusion/optimizer/src/single_distinct_to_groupby.rs Lines 81 to 83 in f9697c1
|
Which issue does this PR close?
DISTINCTandIGNORE NULLS#19735.Rationale for this change
The
SingleDistinctToGroupByoptimizer rewrites aggregate functions withDISTINCTinto aGROUP BYoperation for better performance. However, during this rewrite, it was discarding important aggregate function parameters:null_treatment,filter, andorder_by.This caused queries like
ARRAY_AGG(DISTINCT x IGNORE NULLS)to include NULL values in the result because theIGNORE NULLSclause (stored as null_treatment) was being lost during optimization.What changes are included in this PR?
Preserve aggregate parameters in optimizer: Modified
SingleDistinctToGroupByto extract and preservenull_treatment,filter, andorder_byfrom the original aggregate function when creating the rewritten version.Add regression test: Added SQL logic test to verify that
ARRAY_AGG(DISTINCT x IGNORE NULLS)correctly filters out NULL values.Files changed:
datafusion/optimizer/src/single_distinct_to_groupby.rs: Extract and pass through filter, order_by, and null_treatment parameters
datafusion/sqllogictest/test_files/aggregate.slt: Add test case for ARRAY_AGG(DISTINCT ... IGNORE NULLS)
Are these changes tested?
Yes:
New SQL logic test in aggregate.slt verifies the fix works end-to-end
Existing optimizer tests continue to pass (19 tests in
single_distinct_to_groupby)Existing aggregate tests continue to pass (20 tests in
array_agg)Are there any user-facing changes?
Bug fix - Users can now correctly use
IGNORE NULLS(andFILTER/ORDER BY) withDISTINCTaggregates:Before (broken):
After (fixed):