Skip to content

Conversation

@davidlghellin
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

The SingleDistinctToGroupBy optimizer rewrites aggregate functions with DISTINCT into a GROUP BY operation for better performance. However, during this rewrite, it was discarding important aggregate function parameters: null_treatment, filter, and order_by.

This caused queries like ARRAY_AGG(DISTINCT x IGNORE NULLS) to include NULL values in the result because the IGNORE NULLS clause (stored as null_treatment) was being lost during optimization.

What changes are included in this PR?

Preserve aggregate parameters in optimizer: Modified SingleDistinctToGroupBy to extract and preserve null_treatment, filter, and order_by from the original aggregate function when creating the rewritten version.

Add regression test: Added SQL logic test to verify that ARRAY_AGG(DISTINCT x IGNORE NULLS) correctly filters out NULL values.

Files changed:

datafusion/optimizer/src/single_distinct_to_groupby.rs: Extract and pass through filter, order_by, and null_treatment parameters
datafusion/sqllogictest/test_files/aggregate.slt: Add test case for ARRAY_AGG(DISTINCT ... IGNORE NULLS)

Are these changes tested?

Yes:
New SQL logic test in aggregate.slt verifies the fix works end-to-end
Existing optimizer tests continue to pass (19 tests in single_distinct_to_groupby)
Existing aggregate tests continue to pass (20 tests in array_agg)

Are there any user-facing changes?

Bug fix - Users can now correctly use IGNORE NULLS (and FILTER / ORDER BY) with DISTINCT aggregates:

Before (broken):

SELECT ARRAY_AGG(DISTINCT x IGNORE NULLS) 
FROM (VALUES (1), (2), (NULL), (2), (1)) AS t(x);
-- Result: [2, NULL, 1]  ❌ NULL incorrectly included

After (fixed):

SELECT ARRAY_AGG(DISTINCT x IGNORE NULLS) 
FROM (VALUES (1), (2), (NULL), (2), (1)) AS t(x);
-- Result: [1, 2]  ✅ NULLs correctly filtered

Copilot AI review requested due to automatic review settings January 10, 2026 22:23
@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jan 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where the IGNORE NULLS clause was being lost when optimizing ARRAY_AGG(DISTINCT x IGNORE NULLS) queries. The SingleDistinctToGroupBy optimizer was incorrectly discarding the null_treatment, filter, and order_by parameters when rewriting DISTINCT aggregates into GROUP BY operations.

Changes:

  • Modified the optimizer to preserve aggregate function parameters (null_treatment, filter, order_by) during the DISTINCT-to-GROUP-BY transformation
  • Added a regression test to verify ARRAY_AGG(DISTINCT x IGNORE NULLS) correctly filters NULL values

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
datafusion/optimizer/src/single_distinct_to_groupby.rs Extracts and preserves filter, order_by, and null_treatment parameters when rewriting DISTINCT aggregates
datafusion/sqllogictest/test_files/aggregate.slt Adds regression test for ARRAY_AGG(DISTINCT ... IGNORE NULLS) functionality

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice spot. I notice in the branch below it also doesn't carry over the properties, is this something we should also fix?

} else {
index += 1;
let alias_str = format!("alias{index}");
inner_aggr_exprs.push(
Expr::AggregateFunction(AggregateFunction::new_udf(
Arc::clone(&func),
args,
false,
None,
vec![],
None,
))
.alias(&alias_str),
);
Ok(Expr::AggregateFunction(AggregateFunction::new_udf(
func,
vec![col(&alias_str)],
false,
None,
vec![],
None,
)))
}

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to find a test case for the 2-phase rewrite cases? 🤔

func,
vec![col(&alias_str)],
false,
None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another case here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll look into whether we can add a minimal test case covering the 2-phase rewrite scenario.

@davidlghellin
Copy link
Contributor Author

Is it possible to find a test case for the 2-phase rewrite cases? 🤔

If I understood correctly, I’ve added a couple of tests covering those cases.

@Jefffrey
Copy link
Contributor

Looking into it some more, it looks like the two-phase aggregation related changes aren't really needed as they don't affect anything 🤔

The introduced tests don't fail on main, and it seems its because the only supported aggregates for the two-phase aggregation branch are min/max/sum:

if *distinct {
for e in args {
fields_set.insert(e);
}
} else if func.name() != "sum"
&& func.name().to_lowercase() != "min"
&& func.name().to_lowercase() != "max"
{
return Ok(false);
}

  • See how we bail if we find a non-distinct function that isn't sum/min/max

I guess it doesn't hurt to keep the fix but they won't actually affect anything (since ignore nulls doesn't affect sum/min/max), and we bail out of the rule if we have a filter or order_by in any aggregate:

if filter.is_some() || !order_by.is_empty() {
return Ok(false);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] not ignore null in ARRAY_AGG with DISTINCT and IGNORE NULLS

2 participants