This field, or some variation, is often present in a physical hash join relation. It is missing in Substrait and has been proposed in the past[1][2]. The past discussion led to a lot of confusion (and I think a fair amount of frustration). It is being proposed again in [3] and I would like to avoid the confusion of the past and move quickly on getting this resolved. I would like to discuss this during the community meeting on Jan. 3rd and reach a resolution but feel free to chime in on the ML (it is the holiday season and some cannot make it to Jan 3rd).
# Why does this flag exist?
The anti-join operator is often used as an optimization for certain queries (this is sometimes called "flattening" the subquery). These queries take the following form:
SELECT ... FROM ... WHERE <left_key> NOT IN (SELECT <right_key> FROM ...)
SELECT ... FROM ... WHERE NOT EXISTS (SELECT ... FROM ... WHERE <left_key> = <right_key>)
Because WHERE NOT IN and WHERE NOT EXISTS have slightly different semantics they require slightly different behavior in the anti-join operator.
# What does the flag control?
The flag only controls the behavior of anti-join. A (left) anti-join returns rows from the left when there is no row in the right that satisfies the join expression for the left row. The join expression can return three possible values for a left/right combination. It can return true (satisfies the expression), false (does not satisfy the expression), or null (unknown). This flag controls what happens when the join expression returns null. In a logical sense it answers the question "if the join expression evaluates to null for some row on the right side then should an anti-join return this row?" In a physical (equi-join) sense it answers the question "if the key on the left or the key on the right is null then does the anti join return the row?"
# Problems that have come up in discussion
## A. Either key null or both keys null?
Some query engines (acero/datafusion) have a slightly different flag. This flag is a "null_equals_null" flag which yields the following:
* If set and both keys are null then the candidate is a match
* If one key is null and the other is not then the candidate is not a match
On the other hand, the flag in Velox (to the best of my knowledge) has the following behavior:
* If set and one key is null (on either side) then the candidate is a match
To support these engines do we need both flags?
## B. Physical or logical anti-join?
Another question that has come up in the past is whether this flag belongs in the logical join as well as the physical join.
The argument against the logical join is that you can always fix a logical join by massaging the join expression (e.g. instead of left.x == right.x, if you want nulls, you can say `left.x == right.x AND left.x IS NOT NULL AND right.x IS NOT NULL`. Then we can state that, for a logical anti-join, that we only return rows where the expression evaluates to false.
The argument for the logical join is basically that some tools use the logical plan today (e.g. gluten) to communicate with engines that are thinking physical plans (e.g. velox) and that they can't do that if the property is not in the logical plan.
## C. Some engines don't need this flag?
Other engines, such as duckdb, use mark join which (to my understanding) means they do not need this flag. They can handle null-awareness in other ways. In these engines, the anti join operator only returns rows where the expression evaluates to false. If we introduce a null_aware flag then these engines would need to reject these plans.
## D. Ability to set the flag per-key
It was brought up in today's community meeting that from a mostly theoretical (I believe) standpoint you could actually consider setting this flag on a per-key basis for equijoins that are matching multiple keys (e.g. two keys from the left table and two keys from the right table). On the other hand, all existing engines that I am aware of today do not allow a per-key setting.
## E. Do we need a similar flag for semi join?
There appears to be consensus that semi join never returns rows because a candidate evaluates to NULL. However, there are certain queries where something very much like a semi join is used. For example, "SELECT ... FROM ... WHERE <left_key> IN (SELECT <right_key> FROM ...) OR <some-other-condition>"
In this type of query it can be useful to use a hash table to calculate the WHERE <left_key> IN (...) part in a manner that is very similar to a join. I believe (very fuzzy on the details here) that this is where things like "mark join" (duckdb) and "semi project" (velox) are used. These operators may need to care about null awareness semantics.
Did I miss any discussion points?
Here are my opinions:
A: Can both behaviors solve the NOT IN / NOT EXISTS difference? I don't know but I will try and think through these scenarios before Jan. 3rd. If both behaviors are valid (but different) solutions to the NOT IN / NOT EXISTS problem then I would prefer a single flag (which explains if this is NOT IN or NOT EXISTS) that engines interpret as is appropriate to them. If there is a significant justification for both behaviors then I think we need both flags.
B: I do not think this flag belongs in the logical join operator. Just because some tools are choosing to use logical operators that others are considering physical is not enough justification.
C: If engines don't need this flag then we can still add it. If those engines choose to accept physical plans (DuckDb does not, to my knowledge) then they will need to reject plans that set the flag incorrectly. This is just another symptom of the "no one physical operator will fit all engines" problem. Better to err on the side of a physical operator that is overly expressive than too restrictive in my opinion.
D: Since no engines support a per-key setting I do not believe we should add one. If engines choose to support this in the future then a per-key setting can be an override for the per-operator setting.
E: These "semi project" and "mark join" operators do not exist in Substrait. We should not delay a PR adding null awareness to physical joins while we wait for them.