I was reading the documentation (
https://substrait.io/relations/physical_relations/) and noticed we comment on MergeSortJoin that it can be done in a streaming fashion. If this happens within a system it is a local implementation detail, but if the overall query plan is split between systems (e.g., Spark delegating down to a HW-accelerator part of the computation), whether the subtree execute in a blocking or streaming fashion may have impact on the latency/communication/resources that upstream/downstream systems can expect or are required to provide.
I was wondering whether we should include metadata decoration to the physical operators to capture for example an extra boolean on whether the execution is expected to be streaming or blocking. There are likely other properties of batch sizes, etc. that might also be useful to include for similar reasons. This might be covered in via extensions already.
I can see arguments in favor of inclusion or not of this type of considerations. To be precise, I am curious about the community philosophy around this more so than a specific need for the specific streaming/blocking idea (we are still exploring so things are a bit hypothetical, we will get more concrete hopefully soon).
Thanks,
Carlo