More metadata about operator execution.

35 views
Skip to first unread message

Carlo Aldo Curino

unread,
May 19, 2022, 8:24:04 PM5/19/22
to Substrait
I was reading the documentation (https://substrait.io/relations/physical_relations/) and noticed we comment on MergeSortJoin that it can be done in a streaming fashion. If this happens within a system it is a local implementation detail, but if the overall query plan is split between systems (e.g., Spark delegating down to a HW-accelerator part of the computation), whether the subtree execute in a blocking or streaming fashion may have impact on the latency/communication/resources that upstream/downstream systems can expect or are required to provide. 

I was wondering whether we should include metadata decoration to the physical operators to capture for example an extra boolean on whether the execution is expected to be streaming or blocking. There are likely other properties of batch sizes, etc. that might also be useful to include for similar reasons. This might be covered in via extensions already. 

I can see arguments in favor of inclusion or not of this type of considerations. To be precise, I am curious about the community philosophy around this more so than a specific need for the specific streaming/blocking idea (we are still exploring so things are a bit hypothetical, we will get more concrete hopefully soon). 

Thanks,
Carlo

Jacques Nadeau

unread,
May 20, 2022, 12:55:56 PM5/20/22
to Substrait
For this kind of thing, it sounds like you're talking about an optimization, not a requirement. I suggest that initially you express this as an advanced extension [1] optimization [2]. This is available on all the relational operators. Optimization means that the consumer can ignore the field if it doesn't know how to handle it. Would that be a reasonable way to solve what you are describing?


--
You received this message because you are subscribed to the Google Groups "Substrait" group.
To unsubscribe from this group and stop receiving emails from it, send an email to substrait+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/466ec76a-aa80-474d-90c7-718b65ebc1ffn%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Substrait

unread,
May 20, 2022, 1:42:20 PM5/20/22
to Substrait
I agree this is going to be a very important question.  I was also recently looking at the idea of a split query plan and it does seem like there will need to be a fairly complex negotiation between scheduler and worker.  Another example would be sequencing/ordering guarantees on the output batches.

I think advanced extension optimization is an ok place for this but we'd need to make sure that whatever we come up with for communicating capabilities between engines (if anything) is detailed enough to communicate something like this.

Weston Pace

unread,
May 20, 2022, 1:42:59 PM5/20/22
to Substrait
Sorry, that was me...I am still figuring out google groups.
> You received this message because you are subscribed to a topic in the Google Groups "Substrait" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/substrait/u_hgYoSjeY8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to substrait+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/27b124ae-5fa9-4de2-858f-7251a7ff1fa0n%40googlegroups.com.

Jacques Nadeau

unread,
May 20, 2022, 4:08:49 PM5/20/22
to Substrait
Hey Weston, sorry about that.

I was trying to fix the settings in google groups to behave like apache lists and failed initially. I think I have it now. Messages should come from the person, reply-to should be the list now.

Reply all
Reply to author
Forward
0 new messages