Reminder: Substrait Community Sync Meeting Wednesday

18 views
Skip to first unread message

Weston Pace

unread,
Jul 19, 2022, 9:19:37 PM7/19/22
to Substrait

Carlo Aldo Curino

unread,
Aug 3, 2022, 2:28:02 PM8/3/22
to subs...@googlegroups.com
Folks, 

The Teams-based system we use has recordings and transcripts. Do we want to post the transcripts somewhere? The discussion summary can be more succinct and reflects just the outcomes (thanks Weston for driving this BTW) and we can have searchable version of the entire word-by-word transcript somewhere.

Thanks,
Carlo

--
You received this message because you are subscribed to the Google Groups "substrait" group.
To unsubscribe from this group and stop receiving emails from it, send an email to substrait+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAE4AYb2mhDGmpb7g1EYwRTJKga0jmR84nOdUcOFXz3Nudnag1A%40mail.gmail.com.

Jacques Nadeau

unread,
Aug 3, 2022, 2:33:49 PM8/3/22
to Substrait
How about we add a sub-page for sync that lists when the sync is, how to get access, as well as a list of past sessions and links to each transcript?

Carlo Aldo Curino

unread,
Aug 3, 2022, 4:16:47 PM8/3/22
to subs...@googlegroups.com, Ashvin
We could do that, if we get spam-y folks joining in to troll we can revert to a more private way to get the link I guess. @Ashvin do you think you could do the linking of the transcripts?

Thanks,
Carlo

Jacques Nadeau

unread,
Aug 3, 2022, 5:19:43 PM8/3/22
to Substrait, Ashvin
I'm fine if the actual key requires a request. Just figured everything else could be public.

Weston Pace

unread,
Aug 3, 2022, 7:17:03 PM8/3/22
to Substrait, Ashvin
I believe Ashvin has been adding a link to the transcripts on the
agenda page (at least, he did for the July 6th meeting). This page
here: https://docs.google.com/document/d/1Afg93ojsVWdwo3rBO2Dtng8__RYqAy_zJMrnqhTQapc/edit#heading=h.vhmt9wvczmu8

Speaking of notes. Here are my notes from today's meeting:

[Nic]: Gave brief intro of dplyr / R integration and goals on request

Re URIs

[Weston]: Originally concerned about integrating Acero with various
tools that don’t agree on what the URI should be at the moment. Ended
up deciding to treat a URI of / or <EMPTY> as a wildcard and falling
back to name-only matching.

[Jacques]: Some concern about using github URIs as they are not
guaranteed to be stable, or at least, we don’t “fully own” the URI.

[Carlo]: We could maybe just check with Github and see if they think
that URI might ever change.

[Jacques]: I don’t think this will be super controversial. Perhaps
lets just pick something. Weston, do you want to just create a PR.

[Jacques]: Formalizing the versioning scheme might also be a good idea.

[Weston]: See links section

Relations with multiple / shared outputs

[Jacques]: Problem statement: there is currently no formal way of
defining a “tee” or a node with multiple outputs.

[Jacques]: My plan was originally to use multiple top-level plans.
The shared part would be a top-level plan. The dependent parts would
then reference the other plan. However, this “reference” capability
hasn’t been defined yet. Does this work?

[Carlo]: So to represent an arbitrary DAG we encode it as multiple
trees that reference each other.

[Jacques]: Yes. A single “Plan” message can have multiple “PlanRel”
objects. Then we would create a new operator that can refer to
another relation. This would be a reference to a PlanRel.

[Jacques]: Currently everyone just has one root rel but this
implementation would rely on people having one or more non-root rels.

[Carlo]: We have both (dag and collection of merged trees) at
Microsoft. The referenced trees seems like it might be more flexible
/ powerful since this idea of references might be similar to a
mechanism used to refer to a named view. I can imagine people coming
from a spark dag world might struggle more.

[Jacques]: I don’t think it will be all that burdensome. This also
fits patterns where you can create a reference even if you don’t have
to. For example, if you have a CTE, even if it isn’t shared, this
approach might be clearer.

[Carlo]: No concern, I’ve found that Substrait always ends up being
correct, but often the opposite of my intuition. My only worry is
people thinking DAGs are not supported because they don’t intuitively
understand. Maybe there is some syntactic sugar we can do in a
viewer.

[Jacques]: One related question is “in two years, how much are people
going to be working with the protobuf, versus working with the tools?”
For example, when doing Arrow, we found that the Java implementation
could paste over spec oddities to make things more natural for the
Java language. Now people’s first interaction tends to be with the
libraries and not with the format. So I’m not as worried about
novices encountering Substrait in the future, but this is hard to
predict.

[Jacques]: As another example, when we worked on Isthmus, we realized
it would be very useful to have a representation of Substrait that was
tree based, has a proper tree.

[Various]: Some discussion on different projects and how they use an
intermediate representation as well.

[Jacques]: I think we need some more documentation / definition, at
least on the reference node.

DDL PR needs love

[Jacques]: Still don’t understand why the write rel has an output

[Carlo]: The write rel prepares the data and there could be multiple
physical points where it is written. The write rel could then be
shared across multiple outputs.

[Jacques]: So if you want to write to two different places you would
have a single write rel plan and then an arbitrary number of other
rels which take this write and persist the output.

[Carlo]: The input rel tells me what tuples to operate on. E.g. for
delete, the input is telling us what 20 tuples we want to delete. The
output in this case could be the number of tuples affected by this
operation. In other cases, people want to have the deleted tuples
returned so that they can do some kind of assert, filter, etc. In SQL
server you can even do something like “in a query that updates
salaries compute the total difference between the ‘before salaries’
and the ‘after salaries’.

[Jacques]: My inclination would be to decompose this. So the simplest
way to do this would be to say that the write rel has the ability to
pass through tuples and you can stack whatever operations you want to
work on this.

[Carlos]: This works for certain deletes. I don’t think it works for
the updates. Some systems want to be able to support both the before
and after image of the tuples updated.

[Jacques]: Maybe we have passthrough and before/after as two different
modes. The before/after view could, for example, return a struct with
before and after keys. This seems preferred to having subtrees within
the operator because then you would need to come up with a way to
reference the input.

[Carlo]: Let’s pretend DELETE FROM foo WHERE X > 10. This would have
a subplan SELECT FROM foo WHERE X > 10. An afterimage mode seems odd.

[Jacques]: I had not understood that your decision was for this to be
an arbitrary tree with no references.

[Carlo/Jacques]: Some discussion on this PR

[Jacques]: Would it be sufficient to have the output be # of tuples
modified, before data, and after data.

[Carlo]: Perhaps a simpler version is that the write rel always the
after-image of all affected tuples. If you don’t want to see them
then don’t look at them. If you want the count then count them. If
you want the before image…

[Jacques]: I think 99% of the time people want the count of changed
items so maybe just add a special output for this

[Carlo]: True, but it would be really simple to just COUNT(*) the output.

[Jacques]: And use an emit clause to exclude all the columns. Still,
it seems useful to have some kind of flag to just truncate the
results.

[Carlo]: And, if you want the before image, you can create a shared
rel and merge with the after image.

Prepared statements

[Jacques]: There is an outstanding patch about variables that are
bound and then used later. I’d like more input on this. Right now it
feels like it might be solving too many problems / be too generic and
it would be helpful to constrain.

[Jacques]: However, one of the things that seems very worthwhile is an
“expression annotation node” (and maybe the same thing for relation).
It’s a node we define which allows you to add arbitrary description
information about the node. So the output of this node is identical
to the input (it is a passthrough) and it is just meant for
annotations.

[Jacques]: Could be used for labeling errors in a parsed plan, for
informing people looking at the plan in a viewer. Could label names
to help understand what parts of the plan correspond to what “labels”
in a prepared statement.

[Jacques]: So, this could be useful, but we could also just add a
notes field to arbitrary expression nodes. My concern with prepared
statement was that this was going to end up specific to prepared
statements and it should be more general.

[Calro]: So this is a way to label the “?”

[Jacques]: In this case named bindings was very popular so it was
“foo” and not “?” but yes.

[Carlo]: An arbitrary annotation operator seems useful. Are we saying
this annotation needs to be processed to understand the prepared
statement.

[Jacques]: No, this is supplementary.

[Carlos]: Then, I’m fine with it. We just want to make sure these
annotations are not used in the computation of the result at all.
Agree it is fine and should probably be generic.

[Jacques]: We could maybe just put it in advanced extension (this is
available in more places than rel common).

[Carlos]: I don’t know how I feel about advanced extension, but it
very well might be the right answer. By the way, I’d also like to be
able to attach these annotations to the plan itself.

[Jacques]: Some further explanation of advanced extension and how it works.
> To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAJ9XdSqjdGs-XRngnZypW-nVVRrqmTd0wZVjzha_5A9qjzngJg%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages