Reminder (NEW DIAL-IN): Substrait Sync Meeting Wednesday

Weston Pace

unread,

Jun 22, 2022, 3:59:27 AM6/22/22

to Substrait

The bi-weekly Substrait sync meeting will be Wednesday, June 22nd, at 8AM PDT.

The meeting will now be hosted on Microsoft Teams instead of Google
Meet (thanks Ashvin).

Agenda: https://docs.google.com/document/d/1Afg93ojsVWdwo3rBO2Dtng8__RYqAy_zJMrnqhTQapc/edit?usp=sharing
________________________________________________________________________________

Microsoft Teams meeting

Join on your computer or mobile app

Click here to join the meeting

Or join by entering a meeting ID
Meeting ID: 214 363 280 62
Passcode: nhJBSU

Or call in (audio only)

+1 323-849-4874,,542953096# United States, Los Angeles

(866) 679-9995,,542953096# (Toll-free)

Phone Conference ID: 542 953 096#

Find a local number | Reset PIN

Learn More | Meeting options

________________________________________________________________________________

Aldrin

unread,

Jun 22, 2022, 1:06:31 PM6/22/22

to subs...@googlegroups.com

FYI, the meeting links didn't come through. I didn't attempt to join using only the meeting IDs and passcode, so not sure if they are sufficient or not.

Aldrin Montana

Computer Science PhD Student

UC Santa Cruz

--
You received this message because you are subscribed to the Google Groups "substrait" group.
To unsubscribe from this group and stop receiving emails from it, send an email to substrait+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAE4AYb0rh98s6kffh7st5GO2jE-R9qc0MEpjLkZRXtL2yH-a1w%40mail.gmail.com.

Weston Pace

unread,

Jun 22, 2022, 1:30:11 PM6/22/22

to Substrait

Thanks. That was my fault. I didn't copy over the links when generating the reminder email. I will do so in the future.

The ID/password did work and we met today and talked mostly about function definitions / YAML files and a bit about integration test. I'll be drafting more detailed minutes to send out, hopefully this evening.

To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAC-Ect_8QsCYPiXM4ScvOBPG0-sEuZtS6D3Xf%2BWizPFJf09vtg%40mail.gmail.com.

Weston Pace

unread,

Jun 24, 2022, 4:28:40 PM6/24/22

to Substrait

Notes from the meeting:

Regarding URIs
###############

Weston: When do we want to start narrowing down what URIs we use? Do
we want to use Github URIs or something simpler? Those should point
to the YAML file right? Do we want to start working on a way to solve
the YAML file composition problem (e.g. how do files refer to
extension types in other YAML files).

Weston: First, the URI question. People have been using the Github
link (to the YAML) for URIs. Is that what we want to keep using going
forwards? Is there a significant cost to change this later?

Jacques: Lately people have been using URIs that point to the current
master which seems brittle (YAML could change without URI change). We
could use tags but that has a different problem (URI could change
without YAML change)

Jacques: Do we have a concrete situation that we are struggling with
right now? Best to solve this problem one real need at a time. You
(Weston) asked if we’re ready to talk about this and I suppose the
answer is, “is there a real need?”

Weston: I can think of two needs, one more concrete than the other.
First, Sanjiban recently created a PR adding a bunch of new functions.
What happens if we get one of those wrong? Do we just create a new
function (e.g. add2, add3, …) or do we try and maintain
forward/backward compatibility. However, this use case is not
terribly concrete.

Weston: Second, and a bit more concrete. Arrow has a function
registry internally. Some of these are not in the Substrait spec. A
YAML file for these functions will be generated dynamically. My
current thinking is the consumer will look at the URI, and if it
matches the “dynamic Arrow function” URI, then instead of mapping
manually to a specific registry we will automatically interpret the
call. This doesn’t really map to a YAML file so I wasn't sure what
the URI would be in this case. We could publish a YAML file for this
on a semi-regular basis manually and use that as the URI but what URI?
How do you express you want to use a function defined since the last
release?

Jacques: I think what we want, at an abstract level, is versioned
functions, and not versioned YAMLs. Adding a new function shouldn’t
force a version change because the other functions didn’t change. I’m
not sure what the solution for this would be. Any ideas? We could
have “open” and “sealed” YAMLs.

Ashvin: What happens if functions depend on other functions?

Jacques: That shouldn’t be possible though functions could depend on types.

Jesus: We should probably identify which changes are “breaking”
changes and see if we can automate the versioning somehow. It’s too
much of a burden on the user to figure that out themselves.

Weston: Agree. In protobuf it’s possible to know what all the
possible changes are and which changes should be breaking changes.
For a recent example, we noticed that “divide” was missing overflow
behavior, and adding this optional argument should be backwards
compatible.

Weston: I’ll take, as a homework assignment, walking through which
changes would be breaking changes.

Jacques: I agree with Jesus in theory but I think most changes are
breaking. Perhaps we should be putting a version number in every
function. Also, that “divide” change wasn’t actually backwards
compatible because we optional arguments come first so this inserted a
new argument at the beginning which would break backwards
compatibility. This is because optional arguments must always be
specified (they can just be set to “I don’t care”). Although, the
current Java isn’t necessarily doing this correctly (it sometimes
omits optional arguments).

Jacques: We could put optional arguments at the end. Or maybe we
could even have two lists, one for optional and one for normal
arguments. We probably need to think this through a bit more if we
want backwards compatibility.

Weston: True, we could also make named/numbered arguments which is how
protobuf handles this for fields.

Jacques: Circling back, if there aren’t that many backwards compatible
changes possible, function versioning will probably work. Not sure
what that means for URIs though.

Jeroen: What do we use the YAML files for again? Most people either
ignore the URI completely or treat it as a namespace identifier and
never resolve the URI. In what context are these file actually
interesting?

Jacques: I think right now most consumers and producers are
collaborators and not strangers. In the future that might change and
the YAML may be more key. Also, interactive cases like an IDE will
want to read the YAML. Also, maybe with consumer-registered UDFs
where the consumer knows about functions that a producer has no idea
of.

Jeroen: But most of that should be accomplished with just a namespace
and name matching. Why is the YAML important?

Jeroen: Ignoring YAML for the moment, for versioning, another way is
to just list all versions of all functions in the YAML file so that
the latest YAML file is always the correct YAML file (e.g. this is how
package managers work)

Jeroen: Coming back to the YAML files, I worry they are too complex to
parse and understand correctly.

Jacques: The YAML files are probably underdocumented. The goal was
simplicity so something is probably missing. The expression language
is hard but a necessary evil because we need to be able to express
type derivations.

Jeroen: It’s not just a lack of documentation. The spec is missing
and ambiguous in places.

Jacques: So lets fix that problem.

Jeroen: Ok, I’ll work on it. Could the producer just explicitly
declare the output type?

Jacques: What about a UDF? How would the producer know the UDF type?

Jeroen: That sounds like producer-specific problem.

Jacques: Then you just have a side-channel. We’re getting off topic,
let’s go back to the versioning topic. It doesn’t sound like there is
a strong consensus. Weston is the most in-the-thick-of-it so why
don’t you try tackling this.

Weston: I’ll document a list of fictional yet concrete cases for
versioning and post an issue.

Integration testing
###############

Weston: I’ve maybe got some resources spinning up to look at
integration testing. Specifically, creating tests to ensure that
different consumers interpret the same plan in the same way. Should I
just create a repo and add in test data, test queries, and test
results?

Jacques: Rather than checking in data it would probably be better to
check in tools that synthesize the data. SCM for data is difficult.
For example, TPC-H.

Jacques: For results you might also consider using an Oracle such as
postgres. If we can avoid checking in results that would be a big
win.

Jacques: For results, the challenge is that a lot of properties can be
optional (e.g. ordering) and so there could be many different valid
results. Also, tolerance is another concern. I’ll try and dig up
some of our past knowledge to see if I can share.

Jacques: Making a repository still makes sense.

Weston: Thanks, probably still a few months out, but if someone else
is interested it would also be nice to have a second system to be the
integration test partner.

Jacques: Thinking about the producer side, I think that will be
challenging as well because there are many right answers. For
example, different producers could generate different plans for TPC-H.
We can maybe share some “valid” plans for a query (as in, here is a
reference query that is valid) but not sure how useful.

Weston: I think it also comes down to capabilities discovery. For
example, given a query, and a set of capabilities, does the producer
come up with a query that is valid and respects the capabilities of
the consumer.

Jacques: Calcite does have an inefficient reference implementation of
an execution engine. You could possibly use that as a second
implementation for integration testing. It is a fairly rich engine
even if it isn’t very performant.

Ashvin: Just wanted to add that having some integration tests would be
very valuable for attracting new people.

Optional arguments
#################

Jacques: Circled back to the idea that optional arguments are often
just plain omitted by producers and that will probably be a problem
for consumers which are presumably expecting them to always be
specified.

Jeroen: Isn’t that the way the spec is defined? Shouldn’t the
consumers be more flexible?

Jacques: I think the spec is probably wrong here. We should keep the
plans as simple to understand as possible. I think the plan was
always that all arguments are always specified. Otherwise it is too
hard to understand which argument is which. Jeroen, can you take the
action to clarify the spec?

Jeroen: Sure, I’ll make a PR.

Reply all

Reply to author

Forward