Reminder: Substrait community sync meeting Wednesday

24 views
Skip to first unread message

Weston Pace

unread,
Sep 14, 2022, 2:46:02 AM9/14/22
to Substrait

Weston Pace

unread,
Sep 14, 2022, 4:04:34 PM9/14/22
to Substrait
# Notes

## Discussion of write rel PR questions

[Carlo] I have to leave early and the Github discussion is getting long here so I was hoping to talk this through.
[Jeroen] I’m not sure I’m prepared for the discussion
[Jacques] Me either
[Carlo] Ok, let’s schedule a dedicated meeting for the topic
[Jacques] I will schedule via a thread on the google groups

## Areas of concern

[Weston] We might be wrapping up some things in the next months and I’m wondering what people are concerned about?

### Governance

[Carlo] Governance, not for me particularly, but people I was talking with at VLDB were having concerns.
[Jacques] Yes, this is on me.  I was going to start a ML topic about this but didn’t so I need to.  I’ve noticed that, other than the Apache foundation, the other governance models don’t have much structure.
[Jacques] First, let’s do the ML and get the formal structure established, then find a foundation.
[Jacques] Biggest risk to investors should really just be the trademark issue.  Investor’s need confidence that the brand won’t be “owned” by some company.  Do you know if investors are worried more about trademark or the lack of structure?
[Carlo] I think the primary concern is that people spend a bunch of time and energy on the project and then have no say or control in the project.  I agree that nothing can really prevent community shift but if you loudly declare a foundation, gather people, etc. then the odds of turning evil are harder.  For example, the chances that Iceberg will shift is pretty slim.  The odds that Hadoop will shift is even less.
[Jacques] That makes sense and I think step one is getting the formal governance model agreed to. Meanwhile we can work on step two which is finding a foundation.

### Functions backwards/forwards compatibility & SQL balance

[Jacques] The first thing I’m worried about is that our compatibility model for functions sucks.  We just made a bunch of (valid) changes to functions and they were all backwards incompatible.  It’s also really hard for people interacting with the project to know if they are interacting with the right version.  For example, Java is probably producing functions that aren’t compatible with any other projects.
[Jacques] It probably won’t matter a lot in a few years when things stabilize.  But, the churn rate at the moment could be intimidating to new contributors.
[Jacques] The third thing is finding the right balance orienting with SQL.  SQL is the inspiration because it has wide use and adoption.  E.g. the recent discussion on nullability.  The discussion was fine but the structure of “why is this here if it isn’t in SQL?” is something I worry about.
[Jeroen] I agree with a lot of these things.  I agree we need to fix the function compatibility issue.  For example, options could be resolved by name instead of index.
[Carlo] I noticed a similar debate on the write rel as I was coming from a SQL mindset.  This is an inherent challenge for this project and what makes this project useful.  Some of what we are doing is connecting the dataframes and the SQL worlds.
[Carlo] SQL is successful.  We should be able to go back and forth with SQL on most things.  This should not prevent us from having a natural model for other systems as well.
[Jeroen] Agree that taking representation with SQL is good and aiming for most SQL functionality is good but we just need to be careful to avoid the weird stuff from SQL.  For example, how aggregate functions deal with nulls is very odd to me.  I struggle with the balance as well.
[Carlo] If we have more functionality than SQL (and maybe a SQL mode flag) that seems fine. But if a SQL user has to jump through all kinds of hoops to get the expected behavior that is not ok.
[Jeroen] Yes, for example, we could solve one of these problems with lambdas and that would be clean but a lot of work for a SQL person so maybe not best.  That being said, I do think we should consider support for lambdas.  This would very much simplify things like custom sort functions.
[Weston] Are lambdas just another form of UDF serialization?  Probably off topic
[Jacques] This is also related to the idea of a transformations library.  A lambda function could be a way to express these transformations.  The main point being that there are other places I think would be improved with lambdas.  Whether this is a unique thing or something in the UDF bucket doesn’t really matter.  It isn’t clear to me how much this will solve problems we have.
[Carlo had to leave early but Teams froze his expression in a creepy smile for the rest of the meeting]

### Consumer support for corner cases

[Weston] I have some slight concern that we have a lot of options, which is good for specifying everything, but consumers are not likely to support all of them.  In the rare case where a producer does care about a particular corner case behavior, they will find that their plan is quickly rejected by most/all consumers.
[Jeroen] True, but that’s just how it works isn’t it?  If someone wants something and no one provides it what can you do?
[Jeroen] I do hope that consumers will do their best to fulfill the options and not just choose to ignore all options.
[Jacques] One thing we don’t have is a model where a user can specify precedence.  For example, “I really want behavior X, but if that can’t be delivered, I will live with Y”.
[Jeroen] That also helps because our “option precedence” is currently based on spec order and that makes it hard to add new options (e.g. if we moved to name resolution then order doesn’t make as much sense)
[Cheng Xu] We’ve talked with some customers and they care deeply about function semantics.  They want to keep existing behavior even if things change.  For example, if a Spark user swaps out their engine, they want this to be transparent.
[Jaques] There will always be some challenges there.  There will be some mismatches between those two.  For example, some Spark capabilities may not be available on the new engine and you would have to still run those parts in Spark.
[Cheng Xu] So the consumer side maybe needs to generate some description of what it can offer.
[Jacques] I think that the API for a while is going to be “here is a plan, can you do it?”  Maybe in the future we can specify further (for example, allow the consumer to explain exactly why it can’t fulfill a plan).
[Cheng Xu] So in that way the producer will have to handle the fact that a consumer might just fail and be able to fall back gracefully.
[Jacques] I think so.  For example, Spark has a ton of functionality.  I think newer engines (e.g. Velox) will start off with smaller sets of capability (focusing on the most popular).

[Jacques] Point of order, we haven’t done introductions today and I notice we have a lot of new people.

[Jacques] I noticed that someone put something on Github for Spark <-> Substrait.  I was wondering if anyone had any knowledge of that.

# substrait-cpp

[Chaojun] We have a proposal for a substrait-cpp project (see agenda).
[Jeroen] That first bullet is something I have been working on in Rust for half a year now.  I’ve gotten to the point where I have an ANTLR grammar which is fairly complete.  It may require some changes (I will send a proposal).  Since this is so complicated it might be nice to leave the YAML parsing to a single too and so, for the validator, I’ve been creating protobuf files that represent the YAML content.  Maybe it would make sense that we automate YAML -> protobuf and that would simplify the first bullet quite a bit.  Otherwise, this sounds good, using the validator for these things would be maybe a bit heavyweight so having a C++ utility could be nice.
[Chaojun] I have some questions about the validator.  One of the parts of my proposal has some validation but it is just to validate a part of the plan.  Yaml parsing sounds good if we could do just a part.
[Jeroen] You can’t just do Yaml or partial validation now but I had planned on making improvements on these areas in the future and adding more export capability.
[Jacques] I agree we probably don’t want to build the same thing multiple times.  Should the validator be broken into a set of tools / utilities that other projects could use?  It might make these things more consumable.
[Jeroen] One challenge in the validator is that each of these utilities have to go above and beyond because, for example, a validator can’t just throw an error, it needs to explain why.  So this would mean that generic utilities would have extra functionality that doesn’t really belong in the general utility.  I think I’d prefer to have the validator support more inputs / outputs / behaviors.  That’s why I agree that this generic cpp library would be useful for these utilities.
[Jacques] I do agree that there are many common utilities and a common format that are useful.  I know the Acero project has it’s own internal representation of IR but it would be nice for them to keep an eye on this too.
[Weston] Agreed.  Even though Acero has it’s own internal representation we still have users and they will want help manipulating Substrait plans and so it makes a lot of sense for us to help out here.
[Jacques] I know that protobuf is complicated when it comes to object sharing.  Do we think this is going to create problems for this kind of library?
[Jeroen] I think it will be possible, or more accurately, I think it will be equally problematic regardless of language.  Mainly I would just advise against using Google’s protobuf
[Jacques] To summarize, this will be useful, but we need to take care to make sure this is usable (e.g. usable in Arrow).
[Jeroen] Yes, though it may be very difficult to detect when these problems occur.  I don’t think it’s worth even trying to get google’s protobuf lib to work because you might think you are successful and get quite far before you run into these problems.
Reply all
Reply to author
Forward
0 new messages