[DISCUSS] Workaround Protobuf deserialization depth limit in Substrait

14 views
Skip to first unread message

YongChul Kwon

unread,
Dec 9, 2024, 7:30:44 PM12/9/24
to substrait
Hello,

We have accidentally encountered a deserialization limit of official protobuf library with a deep nesting. The limit is 100 (https://protobuf.dev/programming-guides/proto-limits/#depth), that is, the library will fail to parse a message if it contains more than 100 nestings. So I would like to discuss extending Substrait to reference fragment of scalar expression as it already does for the RelOp tree.

Please see below for the detail and other rationales. Curious to hear how others handling this issue.

Best,
YongChul

* Problem

The official protobuf library has a limit in depth during deserialization (https://protobuf.dev/programming-guides/proto-limits/#depth).  The artificial limit is 100. Although the 100 limit seems reasonable for some applications, a reasonably complex production query plan can often go far beyond 100 nesting.

Substrait currently employs protobuf to describe the spec as well as de facto serialization/deserialization method. This means that the Substrait applications are likely to use the "official" protobuf library and will hit a dead-end on receiving a plan with more than 100 depth nesting.

* Proposal

We propose to extend Substrait spec to support fragmenting expression tree like RelOp tree.

This involves updating both serialization and deserialization. The serialization should be done in this "depth limit" conscious way – that is, the serialization should be done in fragments so that each fragment does not exceed the imposed depth limit. Note that another irk is that the nesting is in terms of protobuf nesting and we will need to validate whether oneof, for instance, is considered a nesting. There is also overhead of outer message as well (e.g., Plan itself takes up "1" in the limit).

Substrait Plan already has a way to do this by substrait.Plan.relations and ReferenceRel (i.e., fragment a large relational graph and re-build based on referencing fragments) but there is no such mechanism for an Expression.

Thus, we will need a similar way to fragment a large expression tree and referencing the fragment. Below is the proposed change to support fragmenting expressions.

message ReferenceExpression {

  int32 subtree_ordinal = 1;

message Expression {

  oneof rex_type {

    ReferenceExpression reference = 14;

  }

 }

message Plan {

  repeated Expression expressions = 6;

In addition to proto changes, we will need to update reference Substrait IR implementations to ease serialization and deserialization for applications.

* Alternatives

* Option a. Do nothing

The consuming application simply fails on receiving such plans.

* Option b. Use other protobuf implementations without limit

If there is a good one... Anyone?

* Option c. Recompile official protobuf library with a higher limit

The current limit is hardcoded in the library and the only way to lift the limit is recompilation. Unless the official library provides a mechanism to override without recompilation, the only way to work around is to update the constant and recompile the library.

* Option d. Support beyond protobuf

This practically means that we translate into other IDLs/schema description language (e.g., Apache ThriftCap'n Proto, FlatBuffers) and supported serialization/deserialization do not impose artificial limit. That is, Substrait is a "meta" spec, and supports multiple frameworks. This is maybe something to consider in the long run.


YongChul Kwon

unread,
Dec 9, 2024, 11:42:41 PM12/9/24
to substrait
Just discovered that there is a programatic way to override the limit while surveying the API. Leaving it here for future reference.

To override the recursion depth limit, you will need to wrap the message in google::protobuf::io::CodedInputStream like following. It turns out the CodedInputStream is the final data structure consumed by deserialization thus if you don't offer it, default limit will be enforced.

google::protobuf::io::CodedInputStream in(reinterpret_cast<const uint8_t *>(s.data()), s.length());
in.SetRecursionLimit(1000);
message.ParseFromCodedStream(&in);

I checked other languages (at least C++, Java, and C#) and they have the same way.

That said, I'm unblocked. :-)

Best,
YongChul

Reply all
Reply to author
Forward
0 new messages