Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

[DISCUSS] Expressions in the Fetch operation

26 views
Skip to first unread message

Jesus Camacho Rodriguez

unread,
Nov 18, 2024, 12:36:41 PM11/18/24
to subs...@googlegroups.com
Hello everyone,

I would like to propose an extension for the Fetch operation in Substrait [1].

The fetch operator eliminates records outside a specified window, typically corresponding to an OFFSET/FETCH or LIMIT/OFFSET clause in SQL. The fetch operator defines two primary properties:
• Offset (optional): Specifies the number of records to skip before retrieval begins.
• Count: Specifies the number of records to retrieve.

Currently, both offset and count in Substrait are restricted to integers. This proposal seeks to generalize these fields to support expressions that evaluate to a constant integer, thus providing more flexibility for engines.

Background
In the SQL standard, engines supporting features F860 (dynamic row count) and F865 (dynamic offset) enable such capabilities. Some examples of engines supporting expressions for offset and/or count include:
1. SQL Server: Supports variables, parameters, and constant scalar expressions, including subqueries [2].
2. PostgreSQL: Supports literal constants, parameters, variables, and other expressions [3].
3. DB2: Supports expressions that must not contain column references, scalar full-selects, non-deterministic functions, functions with external actions, or sequence references [4].
4. Oracle: Supports a literal or an expression that evaluates to a numeric value [5].
5. Apache Spark: Allows foldable function expressions for limit and offset [6].

Proposal
Extending FetchRel in Substrait to support expressions for both offset and count. Here is a sketch of how it would look:

// The relational operator representing LIMIT/OFFSET or TOP type semantics.
message FetchRel {
  RelCommon common = 1;
  Rel input = 2;
  // the offset expressed in number of records
  // Deprecated: use `offset_rel` instead
  int64 offset = 3 [deprecated = true];
  // Expression evaluated into an integer specifying the number of records to
  // skip.
  Expression offset_rel = 5;
  // the amount of records to return
  // use -1 to signal that ALL records should be returned
  // Deprecated: use `count_rel` instead
  int64 count = 4 [deprecated = true];
  // Expression evaluated into an integer specifying the number of records to
  // return. -1 signals that all records should be returned.
  Expression count_rel = 6;
  substrait.extensions.AdvancedExtension advanced_extension = 10;
}

If the community agrees the proposal is reasonable, I will proceed by creating a GH issue and starting to work on a PR. Please let me know if you have any feedback.

Thanks,

Jacques Nadeau

unread,
Nov 18, 2024, 1:02:04 PM11/18/24
to subs...@googlegroups.com
This generally makes sense. Few thoughts:

- I'd make the old and new representation for offset and count be inside a oneof. (My recollection is that this would be backwards compatible.)
- I'm struggling with the right data type to require for the expression. For example, we might require a bigint for some systems but others might be decimal(38,0).
- I'd probably call them _expr or something instead of _rel.
- Have you confirmed that our subquery representation is sufficient for this pattern?

Thanks!


--
You received this message because you are subscribed to the Google Groups "substrait" group.
To unsubscribe from this group and stop receiving emails from it, send an email to substrait+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/substrait/CADbtCCKkBFK_kw9eMercnC9hjLEwG-tH0Seb%3DxV8vTi6ofGjrA%40mail.gmail.com.

Jesus Camacho Rodriguez

unread,
Nov 18, 2024, 3:28:48 PM11/18/24
to subs...@googlegroups.com
Thanks, Jacques! My responses are inlined below.

On Mon, Nov 18, 2024 at 10:02 AM Jacques Nadeau <jac...@sundeck.io> wrote:
This generally makes sense. Few thoughts:

- I'd make the old and new representation for offset and count be inside a oneof. (My recollection is that this would be backwards compatible.)
We can certainly go with that approach.
- I'm struggling with the right data type to require for the expression. For example, we might require a bigint for some systems but others might be decimal(38,0).
Is this something we need to include in the spec? Alternatively, we could leave it implementation-dependent. Since the expression will have an output type built in, it might be up to the implementer to decide whether it’s something they can handle.
- I'd probably call them _expr or something instead of _rel.
Good catch, I went through a couple of iterations and forgot to change that--will do. 
- Have you confirmed that our subquery representation is sufficient for this pattern?
I have not gotten that far yet, but I don't see why it wouldn't work, since this would be similar to other subqueries. Do you have any intuition about potential issues?

Jacques Nadeau

unread,
Nov 18, 2024, 3:58:17 PM11/18/24
to subs...@googlegroups.com
- I'm struggling with the right data type to require for the expression. For example, we might require a bigint for some systems but others might be decimal(38,0).
Is this something we need to include in the spec? Alternatively, we could leave it implementation-dependent. Since the expression will have an output type built in, it might be up to the implementer to decide whether it’s something they can handle.

I think we probably want to specify a small set of what is allowed. Otherwise, we're likely to have everybody do it differently. Maybe only bigint/int64 and decimal?
 
- Have you confirmed that our subquery representation is sufficient for this pattern?
I have not gotten that far yet, but I don't see why it wouldn't work, since this would be similar to other subqueries. Do you have any intuition about potential issues?

Nothing top of mind. Just confirming that this will work well with what we have to close the gap for your use cases. 

Jacques Nadeau

unread,
Nov 20, 2024, 7:49:13 PM11/20/24
to subs...@googlegroups.com
Opened the following ticket [1] to better discuss "required types" for situations like this

Jesus Camacho Rodriguez

unread,
Nov 22, 2024, 6:38:21 PM11/22/24
to subs...@googlegroups.com
Thanks, Jacques!

I have just created the PR for the Fetch operator changes: https://github.com/substrait-io/substrait/pull/748

-Jesús

--
You received this message because you are subscribed to the Google Groups "substrait" group.
To unsubscribe from this group and stop receiving emails from it, send an email to substrait+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages