Proposal: Move scan-query extension into core

70 views
Skip to first unread message

Gian Merlino

unread,
Jul 28, 2017, 6:18:24 PM7/28/17
to druid-de...@googlegroups.com
I was thinking about doing a patch to bring the scan-query contrib extension into core. Not a core extension, but actually into core itself (druid-processing). The main motivation to have it built in is so Druid SQL's default rules can use it instead of the Select query.

The motivation for this, in turn, is to avoid the memory use and performance issues with the Select query. It works well for returning small numbers of rows, but for larger numbers of rows it falls down. The Scan code was added in https://github.com/druid-io/druid/pull/3307 and that PR explains what is wrong with Select:

> select query cost lots of memory because it has to buffer a huge list of events in
> memory, and flushes until the list is ready. scan query flushes when a small batch
> is ready, the client can get the batch while the server is preparing the next batch

Even with a limit, Select can still use surprisingly large amounts of memory, since due to its parallel execution it generates potentially a lot more rows than are actually needed. Scan is single threaded, which works better for this kind of thing.

Along with porting over the Scan query I'd like to change it to return the __time column as "__time" rather than "timestamp". It's more authentic, and will play better with dataSources that actually have a column named "timestamp". We could add a flag to opt-in to the legacy behavior.

I'd also like to keep the name "scan", which means that people that were formerly using the contrib extension would have to do the following to migrate:

1. Change their queries to set the legacy-behavior flag
2. Modify Druid configs to unload the scan-query extension and do a rolling update to the new version

If I see no objections then I'll raise a PR.

Gian

Himanshu

unread,
Jul 29, 2017, 10:55:41 AM7/29/17
to druid-de...@googlegroups.com
+1

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CACZNdYAw-%3DBeqeObkrUZqMTUD59Xu9vGfxgZM1SDjTuSJYqGWA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Slim Bouguerra

unread,
Jul 31, 2017, 7:51:07 PM7/31/17
to Druid Development
great idea thought !

Niketh Sabbineni

unread,
Sep 11, 2017, 4:40:24 PM9/11/17
to Druid Development
+1

We had our fair share of troubles using the select query

Nishant Bangarwa

unread,
Sep 11, 2017, 7:08:56 PM9/11/17
to druid-de...@googlegroups.com
+1 

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.

Gian Merlino

unread,
Sep 11, 2017, 7:41:25 PM9/11/17
to druid-de...@googlegroups.com
There is a PR here, some comments are still outstanding but I should be able to get back to it soon: https://github.com/druid-io/druid/pull/4751

Gian

On Mon, Sep 11, 2017 at 4:08 PM, Nishant Bangarwa <nishant...@gmail.com> wrote:
+1 

On Tue, Sep 12, 2017 at 2:10 AM Niketh Sabbineni <niketh.s...@gmail.com> wrote:
+1

We had our fair share of troubles using the select query


On Friday, 28 July 2017 15:18:24 UTC-7, Gian Merlino wrote:
I was thinking about doing a patch to bring the scan-query contrib extension into core. Not a core extension, but actually into core itself (druid-processing). The main motivation to have it built in is so Druid SQL's default rules can use it instead of the Select query.

The motivation for this, in turn, is to avoid the memory use and performance issues with the Select query. It works well for returning small numbers of rows, but for larger numbers of rows it falls down. The Scan code was added in https://github.com/druid-io/druid/pull/3307 and that PR explains what is wrong with Select:

> select query cost lots of memory because it has to buffer a huge list of events in
> memory, and flushes until the list is ready. scan query flushes when a small batch
> is ready, the client can get the batch while the server is preparing the next batch

Even with a limit, Select can still use surprisingly large amounts of memory, since due to its parallel execution it generates potentially a lot more rows than are actually needed. Scan is single threaded, which works better for this kind of thing.

Along with porting over the Scan query I'd like to change it to return the __time column as "__time" rather than "timestamp". It's more authentic, and will play better with dataSources that actually have a column named "timestamp". We could add a flag to opt-in to the legacy behavior.

I'd also like to keep the name "scan", which means that people that were formerly using the contrib extension would have to do the following to migrate:

1. Change their queries to set the legacy-behavior flag
2. Modify Druid configs to unload the scan-query extension and do a rolling update to the new version

If I see no objections then I'll raise a PR.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CABs1682Ys7%3DSt%2Bza0J08xJvqPBNi73Xx09F76CN1j%2BSggY_X9A%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages