questions about pagination and replication

559 views
Skip to first unread message

Reza

unread,
Apr 15, 2014, 5:28:32 PM4/15/14
to druid-de...@googlegroups.com
Hello Druid team and Druid fans,

I have two questions:

1- looks like pagination is not supported in Druid. If I break down by a dimension, I am not getting millions of rows but even 500 is not something that one would like to return to the client without paginating. There is always the option of caching results but I was wondering if Druid's team sees any benefits in putting pagination there. Looking into broker it's at least possible to limit the returned result size. I would like to look into putting pagination and I'd appreciate any hints or if you guys think it's not a good idea.

2- Question about replications: of course the main reason for replicating data is for reliability but if I replicate data (considering that I have enough memory to load everything there) in theory it should have some performance benefits as well. Is this true? I'm running the benchmarks of my own but is there any way to support it or any guidelines to maximize the performance benefits as well while using replicas?

Thank you,

Fangjin Yang

unread,
Apr 15, 2014, 8:30:08 PM4/15/14
to druid-de...@googlegroups.com
Hi Reza, see inline.

On Tuesday, April 15, 2014 2:28:32 PM UTC-7, Reza wrote:
Hello Druid team and Druid fans,

I have two questions:

1- looks like pagination is not supported in Druid.

It depends on the type of query you want to do.
 
If I break down by a dimension, I am not getting millions of rows but even 500 is not something that one would like to return to the client without paginating.

Lexicographic TopNs (http://druid.io/docs/latest/TopNMetricSpec.html) for example can paginate through all results of a dimension. All TopNs have a notion of a limit/threshold. GroupBy queries do as well.
 
There is always the option of caching results but I was wondering if Druid's team sees any benefits in putting pagination there. Looking into broker it's at least possible to limit the returned result size. I would like to look into putting pagination and I'd appreciate any hints or if you guys think it's not a good idea.

2- Question about replications: of course the main reason for replicating data is for reliability but if I replicate data (considering that I have enough memory to load everything there) in theory it should have some performance benefits as well. Is this true? I'm running the benchmarks of my own but is there any way to support it or any guidelines to maximize the performance benefits as well while using replicas?

The main reason for replication is for redundancy. I'm not sure how much performance benefit you will get out of replicating. 

Thank you,

Reza

unread,
Apr 16, 2014, 8:10:06 PM4/16/14
to druid-de...@googlegroups.com
Hi FJ,


On Tuesday, 15 April 2014 17:30:08 UTC-7, Fangjin Yang wrote:
Hi Reza, see inline.

On Tuesday, April 15, 2014 2:28:32 PM UTC-7, Reza wrote:
Hello Druid team and Druid fans,

I have two questions:

1- looks like pagination is not supported in Druid.

It depends on the type of query you want to do.
 
If I break down by a dimension, I am not getting millions of rows but even 500 is not something that one would like to return to the client without paginating.

Lexicographic TopNs (http://druid.io/docs/latest/TopNMetricSpec.html) for example can paginate through all results of a dimension. All TopNs have a notion of a limit/threshold. GroupBy queries do as well.

Thank you, this can be pretty useful for TopN metrics. I wish it was supported for GroupBy as well since GroupBy has more query power and also gives exact answers (as opposed to TopN) if I'm not wrong.

I'm curious how TopN with lexicographic pagination work... Does it go over all the data in the first call and cache all the results, so when asking for the second page it can use the cached data?

Fangjin Yang

unread,
Apr 17, 2014, 12:01:57 AM4/17/14
to druid-de...@googlegroups.com
Hi Reza, see inline.


On Wednesday, April 16, 2014 5:10:06 PM UTC-7, Reza wrote:
Hi FJ,

On Tuesday, 15 April 2014 17:30:08 UTC-7, Fangjin Yang wrote:
Hi Reza, see inline.

On Tuesday, April 15, 2014 2:28:32 PM UTC-7, Reza wrote:
Hello Druid team and Druid fans,

I have two questions:

1- looks like pagination is not supported in Druid.

It depends on the type of query you want to do.
 
If I break down by a dimension, I am not getting millions of rows but even 500 is not something that one would like to return to the client without paginating.

Lexicographic TopNs (http://druid.io/docs/latest/TopNMetricSpec.html) for example can paginate through all results of a dimension. All TopNs have a notion of a limit/threshold. GroupBy queries do as well.

Thank you, this can be pretty useful for TopN metrics. I wish it was supported for GroupBy as well since GroupBy has more query power and also gives exact answers (as opposed to TopN) if I'm not wrong.

GroupBys have limit specs but there's no pagination support with them right now. Perhaps the limit spec can be extended for this functionality. 

I'm curious how TopN with lexicographic pagination work... Does it go over all the data in the first call and cache all the results, so when asking for the second page it can use the cached data?

Not quite. Lexicographic topNs rely on only scanning rows pertaining to dimension values within a certain range. So the first pass may only scan values from 'aaaa' to 'aabb' and the second pass may scan values from 'aabbc' to 'bbcd'. As a general principle, Druid is intelligent enough to scan only what it needs.

Reza

unread,
Apr 17, 2014, 2:54:06 PM4/17/14
to druid-de...@googlegroups.com
Hi FJ,

So playing with the TopN and lexicographical sorting, it looks like I'd lose the main benefit of TopN when using lexicographical sort which is sorting by an actual aggregate value.

e.g. I would like to paginate through the list of all client_id s sorted by number of purchases made. 

If TopNMetricSpec can support more than one metrics then we can always have a lexicographical metric at the end to emulate pagination though I guess.

Fangjin Yang

unread,
Apr 18, 2014, 1:16:13 AM4/18/14
to druid-de...@googlegroups.com
Hi Reza, yeah, with lexicographic topN, it is useful if you want to get metrics for every value in a dimension. Pagination with groupBy and numerical topN will be a bit trickier as we need some way of providing an offset and limiting results, which probably will require some more design and thought. FWIW, you can just issue these queries with a larger threshold, although this may require using up more memory to hold all these results in heap before they can be returned.

Reza Iranmanesh

unread,
Apr 18, 2014, 1:39:14 AM4/18/14
to druid-de...@googlegroups.com
Thank you FJ,

Druid is a great piece of work; thank you for sharing it.





--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/0TBL5-3Z2PI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/b7860f5e-c942-4ea4-af0e-ea52b6ee5a92%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
|2324 ||24/\//\/\4/\/35]-[

Xavier Léauté

unread,
Apr 18, 2014, 11:56:00 AM4/18/14
to druid-de...@googlegroups.com
Hi Reza,

To paginate you could simply re-issue the same topN but exclude the results from the previous queries. That would not be much more expensive and should be simple to implement on the client side. 
--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.

To post to this group, send email to druid-de...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages