Maha lookup extension for Druid

101 views
Skip to first unread message

Anand Natarajan

unread,
May 13, 2020, 2:28:33 AM5/13/20
to maha-users
Case #1 depends on the size of the tables, but it should be doable. 

If you have case #2, with 400 million rows, then that's the same number of rows in rocksdb. Pranav gave a good example. So you would have one lookup defined and you can lookup any of the 1500 columns given a key and the column at query time. The key would come from a datasouce and the column would come from the actual query submitted to druid.

On Tue, May 12, 2020, 5:36 PM pb <prana...@gmail.com> wrote:
Maha-lookups supports referencing keys to one or more dimensions. We are currently running more than 10 dimension tables lookup at historicals. 

I just updated the docs with the steps and assumptions (written by hiral)
Thought it might be helpful for you to get the full picture. 

Thank you


On Tue, May 12, 2020 at 5:28 PM Anand Natarajan <nataraj...@gmail.com> wrote:
Thanks Pranav and Hiral. Just out of curiosity, how would you handle #1 case? 

On Tue, May 12, 2020 at 4:50 PM pb <prana...@gmail.com> wrote:
Please let me know if my assumption is incorrect? Also, the main problem is the selection of the rows in druid will depend on the value in the lookups. Is this even feasible..? Ans: 
Lets take an example: 
Fact: ID1, ID2, M1, M2.  
RocksDB DIM: D2 -> (a,b,c) 

After looking D2 id from DIM, you can use nested group by query and apply filter or sorting on the looked up values (a,b,c).

As you said most of the values are boolean, it should be fine. you can do POC and get the performance numbers.  I will keep you posted on the flat buffer progress. 

Thank you
 

On Tue, May 12, 2020 at 4:36 PM Anand Natarajan <nataraj...@gmail.com> wrote:
Thanks Hiral. I'm trying to describe what you are referring to as #2 in your email. 

On Tue, May 12, 2020 at 2:38 PM tazan007 <taza...@gmail.com> wrote:
@Anand Natarajan
let's clear up a few things here so we are all on the same page before we go any further. In our conversation, when we refer to a fact or a dimension, it is with respect to STAR or SNOWFLAKE schema implementations.  When you say you have 1500 dimensions, I assume you mean one of the following, tell us which one:

1. 1500 dimension tables with their own unique keys which are referenced by a fact table.  E.g. a single fact table with 1500 dimensions like in a STAR schema.
2. 1500 dimension columns in one dimension table with each row having a key and 1500 columns.  E.g. a single fact table with 1 dimension like in a STAR schema.

Thanks,
Hiral Patel


On Tue, May 12, 2020 at 12:39 PM Anand Natarajan <nataraj...@gmail.com> wrote:
Thank you Pranav and Tazan007 for a quick response and for great clarity. 
We are exploring Druid as an option to serve as a search datastore to drive a self-serve portal for data exploration. We did explore ElasticSearch as an option but mapping explosion is a dead-end that we want to avoid. Because of the column oriented nature of Druid and its inverted indexing capability, this option seems to top the list. But Druid has a far serious limitation in terms of dimension updates of historicals. 
All the 1500 dimensions are from one curated data source which contains about 400 mil rows. Some normalization is possible but is quite difficult for all dimensions. Most of the columns are boolean values but with lookups (since you need to update values frequently), I presume we have to go from low cardinality (of 2) to high cardinality (of say 300 mil - because the dim values in Druid will be ids of the key in RocksDB). Please let me know if my assumption is incorrect? Also, the main problem is the selection of the rows in druid will depend on the value in the lookups. Is this even feasible..? 

I was thinking more along the lines of creating unique ids for each internal_id to each column combination and store those unique ids in RocksDB and assign values (which can be updated frequently). So, we are looking at about 400 mil X 1500 dimensions = 600 billion Key-Values in RocksDB. How would this set-up change the query response times (with all-else equal). 

Thanks,

Anand. 


On Mon, May 11, 2020 at 11:49 PM tazan007 <taza...@gmail.com> wrote:
Yea we did some performance tests back in 2016 or 2017, Pavan would know for sure.  The numbers were as good as our query performance with Oracle  RAC at the time.  The performance has only improved since those initial tests so I imagine it is better now.  I think Pranav already answered most of your questions.  400 million is big but not impossible, all depends on your hardware.  I think we were in the 300's back two years ago, Pavan would know if it is as high as 400's now.  Your biggest limitation is going to be the 1500 dimensions columns.  A single row that is 1500 columns wide is going to be hard on the java garbage collection if the cardinality of the query is high.  E.g. if the fact table in Druid is queried and a million rows are selected which lookup a million unique values in the 400 million row lookup then you basically have a million 1500 wide values being deserialized.  As Pranav noted, FlatBuffers would solve this problem with their lazy deserialization.  But I guess you should understand the kind of queries your users are going to be making first.  If for example, a million rows in Druid are selected but only 100,000 unique values are looked up in the 400 million row lookup (due to repeated values) then you might be fine.  You should just try it out and see what kind of performance you get and tune it from there.  Maybe you find that most of your customers are only using 10 of those dimension columns.


On Mon, May 11, 2020 at 7:43 PM pb <prana...@gmail.com> wrote:
Correcting Q :  if you *normalized your dataset then how many *columns  and rows will be present in the each dimension table?

On Mon, May 11, 2020 at 7:40 PM pb <prana...@gmail.com> wrote:
Hello Anand, 
I am good, nice to meet you. Have some Qs for you, if you denormalized your dataset then how many rows will be present in the dimension table? What is avg size of values  in one row. In Maha-lookup ext, we do query side look up of the keys present in datasets and expand the values from rocksdb. 
  • Is there a max limitation on the number of dimensions that I can store in say RocksDB (i see support for this KV store)
  • Ans: No limitation as such on the number of dimensions that we can store, we are using protobuf to serialize dimensions and storing in the rocksdb against key lookup id. If you have repeated dim keys to lookup in the query then we also have query level cache and can be enabled with the a flag in the druid query, it will work well in that case. As you have more number of the dimensions and each dimension row has large values (some big strings) and if you are looking up too many times (lets say more than 1 million) then you might see the GC problem on the historical as it lookups the value from rocksb and DeSer protobuf and fill the requested dimensions and discard the protobuf. Good news is, We are actually trying to fix this problem with Google's Flat Buffers. I will have PR ready by next week. If you use the Flatbuffers then you do not need to DeSer full buffer everytime. It can jump to requested columns. This will have huge performance gain as compared to ProtoBuf.

  • How is the query performance if we choose to go with such a large number of dimensions to be stored outside of Druid? 
  • Ans: I do not have numbers for the remote dimensions, (may by other contributors can ans this Q well)  @tazan007 @pavan ab . We have mangodb lookup ext, @tazan007  can give performance numbers. To gain the query performance, you need to load the remote dims into rocksdb using maha-ext. For the Dimension updates, we have kafka listener which can update the rocksdb on fly. (You can take look at the docs https://github.com/yahoo/maha/blob/master/druid-lookups/README.md)

  • Does your team have any performance benchmarking that you can share to understand the framework's limitation?
  • Ans: I do not have detailed benchmark table right now, need to find it out.  On avg of around 10k dimension lookups on given query we get response back around 200-300 ms. Again this is just one case, depends upon the hardware and other factors.  Let me find out the detailed benchmark.
Adding other contributors to cc as well.  

Thank you

    On Mon, May 11, 2020 at 5:40 PM Anand Natarajan <nataraj...@gmail.com> wrote:
    Hi Pranav, 

    How are you? I work for Intuit and we are working on a project that deals with Operational analytics of a data source that has slowly changing dimensions. Since Druid does not support updates and requires that the segment files be reindexed or the datasource re-ingested (say in a batch fashion), We are currently exploring maha-lookup extension to see if the framework can offset the druid limitation. 

    The dataset we are dealing with has about 400 million rows with about 1500 dimensions of which about 2million rows are updated daily. I have a few questions that can help me decide if maha-druid-lookup extension is the path to take:
    1. Is there a max limitation on the number of dimensions that I can store in say RocksDB (i see support for this KV store)
    2. How is the query performance if we choose to go with such a large number of dimensions to be stored outside of Druid? 
    3. Does your team have any performance benchmarking that you can share to understand the framework's limitation? 
    Looking forward to hearing from you. 

    Thanks,

    Anand. 



    --
    Pranav Bhole 
    Software Development Engineer at Yahoo!
    Email: prana...@gmail.com 
    Cell Phone No: 408-507-4773.



    --
    Pranav Bhole 
    Software Development Engineer at Yahoo!
    Email: prana...@gmail.com 
    Cell Phone No: 408-507-4773.



    --
    Pranav Bhole 
    Software Development Engineer at Yahoo!
    Email: prana...@gmail.com 
    Cell Phone No: 408-507-4773.



    --
    Pranav Bhole 
    Software Development Engineer at Yahoo!
    Email: prana...@gmail.com 
    Cell Phone No: 408-507-4773.

    Anand Natarajan

    unread,
    May 13, 2020, 3:09:11 PM5/13/20
    to maha-users
    Thanks Pranav & Hiral for your responses! To think of the problem, you can model the problem either ways (#1 or #2). Let's say, I go with #2, I fail to understand how I can leverage the inverted index capability of druid? This model (and solution henceforth) is applicable to querying based on specific ids.

    The problem I'm trying to solve is given these query conditions (or dimensional values), what is the total count of ids - more like a search system. I really wish Druid had ACID compliance.. :-)

    Thanks,

    Anand. 


    --
    You received this message because you are subscribed to the Google Groups "maha-users" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to maha-users+...@googlegroups.com.
    To view this discussion on the web visit https://groups.google.com/d/msgid/maha-users/8052bd07-a393-4c87-a1bf-9a68a0e2591b%40googlegroups.com.

    pb

    unread,
    May 13, 2020, 8:40:39 PM5/13/20
    to Anand Natarajan, maha-users
    Inverted index comes in picture only for the segment part in initial fact query, using inverted index and based request filters druid decides which ids to lookup from rocksdb. Once the values are filled in the intermediate result then inverted index are not used. Currently we don't optimize the storage of these looked up values, inverted index will help only if you have repeated strings in dim values. All values are as it is in the intermediate result. If you want to filter or sort on the these looked up values then it can be done using nested outer group by druid query.(example:druid nested group by query example).

    The problem I'm trying to solve is given these query conditions (or dimensional values), what is the total count of ids - more like a search system. 
    Ans=> If you want distinct count ids from fact table then you can use sketches, maha supports it. For running a search query on dimension looked up values, maha supports like, in not in, or, and filters. If you want to design something like regex search on the dimension values then it looks more like an elastic search use case. 


    Reply all
    Reply to author
    Forward
    0 new messages