Maha-lookups supports referencing keys to one or more dimensions. We are currently running more than 10 dimension tables lookup at historicals.I just updated the docs with the steps and assumptions (written by hiral)Thought it might be helpful for you to get the full picture.Thank youOn Tue, May 12, 2020 at 5:28 PM Anand Natarajan <nataraj...@gmail.com> wrote:Thanks Pranav and Hiral. Just out of curiosity, how would you handle #1 case?On Tue, May 12, 2020 at 4:50 PM pb <prana...@gmail.com> wrote:Please let me know if my assumption is incorrect? Also, the main problem is the selection of the rows in druid will depend on the value in the lookups. Is this even feasible..? Ans:Lets take an example:Fact: ID1, ID2, M1, M2.RocksDB DIM: D2 -> (a,b,c)After looking D2 id from DIM, you can use nested group by query and apply filter or sorting on the looked up values (a,b,c).As you said most of the values are boolean, it should be fine. you can do POC and get the performance numbers. I will keep you posted on the flat buffer progress.Thank youOn Tue, May 12, 2020 at 4:36 PM Anand Natarajan <nataraj...@gmail.com> wrote:Thanks Hiral. I'm trying to describe what you are referring to as #2 in your email.On Tue, May 12, 2020 at 2:38 PM tazan007 <taza...@gmail.com> wrote:@Anand Natarajanlet's clear up a few things here so we are all on the same page before we go any further. In our conversation, when we refer to a fact or a dimension, it is with respect to STAR or SNOWFLAKE schema implementations. When you say you have 1500 dimensions, I assume you mean one of the following, tell us which one:1. 1500 dimension tables with their own unique keys which are referenced by a fact table. E.g. a single fact table with 1500 dimensions like in a STAR schema.2. 1500 dimension columns in one dimension table with each row having a key and 1500 columns. E.g. a single fact table with 1 dimension like in a STAR schema.Thanks,Hiral PatelOn Tue, May 12, 2020 at 12:39 PM Anand Natarajan <nataraj...@gmail.com> wrote:Thank you Pranav and Tazan007 for a quick response and for great clarity.We are exploring Druid as an option to serve as a search datastore to drive a self-serve portal for data exploration. We did explore ElasticSearch as an option but mapping explosion is a dead-end that we want to avoid. Because of the column oriented nature of Druid and its inverted indexing capability, this option seems to top the list. But Druid has a far serious limitation in terms of dimension updates of historicals.All the 1500 dimensions are from one curated data source which contains about 400 mil rows. Some normalization is possible but is quite difficult for all dimensions. Most of the columns are boolean values but with lookups (since you need to update values frequently), I presume we have to go from low cardinality (of 2) to high cardinality (of say 300 mil - because the dim values in Druid will be ids of the key in RocksDB). Please let me know if my assumption is incorrect? Also, the main problem is the selection of the rows in druid will depend on the value in the lookups. Is this even feasible..?I was thinking more along the lines of creating unique ids for each internal_id to each column combination and store those unique ids in RocksDB and assign values (which can be updated frequently). So, we are looking at about 400 mil X 1500 dimensions = 600 billion Key-Values in RocksDB. How would this set-up change the query response times (with all-else equal).Thanks,Anand.On Mon, May 11, 2020 at 11:49 PM tazan007 <taza...@gmail.com> wrote:Yea we did some performance tests back in 2016 or 2017, Pavan would know for sure. The numbers were as good as our query performance with Oracle RAC at the time. The performance has only improved since those initial tests so I imagine it is better now. I think Pranav already answered most of your questions. 400 million is big but not impossible, all depends on your hardware. I think we were in the 300's back two years ago, Pavan would know if it is as high as 400's now. Your biggest limitation is going to be the 1500 dimensions columns. A single row that is 1500 columns wide is going to be hard on the java garbage collection if the cardinality of the query is high. E.g. if the fact table in Druid is queried and a million rows are selected which lookup a million unique values in the 400 million row lookup then you basically have a million 1500 wide values being deserialized. As Pranav noted, FlatBuffers would solve this problem with their lazy deserialization. But I guess you should understand the kind of queries your users are going to be making first. If for example, a million rows in Druid are selected but only 100,000 unique values are looked up in the 400 million row lookup (due to repeated values) then you might be fine. You should just try it out and see what kind of performance you get and tune it from there. Maybe you find that most of your customers are only using 10 of those dimension columns.On Mon, May 11, 2020 at 7:43 PM pb <prana...@gmail.com> wrote:Correcting Q : if you *normalized your dataset then how many *columns and rows will be present in the each dimension table?On Mon, May 11, 2020 at 7:40 PM pb <prana...@gmail.com> wrote:Hello Anand,I am good, nice to meet you. Have some Qs for you, if you denormalized your dataset then how many rows will be present in the dimension table? What is avg size of values in one row. In Maha-lookup ext, we do query side look up of the keys present in datasets and expand the values from rocksdb.
- Is there a max limitation on the number of dimensions that I can store in say RocksDB (i see support for this KV store)
- Ans: No limitation as such on the number of dimensions that we can store, we are using protobuf to serialize dimensions and storing in the rocksdb against key lookup id. If you have repeated dim keys to lookup in the query then we also have query level cache and can be enabled with the a flag in the druid query, it will work well in that case. As you have more number of the dimensions and each dimension row has large values (some big strings) and if you are looking up too many times (lets say more than 1 million) then you might see the GC problem on the historical as it lookups the value from rocksb and DeSer protobuf and fill the requested dimensions and discard the protobuf. Good news is, We are actually trying to fix this problem with Google's Flat Buffers. I will have PR ready by next week. If you use the Flatbuffers then you do not need to DeSer full buffer everytime. It can jump to requested columns. This will have huge performance gain as compared to ProtoBuf.
- How is the query performance if we choose to go with such a large number of dimensions to be stored outside of Druid?
- Ans: I do not have numbers for the remote dimensions, (may by other contributors can ans this Q well) @tazan007 @pavan ab . We have mangodb lookup ext, @tazan007 can give performance numbers. To gain the query performance, you need to load the remote dims into rocksdb using maha-ext. For the Dimension updates, we have kafka listener which can update the rocksdb on fly. (You can take look at the docs https://github.com/yahoo/maha/blob/master/druid-lookups/README.md)
- Does your team have any performance benchmarking that you can share to understand the framework's limitation?
- Ans: I do not have detailed benchmark table right now, need to find it out. On avg of around 10k dimension lookups on given query we get response back around 200-300 ms. Again this is just one case, depends upon the hardware and other factors. Let me find out the detailed benchmark.
Adding other contributors to cc as well.Thank youOn Mon, May 11, 2020 at 5:40 PM Anand Natarajan <nataraj...@gmail.com> wrote:Hi Pranav,How are you? I work for Intuit and we are working on a project that deals with Operational analytics of a data source that has slowly changing dimensions. Since Druid does not support updates and requires that the segment files be reindexed or the datasource re-ingested (say in a batch fashion), We are currently exploring maha-lookup extension to see if the framework can offset the druid limitation.The dataset we are dealing with has about 400 million rows with about 1500 dimensions of which about 2million rows are updated daily. I have a few questions that can help me decide if maha-druid-lookup extension is the path to take:
- Is there a max limitation on the number of dimensions that I can store in say RocksDB (i see support for this KV store)
- How is the query performance if we choose to go with such a large number of dimensions to be stored outside of Druid?
- Does your team have any performance benchmarking that you can share to understand the framework's limitation?
Looking forward to hearing from you.Thanks,Anand.--Cell Phone No: 408-507-4773.--Cell Phone No: 408-507-4773.--Cell Phone No: 408-507-4773.--Cell Phone No: 408-507-4773.
--
You received this message because you are subscribed to the Google Groups "maha-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to maha-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/maha-users/8052bd07-a393-4c87-a1bf-9a68a0e2591b%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/maha-users/CAKxB8%2B3nvG9YZV%2BB7ERxk7TRxLA%2BF8F3bhKNPhDK5wHQSfWSbA%40mail.gmail.com.