Hive and Druid integration

1,170 views
Skip to first unread message

Jesus Camacho Rodriguez

unread,
Sep 9, 2016, 5:28:26 AM9/9/16
to Druid Development
Hi everyone,

I have been working in the last few months on the integration of Druid and Apache Hive. Yesterday we pushed the first part of this work to Hive master (HIVE-14217), which allows Druid data sources to be queried from Hive.

Internally, Hive optimizer is powered by Apache Calcite, so we leveraged the work that had already been done in Calcite, and we extended it to recognize new type of queries (Timeseries, TopN, GroupBy, and Select, so far), pushing time intervals to Druid, recognizing different granularities, etc. Further, we created a Hive Storage Handler specific for Druid, which is responsible for submitting JSON queries to the broker, and retrieving the results. Further details about the design, set up, and current status can be found here.

If you are interested in this work, please go ahead and use it! It would be great to gather feedback from the Druid community. If you find any issues, you could post them to HIVE-14473, which we are using to gather follow-up work.

Thanks,
Jesús

Gian Merlino

unread,
Sep 9, 2016, 1:23:34 PM9/9/16
to druid-de...@googlegroups.com
Hey Jesús,

Looking great!! Thanks for posting!

Do you have any plans to have Hive talk directly to data nodes rather than the broker? I think there are good opportunities to parallelize query push down if the broker is removed as a possible bottleneck.

Also do you have any plans to contribute your additional optimizer work to Calcite? It seems like the enhancements of https://issues.apache.org/jira/browse/HIVE-14466 would be useful in base Calcite but are currently only available in Hive. But I'm not too familiar with either Hive or Calcite sources, and might be missing a technical reason it has to be this way.

Gian

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/ae3e9abc-e590-4563-b38e-28e023c9472d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Jesus Camacho Rodriguez

unread,
Sep 9, 2016, 2:44:17 PM9/9/16
to druid-de...@googlegroups.com
Hi Gian,

Talking to the data nodes rather than the broker is in our roadmap. Indeed we think it is needed to scale up for queries which return a large result set from Druid, and for which you might want to execute some additional operations using Hive.
Note that for Select queries, we already parallelize (although trivially). In particular, we get the total row count of the result set, and we split the query across the time dimension using a threshold for each of these splits; basically this makes the assumption that the get a constant number of records per split if they are uniformly distributed across the time dimension. All these queries will hit the broker, so the scalability is limited by the number of concurrent queries that the broker can serve. However, it is my understanding that multiple Druid brokers can coexist within the same Druid deployment? I think initially that would help.
For timeseries, topN, and groupBy queries, we do not do anything special to parallelize query push down.

Although the development has been done in Hive, the code using Calcite is being contributed back to project (see CALCITE-1357, CALCITE-1358). The idea is that once there is a new Calcite release, and Hive is upgraded to use it, we can get rid of that code on the Hive side; moving code back and forth is something that we have frequently done to develop faster.

Thanks,
Jesús



You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/6DjXnXdgqXw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CACZNdYDpzc_Fj_FzFY7zq4DpAozDNjfJMOhravMuy-OetPQBag%40mail.gmail.com.

Akul Narang

unread,
Feb 20, 2018, 10:08:28 AM2/20/18
to Druid Development
Hey Jesus ,

I'm trying to integrate my hive with druid as data source but unable to do so due to the following error:

I have already added: 

hive-druid-handler-2.2.0.jar


QUERY:

hive> CREATE TABLE device_cap_druid

    > STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'

    > TBLPROPERTIES ("druid.datasource" = "device_cap_druid")

    > AS

    > select cast(unix_timestamp(`EVENT_DATE`,'yyyy-MM-dd') as timestamp) as `__time`,`TAC`,`DEVICE_MODEL`,`CAPABILITY_CLASS`,`CAPABILITY_NAME`,`AGG_CAPABILITY_VALUE` from device.device_capability;




Error: java.lang.ClassNotFoundException: org.roaringbitmap.IntIterator
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.hive.druid.com.metamx.collections.bitmap.ConciseBitmapFactory.<clinit>(ConciseBitmapFactory.java:31)
at org.apache.hive.druid.io.druid.segment.data.ConciseBitmapSerdeFactory.<clinit>(ConciseBitmapSerdeFactory.java:36)
at org.apache.hive.druid.io.druid.segment.IndexSpec.<init>(IndexSpec.java:107)
at org.apache.hive.druid.io.druid.segment.IndexSpec.<init>(IndexSpec.java:70)
at org.apache.hive.druid.io.druid.segment.indexing.RealtimeTuningConfig.<clinit>(RealtimeTuningConfig.java:49)
at org.apache.hadoop.hive.druid.io.DruidOutputFormat.getHiveRecordWriter(DruidOutputFormat.java:182)
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:284)
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:269)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketForFileIdx(FileSinkOperator.java:608)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:552)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1015)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:279)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Previously, I was getting error as hive was connecting to hdfs for hive-druid-handler-2.2.0.jar . I was not able to solve this so I simply added the jar to the hdfs path.
I'm not sure but is the above error a side effect of the workaround I mentioned ?

Also, I'd like to know if the integration also supports simple update and delete queries as well ?

Thanks,
Akul

Slim Bouguerra

unread,
Feb 21, 2018, 12:04:32 AM2/21/18
to Druid Development
Seems like some jars are missing from the uber connector jar.
Am not sure how you are building the connector but make sure the shading/uber jar building is including the bitmap libs as well.

Akul Narang

unread,
Feb 21, 2018, 12:29:43 AM2/21/18
to druid-de...@googlegroups.com
Hey Slim,

I'm directly using hive 2.2.0 tarball to connect. It already contains the $HIVE_HOME/lib/hive-druid-handler-2.2.0.jar but instead the mapreduce job is trying to find the jar file at hdfs://localhost:9000/opt/hive-2.2.0/lib/hive-druid-handler-2.2.0.jar which is $HIVE_HOME on the local file system.

Any help on this ? Why is the CREATE TABLE device_cap_druid STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' command fetching hive_home on HDFS ?


Cheers,
Akul Narang


--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/6DjXnXdgqXw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-development+unsubscribe@googlegroups.com.

To post to this group, send email to druid-development@googlegroups.com.

Pawan Goyal

unread,
Apr 10, 2018, 7:17:25 AM4/10/18
to Druid Development
use below command to add jar to the classpath


hive --auxpath /usr/local/Cellar/hive/2.3.1/libexec/lib/hive-druid-handler-2.3.1.jar,/usr/local/Cellar/hive/2.3.1/libexec/lib/RoaringBitmap-0.5.18.jar



Reply all
Reply to author
Forward
0 new messages