Exporting data from Spark to Druid

an...@cabine.org

unread,

Nov 4, 2016, 7:18:21 AM11/4/16

to Druid User

Hello.

I'm using an HDInsight cluster (Spark) from Azure to batch process a large amount of log files, and the result of this I'm storing in Hive tables. For visualization purposes I want to use Druid as backend, and for transferring the data I'm exporting the Hive data to json files in HDFS, then downloading the data to the machine running druid, and then running indexer tasks on the local json files. I'm sure that there is a more efficient way to transfer the data between Spark and Druid, does anyone have a better suggestion?

I don't want to keep the HDInsight cluster running after the processing, so the data needs to reside on the Druid cluster in the end.

I've looked at https://cwiki.apache.org/confluence/display/Hive/Druid+Integration, and it seems that starting with Hive 2.2.0 I will be able to create a druid-backed hive table. Will I be able to store my results in this table and they will end up on Druid?

I think there's also the option of having the index task of Druid fetch the json files directly from HDFS, but I would need to connect Druid to the HDInsight hadoop cluster which I haven't done yet. At least this would save me the hassle of transferring the large files.

What do you guys think?

Best regards,

André

Jesus Camacho Rodriguez

unread,

Nov 4, 2016, 1:03:38 PM11/4/16

to Druid User

Hi André,

The Hive integration should be able to do that; we are working on it precisely to automate the kind of workflow that you are describing.

Basically, you will be able to execute a statement from Hive such as :

CREATE TABLE druid_table_1

STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'

TBLPROPERTIES ("druid.datasource" = "datasource_1")

AS

select `timecolumn`, `dimension1`, `metric1`

FROM test_table;

This will create a Druid data source called "datasource_1". The Hive query will create the Druid segments and register them to Druid for you. Then, you can choose whether you want to query the Druid datasource directly from Druid, or you want to continue using Hive to express your queries on top of the newly created table.

--

Jesús

André Cruz

unread,

Nov 4, 2016, 2:16:51 PM11/4/16

to druid...@googlegroups.com, Jesus Camacho Rodriguez

Hello Jesús.

On 04 Nov 2016, at 17:03, Jesus Camacho Rodriguez <jcamacho...@hortonworks.com> wrote:

The Hive integration should be able to do that; we are working on it precisely to automate the kind of workflow that you are describing.

Great!

However, since Azure HDInsight uses the Hortonworks Data Platform 2.5 and Hive 1.2.1, I'm not able to take advantage of that hive integration, right?

Or is it something standalone that I can download and use with this Hive version? It seemed to me that this integration is something that only comes with Hive 2.2 but I may be wrong here.

Thank you and best regards,

André

Jesus Camacho Rodriguez

unread,

Nov 4, 2016, 2:33:00 PM11/4/16

to André Cruz, druid...@googlegroups.com

Indeed this should be included in next Apache Hive release (2.2.0), it not a stand-alone module, thus it will require upgrading to that version.

--

Jesús

Gaurav Shah

unread,

Aug 17, 2017, 10:34:25 AM8/17/17

to Druid User, an...@cabine.org

Hi André,

Were you able to try this out ?

we are looking for similar functionality where we can insert from spark to druid.

Thanks

André Cruz

unread,

Aug 20, 2017, 4:36:29 AM8/20/17

to Druid User, Gaurav Shah

Hello Gaurav.

No, I don't think I did.

André

Gaurav Shah

unread,

Aug 20, 2017, 5:01:32 AM8/20/17

to André Cruz, Druid User

Thanks André

Reply all

Reply to author

Forward