Hello.
I'm using an HDInsight cluster (Spark) from Azure to batch process a large amount of log files, and the result of this I'm storing in Hive tables. For visualization purposes I want to use Druid as backend, and for transferring the data I'm exporting the Hive data to json files in HDFS, then downloading the data to the machine running druid, and then running indexer tasks on the local json files. I'm sure that there is a more efficient way to transfer the data between Spark and Druid, does anyone have a better suggestion?
I don't want to keep the HDInsight cluster running after the processing, so the data needs to reside on the Druid cluster in the end.
I think there's also the option of having the index task of Druid fetch the json files directly from HDFS, but I would need to connect Druid to the HDInsight hadoop cluster which I haven't done yet. At least this would save me the hassle of transferring the large files.
What do you guys think?
Best regards,
André