Druid and Apache Parquet a key to interoperability?

529 views
Skip to first unread message

ste...@activitystream.com

unread,
Jul 11, 2015, 4:24:54 PM7/11/15
to druid-de...@googlegroups.com
Hi,

I have been looking a Drill and Parquet and even though I found Drill very interesting, solving many use cases Druid is not meant to solve (variable datasources/datatypes etc.), I found it lacking in many areas (Histograms/HyperLogLog etc.) that are important for us when running our analytic queries.

The only question that remains is if you have considered support for Apache Parquet as segment files.

There are several reasons I ask:
  • It seems to offer superb encoding/compression/efficiency
  • It supports nested structures (complex JSON structures)
I know it's missing HyperLogLog/Histograms now but they seem to be on the roadmap (in their Jira issues) and if Druid was to support them then that would open up for a lot of interoperability (I think).

I'm certainly not qualified for any in-depth discussion about Druid architecture but I wanted to ask if you have considered it as a possibility going forward.

Best regards,
 -Stefan

Fangjin Yang

unread,
Jul 12, 2015, 7:08:54 PM7/12/15
to druid-de...@googlegroups.com, ste...@activitystream.com
Hi Stefan, see inline.


On Saturday, July 11, 2015 at 1:24:54 PM UTC-7, ste...@activitystream.com wrote:
Hi,

I have been looking a Drill and Parquet and even though I found Drill very interesting, solving many use cases Druid is not meant to solve (variable datasources/datatypes etc.), I found it lacking in many areas (Histograms/HyperLogLog etc.) that are important for us when running our analytic queries.

The only question that remains is if you have considered support for Apache Parquet as segment files.

We have considered it before for storing dimension values and that is something that is still a possibility. I recall other committers looked at it before for possibly storing metric values but seemed the format poor for fast scans and would hurt performance.
 
There are several reasons I ask:
  • It seems to offer superb encoding/compression/efficiency
I'd be curious to understand more about what compression/encoding algorithms Parquet is using that would give it a significant reduction in size over the current compression algorithms that exist in Druid.

According to https://github.com/Parquet/parquet-format/blob/master/Encodings.md, there's some overlap with how Druid columns are stored in terms of dictionary encoding and RLE. It may be interesting to see how delta compression reduces the sizes of the metric columns.
 
  • It supports nested structures (complex JSON structures)
I think this is the main use case that folks would like in Druid. It may make more sense to add support for nested structures according to the original Dremel paper that Parquet is based on rather than integrating Parquet columns in Druid. This will require a longer discussion and proposal though.

ste...@activitystream.com

unread,
Jul 13, 2015, 6:39:22 AM7/13/15
to druid-de...@googlegroups.com, ste...@activitystream.com

Thank you Fangjin.

I do understand and appreciate what Druid's main focus is and what it does exceptionally well.
At the higher end of that scale "special powers" of Druid shine and I like that.

The ability to access and link/join Druid data with SQL, via Drill, is something that appeals to me but I realize that this is far from being simple or feasible.

Very best,
  -Stefán

Fangjin Yang

unread,
Jul 13, 2015, 11:40:38 PM7/13/15
to druid-de...@googlegroups.com, ste...@activitystream.com
The problem we've always had with joins is that they are fundamentally expensive and slow, and supporting them may impact Druid's main value add of being able to power user facing analytic dashboards. We've always just recommended joins at ETL time. Query time lookups though (https://github.com/druid-io/druid/pull/1259), is adding support for joins of a large table with a small talbe 

Charles Allen

unread,
Jul 14, 2015, 8:27:47 PM7/14/15
to druid-de...@googlegroups.com, ste...@activitystream.com
Very limited joins. Essentially if you have a denormalized star~ish schema in Druid you can store dimension values in a lookup and just keep IDs in the datasource itself.
Reply all
Reply to author
Forward
0 new messages