[Proposal] More flexible dimension types and indexing

408 views
Skip to first unread message

jon...@imply.io

unread,
Dec 3, 2015, 4:24:54 PM12/3/15
to Druid Development

[Proposal] More flexible dimension types and indexing

=============

Currently, Druid assumes that all dimensions have string values and are associated with a bitmap index.

There has been interest in loosening these constraints to support use cases that blur the existing separation between dimensions and metrics, e.g., filtering on numeric columns, aggregating dimensions at query time.

A recent discussion on these topics can be found here:

This proposal calls for two major changes/features:

-----

1.) Remove the assumption that dimensions always have string values. 

This change is a path towards reducing the distinction between dimensions and metrics.

This would involve changes to:
- IncrementalIndex, IndexMerger, etc. (ingestion)
- StorageAdapters (querying)
- On-disk format of segments (storage)
- Ingestion specs, allow user to specify dimension types (e.g., String, Long, Double)

(Perhaps this is also a good time to redesign spatial dimensions to be less of a "special case"?)

-----

2.) Allow user to choose per-column index strategies

Druid could support a wider range of index types beyond bitmaps. Giving users control over what indexes are used on a per-column basis could make Druid more powerful and efficient.

For example, if a dimension is expected to have high cardinality and range filters applied to it, the user may want to choose a tree-based index instead of bitmaps. 

As another example, trie indexes could be used to better support text search on dimension values.

The existing ColumnCapabilities class could be used to describe what indexes are supported for a column.

This would involve substantial changes to:
- query-related components
- on-disk storage format
- ingestion specs

-----

These two changes are conceptually separate and could probably be implemented separately (with some later merging/adjustment), although either would be more beneficial with the other change present.

However, it would probably be cleaner/less churn to implement the non-String dimensions first.

A roadmap for this proposal could be:

1. Replace IncrementalIndex with a new Index that accepts non-String dimensions
2. Adjust disk storage format and ingestion specs 
3. Update querying logic to be aware of non-String dimensions
4. Implement per-column index type selection
5. Add a new index type that better supports range queries (appears to be the most common use case for numerical dimensions?)

6. Add other index types as needed

If this plan sounds sensible, shall we go ahead and create github issues for the roadmap items? I am thinking of working on the first item.

Thanks,
Jon

Fangjin Yang

unread,
Dec 4, 2015, 2:19:22 AM12/4/15
to Druid Development
I'm on board with this idea but I recently had an offline discussion with @cheddar about this so it would be interesting to hear his comments in this thread.

Eric Tschetter

unread,
Dec 8, 2015, 1:50:06 PM12/8/15
to Druid Development
You mention changes to the storage format.  I can understand why additions might happen to the storage format, but I'm not sure why the storage format would have to change whole-hog.  Can you elaborate on what changes you expect will be needed?

For range queries on numerical columns, I think that need could be served without abandoning bitmap indexes.  Then again, if we benchmark other indexing options and it turns out to be faster, that's also great.

In general, I think the high-level idea here is great.  I look forward to seeing and hearing about some of the particulars of the implementation as it makes forward progress.

--Eric

Jonathan Wei

unread,
Dec 9, 2015, 8:35:21 PM12/9/15
to druid-de...@googlegroups.com
Agreed, the storage changes would be incremental, not a whole-hog redesign, I probably could've expressed that more precisely.

The changes I foresee for the storage format would be:
- Additional metadata to track the dimension types
- Omitting the value dictionary for numeric dims
- Support for storing non-bitmap indexes

I'm currently taking a stab at removing the string-valued dimension assumptions from the IncrementalIndex and plan on getting a PR out for discussion soon, I think that'll be a good base for fleshing out further areas.

- Jon


--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/fd8ab588-86b9-4832-a844-3715c8ff51b6%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Charles Allen

unread,
Dec 14, 2015, 1:08:53 PM12/14/15
to Druid Development
Would it (at some point) make sense to move druid off of the smoosh file and onto a more standard format such as Parquet?

Fangjin Yang

unread,
Dec 17, 2015, 7:10:36 PM12/17/15
to Druid Development
Charles, smoosh is a way of packaging the Druid binary columns, not entirely unlike a tarball. It might make sense to for there to be a Parquet complex Druid column, but what would you want to do with it?

Charles Allen

unread,
Dec 17, 2015, 8:20:50 PM12/17/15
to Druid Development
have a more standard way of storing column metadata. After looking into it some more it is not clear parquet is a good fit for the way druid handles column store data

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.

Fangjin

unread,
Dec 17, 2015, 8:21:23 PM12/17/15
to druid-de...@googlegroups.com
Benchmarks on scan speed would be interesting.

jon...@imply.io

unread,
Jan 19, 2016, 6:55:17 PM1/19/16
to Druid Development
Hi all,

As discussed in this morning's sync up, I've moved this proposal to Github issues:


- Jon


On Thursday, December 3, 2015 at 1:24:54 PM UTC-8, jon...@imply.io wrote:
Reply all
Reply to author
Forward
0 new messages