Hi, please see inline.
On Thursday, April 23, 2015 at 5:51:15 PM UTC-7, Laurent Quérel wrote:
Druid guys asked me to share my questions with the community, so here are my questions:
- Size of the biggest cluster currently in production? #servers, #events, ...
50-100 PB of raw data. 20 trillion or so raw events. After roll-up and compression, it becomes 500TB of Druid segments across 400ish nodes.
- Bulk insert API (eg: for initial feeding) ?
- Support of idempotent insert? I mean what happens if I insert several times the same data point (ie: same timestamp, same dimensions), only one entry or several entries in the DB?
Right now we require deduping to be done outside of Druid. However, if you batch insert (via Hadoop indexing) a static set of data and regenerate a set of segments, Druid will atomically replace the old set of segments with new set of segments.
- Support of upsert operation? I mean, is it possible to update an existing data point?
This is supported, but an expensive operation. Appending data is cheap (Druid is designed for append-heavy data), but updates of appended data requires reindexing segments.
- Interpolation support? I mean, manage the aggregation of multiple timeseries not lined up.
This currently needs to be done at the client level.
- Availability of a AWS CFT or set of Docker images to deploy a test environment?
We have a docker script to set up a cluster locally.
Our integration tests also use docker to spin up a cluster.
- Support of stddev, derivative/rate, difference aggregate functions, ...?
Not out of the box. There will be some coding you'll have to do on your side to support some of these aggregations.
- Support for custom batch processing? I mean, execute a kind of map/reduce job (or similar) over a large subset of data managed by Druid independently of the deep storage used.
Druid bundles hadoop based batch indexing out of the box.
- Efficient Spark connector (if hdfs is used as deep storage) to compute complex analytics? I mean, use directly segments stored in hdfs to batch process these data by a solution like Spark, Impala...
Some folks in the community are working on this. FWIW, if you choose to ues Druid, you may not need to use Impala.
- Full text search on text columns? If no, is there a plugin for Elastic Search?
This is not currently supported but something we've been thinking about. Some folks out there run Druid and ES together.
- Support for metric discovery? I mean an access to a dictionary of metric/event types existing in the system (with regex support).
There are endpoints for schema discovery, but regex is not currently supported.
None that are open source, although some will be open sourced soon.
There's a basic admin UI bundled with the Druid coordinator.
- Access control, and granularity of this access control? I mean control the access to the database or to a subset of the database for a set of users. By granularity, I mean access control on the DB, on a specific time serie.
There is control on a per datasource level around data retention but no current support for different sets of users (usually we see this done on the client side).
- Commercial support and pricing?
Support is currently provided by the community.
There are numerous engineers at Metamarkets, Yahoo, and other places that works on Druid full time. The developers are here: