Hey Igor,
Yes that's correct - the events ingested by the Kafka
indexing task are available for queries shortly after being read from
Kafka and you don't need to wait for them to be handed off to the
historical nodes.
Regarding the relationship between
taskDuration and segmentGranularity, 4 segments is the theoretical
minimum for your example, but in practice you'll always generate more
segments than this. The related reason for this is that taskDuration and
segmentGranularity are not aligned on the same time boundaries, i.e.
segmentGranularity of 1H will generate segments from say
12:00:00-1:00:00, 1:00:00-2:00:00, but taskDuration 1H will run for 1H
starting from when the task is created, so for example 12:01:00-1:01:00.
There's some discussion on this here:
http://druid.io/docs/0.10.0/development/extensions-core/kafka-ingestion.html#on-the-subject-of-segmentsTypically,
the Kafka indexing service generates a large number of segments, so I'd
recommend setting taskDuration to something higher than 15 minutes
(somewhere between 1 and 4 hours is probably a good starting point). As
mentioned in the above link, having a daily re-indexing job is also a
good idea, which leads to your third question..
The segments
generated by the batch indexing job run later on will have a higher
version number than those generated by the Kafka indexing tasks and will
overshadow them, meaning that historicals will load this newer version
and will use it to serve queries. The previous segments generated by the
Kafka indexing tasks will still remain in deep storage until they are
deleted using a coordinator kill command, or alternatively you can
enable automatic killing of unused segments. See:
druid.coordinator.kill.on and related properties here:
http://druid.io/docs/0.10.0/configuration/coordinator.htmlOne
thing to keep in mind - to keep things moving as smoothly as possible,
it's best to have batch jobs and realtime jobs working on different
sections of the timeline; in other words, the daily batch job should
operate on intervals that the Kafka indexing tasks are no longer seeing
events for. Otherwise, the batch jobs and the realtime jobs will be
trying to acquire locks for the same time ranges and will be blocking
one another.
Hope this helps,
David