I believe ‘taskDuration’ field for Kafka Indexing Service does more harm than good. Here is why:
Lets assume that I am creating hourly segments ("segmentGranularity": "HOUR").
If I am ingesting via Kafka, I typically want to persist a segment:
1. When segmentGranularity is reached. That is, every hour.
Thus I want my segments to look like :
01:00-02:00
02:00-03:00
2. If a segment is getting too large, then I want to split it even if the segment granularity hasn't been reached. "maxRowsPerSegment" achieves this.
Now there is the `taskDuration` field. I set it to "taskDuration": "PT1H" assuming this will result in the segments I want above. However, turns out that is not the case!
Unless I submit my supervisor spec at EXACTLY 01:00, the segments will now be created every hour from when I SUBMITTED my supervisor spec.
So instead of, segments like:
01:00-02:00
02:00-03:00
I now get segments like:
01:15-02:00
02:00-02:15
02:15-03:00
The segments are now broken not just by segment granularity but also by 'taskDuration'.
This is not what I wanted! Now the segments for every hour are split into atleast 2 non-optimal sized segments.
The segments created shouldn't be dependent on when the supervisor spec was submitted.
The 'taskDuration' field thus is not only not necessary for Kafka Indexing Service, it actually results in unwanted behavior.