Kafka indexing tasks stuck in publishing while handing off segments

177 views
Skip to first unread message

Daniel Nash

unread,
Mar 20, 2025, 6:42:03 PMMar 20
to Druid User
Experiencing an intermittent issue in my cluster where some of my kafka indexing tasks which are publishing get stuck on handing off segments and eventually exceed their completion time and are killed.  They appear to have fully pushed everything to deep storage (S3) and the metadata store (correct me if I'm wrong, but indexing tasks directly save to the metadata store, right?) and have even handed off most of their segments, but they seem to believe a handful (1 to 6 or so from what I've seen) of segments are not loaded onto historicals and are waiting for a handoff for those segments.

Observations I can share:
  • We are running 1 hour indexing tasks and they usually finish publishing and exit within 3-17 minutes depending on current load.  For the issue ones, I upped my task completion time to 45 minutes and they still aren't finishing.  They are often just waiting for the final handoffs for over 30 minutes.

  • This datasource is dealing with quite fragmented data and can create quite a few segments in the hourly intervals we are running.  Several hundred or so.  Has not been an issue to date and compacting is keeping up with it.

  • We are running Druid 32.0, but an older version of ZooKeeper.  I don't know why the team I'm taking over hasn't kept ZooKeeper up with newer versions.  I guess "ain't broke, don't fix it" mentality.  I only mention it in case it could be a factor.  I see in the Druid 32.0 incompatibility notes something about ZooKeeper segment loading not being supported anymore.  How can I check if we are doing this in our cluster?  I assume we aren't because Druid is running fine for the most part other than this.  We were on a 28.x version for a long while before this update a few weeks ago.

  • The historical nodes are not being over taxed.  We have 8 and several of them are usually idle (no segments to load or drop) from the Services page of the web console.  They also have plenty of disk space left.

  • I don't see any errors in the Coordinator or Historical logs.  I did see some warning messages along the lines of "asked to created an adaptor for a segment that already exists" (paraphrasing because I don't have the exact message in front of me) on the historicals.  I don't know if this could be related to the segments in question or not.
Any thoughts appreciated.

Sincerely,
Dan

Ben Krug

unread,
Mar 20, 2025, 7:42:57 PMMar 20
to druid...@googlegroups.com
My first suspicion from the publishing times, etc, you describe, is that the metadata DB is possibly a bottleneck.  Do you use a MySQL-compliant or postgres-compliant DB?
(I'd send more queries to run there, but they depend on the DB.)

If you run some queries in the metadata DB like 
SELECT used, COUNT(*) FROM druid_segments GROUP BY 1
or 
SELECT active, COUNT(*) FROM druid_segments GROUP BY 1

how long do they take to return, and what do they return?

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/druid-user/cddbe849-fa4b-4f3c-956c-32f7df4d27e9n%40googlegroups.com.

Daniel Nash

unread,
Mar 20, 2025, 7:50:51 PMMar 20
to druid...@googlegroups.com
We use PostgresSQL metadata store.  I upped the defaults on that server to allow for more index caching and per query work memory.  It's been pretty responsive.  I can't access the system at this moment to give you exact times, but it only takes a few seconds at most to return the answers to those queries.

As I said, it seems like the segments in question have already been pushed to deep storeage and published to the metadatastore.  It just seems like the Coordinator never asks one of the idle historicals to load them.

I just found this Druid GitHub issue a few moments ago and what they describe is exactly what we are seeing, the task stalling out on "Coordinator handoff scheduled - Still waiting for handoff for X segments".

It sounds like I might need to make some adjustments to our coordinator to get those segments onto historicals faster instead of trying to balance the other segments out.  It's just odd to me that the historicals are showing as idle on the Service page on the Console if that was the issue.

~Dan


Daniel Nash

unread,
Mar 21, 2025, 8:18:20 AMMar 21
to druid...@googlegroups.com
So, looking at the system this morning, for only one datasource, this issue is happening for every indexing task.  The publishing task gets stalled with 1-6 segments "Still waiting for handoff" and the handoff never happens.  Unfortunately, when those publishing tasks hit their completion time limit, they gracefully exit with "SUCCESS", but are marked as failed, and the Supervisor also kills the latest indexing tasks that were reading because "No task in the corresponding pending completion taskGroup[0] succeeded before completion timeout elapsed".

Also interesting is that our druid_segments table in our metadatastore doesn't have an "active" column that you asked me to query against, Ben.  Should it?  There is only a "used" column.

I need to get the Coordinator to assign these segments that are published to the historicals so these publishing tasks can exit successfully.  Just not sure what switches to use to do that.  We do have "Smart Segment Loading" set to true, so it should be using Round Robin assignment, if what I've read is correct.  Seems like the historicals should be picking these up really quick, but they aren't.

~Dan

Daniel Nash

unread,
Mar 21, 2025, 2:44:38 PMMar 21
to Druid User
OK, found a solution to my problem.  I was able to reduce my indexing task count for this one datasource to 1 task and reviewed the logs of the repeated failures.  I realized that, when the tasks were asked to terminate due to exceeding their completion time, they unannounced the 6 segments that were refusing to be handed off.

Somehow, the same 6 hourly segments (non-contiguous) for one particular day refused to be handed off every time the index tasks ran.  I suspended the supervisor, dropped all the segments for those 6 particular hourly chunks, ran kill tasks for those hourly chunks, and resumed the supervisor, and all is now well.  I'll need to refeed all the data for those 6 hourly chunks soonish, but not going to do it before the weekend in case it triggers this behavior again somehow.

I don't know how the system got in this state.  The coordinator and/or historicals seemed to refuse to take the handoff on these segments for these 6 particular hourly chunks and I have no idea why.

~Dan

Daniel Nash

unread,
Mar 25, 2025, 3:45:32 PMMar 25
to Druid User
I wanted to bring this up again as I still see random task failures due to a failure of a publishing task to have all their segments handed off in time before their completion timeout.  Fortunately, we do seem to recover in subsequent indexing tasks, but the random failures are aggravating if nothing else.

I think the nature of our data is causing us a problem here.  Even today, we are receiving chunks of data for days at the beginning of March and throughout the month and we are making hourly segments for them.  So, one 1 hour indexing task might be trying to publish 100-300 segments at the end of its read interval.  I typically see that we have published everything in 15-20 minutes, but then, sometimes, the handoffs just seem to stall out with the " Coordinator handoff scheduled - Still waiting for handoff for X segments" messages for 20-30 minutes before the tasks are interrupted (45 minute completion expired).

It seems like the Coordinator is not prioritizing handing off newly published middle segments so the middle tasks can shutdown.  Is there any way to force that?  I do have "Smart Segment Loading" enabled in the Coordinator dynamic config, which I thought was what quickly used round-robin to assign historicals to loads these segments.  While these middle tasks are waiting for the handoffs, I see the Coordinator moving other segments around, balancing the historicals.

~Dan


Ben Krug

unread,
Mar 25, 2025, 4:05:53 PMMar 25
to druid...@googlegroups.com
Wow, quite a journey so far...

I wrote my response earlier off the top of my head, so, yes, I meant "used", not "active" for the metadata field.
If things still typically take 15 minutes to publish, that sounds problematic to me.  I wonder whether you have compaction running,
or other tasks that could be locking time intervals, and making the publishing wait.  That number of (hourly) segments spread out over that
many days could lead to some locking contention for something like compaction (the usual culprit).

I suspect that if it's really an issue with smart segment loading (which is fairly new imo), then you could turn that off, and eg reduce the max segments to move
(to slow down balancing) and see what happens.

Kashif Faraz

unread,
Mar 26, 2025, 6:19:50 AMMar 26
to druid...@googlegroups.com
Hey Daniel

If smart segment loading is enabled, then the Coordinator will always prioritize handoff of newly published segments.
Segment balancing is always lower on priority than assignment of under-replicated segments or segments that need to be handed off.

When the segment takes too long to handoff, do you see anything being reported for:
- Metric segment/assignSkipped/count
- Metric segment/loadQueue/failed
- Do you see any errors in the Coordinator or historical logs?

If there are too many segments to handoff, it will take time. The bottleneck in that case is the loading rate on historicals, not the assignments done by Coordinator.
Since your data is too fragmented, I would advise considering a higher segment granularity of perhaps DAY.

Thanks
Kashif


Daniel Nash

unread,
Mar 26, 2025, 9:13:06 AMMar 26
to druid...@googlegroups.com
Hey Kashif,

Thank you for the response and thoughts.  Unfortunately, metrics do not appear to be enabled on this cluster; I don't know why.  I've just taken over management of this cluster from the person who left and am learning as I go.  I will look into enabling them to help diagnose this.

As for errors in the Coordinator or Historical logs, I see now that I'm getting quite a few messages about stuck segments in the Coordinator logs around the time periods this is occuring, similar to as follows:
<time> ERROR [Coordinator-Exec-HistoricalManagementDutites-0] org.apache.druid.server.coordinator.ServerHolder - Load queue for server [druid-historical-6.xxx:8083], tier [_default_tier] has [36] segments stuck.: <list of segments follows>

I traced a segment from an index task that had one segment not handed off in time and I can see:
- It was promptly published by the IndexerSQLMetadataStorageCoordinator to the metadata store DB at the end of the read interval
- 6 minutes later, I see that segment in one of the above "Load queue ... stuck" messages
- It remains in that queue (seen in subsequent "Load queue..." messages) until about 30 seconds before the publishing task is killed due to exceeding its completion interval.

What I'm not sure I understand is that the segment appears as the first segment in the list of segments for the first "Load queue... stuck" message, but, in later messages, it appears to be sliding down the list.  If that list is supposed to be in priority order, then the segment seems to be moving down the priority list even though it is a LOAD segment.  A bunch of MOVE_TO and REPLICATE segments get moved in front of the LOAD segment over time.

I also have other Historicals sitting idle at the time this is occurring, so I'm wondering if there is a way to get the Coordinator to put a LOAD for the segment on another Historical if this is happening.

Thanks,
Dan



Kashif Faraz

unread,
Mar 26, 2025, 9:44:20 AMMar 26
to druid...@googlegroups.com
Thanks for the details, Daniel.


A bunch of MOVE_TO and REPLICATE segments get moved in front of the LOAD segment over time.

Could you please share how you identified this? This shouldn't happen under normal operating conditions.
Just to clarify, move/replication may happen on other servers but for any given server, LOAD operations are always top priority.

The load queue being stuck could simply mean that your historical is busy downloading the segments.
Once you have metrics available, you can compare the values of `segment/loading/rateKbps` to check if a historical is indeed slower than the others.

You can try tweaking the following configs and see if it makes loading on that historical faster:
- Increase `druid.coordinator.loadqueuepeon.http.batchSize`
- Increase `druid.segmentCache.numLoadingThreads`

If the above doesn't work, try increasing `completionTimeout` or tweak the size / number of segments generated by the indexing tasks.
You can do this by tuning the `maxRowsPerSegment` and / or fixing the segment granularity in the streaming supervisor.

I wouldn't advise disabling smart segment loading, as it is almost never the reason for slow segment loading.
But if you want to explore that option, you may set the following parameters in your coordinator dynamic config:

smartSegmentLoading: false
maxSegmentsInNodeLoadingQueue: 0 (unlimited)
maxSegmentsToMove: 0 (disable balancing)
replicationThrottleLimit: 100 (reduce replication)

Let me know if any of these solutions work for you.

Thanks
Kashif

Daniel Nash

unread,
Mar 26, 2025, 10:03:48 AMMar 26
to druid...@googlegroups.com
Kashif, thank you for the additional thoughts and things to try.  I will definitely report back what helps and what doesn't and hopefully have some more information from the metrics.

To answer your question about the load queue, I based my observations purely on the order of the segments that were listed in the "Load queue for server" messages in the Coordinator logs for the Historical that was supposed to be loading my segment.  I did not look into the source to see how those segments are ordered in that list.  When I observed that my one LOAD segment seemed to be sliding down the list behind non-LOAD segments, that was based purely on the order of the segments in that printed list in that log message over time.  My segment that never got handed off started out as the first segment in that list and slowly had other LOAD and non-LOAD segments put in front of it.  I don't know if the Coordinator views that queue in priority-sorted order, insertion order, or what, nor if that view of the list is different from how the Historicals actually process it.

I forgot to mention in my last message that I did not see any issues in the Historical logs during the same time periods that I found these stuck messages in the Coordinator logs.

Sincerely,
Dan


Kashif Faraz

unread,
Mar 26, 2025, 10:12:45 AMMar 26
to druid...@googlegroups.com
"To answer your question about the load queue, I based my observations purely on the order of the segments that were listed in the "Load queue for server" messages in the Coordinator logs for the Historical that was supposed to be loading my segment.  I did not look into the source to see how those segments are ordered in that list."

I see. Unfortunately, that printed message is based on a set. It doesn't represent the actual order in the priority queue.
I suppose we could use a list instead to show the order of segments currently in queue. It might help debugging in such situations.

"I forgot to mention in my last message that I did not see any issues in the Historical logs during the same time periods that I found these stuck messages in the Coordinator logs."

That makes sense. My guess is historical is just busy loading the segments already assigned to it. Increasing the number of loading threads on the historical should definitely help.


Daniel Nash

unread,
Apr 2, 2025, 8:44:58 AMApr 2
to Druid User
Well, it turns out our pipe to S3 is just being saturated.  Our historicals were already running 8 numLoadingThreads.  I reduced the number of compaction tasks I was running and found that the indexing tasks were able to keep up without issue.  Also discovered we had an incoming data issue that was giving us data from way earlier in the month than it should have.  Resolving that reduced the number of segments we were generating, which also helped with the S3 load.

I also discovered that someone had the good idea to do as I suggested and not fail tasks if they aren't able to handoff all their segments within some interval since they are already persisted.  The handoffConditionTimeout allows you to gracefully exit a publishing task after some time if the handoffs are still pending.
Change default handoffConditionTimeout to 15 minutes. by gianm · Pull Request #14539 · apache/druid · GitHub 

So, between these things, I'm not getting my failed indexing tasks anymore.

~Dan

Kashif Faraz

unread,
Apr 5, 2025, 1:04:18 PMApr 5
to druid...@googlegroups.com
That's good to hear!

Thanks for the update, Daniel.

Reply all
Reply to author
Forward
0 new messages