1. 1. Too many ServiceCache-0 threads issue
a. This is a load-independent issue, there is a pattern in the increase in no. of threads per hour
b. Observation is : When the new hour starts, the downstream listener connects to peons for the new hour, so there is an increase in the no. of threads (these new threads are ServiceCache-0 threads), but the threads connecting to peons of the past hours do not go down and thus there is continuous increase in no. of threads
c. Under load also there is no much increase in no. of threads, they are increasing at almost same rate (300 per hour)
2. 2. Downstream listener choking
a. This is a load-dependent issue
b. Values : Xmx30g, at choking point, RES memory for process is 31g => Downstream is overwhelmed by Upstream
c. To simulate this issue : Stop downstream listeners for a particular topic for 30 minutes (while messages are being pushed into Kafka for the same topic) and then start downstream listener so it is under very heavy load (which is much more than peak hour load for same no. of channels)
d. At this point, the listener doesn't get message from Kafka (or gets at a very slow rate)
a. fresh.txt = threaddump after 20 minutes of listener start
b. 1672.txt = threaddump when listener choked (and also spawned too many threads)
c. Output_1672.log & output_fresh.log = files containing more readable thread running info from these dumps
Hello Gian,
Did you have a chance to look at this issue?
Thanks,
Saurabh
--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/M13CYW7oWks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/1004d3fe-8a27-4134-8c32-de3623e05946%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi Gian,
Yes the issue is consistently reproducible with 0.3.3 even when the tasks are running fine. As soon as I deploy with 0.3.2, it starts working.... So currently we are using 0.3.2 in prod with a wrapper script that does a rolling restart of Kafka listeners cum druid pushers so that we don't hit too many threads issue.
The flow observed in 0.3.3 is as below:-
1. Listeners pushing data to peons successfully
2. New hour starts, new peons created
3. Listener unable to discover the new task names to connect to
4. Restart the listener and they are able to connect this time
It seems the service discovery at listener is unable to find the new peons if new peons are spawned in the same zk session, though it is able to find new peons if it is restarted
That said the issue is still there and we have not yet solved its root cause
Thanks
Saurabh
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/3cb583ac-763a-4699-b235-536a95b3eca7%40googlegroups.com.