Re: LMAX Disruptor 3.4.4 Halts Randomly

83 views
Skip to first unread message

Sam Barker

unread,
May 31, 2023, 1:18:42 AM5/31/23
to lmax-di...@googlegroups.com


On Tue, 30 May 2023 at 21:51, Hooman H Doost <h.hass...@gmail.com> wrote:
Hi, everyone
We are experiencing random halts during long runs - or sometimes even short runs when the machine is under heavy load due to other services.
We are using LMAX Disruptor 3.4.4 where each event on the ring is an array of size N each element of which is again a (two-dimensional) array, instead of assigning ids to an event on the ring. When there are parallel handlers, each reads and modifies only the set of mutually exclusive array indices that are related to it.

Facts :
- No exceptions are thrown by any handlers
- The publisher halts on the next publish event
The publisher will block when the ring buffer is full. Which would imply that one or more of the handlers are either not keeping up (or are at least perceived to be falling behind) as their sequences are not incrementing fast enough.  
 
- May be unrelated but we are using it through Clojure interop features with Java. This is not the first project we are using Clojure and Disruptor but the first time we are experiencing this problem.
- We are using :
```
LMAX Disrupotor : 3.4.4
Java Versions       : 11, 17, 20 - both OpenJDK and OracleJDK have been tested)
Clojure Version     : 1.11                                                  
```
- This happens with various ring sizes and producer types

--
You received this message because you are subscribed to the Google Groups "Disruptor" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lmax-disrupto...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/lmax-disruptor/dbda5b66-e197-4338-89a0-fbb5833cf210n%40googlegroups.com.

Daniel Marques

unread,
May 31, 2023, 11:25:38 AM5/31/23
to lmax-di...@googlegroups.com
Are these "halts" temporary (i.e. processing eventually resumes) or permanent?

If temporary, it is likely that the buffer is full.  This occurs when you publish at a faster rate than you consume, and the ring buffer starts to fill.  Once it is completely full, the next publication attempt blocks until the consumer can free a item from the ring.  If the consumer pauses significantly (e.g. a GC) this can happen pretty quickly.

If the halt is permanent, it is possible that you've deadlocked the producer and consumer.  This can occur when the consumer thread directly (or indirectly) produces events onto the ring itself, right as the buffer is full.

Imagine a ring buffer that is completely full. The consumer starts processing the oldest item,  A.  As part of processing A, it decides to publish a different event, B, onto the buffer.  However, as the buffer is full, the publication of B will block, meaning thr processing of A is also blocked - the consumer is now deadlocked.  The usual publication thread halts as the buffer is completely full, but it will never resume.

Of course, this can happen in ways which aren't as obvious, e.g. if the consumer thread passes something to another thread, which then publishes B back to the ring.

If your architecture does require such a "cycle", it is very important that those "cycle publications" are non-blocking (using the try-publish method).  Of course, you need to figure out what to do if publication fails.





On Tue, May 30, 2023, 05:51 Hooman H Doost <h.hass...@gmail.com> wrote:
Hi, everyone
We are experiencing random halts during long runs - or sometimes even short runs when the machine is under heavy load due to other services.
We are using LMAX Disruptor 3.4.4 where each event on the ring is an array of size N each element of which is again a (two-dimensional) array, instead of assigning ids to an event on the ring. When there are parallel handlers, each reads and modifies only the set of mutually exclusive array indices that are related to it.

Facts :
- No exceptions are thrown by any handlers
- The publisher halts on the next publish event
- May be unrelated but we are using it through Clojure interop features with Java. This is not the first project we are using Clojure and Disruptor but the first time we are experiencing this problem.
Reply all
Reply to author
Forward
0 new messages