permit count exceeded

Jeff Rose

unread,

Apr 23, 2011, 11:18:05 AM4/23/11

to alep...@googlegroups.com

I'm occasionally getting this error related to the on-close handler
for a channel. It is non-deterministic and fairly rare, otherwise I'd
supply a test case that makes it happen. I tried searching for the
error and the only thing I came up with is this diff for jdk7, which
I'm using:

http://hg.openjdk.java.net/jdk7/hotspot/jdk/rev/025f9e57566a

I don't know if this is a bug in the jdk or if lamina might be doing
something incorrect with the semaphore used to guard an observer.

-Jeff

java.lang.Error: Maximum permit count exceeded
at java.util.concurrent.Semaphore$Sync.tryReleaseShared(Semaphore.java:197)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1340)
at java.util.concurrent.Semaphore.release(Semaphore.java:615)
at lamina.core.observable.Observable.close(observable.clj:71)
at lamina.core.observable$siphon$fn__17034$fn__17038.invoke(observable.clj:234)
at lamina.core.observable$observer$reify__16758.on_close(observable.clj:43)
at lamina.core.observable.Observable.close(observable.clj:125)
at lamina.core.channel$close.invoke(channel.clj:127)
at aleph.tcp$basic_server_pipeline$fn__22437.invoke(tcp.clj:94)
at aleph.netty$channel_close_stage$fn__21499.invoke(netty.clj:156)
at aleph.netty$upstream_stage$reify__21482.handleUpstream(netty.clj:122)
at aleph.netty$upstream_stage$reify__21482.handleUpstream(netty.clj:124)
at aleph.netty$upstream_stage$reify__21482.handleUpstream(netty.clj:123)
at org.jboss.netty.handler.codec.frame.FrameDecoder.cleanup(FrameDecoder.java:344)
at org.jboss.netty.handler.codec.frame.FrameDecoder.channelDisconnected(FrameDecoder.java:226)
at aleph.netty$upstream_stage$reify__21482.handleUpstream(netty.clj:124)
at org.jboss.netty.channel.Channels.fireChannelDisconnected(Channels.java:360)
at org.jboss.netty.channel.socket.nio.NioWorker.close(NioWorker.java:587)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:356)
at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:281)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:201)
at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46)
at java.util.concurrent.ThreadPoolExecutor.runWork

Zach Tellman

unread,

Apr 23, 2011, 4:02:18 PM4/23/11

to alep...@googlegroups.com

There's a test in lamina.test.channel called 'test-on-close'.
Currently it runs 100 times through a simple close/on-close cycle.
Can you ramp that up to a million and run it on your particular setup?

Zach

Jeff Rose

unread,

Apr 23, 2011, 4:58:06 PM4/23/11

to alep...@googlegroups.com

Ok, it's running now. It's going to take on the order of:

user=> (/ (* 1e6 5) 1000 60.0)
83.33333333333333

an hour and a half.

Why does calling a closed handler result in something so non-deterministic?

Zach Tellman

unread,

Apr 23, 2011, 5:09:21 PM4/23/11

to alep...@googlegroups.com

Obviously, it shouldn't be non-deterministic. However, observables
use a somewhat complicated locking mechanism to maximize concurrent
throughput, so I created a few stress tests to satisfy myself that
there wasn't a race condition lurking somewhere. Running through it a
million times works perfectly on my machine (using JDK6), so if this
reproduces your issue then it's possible it's an issue with your JVM,
even though I'm pretty hesitant to point my finger at the platform.

But really, a million iterations is probably overkill. Feel free to
stop it sooner if you're not seeing the issue.

Zach

Jeff Rose

unread,

Apr 23, 2011, 5:47:55 PM4/23/11

to alep...@googlegroups.com

Ok, I'll let it run through the million just to be sure.

I'd be interested to understand your thinking about the locking
though. For example, in the implementation of close in the Observable
defrecord form, it closes, locks, calls the on-close handlers, and
then unsubscribes the handlers, but I would have thought you could
read the value of @observers once, call the associated handlers, and
then unsubscribe them, without locking. New subscribers will fail
anytime after the compare-and-set! of closed? so they don't need to be
locked out, right? If @observers is dereferenced once after the
compare-and-set! at the top of the function then it will be consistent
for both calling the on-closed handlers and unsubscribing them. Also,
I noticed that unsubscribe passes false for the closed? parameter to
modify-observers, which in the case of being called inside close is
not true.

I am still just feeling my way around so I'm sure I don't grok all of
what's happening here, as I've only used semaphores to guard pools of
objects and I've never used thread locals. Can you explain how this
works? Clearly it is working well, but I don't understand the need
for both the thread-local counter and the semaphore, which itself
maintains a counter.

Thanks,
Jeff

Zach Tellman

unread,

Apr 23, 2011, 6:16:11 PM4/23/11

to alep...@googlegroups.com

There are two atoms in the observable: closed? and subscribers. When
closing, I need to first set closed? to true, and then sample the
observers. When subscribing, I need to sample closed? and then update
the subscribers. If one thread is closing and another is subscribing,
it's possible for the following to occur:

Thread 1 checks closed?, which is false
Thread 2 sets closed? to true
Thread 2 triggers the on-closed event for all subscribers
Thread 1 adds a new subscriber which will never get the on-closed event

If you take out the locks, the test-on-closed function you're running
will generally fail 1 time in 100. Similar issues can occur with
sending a message after on-closed callbacks have been triggered.

What we want, then, is to be sure that nothing else is going on when
we close an observable. However, we don't want to serialize
everything; we're perfectly happy having two threads sending a message
through the observable at the same time, or one thread adding a
subscriber while another sends a message.

The solution I use for this is to create a semaphore with
Integer/MAX_VALUE permits, have each subscribe and message operation
take a single permit, and have the close operation take all the
permits. This means that an arbitrary number of messages and
subscriptions can happen concurrently, but a close operation will wait
for all pending messages to clear, and then all further messages will
have to wait until it's complete.

But there's an additional problem: what if the close operation occurs
inside a message callback? The message operation has already consumed
a single permit (and who knows how deeply we're nested), so if we
simply have (close ...) consume Integer/MAX_VALUE permits, it will
hang forever. My solution for this is to have a ThreadLocal variable
keeping track of how many permits have been consumed in the context of
the current thread, and only consume the difference between that value
and Integer/MAX_VALUE.

This re-entrant locking mechanism also allows for stranger behavior,
like trying to send a message into a channel inside an on-close
callback, or trying to close the channel a second time within an
on-close callback, and so on.

The exception that's being thrown is that the permits are overflowing
from Integer/MAX_VALUE to a negative value. This check only exists in
JDK 7, so it's likely that it was silently overflowing previously, and
it's a flaw in my code. I've identified one scenario where that could
happen, but as far as I can tell it would only be when a thread is
interrupted. Regardless, I've pushed a fix. If this stress test
doesn't cause any errors, please pull the latest and let me know if
you're still seeing the issue.

Thanks,
Zach

Jeff Rose

unread,

Apr 24, 2011, 4:19:54 PM4/24/11

to alep...@googlegroups.com

Thanks for taking the time to describe how this works. I've never
seen this kind of asymmetric locking before, but it makes sense and
your thread-local mechanism to support re-entrancy is very cool. You
hit the nail on the head in terms of the issue too, and now that I
understand I don't think I'd really even call it a bug. I had a
rarely occurring problem where a thread would occasionally go into an
infinite recursion. I was inspecting the threads to find the
stackframe and eventually isolate the issue, which eventually led me
to the bug, but in the meantime I had called the stop method on the
thread because it was pegging my CPU. Everything seemed to work fine
so I kept developing and running code, and then later this overflow
error would occur, so I hadn't connected it.