NullPointerException sending to a channel in a thread

186 views
Skip to first unread message

Chris Pennello

unread,
Apr 18, 2016, 8:04:31 PM4/18/16
to quasar-pulsar-user
Hello!

I've been experimenting with Quasar channels for our Java application, and I'm really loving the CSP-style of concurrency management.

Unfortunately, with a fairly simple setup, I'm running into the following NPE.

java.lang.NullPointerException: null
at co.paralleluniverse.strands.channels.TransferChannel$Node.lease(TransferChannel.java:387)
at co.paralleluniverse.strands.channels.TransferChannel.tryMatch(TransferChannel.java:611)
at co.paralleluniverse.strands.channels.TransferChannel.xfer1(TransferChannel.java:501)
at co.paralleluniverse.strands.channels.TransferChannel.send(TransferChannel.java:87)
... my application code that calls send ...

The setup is as follows.
Quasar 0.7.4 with jdk8.  Instrumentation enabled at runtime with -javaagent.  -Dco.paralleluniverse.fibers.verifyInstrumentation=true (although no "QUASAR WARNING" output).  Everything is running on threads, not fibers (is instrumentation even necessary if I'm not running any fibers?).  Multiple producer threads, each with its own unbuffered/transfer channel, configured with overflow policy block and single producer and single consumer true.  One consumer thread that periodically selects across a subset of the producers' channels based on various application-specific properties of the items being consumed.

Anything obvious that I'm doing wrong?  Any other information I can provide?

I love that all my locks have gone away due to Quasar's channels, but this NPE is troubling; the code has to work in addition to being beautiful!  : ]

Chris

Fabio Tudone

unread,
Apr 19, 2016, 6:19:58 AM4/19/16
to quasar-pulsar-user
Hi,


Unfortunately, with a fairly simple setup, I'm running into the following NPE.

Oy.

The setup is as follows.
Quasar 0.7.4 with jdk8.  Instrumentation enabled at runtime with -javaagent.  -Dco.paralleluniverse.fibers.verifyInstrumentation=true (although no "QUASAR WARNING" output). Everything is running on threads, not fibers (is instrumentation even necessary if I'm not running any fibers?).

At present, in specific cases, possibly not; in general though, you want deployments with future-proof robustness so you shouldn't asume that instrumentation (with the agent or otherwise) is necessary only if you're actually using fibers. That's for at least the following reasons (that I can think of now):

- Even if you're not using fibers yourself, Quasar could still use them behind the scenes, e.g. http://docs.paralleluniverse.co/quasar/javadoc/co/paralleluniverse/strands/channels/transfer/Pipeline.html (that is, if not explicitly passing the constructor a strand factory that creates threads rather than fibers).
- Some Quasar code assumes (or might just assume in the future) that your code is ready for interoperability with fibers (which means or will possibly mean in the future that it has passed through Quasar's instrumentor) even if in a present specific runtime situation it's not actually inter-operating with fibers at all.
- At some point you might want to hot-upgrade your running code (e.g. actors hot-deployment or class redefines) and the new one works with fibers.

If what bothers you is single-artifact deployment in presence of agents (Java or otherwise), look no further than Capsule: http://www.capsule.io/.

A different matter is instead the runtime (classpath) dependency on Quasar: if your code is using e.g. only Quasar annotations (say, a library made of a set of computation methods that could block fibers only in some projects) then you can package it without declaring a Quasar dependency: annotations not found on the classpath should be ignored by the JVM.
 
Multiple producer threads, each with its own unbuffered/transfer channel, configured with overflow policy block and single producer and single consumer true.  One consumer thread that periodically selects across a subset of the producers' channels based on various application-specific properties of the items being consumed.

Anything obvious that I'm doing wrong?  Any other information I can provide?

Here's a small project with a similar program (if I understood correctly your case): https://github.com/circlespainter/quasar-chris-pennello-issue (you can edit some program constants and JVM flags according to the README). It seems to work fine for me, does it work for you too? What are the differences with your case, if any?
 
I love that all my locks have gone away due to Quasar's channels

Great to hear!
 
but this NPE is troubling; the code has to work in addition to being beautiful!  : ]

Oh but you really want it all... Aren't you asking for too much? :) 

Let me know.

Cheers,
-- Fabio

pron

unread,
Apr 19, 2016, 6:51:38 AM4/19/16
to quasar-pulsar-user
Hi Chris.
I’ve just uploaded a snapshot version to sonatype (0.7.5-SNAPSHOT) that hopefully contains a fix. This bug is tracked here.

Ron

Fabio Tudone

unread,
Apr 19, 2016, 8:18:18 AM4/19/16
to quasar-pulsar-user
I've just been able to reproduce the same exception and stacktrace with the above program, Quasar 0.7.4 and a higher number of producers and exchanges (new values pushed) but not with 0.7.5-SNAPSHOT (new version pushed too).

Let us know if the latest 0.7.5-SNAPSHOT solves the issue for you too.

-- Fabio

Chris Pennello

unread,
Apr 19, 2016, 12:34:18 PM4/19/16
to quasar-pulsar-user
First off, thank you for the prompt and thoughtful reply.

On Tuesday, April 19, 2016 at 3:19:58 AM UTC-7, Fabio Tudone wrote:
Everything is running on threads, not fibers (is instrumentation even necessary if I'm not running any fibers?).

At present, in specific cases, possibly not; in general though, you want deployments with future-proof robustness so you shouldn't asume that instrumentation (with the agent or otherwise) is necessary only if you're actually using fibers. That's for at least the following reasons (that I can think of now):

- Even if you're not using fibers yourself, Quasar could still use them behind the scenes, e.g. http://docs.paralleluniverse.co/quasar/javadoc/co/paralleluniverse/strands/channels/transfer/Pipeline.html (that is, if not explicitly passing the constructor a strand factory that creates threads rather than fibers).
- Some Quasar code assumes (or might just assume in the future) that your code is ready for interoperability with fibers (which means or will possibly mean in the future that it has passed through Quasar's instrumentor) even if in a present specific runtime situation it's not actually inter-operating with fibers at all.
- At some point you might want to hot-upgrade your running code (e.g. actors hot-deployment or class redefines) and the new one works with fibers.
Ah, very interesting.  I hadn't dug into Pipelines yet, but it of course is reasonable that various Quasar libraries will themselves make internal use of fibers.  For that and the other reasons you mention it sounds like it makes sense to always instrument.

If what bothers you is single-artifact deployment in presence of agents (Java or otherwise), look no further than Capsule: http://www.capsule.io/.
Ah, very cool!  We'll take a look at this.  On first blush, we actually do something fairly similar with our internal tooling.

One thing we are interested in is doing the instrumentation at compile time instead of at runtime.  One reason is to have a less stateful runtime.  We're using Maven, and, unless you have compelling ideas to the contrary, were planning on building a Maven plugin to do compile-time Quasar instrumentation and contributing that back.  Thoughts?
 
A different matter is instead the runtime (classpath) dependency on Quasar: if your code is using e.g. only Quasar annotations (say, a library made of a set of computation methods that could block fibers only in some projects) then you can package it without declaring a Quasar dependency: annotations not found on the classpath should be ignored by the JVM.
Thank you; I'm glad you point this out; it's a good distinction to be aware of.
 
Here's a small project with a similar program ...
I'm impressed at how quickly you whipped this up.  Nice! 

Chris Pennello

unread,
Apr 19, 2016, 12:34:31 PM4/19/16
to quasar-pulsar-user
On Tuesday, April 19, 2016 at 5:18:18 AM UTC-7, Fabio Tudone wrote:
I've just been able to reproduce the same exception and stacktrace with the above program, Quasar 0.7.4 and a higher number of producers and exchanges (new values pushed)
Interestingly, we don't have a large number of producers in our use case.  We have small-ish numbers ranging from 4-16, depending on environment.  In the environment in which the NPE arose, the item rate is pretty low too--maybe about 2-20 per second.  The real stress test is to come in the next few days; I'll be pumping ~35,000 items per second through these channels with the subset being selected from changing quite a lot as it goes.  : D
 
but not with 0.7.5-SNAPSHOT (new version pushed too).

Let us know if the latest 0.7.5-SNAPSHOT solves the issue for you too.
Been running my tests for about 40 minutes now, no hiccups so far!  Comparatively, with 0.7.4, I was running into the NPE within minutes.  Will holler if anything else goes wrong.
 
On Tuesday, April 19, 2016 at 1:51:38 PM UTC+3, pron wrote:
I’ve just uploaded a snapshot version to sonatype (0.7.5-SNAPSHOT) that hopefully contains a fix. This bug is tracked here.
Nice; thanks Ron.  I must admit, however, I'm a bit surprised.  How has nobody else run into this before?  What is the customer base like for Quasar?

pron

unread,
Apr 19, 2016, 12:50:23 PM4/19/16
to quasar-pulsar-user


On Tuesday, April 19, 2016 at 7:34:31 PM UTC+3, Chris Pennello wrote:

Nice; thanks Ron.  I must admit, however, I'm a bit surprised.  How has nobody else run into this before?  What is the customer base like for Quasar?

Quasar is used heavily in production by a number of companies, in critical services. Transfer-channels, OTOH — not so much. People generally use bounded or unbounded channels, and for actors, transfer channels aren’t even an option.

Ron 

pron

unread,
Apr 19, 2016, 1:00:17 PM4/19/16
to quasar-pulsar-user


On Tuesday, April 19, 2016 at 7:34:18 PM UTC+3, Chris Pennello wrote:
One thing we are interested in is doing the instrumentation at compile time instead of at runtime.  One reason is to have a less stateful runtime.  We're using Maven, and, unless you have compelling ideas to the contrary, were planning on building a Maven plugin to do compile-time Quasar instrumentation and contributing that back.  Thoughts?

There's an Ant task that you can directly invoke from Maven, but any contribution would be welcome. However, I don't think your reasons for doing AOT instrumentation are strong enough (a good reason to do AOT instrumentation is, for example, no control over the JVM's command line, which can arise in various hosted scenarios). Runtime instrumentation is not any more stateful than any other Java code (it certainly doesn't write anything to disk). Generating code on the fly happens all the time on the JVM, and not just in third-party libraries, but even in the JDK itself.

Ron

Chris Pennello

unread,
Apr 19, 2016, 1:01:00 PM4/19/16
to quasar-pulsar-user
Cool; that's actually exactly what I expected.  A coworker of mine was also unsurprised; he commented that it's probably similar to how SynchronousQueue is the least-used of the BlockingQueues.

I wonder why this is, though.  For example, in Go, unbuffered channels are the default.

Chris Pennello

unread,
Apr 19, 2016, 1:11:14 PM4/19/16
to quasar-pulsar-user
Point taken.  The command-line control is another reason in our case.  It's a bit of a pain to adjust the command line and elegantly package the JAR up with our builds.

pron

unread,
Apr 19, 2016, 1:23:04 PM4/19/16
to quasar-pulsar-user


On Tuesday, April 19, 2016 at 8:11:14 PM UTC+3, Chris Pennello wrote:

Point taken.  The command-line control is another reason in our case.  It's a bit of a pain to adjust the command line and elegantly package the JAR up with our builds.

That is not a problem at all with Capsule. Don't get me wrong -- AOT instrumentation works well (in fact, Quasar itself is AOT-instrumented) -- but runtime instrumentation is just simpler and cleaner, and come Java 9, it will let us get rid of all manual suspendability annotations (although Quasar will still support AOT instrumentation). So my suggestion is: try to use the agent and move to AOT only if it proves to be a real problem in practice (in which case we'd love to know why). Whatever you choose to do, a Maven plugin would be welcome.

BTW, you should know that Quasar's instrumentation is quite minimal: no new classes, fields/methods or method arguments are added. 

Ron

Chris Pennello

unread,
Apr 19, 2016, 1:31:53 PM4/19/16
to quasar-pulsar-user
On Tuesday, April 19, 2016 at 10:23:04 AM UTC-7, pron wrote:
On Tuesday, April 19, 2016 at 8:11:14 PM UTC+3, Chris Pennello wrote:
Point taken.  The command-line control is another reason in our case.  It's a bit of a pain to adjust the command line and elegantly package the JAR up with our builds.

That is not a problem at all with Capsule. Don't get me wrong -- AOT instrumentation works well (in fact, Quasar itself is AOT-instrumented) -- but runtime instrumentation is just simpler and cleaner, and come Java 9, it will let us get rid of all manual suspendability annotations (although Quasar will still support AOT instrumentation).
Ah, that's interesting.  Do you have a quick rundown you can point to on why that's the case with Java 9?  If not, would you mind briefly elaborating?
 
So my suggestion is: try to use the agent and move to AOT only if it proves to be a real problem in practice (in which case we'd love to know why).
Ok, I appreciate the suggestion; we'll definitely take it into consideration.
 
Whatever you choose to do, a Maven plugin would be welcome.
Right on.
 
BTW, you should know that Quasar's instrumentation is quite minimal: no new classes, fields/methods or method arguments are added. 
Yup; I looked through the code when reading up on the Ant task.  : ]


Switching back to the more immediate, I have a fun new exception that we ran into.  : [

java.lang.RuntimeException: Unable to obtain selector lease: LEASED
at co.paralleluniverse.strands.channels.Selector.lease(Selector.java:463)
at co.paralleluniverse.strands.channels.SelectActionImpl.lease(SelectActionImpl.java:77)
at co.paralleluniverse.strands.channels.TransferChannel$Node.lease(TransferChannel.java:388)
at co.paralleluniverse.strands.channels.TransferChannel.tryMatch(TransferChannel.java:614)
at co.paralleluniverse.strands.channels.TransferChannel.xfer1(TransferChannel.java:504)
at co.paralleluniverse.strands.channels.TransferChannel.send(TransferChannel.java:87)
        ... my application code that calls send ...

Same setup as before, running the 0.7.5-SNAPSHOT you shared.  FWIW, it happens about as quickly as the NPE was happening before (a few minutes after startup and plowing through ~50,000 items.  This one, however, doesn't happen every time.  What other information, if any, can I provide to help diagnose?

Fabio Tudone

unread,
Apr 20, 2016, 2:59:45 AM4/20/16
to quasar-pulsar-user
come Java 9, it will let us get rid of all manual suspendability annotations (although Quasar will still support AOT instrumentation).
Ah, that's interesting.  Do you have a quick rundown you can point to on why that's the case with Java 9?  If not, would you mind briefly elaborating?

Shortly, the new "StackWalker" API will allow us to access the JVM operand stack and locals for every method frame in the stack, so we'll fix up lazily (i.e. just before suspension) the fiber stack as well as code (through redefinition) for all method frames that lack instrumentation.

It's WIP and the API (and VM support for it) is not finalized yet (we're in touch with Oracle guys) but if you're curious you can have a look here.

Switching back to the more immediate, I have a fun new exception that we ran into.  : [

java.lang.RuntimeException: Unable to obtain selector lease: LEASED
at co.paralleluniverse.strands.channels.Selector.lease(Selector.java:463)
at co.paralleluniverse.strands.channels.SelectActionImpl.lease(SelectActionImpl.java:77)
at co.paralleluniverse.strands.channels.TransferChannel$Node.lease(TransferChannel.java:388)
at co.paralleluniverse.strands.channels.TransferChannel.tryMatch(TransferChannel.java:614)
at co.paralleluniverse.strands.channels.TransferChannel.xfer1(TransferChannel.java:504)
at co.paralleluniverse.strands.channels.TransferChannel.send(TransferChannel.java:87)
        ... my application code that calls send ...

Same setup as before, running the 0.7.5-SNAPSHOT you shared.  FWIW, it happens about as quickly as the NPE was happening before (a few minutes after startup and plowing through ~50,000 items.  This one, however, doesn't happen every time.  What other information, if any, can I provide to help diagnose?

With which JVM and settings is this happening?

Thanks,
-- Fabio

pron

unread,
Apr 20, 2016, 7:04:28 AM4/20/16
to quasar-pulsar-user
Can you try with the new snapshot?

Regardless of whether this fixes the issue or not, I'd like to say that TransferChannel is by far the most complicated class in all of Quasar, and also happens to be among the least used ones, so it is less mature than the rest of the codebase. Is there a reason why you'd want to use a transfer channel over a bounded channel?

Ron

Chris Pennello

unread,
Apr 20, 2016, 5:15:58 PM4/20/16
to quasar-pulsar-user
On Tuesday, April 19, 2016 at 11:59:45 PM UTC-7, Fabio Tudone wrote:
Shortly, the new "StackWalker" API will allow us to access the JVM operand stack and locals for every method frame in the stack, so we'll fix up lazily (i.e. just before suspension) the fiber stack as well as code (through redefinition) for all method frames that lack instrumentation.

It's WIP and the API (and VM support for it) is not finalized yet (we're in touch with Oracle guys) but if you're curious you can have a look here.
Extremely cool; thanks for sharing.
 
With which JVM and settings is this happening?
 $ java -version

openjdk version "1.8.0_72-internal"

OpenJDK Runtime Environment (build 1.8.0_72-internal-b15)

OpenJDK 64-Bit Server VM (build 25.72-b15, mixed mode)


Relevant JVM arguments:

-Djava.net.preferIPv4Stack=true

-XX:+AlwaysPreTouch

-XX:MaxMetaspaceSize=64m

-XX:CompressedClassSpaceSize=32m

-XX:ReservedCodeCacheSize=64m

-XX:HeapDumpPath=/mnt/mesos/sandbox

-XX:-OmitStackTraceInFastThrow

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-XX:ParallelGCThreads=4

-XX:+UseConcMarkSweepGC

-XX:+DisableExplicitGC

-XX:-PrintGC

-XX:+PrintGCDateStamps

-XX:+PrintGCDetails

-XX:+PrintGCApplicationStoppedTime

-Xmx2200m

-Xms2200m

Chris Pennello

unread,
Apr 20, 2016, 5:16:14 PM4/20/16
to quasar-pulsar-user
On Wednesday, April 20, 2016 at 4:04:28 AM UTC-7, pron wrote:
Can you try with the new snapshot?
 So far so good!  Been testing for about an hour with no issues.  Very much appreciated.  Although I should note that we saw the "Unable to obtain selector lease: LEASED" only once.

Regardless of whether this fixes the issue or not, I'd like to say that TransferChannel is by far the most complicated class in all of Quasar, and also happens to be among the least used ones, so it is less mature than the rest of the codebase. Is there a reason why you'd want to use a transfer channel over a bounded channel?
 Well first I will say that we aren't put off by the status of this poor neglected class.  : ]  In fact, we're quite excited to help you shake out the bugs through our usage.

Superficially, though, given any buffered channel, the behavior will be the same once the buffer is filled, right?  Both the producing and consuming strand will be required to synchronously rendezvous to move forward.  Preventing our producers from doing more work until their latest item is consumed simplifies our system a bit, so I have a slight preference for that.  In addition, we do implement some pretty non-trivial buffering downstream from the item consumption, so the less state we can keep the better, if not just for clearer and simpler debugging when there are problems.

Do you have insight into why most of your customers are using buffered channels instead of "pure" rendezvous channels?  I again note the default in Go: channels are unbuffered unless you explicitly specify a size.  I wonder why the discrepancy?  Perhaps lack of familiarity with CSP-style programming?

pron

unread,
Apr 21, 2016, 7:37:23 AM4/21/16
to quasar-pulsar-user


On Thursday, April 21, 2016 at 12:16:14 AM UTC+3, Chris Pennello wrote

Superficially, though, given any buffered channel, the behavior will be the same once the buffer is filled, right?  Both the producing and consuming strand will be required to synchronously rendezvous to move forward.  

Correct, IF the buffer is full and IF the channel has been created with the BLOCK overflow policy (there are other policies to determine what the producer should do if the buffer is full).
 

Do you have insight into why most of your customers are using buffered channels instead of "pure" rendezvous channels?

Because you don't want to introduce unnecessary synchronization when it isn't required, and some leeway to allow for temporary uneven production/consumption rates may help.
 
 I again note the default in Go: channels are unbuffered unless you explicitly specify a size.  I wonder why the discrepancy?  Perhaps lack of familiarity with CSP-style programming?

I can only hypothesize: when Go was created, its default was no parallelism (a single kernel thread), and in that case buffering doesn't help and may even confuse things for the scheduler. Rendezvous in that case is simple and efficient: the scheduler suspends the producer and wakes the consumer. Also, rendezvous channels are more powerful in the sense that they can be easily made to mimic other synchronization primitives like critical sections, and perhaps Go's designers wanted users to use channels for all types of synchronizations without need to learn a few different ones (even if they exist in the libraries).

Ron

Chris Pennello

unread,
Apr 21, 2016, 5:10:25 PM4/21/16
to quasar-pulsar-user
Thanks for the insights Ron.

The synchronization in our case is quite useful; our producers really shouldn't do any further work until their item has been consumed.

Any idea when we'll see a 0.7.5 release including the fixes you've done?

Chris Pennello

unread,
Apr 21, 2016, 7:43:47 PM4/21/16
to quasar-pulsar-user
I may have spoken too soon.

We've seen a couple instances in which we got a non-null select action back from a select call that returned null when calling message.  All of the select actions passed into the select call were receives.  (This is on our consumer thread.)  I know that channels can't transmit null values.  My understanding is that a null value from calling message on a select action indicates that it was not completed.  But getting back a non-null select action from a select call should happen only if the action completed right?

Unfortunately, unlike the other two oddities we ran into thus far, this one doesn't come with a clear stack trace.  How can I gather information to help debug this?

The scale here is that we're running maybe 35,000 events per second through 16 channels.  Otherwise, same setup as described earlier in the thread.

Fabio Tudone

unread,
Apr 22, 2016, 2:02:39 AM4/22/16
to quasar-pulsar-user
Hi Chris, "select" also reports "close" events from receive actions in addition to actual messages: if the producer closes the channel explicitly then "select" will return the corresponding receive action and the message will be "null". This happens in my test program too.

Is the producer indeed closing the selected receive channel in your case?

-- Fabio

Chris Pennello

unread,
Apr 22, 2016, 11:46:38 AM4/22/16
to quasar-pulsar-user
On Thursday, April 21, 2016 at 11:02:39 PM UTC-7, Fabio Tudone wrote:
Hi Chris, "select" also reports "close" events from receive actions in addition to actual messages: if the producer closes the channel explicitly then "select" will return the corresponding receive action and the message will be "null". This happens in my test program too.

Is the producer indeed closing the selected receive channel in your case?
It is not.

(I suppose I should have mentioned that!)

Coming from Go, I was ready to check whether the channel was closed on read.  Consequently, I have been logging this when I blow up after getting back a null message and in these cases, nope, it wasn't closed.  For the previous two errors, the NPE and the lease problem of course threw a runtime exception, which appropriately killed the producer threads--when they shut down, I close the channel.  But here, there were no problems on the producer threads--no exceptions thrown, etc.

Chris Pennello

unread,
Apr 22, 2016, 11:57:25 AM4/22/16
to quasar-pulsar-user
On Friday, April 22, 2016 at 8:46:38 AM UTC-7, Chris Pennello wrote:
On Thursday, April 21, 2016 at 11:02:39 PM UTC-7, Fabio Tudone wrote:
Hi Chris, "select" also reports "close" events from receive actions in addition to actual messages: if the producer closes the channel explicitly then "select" will return the corresponding receive action and the message will be "null". This happens in my test program too.

Is the producer indeed closing the selected receive channel in your case?
Oops, I meant to say that it is.  But in the failure scenario I'm seeing, there were no problems on the producer threads, so the channels were not closed.  The consumer just "randomly" gets a null message from the select action from the select call!

To add a little more context: in the last 16 hours, with an average item rate across all 16 channels / producer threads of 1,400 per second, we have suffered this circumstance 23 times.

Chris Pennello

unread,
Apr 22, 2016, 2:59:22 PM4/22/16
to quasar-pulsar-user
On Friday, April 22, 2016 at 8:57:25 AM UTC-7, Chris Pennello wrote:
To add a little more context: in the last 16 hours, with an average item rate across all 16 channels / producer threads of 1,400 per second, we have suffered this circumstance 23 times.
I added a little more logging, and the select actions believe they're done as well when they're returning a null message. 

pron

unread,
Apr 25, 2016, 5:45:26 AM4/25/16
to quasar-pulsar-user
Hi. Can you try again with the latest snapshot? The idea is very simple: it now gives the kernel a full 10 seconds to schedule the thread holding the selector’s lease instead of a few tens of milliseconds. 

Chris Pennello

unread,
Apr 25, 2016, 1:15:09 PM4/25/16
to quasar-pulsar-user
On Monday, April 25, 2016 at 2:45:26 AM UTC-7, pron wrote:
Hi. Can you try again with the latest snapshot? The idea is very simple: it now gives the kernel a full 10 seconds to schedule the thread holding the selector’s lease instead of a few tens of milliseconds.
Hey Ron, thanks for the new build.  Unfortunately, it looks like the issue is still here.

In ~40 minutes of running the latest snapshot, we saw the issue occur three times.  In all instances, the channel associated with the select action was not closed.  In one instance, the action believed it was done, and in the other two, it believed it was not done.

Over the weekend (1.5~2 days process time, running the snapshot prior to this one), we saw the problem 40 times.  In all cases, the channel associated with the message was not closed.  In 9 of the cases, the action believed it was done, and in the other 8, it believed it was not done.

pron

unread,
Apr 25, 2016, 3:38:34 PM4/25/16
to quasar-pulsar-user


On Monday, April 25, 2016 at 8:15:09 PM UTC+3, Chris Pennello wrote:
Hey Ron, thanks for the new build.  Unfortunately, it looks like the issue is still here.

In ~40 minutes of running the latest snapshot, we saw the issue occur three times.  In all instances, the channel associated with the select action was not closed.  In one instance, the action believed it was done, and in the other two, it believed it was not done.

Are you sure you’re running the latest build (from today)?

Chris Pennello

unread,
Apr 25, 2016, 3:43:47 PM4/25/16
to quasar-pulsar-user
I made double-sure by blowing away co/paralleluniverse from my local Maven repo, re-installing, and re-building:

% ls -l ~/.m2/repository/co/paralleluniverse/quasar-core/0.7.5-SNAPSHOT 

total 4928

-rw-r--r--  1 cpennello  513      242 Apr 25 09:36 _remote.repositories

-rw-r--r--  1 cpennello  513      367 Apr 25 09:36 maven-metadata-opentable.xml

-rw-r--r--  1 cpennello  513       40 Apr 25 09:36 maven-metadata-opentable.xml.sha1

-rw-r--r--  1 cpennello  513  1237309 Apr 25 09:36 quasar-core-0.7.5-20160425.093323-14-jdk8.jar

-rw-r--r--  1 cpennello  513       40 Apr 25 09:36 quasar-core-0.7.5-20160425.093323-14-jdk8.jar.sha1

-rw-r--r--  1 cpennello  513     4490 Apr 25 09:36 quasar-core-0.7.5-20160425.093323-14.pom

-rw-r--r--  1 cpennello  513       40 Apr 25 09:36 quasar-core-0.7.5-20160425.093323-14.pom.sha1

-rw-r--r--  1 cpennello  513  1237309 Apr 25 09:36 quasar-core-0.7.5-SNAPSHOT-jdk8.jar

-rw-r--r--  1 cpennello  513     4490 Apr 25 09:36 quasar-core-0.7.5-SNAPSHOT.pom

-rw-r--r--  1 cpennello  513      188 Apr 25 09:36 resolver-status.properties


Those are today's files, right?


:/

Chris Pennello

unread,
Apr 27, 2016, 11:37:15 AM4/27/16
to quasar-pulsar-user
In another run, we continue to see the null message from a non-null select action returned by a select call.  We saw 57 instances over the course of 1.5 days with an average item rate through all channels of approximately 1,400 per second.  In all cases, the channel was not closed.  In 53 of the cases, the select action believed it wasn't done, and in 4, it did.

What other diagnostic information can I gather to help?  Perhaps a dump of all of the stacks of all threads when we get one of these errant select actions?

pron

unread,
Apr 27, 2016, 11:52:36 AM4/27/16
to quasar-pulsar-user
On Wednesday, April 27, 2016 at 6:37:15 PM UTC+3, Chris Pennello wrote:
What other diagnostic information can I gather to help?  Perhaps a dump of all of the stacks of all threads when we get one of these errant select actions?

We may need to create a special debug build to gather more information. Let me think about it some more. In the meantime, are you absolutely sure that bounded channels can’t work for you? If so, I’m interested in the use case. After all, once the message has been consumed, the producer is free to work in parallel anyway, so I can’t imagine a scenario where this would be OK but a bounded channel (even of size 1) isn’t.

Ron

Chris Pennello

unread,
Apr 27, 2016, 12:01:23 PM4/27/16
to quasar-pulsar-user

Maybe I'm being too idealistic here, but buffering just isn't necessary.  And introducing more asynchronicity into the system will increase the complexity.  I'm happy, thus, I suppose, to give a buffered channel of size 1 a try and report back if there are any problems.  But I really am enthusiastic about helping shake out the bugs in TransferChannel.  I want Quasar to have bullet-proof channel offerings for all different use cases.

Chris Pennello

unread,
Apr 27, 2016, 2:48:15 PM4/27/16
to quasar-pulsar-user
So I tried with a buffered channel of size 1.  In both cases, all I did was alter bufferSize in my call to Channels.newChannel.  In both cases, my blocking/timeout behavior went away on my SendPort.send(Message, long, TimeUnit) calls. I see over 400,000 send "timeouts" per second now, whereas with a TransferChannel, it behaves as expected (no more than 1 per second per channel, since my timeout is 1 second).  I was a little incredulous, so I ran the flight recorder.  Indeed, it shows gazillions of send calls, one after the other with no discernible delay.  (See an excerpt below.)  The select calls still block and time out correctly.

I thought it might have to do with the BoxQueue, so I also tried a bufferSize of 2, but the same behavior resulted.

Am I just missing something basic here?  My Channel creation call is Channels.newChannel(1, Channels.OverflowPolicy.BLOCK, true, true), and my send call is channel.send(item, 1000, TimeUnit.MILLISECONDS).

Here is an excerpt from the flight recorder dump when I was using a buffered channel of size 1.
[18:32:21.007 aggregator-reader-4 (18,988)]     QueueChannel.send0 Channel{sync: co.paralleluniverse.strands.OwnedSynchronizer@7d96fbc7, queue: co.paralleluniverse.strands.queues.BoxQueue@4c031999, capacity: 1} enqueing message com.opentable.abtest.aggregation.KafkaABEvent@4cf70749
[18:32:21.007 aggregator-reader-4 (18,989)]     QueueChannel.send0 Channel{sync: co.paralleluniverse.strands.OwnedSynchronizer@7d96fbc7, queue: co.paralleluniverse.strands.queues.BoxQueue@4c031999, capacity: 1} channel queue is full. policy: BLOCK
[18:32:21.007 aggregator-reader-4 (18,990)]     ConditionSynchronizer.unregister co.paralleluniverse.strands.SimpleConditionSynchronizer@5fbe6acd unregister Thread[aggregator-reader-4,5,main]
[18:32:21.007 aggregator-reader-4 (18,991)]     ConditionSynchronizer.register co.paralleluniverse.strands.SimpleConditionSynchronizer@5fbe6acd register Thread[aggregator-reader-4,5,main]
[18:32:21.007 aggregator-reader-4 (18,992)]     QueueChannel.send0 Channel{sync: co.paralleluniverse.strands.OwnedSynchronizer@7d96fbc7, queue: co.paralleluniverse.strands.queues.BoxQueue@4c031999, capacity: 1} enqueing message com.opentable.abtest.aggregation.KafkaABEvent@4cf70749
[18:32:21.007 aggregator-reader-4 (18,993)]     QueueChannel.send0 Channel{sync: co.paralleluniverse.strands.OwnedSynchronizer@7d96fbc7, queue: co.paralleluniverse.strands.queues.BoxQueue@4c031999, capacity: 1} channel queue is full. policy: BLOCK
[18:32:21.007 aggregator-reader-4 (18,994)]     ConditionSynchronizer.unregister co.paralleluniverse.strands.SimpleConditionSynchronizer@5fbe6acd unregister Thread[aggregator-reader-4,5,main]
[18:32:21.007 aggregator-reader-4 (18,995)]     ConditionSynchronizer.register co.paralleluniverse.strands.SimpleConditionSynchronizer@5fbe6acd register Thread[aggregator-reader-4,5,main]
[18:32:21.007 aggregator-reader-4 (18,996)]     QueueChannel.send0 Channel{sync: co.paralleluniverse.strands.OwnedSynchronizer@7d96fbc7, queue: co.paralleluniverse.strands.queues.BoxQueue@4c031999, capacity: 1} enqueing message com.opentable.abtest.aggregation.KafkaABEvent@4cf70749
[18:32:21.007 aggregator-reader-4 (18,997)]     QueueChannel.send0 Channel{sync: co.paralleluniverse.strands.OwnedSynchronizer@7d96fbc7, queue: co.paralleluniverse.strands.queues.BoxQueue@4c031999, capacity: 1} channel queue is full. policy: BLOCK
[18:32:21.007 aggregator-reader-4 (18,998)]     ConditionSynchronizer.unregister co.paralleluniverse.strands.SimpleConditionSynchronizer@5fbe6acd unregister Thread[aggregator-reader-4,5,main]
[18:32:21.007 aggregator-reader-4 (18,999)]     ConditionSynchronizer.register co.paralleluniverse.strands.SimpleConditionSynchronizer@5fbe6acd register Thread[aggregator-reader-4,5,main]
[18:32:21.007 aggregator-reader-4 (19,000)]     QueueChannel.send0 Channel{sync: co.paralleluniverse.strands.OwnedSynchronizer@7d96fbc7, queue: co.paralleluniverse.strands.queues.BoxQueue@4c031999, capacity: 1} enqueing message com.opentable.abtest.aggregation.KafkaABEvent@4cf70749
[18:32:21.007 aggregator-reader-4 (19,001)]     QueueChannel.send0 Channel{sync: co.paralleluniverse.strands.OwnedSynchronizer@7d96fbc7, queue: co.paralleluniverse.strands.queues.BoxQueue@4c031999, capacity: 1} channel queue is full. policy: BLOCK
[18:32:21.007 aggregator-reader-4 (19,002)]     ConditionSynchronizer.unregister co.paralleluniverse.strands.SimpleConditionSynchronizer@5fbe6acd unregister Thread[aggregator-reader-4,5,main]
[18:32:21.007 aggregator-reader-4 (19,003)]     ConditionSynchronizer.register co.paralleluniverse.strands.SimpleConditionSynchronizer@5fbe6acd register Thread[aggregator-reader-4,5,main]
[18:32:21.007 aggregator-reader-4 (19,004)]     QueueChannel.send0 Channel{sync: co.paralleluniverse.strands.OwnedSynchronizer@7d96fbc7, queue: co.paralleluniverse.strands.queues.BoxQueue@4c031999, capacity: 1} enqueing message com.opentable.abtest.aggregation.KafkaABEvent@4cf70749
[18:32:21.007 aggregator-reader-4 (19,005)]     QueueChannel.send0 Channel{sync: co.paralleluniverse.strands.OwnedSynchronizer@7d96fbc7, queue: co.paralleluniverse.strands.queues.BoxQueue@4c031999, capacity: 1} channel queue is full. policy: BLOCK
[18:32:21.007 aggregator-reader-4 (19,006)]     ConditionSynchronizer.unregister co.paralleluniverse.strands.SimpleConditionSynchronizer@5fbe6acd unregister Thread[aggregator-reader-4,5,main]
[18:32:21.007 aggregator-reader-4 (19,007)]     ConditionSynchronizer.register co.paralleluniverse.strands.SimpleConditionSynchronizer@5fbe6acd register Thread[aggregator-reader-4,5,main]
[18:32:21.007 aggregator-reader-4 (19,008)]     QueueChannel.send0 Channel{sync: co.paralleluniverse.strands.OwnedSynchronizer@7d96fbc7, queue: co.paralleluniverse.strands.queues.BoxQueue@4c031999, capacity: 1} enqueing message com.opentable.abtest.aggregation.KafkaABEvent@4cf70749
[18:32:21.007 aggregator-reader-4 (19,009)]     QueueChannel.send0 Channel{sync: co.paralleluniverse.strands.OwnedSynchronizer@7d96fbc7, queue: co.paralleluniverse.strands.queues.BoxQueue@4c031999, capacity: 1} channel queue is full. policy: BLOCK

pron

unread,
May 2, 2016, 10:53:33 AM5/2/16
to quasar-pulsar-user
Hi.
We've just released a new version (0.7.5) with a fix to a serious bug in SimpleConditionSynchronizer that manifested under some kinds of load. I'm not sure it's related to your issue, but it may very well be. Can you give it a try?

Ron

Chris Pennello

unread,
May 2, 2016, 12:24:07 PM5/2/16
to quasar-pulsar-user
On Monday, May 2, 2016 at 7:53:33 AM UTC-7, pron wrote:
We've just released a new version (0.7.5) with a fix to a serious bug in SimpleConditionSynchronizer that manifested under some kinds of load. I'm not sure it's related to your issue, but it may very well be. Can you give it a try?
Hey Ron, thanks for the update.  We gave it a try and almost immediately ran into the null message bug.  I am happy about the new release, since it includes the other critical bugfixes.  We tried adding some logging to the null message problem, but it wasn't terribly helpful.  For example, we dumped the state of all threads, but there wasn't anything suspicious.  Presumably, things are somehow getting into a bad state a little while prior to the select call.  In the mean time, I think we'll just ignore the null messages.  Can you think of any other helpful diagnostic information we can collect for you?

Fabio Tudone

unread,
May 5, 2016, 2:23:36 AM5/5/16
to quasar-pulsar-user
Hi Chris, I'm looking into it and I have also updated the previous test project with Quasar 0.7.5, more (and real) logging about error conditions, few fixes as well as support for unlimited runs and message exchanges.

I tried running it locally with 128 producers for more than 12 hours with no wait times (so, maximum speed) but no luck: no matter how hard I try I don't seem to be able to reproduce the "null" message issue.
It could be that I'm just being unlucky and missing it, even though I wonder if there's something different in your code (that I'm not thinking of) that is triggering it instead. Could you re-check if my test mimics closely enough the way you use Quasar? If you could send (even privately) some code (even a minimal program, actually better so) that triggers the issue this could help a lot.

BTW I think we've reached the point where a GitHub issue is more comfortable to work on this, could you open one?

Thanks,
-- Fabio

Fabio Tudone

unread,
May 16, 2016, 9:01:00 AM5/16/16
to quasar-pulsar-user
Hi Chris, have you got chances to look further into this?

-- Fabio

Chris Pennello

unread,
May 16, 2016, 6:30:51 PM5/16/16
to quasar-pulsar-user
On Monday, May 16, 2016 at 6:01:00 AM UTC-7, Fabio Tudone wrote:
Hi Chris, have you got chances to look further into this?
Hi Fabio!  I was out last week, so I'm just getting to look at this now.
 
On Thursday, May 5, 2016 at 9:23:36 AM UTC+3, Fabio Tudone wrote:
Could you re-check if my test mimics closely enough the way you use Quasar?
 Happily!  Included details in the GitHub issue.

If you could send (even privately) some code (even a minimal program, actually better so) that triggers the issue this could help a lot.
For sure; let's look into that after I vet the test program you wrote.

BTW I think we've reached the point where a GitHub issue is more comfortable to work on this, could you open one?

Fabio Tudone

unread,
May 17, 2016, 5:02:45 AM5/17/16
to quasar-pulsar-user
Great, thanks, I'll have a look.
Reply all
Reply to author
Forward
0 new messages