Unfortunately, with a fairly simple setup, I'm running into the following NPE.
The setup is as follows.Quasar 0.7.4 with jdk8. Instrumentation enabled at runtime with -javaagent. -Dco.paralleluniverse.fibers.verifyInstrumentation=true (although no "QUASAR WARNING" output). Everything is running on threads, not fibers (is instrumentation even necessary if I'm not running any fibers?).
Multiple producer threads, each with its own unbuffered/transfer channel, configured with overflow policy block and single producer and single consumer true. One consumer thread that periodically selects across a subset of the producers' channels based on various application-specific properties of the items being consumed.Anything obvious that I'm doing wrong? Any other information I can provide?
I love that all my locks have gone away due to Quasar's channels
but this NPE is troubling; the code has to work in addition to being beautiful! : ]
Everything is running on threads, not fibers (is instrumentation even necessary if I'm not running any fibers?).At present, in specific cases, possibly not; in general though, you want deployments with future-proof robustness so you shouldn't asume that instrumentation (with the agent or otherwise) is necessary only if you're actually using fibers. That's for at least the following reasons (that I can think of now):
- Even if you're not using fibers yourself, Quasar could still use them behind the scenes, e.g. http://docs.paralleluniverse.co/quasar/javadoc/co/paralleluniverse/strands/channels/transfer/Pipeline.html (that is, if not explicitly passing the constructor a strand factory that creates threads rather than fibers).- Some Quasar code assumes (or might just assume in the future) that your code is ready for interoperability with fibers (which means or will possibly mean in the future that it has passed through Quasar's instrumentor) even if in a present specific runtime situation it's not actually inter-operating with fibers at all.
- At some point you might want to hot-upgrade your running code (e.g. actors hot-deployment or class redefines) and the new one works with fibers.
If what bothers you is single-artifact deployment in presence of agents (Java or otherwise), look no further than Capsule: http://www.capsule.io/.
A different matter is instead the runtime (classpath) dependency on Quasar: if your code is using e.g. only Quasar annotations (say, a library made of a set of computation methods that could block fibers only in some projects) then you can package it without declaring a Quasar dependency: annotations not found on the classpath should be ignored by the JVM.
Here's a small project with a similar program ...
I've just been able to reproduce the same exception and stacktrace with the above program, Quasar 0.7.4 and a higher number of producers and exchanges (new values pushed)
but not with 0.7.5-SNAPSHOT (new version pushed too).
Let us know if the latest 0.7.5-SNAPSHOT solves the issue for you too.
On Tuesday, April 19, 2016 at 1:51:38 PM UTC+3, pron wrote:I’ve just uploaded a snapshot version to sonatype (0.7.5-SNAPSHOT) that hopefully contains a fix. This bug is tracked here.
Nice; thanks Ron. I must admit, however, I'm a bit surprised. How has nobody else run into this before? What is the customer base like for Quasar?
One thing we are interested in is doing the instrumentation at compile time instead of at runtime. One reason is to have a less stateful runtime. We're using Maven, and, unless you have compelling ideas to the contrary, were planning on building a Maven plugin to do compile-time Quasar instrumentation and contributing that back. Thoughts?
Point taken. The command-line control is another reason in our case. It's a bit of a pain to adjust the command line and elegantly package the JAR up with our builds.
On Tuesday, April 19, 2016 at 8:11:14 PM UTC+3, Chris Pennello wrote:Point taken. The command-line control is another reason in our case. It's a bit of a pain to adjust the command line and elegantly package the JAR up with our builds.That is not a problem at all with Capsule. Don't get me wrong -- AOT instrumentation works well (in fact, Quasar itself is AOT-instrumented) -- but runtime instrumentation is just simpler and cleaner, and come Java 9, it will let us get rid of all manual suspendability annotations (although Quasar will still support AOT instrumentation).
So my suggestion is: try to use the agent and move to AOT only if it proves to be a real problem in practice (in which case we'd love to know why).
Whatever you choose to do, a Maven plugin would be welcome.
BTW, you should know that Quasar's instrumentation is quite minimal: no new classes, fields/methods or method arguments are added.
come Java 9, it will let us get rid of all manual suspendability annotations (although Quasar will still support AOT instrumentation).Ah, that's interesting. Do you have a quick rundown you can point to on why that's the case with Java 9? If not, would you mind briefly elaborating?
Switching back to the more immediate, I have a fun new exception that we ran into. : [java.lang.RuntimeException: Unable to obtain selector lease: LEASEDat co.paralleluniverse.strands.channels.Selector.lease(Selector.java:463)at co.paralleluniverse.strands.channels.SelectActionImpl.lease(SelectActionImpl.java:77)at co.paralleluniverse.strands.channels.TransferChannel$Node.lease(TransferChannel.java:388)at co.paralleluniverse.strands.channels.TransferChannel.tryMatch(TransferChannel.java:614)at co.paralleluniverse.strands.channels.TransferChannel.xfer1(TransferChannel.java:504)at co.paralleluniverse.strands.channels.TransferChannel.send(TransferChannel.java:87)... my application code that calls send ...Same setup as before, running the 0.7.5-SNAPSHOT you shared. FWIW, it happens about as quickly as the NPE was happening before (a few minutes after startup and plowing through ~50,000 items. This one, however, doesn't happen every time. What other information, if any, can I provide to help diagnose?
Shortly, the new "StackWalker" API will allow us to access the JVM operand stack and locals for every method frame in the stack, so we'll fix up lazily (i.e. just before suspension) the fiber stack as well as code (through redefinition) for all method frames that lack instrumentation.It's WIP and the API (and VM support for it) is not finalized yet (we're in touch with Oracle guys) but if you're curious you can have a look here.
With which JVM and settings is this happening?
openjdk version "1.8.0_72-internal"
OpenJDK Runtime Environment (build 1.8.0_72-internal-b15)
OpenJDK 64-Bit Server VM (build 25.72-b15, mixed mode)
Relevant JVM arguments:
-Djava.net.preferIPv4Stack=true
-XX:+AlwaysPreTouch
-XX:MaxMetaspaceSize=64m
-XX:CompressedClassSpaceSize=32m
-XX:ReservedCodeCacheSize=64m
-XX:HeapDumpPath=/mnt/mesos/sandbox
-XX:-OmitStackTraceInFastThrow
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-XX:ParallelGCThreads=4
-XX:+UseConcMarkSweepGC
-XX:+DisableExplicitGC
-XX:-PrintGC
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCApplicationStoppedTime
-Xmx2200m
-Xms2200m
Can you try with the new snapshot?
Regardless of whether this fixes the issue or not, I'd like to say that TransferChannel is by far the most complicated class in all of Quasar, and also happens to be among the least used ones, so it is less mature than the rest of the codebase. Is there a reason why you'd want to use a transfer channel over a bounded channel?
Superficially, though, given any buffered channel, the behavior will be the same once the buffer is filled, right? Both the producing and consuming strand will be required to synchronously rendezvous to move forward.
Do you have insight into why most of your customers are using buffered channels instead of "pure" rendezvous channels?
I again note the default in Go: channels are unbuffered unless you explicitly specify a size. I wonder why the discrepancy? Perhaps lack of familiarity with CSP-style programming?
Hi Chris, "select" also reports "close" events from receive actions in addition to actual messages: if the producer closes the channel explicitly then "select" will return the corresponding receive action and the message will be "null". This happens in my test program too.Is the producer indeed closing the selected receive channel in your case?
On Thursday, April 21, 2016 at 11:02:39 PM UTC-7, Fabio Tudone wrote:Hi Chris, "select" also reports "close" events from receive actions in addition to actual messages: if the producer closes the channel explicitly then "select" will return the corresponding receive action and the message will be "null". This happens in my test program too.Is the producer indeed closing the selected receive channel in your case?
To add a little more context: in the last 16 hours, with an average item rate across all 16 channels / producer threads of 1,400 per second, we have suffered this circumstance 23 times.
Hi. Can you try again with the latest snapshot? The idea is very simple: it now gives the kernel a full 10 seconds to schedule the thread holding the selector’s lease instead of a few tens of milliseconds.
Hey Ron, thanks for the new build. Unfortunately, it looks like the issue is still here.In ~40 minutes of running the latest snapshot, we saw the issue occur three times. In all instances, the channel associated with the select action was not closed. In one instance, the action believed it was done, and in the other two, it believed it was not done.
% ls -l ~/.m2/repository/co/paralleluniverse/quasar-core/0.7.5-SNAPSHOT
total 4928
-rw-r--r-- 1 cpennello 513 242 Apr 25 09:36 _remote.repositories
-rw-r--r-- 1 cpennello 513 367 Apr 25 09:36 maven-metadata-opentable.xml
-rw-r--r-- 1 cpennello 513 40 Apr 25 09:36 maven-metadata-opentable.xml.sha1
-rw-r--r-- 1 cpennello 513 1237309 Apr 25 09:36 quasar-core-0.7.5-20160425.093323-14-jdk8.jar
-rw-r--r-- 1 cpennello 513 40 Apr 25 09:36 quasar-core-0.7.5-20160425.093323-14-jdk8.jar.sha1
-rw-r--r-- 1 cpennello 513 4490 Apr 25 09:36 quasar-core-0.7.5-20160425.093323-14.pom
-rw-r--r-- 1 cpennello 513 40 Apr 25 09:36 quasar-core-0.7.5-20160425.093323-14.pom.sha1
-rw-r--r-- 1 cpennello 513 1237309 Apr 25 09:36 quasar-core-0.7.5-SNAPSHOT-jdk8.jar
-rw-r--r-- 1 cpennello 513 4490 Apr 25 09:36 quasar-core-0.7.5-SNAPSHOT.pom
-rw-r--r-- 1 cpennello 513 188 Apr 25 09:36 resolver-status.properties
Those are today's files, right?
:/
What other diagnostic information can I gather to help? Perhaps a dump of all of the stacks of all threads when we get one of these errant select actions?
We've just released a new version (0.7.5) with a fix to a serious bug in SimpleConditionSynchronizer that manifested under some kinds of load. I'm not sure it's related to your issue, but it may very well be. Can you give it a try?
Hi Chris, have you got chances to look further into this?
On Thursday, May 5, 2016 at 9:23:36 AM UTC+3, Fabio Tudone wrote:Could you re-check if my test mimics closely enough the way you use Quasar?
If you could send (even privately) some code (even a minimal program, actually better so) that triggers the issue this could help a lot.
BTW I think we've reached the point where a GitHub issue is more comfortable to work on this, could you open one?