Intermittent build failures in tools.nrepl

77 views
Skip to first unread message

Alex Miller

unread,
Nov 27, 2013, 8:04:34 AM11/27/13
to clojur...@googlegroups.com
In watching the clojure contrib builds I've been seeing intermittent build failures for tools.nrepl. It feels like a timing thing and some percentage of the build matrix fails, but never any particular jdk or clojure version consistently.


Is this a known thing? Anyone know what's up?

Example failure:


Testing clojure.tools.nrepl-test
Exception in thread "nREPL-worker-0" java.lang.Error: java.net.SocketException: Socket closed
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
	at java.lang.Thread.run(Thread.java:636)
Caused by: java.net.SocketException: Socket closed
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:116)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
	at clojure.tools.nrepl.transport$bencode$fn__498.invoke(transport.clj:103)
	at clojure.tools.nrepl.transport.FnTransport.send(transport.clj:28)
	at clojure.tools.nrepl.middleware.pr_values$pr_values$fn$reify__810.send(pr_values.clj:23)
	at clojure.tools.nrepl.middleware.interruptible_eval$evaluate$fn__826$fn__839.invoke(interruptible_eval.clj:75)
	at clojure.main$repl$fn__6410.invoke(main.clj:268)
	at clojure.main$repl.doInvoke(main.clj:266)
	at clojure.lang.RestFn.invoke(RestFn.java:1096)
	at clojure.tools.nrepl.middleware.interruptible_eval$evaluate$fn__826.invoke(interruptible_eval.clj:56)
	at clojure.lang.AFn.applyToHelper(AFn.java:159)
	at clojure.lang.AFn.applyTo(AFn.java:151)
	at clojure.core$apply.invoke(core.clj:601)
	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1771)
	at clojure.lang.RestFn.invoke(RestFn.java:425)
	at clojure.tools.nrepl.middleware.interruptible_eval$evaluate.invoke(interruptible_eval.clj:41)
	at clojure.tools.nrepl.middleware.interruptible_eval$interruptible_eval$fn__867$fn__869.invoke(interruptible_eval.clj:172)
	at clojure.core$comp$fn__4034.invoke(core.clj:2278)
	at clojure.tools.nrepl.middleware.interruptible_eval$run_next$fn__860.invoke(interruptible_eval.clj:139)
	at clojure.lang.AFn.run(AFn.java:24)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
	... 2 more

Chas Emerick

unread,
Nov 27, 2013, 12:46:15 PM11/27/13
to clojur...@googlegroups.com
The SocketException isn't the failure; that's a ~spurious exception that
occurs upon a sudden client disconnect that has been sitting around for
a while. I fixed that at some point, but it crops up again as new tests
are added that produce responses that the client side isn't necessarily
testing for:

http://dev.clojure.org/jira/browse/NREPL-10

The real failure is definitely a timing issue that has been a thorn in
our sides for a _long_ time, and something that no one has been able to
nail down conclusively:

[INFO] FAIL in (test-url-connect) (nrepl_test.clj:387)
[INFO] expected: (= [2] (response-values (response-seq conn 100)))
[INFO] actual: (not (= [2] nil))

AFAIK, three people have looked into it significantly (myself, Colin,
and another fellow whose name escapes me at the moment) with no
definitive result. There have never been any reports of this in the
wild (though timeouts in the wild are also much wider, and sometimes
MAX_VALUE).

It's definitely irritating, but not something I'm super-motivated to
sink more time into at the moment. Maybe it's just time to bump the
timeout past 100ms, though I hate to do something like that just to
accommodate e.g. build box traffic...

- Chas
> --
> You received this message because you are subscribed to the Google
> Groups "clojure-tools" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to clojure-tool...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Alex Miller

unread,
Nov 27, 2013, 1:02:55 PM11/27/13
to clojur...@googlegroups.com
Maybe it would be worth trapping for this condition and dumping lots of state when it happens if you think that might aid in finding the problem. 

Or maybe it would be worth changing the timeout just to see if that stabilizes things?


To unsubscribe from this group and stop receiving emails from it, send an email to clojure-tools+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the Google Groups "clojure-tools" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure-tools/g1Ygkkp9T6c/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure-tools+unsubscribe@googlegroups.com.

David Greenberg

unread,
Nov 27, 2013, 4:21:47 PM11/27/13
to clojur...@googlegroups.com
We use a leiningen plugin at work that breaks the parent Jvm/child Jvm
process relationship by inserting a couple shells between them. If I
close the repl driving JVM, then the project Jvm hangs around forever.
If I system/exit the child Jvm, I see this exact error.

Now you've got a report from the wild!

I'd be happy to elaborate on how I reliably reproduce this error, and
help triage it through the tooling ecosystem.

Sent from my iPhone

Chas Emerick

unread,
Nov 28, 2013, 9:54:22 PM11/28/13
to clojur...@googlegroups.com
Hi David,

Just to clarify: are you talking about the SocketException? The test
failure is something that is fairly specific to the nREPL test suite...

- Chas

David Greenberg

unread,
Nov 29, 2013, 12:03:28 AM11/29/13
to clojur...@googlegroups.com
Yes, that socket exception is one that I see. I'm not triggering it
via the test suite--I actually see it in daily use with my modified
leiningen/nrepl setup. If it's not easy for you to reproduce now, I
can probably help make it reproducible.

Sent from my iPhone

Chas Emerick

unread,
Nov 29, 2013, 12:07:58 AM11/29/13
to clojur...@googlegroups.com
No, it's quite easy to reproduce; it's an exception that is thrown when
the remote side of a Java socket disconnects unexpectedly, that nREPL
allows to percolate up. It would be nice if it could be translated
into something more immediately understandable (e.g. "The other side of
your nREPL connection went away"), but I think that's the limit of
what's possible.

- Chas
Reply all
Reply to author
Forward
0 new messages