Bug in locals-clearing in Clojure 1.5?

600 views
Skip to first unread message

Jason Wolfe

unread,
Mar 5, 2013, 12:50:17 AM3/5/13
to Clojure Dev
We just started redeploying our services at Prismatic with 1.5, and
one of them promptly ran out of memory. I isolated the cause to this
reduced test case:

https://gist.github.com/w01fe/5088279

As you can see, 's' was cleared in 1.4 and is not cleared in 1.5. The
issue appears to be quite subtle, since removing any part of the test
case (namely the with-open, when, or try) eliminates the issue.

Is this a bug or intended behavior? If the latter, an explanation of
the cause would be very helpful, so we know what to look out for in
similar situations.

(In this case, a workaround seems to be to pull out the (try ...) into
a separate function).

Thanks for your help,
Jason

Alan Malloy

unread,
Mar 5, 2013, 3:18:41 AM3/5/13
to cloju...@googlegroups.com
I looked into this a bit, and it seems to check out. I don't really care for using the behavior of code depending on System/gc to claim breakages, since "do nothing" is a valid implementation of that method. So instead, I AOT-compiled this code once in clojure 1.4 and once in clojure 1.5, and inspected the generated bytecode, to see if locals were really being cleared differently.

My summary (details to follow) is that for whatever reason clojure 1.5 structures the body of the try differently, creating an extra lambda that doesn't exist in 1.4, and holds onto the s local longer. It's as if you had written this:

(defn leaker [n]
  (with-open [_ (reify java.io.Closeable (close [this]))]
    (let [s (map #(FinalizeReporter. %) (take 10 (iterate inc 0)))]
      (when (> n -1)
        (try
          ((fn body []
             (->> s
                  (map #(do (Thread/sleep 100) (System/gc) (println @%)))
                  dorun))))))))

And the 'body function saves a reference to s. I'm not qualified to say whether this is a problem with locals-clearing (shouldn't body release its local before calling map?) or something else (why is this lambda generated at all?), but it does look like a serious, but rare, regression for 1.5.

As to the details, here are the relevant snippets I found in the disassembler, annotated for explanation:

// clojure 1.4
   115: new     #137; // core$leaker$fn__25, the lambda #(do (Thread/sleep 100) (System/gc) (println @%))
   118: dup
   119: invokespecial   #138; // call its no-arg constructor
   122: aload_3 // this is the local s
   123: aconst_null 
   124: astore_3 // clear it
   125: invokeinterface #129,  3; // call constructed leaker$fn with s

No problem there, of course.

// clojure 1.5
   97:  new     #140; // core$leaker$fn__27, an "implicit?" lambda containing the whole 'try body, (fn [] (dorun (map #(do (Thread/sleep 100) (System/gc) (println @%)) s)))
   100: dup
   101: aload_3 // we pass s to that lambda's constructor, since it needs our s
   102: aconst_null
   103: astore_3 // still clearing our local, though
   104: invokespecial   #143; // aforesaid one-arg constructor, called with s
   107: checkcast       #126; //class clojure/lang/IFn
   110: invokeinterface #145,  1; // call it with no arguments

If the body of leaker$fn__27 cleared s, as ^:once functions do when called, this would all be fine, but the relevant part is:

   18: new #45; // core$leaker$fn__27$fn__28, the lambda #(do (Thread/sleep 100) (System/gc) (println @%)) as in 1.4
   21: dup
   22: invokespecial #46; // a no-arg constructor
   25: aload_0
   26: getfield #36; // field 's' from the "body" lambda-object
   29: invokeinterface #49,  3; // (map f s)
   34: invokeinterface #52,  2; // (dorun coll)

Note that there is no `aload_null; setfield #36` pair to match the clearing that happens for locals.

I wonder if this problem leaked in as part of the fix for handling of recur inside of try/catch/finally clauses, but the only recent commit in that area of the compiler is 9b80a552, which seems harmless and designed to fix exactly this problem. So I definitely don't have a proposed fix: I'm just reporting this here to confirm that this looks like a real problem, and so that hopefully someone whose time is more valuable than mine won't have to do as much mucking about with assembly internals to find the cause.

~Alan

Bronsa

unread,
Mar 5, 2013, 4:50:42 AM3/5/13
to cloju...@googlegroups.com

I'm not sure whether this matters at all, but I seem to recall that a lambda that gets immediately called is created whenever the context is EVAL, so probably the commit that caused this involves some changes in the contexts passed around.

--
You received this message because you are subscribed to the Google Groups "Clojure Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure-dev...@googlegroups.com.
To post to this group, send email to cloju...@googlegroups.com.
Visit this group at http://groups.google.com/group/clojure-dev?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Ghadi Shayban

unread,
Mar 6, 2013, 12:24:39 AM3/6/13
to cloju...@googlegroups.com
git bisect shows ddc65a96fdb11  "Allow loop/recur nested in catch and finally" as the culprit. Revert it and the problem disappears.
I'll try to look into a real fix later today.

... products of being jet-lagged in Beirut. 

Rich Hickey

unread,
Mar 6, 2013, 10:39:34 AM3/6/13
to cloju...@googlegroups.com

Sean Corfield

unread,
Mar 6, 2013, 4:04:31 PM3/6/13
to cloju...@googlegroups.com
Given the (potentially) critical nature of this issue, is the plan to
release 1.5.1 with (just) this fix in it?

I ask because we're considering upgrading to 1.5 in production and I'd
like to know whether to wait for a release with this fix or see how
1.5.0 actually does in production - since it's not clear to me how
many people would be affected by this leak...

Thanx,
Sean
Sean A Corfield -- (904) 302-SEAN
An Architect's View -- http://corfield.org/
World Singles, LLC. -- http://worldsingles.com/

"Perfection is the enemy of the good."
-- Gustave Flaubert, French realist novelist (1821-1880)

Softaddicts

unread,
Mar 6, 2013, 6:17:50 PM3/6/13
to cloju...@googlegroups.com
Count me in as a collateral here, I feel it's gonna be hard to run 24/7 with 1.5
if it leaks out :)

However I can't test it yet, we need to digest 1.4 literal changes along the way.
Hope to achieve the port by end of April.

Luc P.
Softaddicts<lprefo...@softaddicts.ca> sent by ibisMail from my ipad!

Jason Wolfe

unread,
Mar 6, 2013, 10:11:25 PM3/6/13
to cloju...@googlegroups.com
It looks like this eliminates our problem -- thanks all for the quick fix!

-Jason

Ghadi Shayban

unread,
Mar 6, 2013, 10:17:17 PM3/6/13
to cloju...@googlegroups.com
Fixes the issue for me too. I have to echo the sentiment that it is pretty finicky to intentionally reproduce, but Jason's motivating example is meat and potatoes I/O processing.

NB: Replacing the sequence binding in Jason's example with: (take 10 (map #(FinalizeReporter. %) (range))) will make the bug appear to present in 1.4 -- but it's not. Chunked seqs is muddying the waters, you have to take > 32.

Ghadi

Alex Baranosky

unread,
Mar 10, 2013, 6:52:54 AM3/10/13
to cloju...@googlegroups.com
Is this fix going to be officially released?  We're waiting for the fix at Runa, before we upgrade.

Stuart Halloway

unread,
Mar 10, 2013, 2:23:25 PM3/10/13
to cloju...@googlegroups.com
1.5.1 is flowing through the release pipes now...
Reply all
Reply to author
Forward
0 new messages