reify field order may cause intermittent problems

67 views
Skip to first unread message

Chouser

unread,
Apr 18, 2012, 7:54:35 PM4/18/12
to cloju...@googlegroups.com
On very rare occasions we are seeing an exception in code like this:

(defn make-thing []
 (let [x (CountDownLatch. 1)
       y (atom 2)
       z (fn something-else [] …)]
   (reify clojure.lang.IDeref
     (deref [_] (.await x))
     …also closes over y and z…)))

The exception indicates that when deref is called, x is an atom not a CountDownLatch:

   clojure.lang.Atom cannot be cast to java.util.concurrent.CountDownLatch

Using reflection to examine the reify object returned by make-thing, we found that the reify’s x field indeed contained the atom, and its y field contained the CountDownLatch.  That is, the field values were swapped compared to what they were supposed to be.

When we’re seeing this problem, every call to make-thing returns an object with swapped field. Only by reloading the source code will the problem go away, after which we are again unlikely to see the problem for days or weeks. This seems to indicate that the code generated by the compiler (or perhaps as optimized by the JVM) is incorrect.

BTW, our real make-thing is mostly a copy of clojure.core/future, with z and some methods added.

We noted some details of the Clojure compiler that may or may not be relevant:
  • ObjExpr initializes closes as:

IPersistentMap closes = PersistentHashMap.EMPTY

  • clojure Compiler.java  NewInstanceExpr()->build() has a comment about field order for deftype and defrecord (it’s called closesvec but is actually from the declared fields, not closed-over locals):

//use array map to preserve ctor order

ret.closes = new PersistentArrayMap(closesvec);

  • Because reify doesn’t use the above, closed-over bindings are assoc’ed into a hash map, not an array map, and so their order depends on the hash of the keys, not the order in which the assoc’s happened.
  • The keys of closes are LocalBinding objects which do not provide a hashCode method, and so do not have a stable order from one compilation to the next.

We have not yet thought of a specific scenario that could cause the behavior we’re seeing, which is why we don’t know if the compiler facts above are relevant. We do use a mixture of uberjars, AOT compiling, and live reloading in swank. We think all the times we’ve seen this was shortly after a clojure-jack-in (that is, a new JVM and REPL) with no modifications to the file where the Exception originates.

Is it possible that the emission of the reify ctor invocation and that of the reify ctor itself were generated from different runs of the compiler? What if a load of the .clj happened to use the same reify class name as a reify .class file on disk with different field orderings -- could a make-thing from one call the reify ctor of the other? Does anyone have any suggestions on how to pursue this further?

--The LonoCloud Team

Kevin Downey

unread,
Apr 18, 2012, 10:36:29 PM4/18/12
to cloju...@googlegroups.com
> --
> You received this message because you are subscribed to the Google Groups
> "Clojure Dev" group.
> To post to this group, send email to cloju...@googlegroups.com.
> To unsubscribe from this group, send email to
> clojure-dev...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/clojure-dev?hl=en.

what version of clojure?

--
And what is good, Phaedrus,
And what is not good—
Need we ask anyone to tell us these things?

Chouser

unread,
Apr 19, 2012, 12:33:54 AM4/19/12
to cloju...@googlegroups.com
On Wed, Apr 18, 2012 at 8:19 PM, Alex Miller <al...@puredanger.com> wrote:
>
> Any possibility of a mix of JDKs such that JDK hash codes are different?  I can't imagine where that would actually matter but since this has bitten me a dozen or so times over the years I thought I would mention it.

Nope, always Java 1.7.0_147-icedtea, and we've been seeing it on Clojure 1.4

> Misc debugging ideas:
>
> You could wrap a try-catch around the error point to trap the ClassCast and dump the System/identityHashCode of every parent ClassLoader.

Interesting. I'm not sure how I'd use that information.

> You could also hack the compiler or the DynamicClassLoader to dump the bytes of this class when it's generated so you could go back and disassemble it.  Attached is an old diff (prob Clojure 1.2) that I wrote to do this.

Ah, now this I can understand.  We disassembled the .class files on
disk, but they looked ok.  If we also always dumped the dynamically
loaded classes somewhere, we might be able to spot the problem in the
bytecode when it occurs.  Thanks for the tip!

--Chouser

Chouser

unread,
May 11, 2012, 4:03:10 PM5/11/12
to cloju...@googlegroups.com
Updating .class files in the classpath of a running Clojure JVM is always dangerous, either because of nondeterminism in the Clojure compiler, or because changes to a .clj can require changes in dependant pairs of .class files where one member of the pair has already been loaded.

There may be more than this, but we found 3 areas of non-determinism:
  • The order of fields for closures (fns and reify) can change.
  • The order of methods generated by the proxy macro.
  • The order of entries in constant map where the keys are hashed by identity rather than value (for example using regex literals as keys).

Of these, we only observed negative consequences for the closure fields. This is how to reproduce the original problem we were seeing:
  1. Create a .clj file like:

(defn foo []

 (let [a 1, b 2]

   ((fn bar []

      [a b]))))

  1. AOT compile it, start a repl, and load the namespace via require or load.
    • Clojure checks the timestamps of the .clj and namespace .class, and loads the class files for the namespace and foo.
  2. Without leaving that repl, recompile the very same .clj file without changing its content or timestamp. You may have to manually delete the .class files first to actually generate new .class files.
    • The new class files for foo and bar may have a different order for the closed-over fields a and b, but since the class for foo has already been loaded, its field order on disk doesn’t matter.
  3. Now in the original repl, run (foo). Some percentage of the time the repl will print [2 1] instead of [1 2]
    • Only now is the class for bar loaded, so if step 3 used a different field order than step 2, the repl will print [2 1]

    In our original message, we were getting an exception because the types of the field values change, but in this example you can see that some scenarios could cause data corruption that might be extremely difficult to track down. The combination of code causing the problem doesn’t exist on disk anywhere, as .clj or .class files, but only in the memory of the running JVM that’s demonstrating the problem.

    I think the only sure way to avoid this problem is to never update .class files in the classpath of a running JVM.

    I’m sure there are improvements that could be made to the compiler or the way class files are loaded that would help, but many of the trivial improvements we initially thought were promising turned out to still have problems. For example, although having a deterministic compiler may have other benefits, it wouldn’t help if in step 3 above you were to change the names of locals a and b in a way that would cause their compiled order to be swapped. In that case the old foo would have been loaded already and would try to call the new bar, resulting in the same breakage. If *only* the .clj file were updated (not the .class files), then new foo and bar classes would be dynamically compiled and loaded, and so would work properly. So again, updating the .class files under a running JVM is problematic.

    --The LonoCloud Team

    Stuart Halloway

    unread,
    May 11, 2012, 4:18:49 PM5/11/12
    to cloju...@googlegroups.com
    Zooming in on a single point:

    I think the only sure way to avoid this problem is to never update .class files in the classpath of a running JVM.

    Is anything about this observation specific to Clojure?

    Stu

    Chouser

    unread,
    May 11, 2012, 4:23:31 PM5/11/12
    to cloju...@googlegroups.com
    Clojure supports code reloading and AOT compilation to .class files, but trying to use both features together can run afoul of my statement.

    --Chouser

    Stuart Halloway

    unread,
    May 11, 2012, 6:06:29 PM5/11/12
    to cloju...@googlegroups.com
    And in my experience tools pull them together on your behalf. I would like to have a way to insist on one or the other. 

    Stu
    Reply all
    Reply to author
    Forward
    0 new messages