Unknown constant tag 32 in class file error

863 views

Skip to first unread message

Peter T

unread,

Jul 4, 2011, 3:11:54 AM7/4/11

to Clojure

Hi all,

Since I started using Clojure (I think from just before 1.0) I have,
on the odd occasion, received an error message of this form (Unknown
constant tag X in class file error) while trying to evaluate a
namespace in Slime/Emacs.

Whenever it cropped up, it was completely random: it never seemed to
have any specific relation to something I'd done in the code.
Sometimes it'd get triggered by an arbitrary change like renaming a
variable or deleting a comment or unused function.

In all cases, recompiling the project from scratch would get past the
issue.

I'd always figured it was some random quirk of the compiler. Since it
didn't happen often and could be fixed by a recompile, I just let it
be.

Today though, I've started running into this error more persistently.
It'll start appearing consistently -right- after a recompile.

I.e.: I'll compile the project, then try re-evaluate a namespace that
was just compiled (no changes to the code) and it'll throw the error.

There hasn't been any change to the code in this file and the other
recent changes are all superficial and syntactically correct (proved
also by the fact that the source is compiling successfully).

I'm using Clojure 1.2.0.

The best info I could find on this error is here, from 2009:
http://www.mail-archive.com/clo...@googlegroups.com/msg19130.html.

I don't have any familiarity with Clojure's compiler (or Java or the
JVM) so am a little clueless as to where I'd begin trying to get past
the problem. Any rough idea what might be going on? Am I possibly
bumping into a Clojure/Java/JVM limitation somewhere? Could this be a
Slime/Swank issue? The project's about 21,000 LOC right now, including
lots of comments. This is split into about 40 namespaces/files.

Kind of in the middle of a product launch right now, so the timing's a
little bit unfortunate: would really appreciate any advice! Thank you!

The full stack trace is below:

Unknown constant tag 32 in class file wusoup/web/responses/profiles
$eval7347
[Thrown class java.lang.ClassFormatError]

Restarts:
0: [QUIT] Quit to the SLIME top level

Backtrace:
0: java.lang.ClassLoader.defineClass1(Native Method)
1: java.lang.ClassLoader.defineClass(ClassLoader.java:634)
2: java.lang.ClassLoader.defineClass(ClassLoader.java:480)
3:
clojure.lang.DynamicClassLoader.defineClass(DynamicClassLoader.java:
45)
4: clojure.lang.Compiler$ObjExpr.getCompiledClass(Compiler.java:
3964)
5: clojure.lang.Compiler$FnExpr.parse(Compiler.java:3219)
6: clojure.lang.Compiler.analyzeSeq(Compiler.java:5367)
7: clojure.lang.Compiler.analyze(Compiler.java:5190)
8: clojure.lang.Compiler.eval(Compiler.java:5421)
9: clojure.lang.Compiler.eval(Compiler.java:5391)
10: clojure.core$eval.invoke(core.clj:2382)
11: swank.core$eval_in_emacs_package.invoke(core.clj:94)
12: swank.core$eval_for_emacs.invoke(core.clj:241)
13: clojure.lang.Var.invoke(Var.java:373)
14: clojure.lang.AFn.applyToHelper(AFn.java:169)
15: clojure.lang.Var.applyTo(Var.java:482)
16: clojure.core$apply.invoke(core.clj:540)
17: swank.core$eval_from_control.invoke(core.clj:101)
18: swank.core$spawn_worker_thread$fn__465$fn__466.invoke(core.clj:
300)
19: clojure.lang.AFn.applyToHelper(AFn.java:159)
20: clojure.lang.AFn.applyTo(AFn.java:151)
21: clojure.core$apply.invoke(core.clj:540)
22: swank.core$spawn_worker_thread$fn__465.doInvoke(core.clj:296)
23: clojure.lang.RestFn.invoke(RestFn.java:398)
24: clojure.lang.AFn.run(AFn.java:24)
25: java.lang.Thread.run(Thread.java:636)

Patrick Houk

unread,

Jul 5, 2011, 11:22:44 AM7/5/11

to Clojure

Does the file you are evaluating have more than 65535 characters? As
far as I can tell, that is the maximum length of a String literal in
Java (see the CONSTANT_Utf8_info struct in
http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html).
I've encountered that limit when using Eclipse/CounterClockwise. The
problem occurs when evaluating a file by doing something like:

(clojure.lang.Compiler/load (java.io.StringReader. "the-whole-file-as-
a-string"))

So the contents of the file ends up as a String literal, and Clojure
will generate a corrupt class if that String is too long.
CounterClockwise calls a function in nREPL (helpers/load-file-command)
that does this. Perhaps Emacs/Slime is doing something similar.

I hope that helps.
- Pat

Ken Wesson

unread,

Jul 5, 2011, 12:49:17 PM7/5/11

to clo...@googlegroups.com

On Tue, Jul 5, 2011 at 11:22 AM, Patrick Houk <pat...@gmail.com> wrote:
> Does the file you are evaluating have more than 65535 characters? As
> far as I can tell, that is the maximum length of a String literal in
> Java (see the CONSTANT_Utf8_info struct in
> http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html).
> I've encountered that limit when using Eclipse/CounterClockwise. The
> problem occurs when evaluating a file by doing something like:
>
> (clojure.lang.Compiler/load (java.io.StringReader. "the-whole-file-as-
> a-string"))
>
> So the contents of the file ends up as a String literal, and Clojure
> will generate a corrupt class if that String is too long.
> CounterClockwise calls a function in nREPL (helpers/load-file-command)
> that does this. Perhaps Emacs/Slime is doing something similar.

Smells like multiple bugs to me.

1. A too-large string literal should have a specific error message,
rather than generate a misleading one suggesting a different type of
problem.

2. The limit should not be different from that on String objects in
general, namely 2147483647 characters which nobody is likely to hit
unless they mistakenly call read-string on that 1080p Avatar blu-ray
rip .mkv they aren't legally supposed to possess.

3. Though both of the above bugs are in Oracle's Java implementation,
it would seem to be a bug in Clojure's compiler if it is trying to
make the entire source code of a namespace into a string *literal* in
dynamically-generated bytecode somewhere rather than a string
*object*. Sensible alternatives are a) get the string to whatever
consumes it by some other means than embedding it as a single
monolithic constant in bytecode, b) convert long strings into shorter
chunks and emit a static initializer into the bytecode to reassemble
them with concatenation into a single runtime-computed string constant
stored in another static field, and c) restructure whatever consumes
the string to consume a seq, java.util.List, or whatever of strings
instead and feed it digestible chunks (e.g. a separate string for each
defn or other top-level form, in order of appearance in the input file
-- surely nobody has *individual defns* exceeding 64KB).

--
Protege: What is this seething mass of parentheses?!
Master: Your father's Lisp REPL. This is the language of a true
hacker. Not as clumsy or random as C++; a language for a more
civilized age.

Alessio Stalla

unread,

Jul 5, 2011, 4:32:37 PM7/5/11

to Clojure

On 5 Lug, 18:49, Ken Wesson <kwess...@gmail.com> wrote:

> On Tue, Jul 5, 2011 at 11:22 AM, Patrick Houk <path...@gmail.com> wrote:
> > Does the file you are evaluating have more than 65535 characters? As
> > far as I can tell, that is the maximum length of a String literal in
> > Java (see the CONSTANT_Utf8_info struct in

> >http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc...).

> > I've encountered that limit when using Eclipse/CounterClockwise. The
> > problem occurs when evaluating a file by doing something like:
>
> > (clojure.lang.Compiler/load (java.io.StringReader. "the-whole-file-as-
> > a-string"))
>
> > So the contents of the file ends up as a String literal, and Clojure
> > will generate a corrupt class if that String is too long.
> > CounterClockwise calls a function in nREPL (helpers/load-file-command)
> > that does this. Perhaps Emacs/Slime is doing something similar.
>
> Smells like multiple bugs to me.
>
> 1. A too-large string literal should have a specific error message,
> rather than generate a misleading one suggesting a different type of
> problem.

There is no such thing as a too-large string literal in a class file.
See: <http://java.sun.com/docs/books/jvms/second_edition/html/
ClassFile.doc.html#7963>. String literals are made of 1-byte tag, 2-
bytes length, and (length * 1-byte) contents. I suppose Clojure's
compiler is generating an incorrect class file because the length is
either overflowing or growing past two bytes.

> 2. The limit should not be different from that on String objects in
> general, namely 2147483647 characters which nobody is likely to hit
> unless they mistakenly call read-string on that 1080p Avatar blu-ray
> rip .mkv they aren't legally supposed to possess.

That's a limitation imposed by the Java class file format.

> 3. Though both of the above bugs are in Oracle's Java implementation,

By the above, 1. is a Clojure bug and 2. is not a bug at all.

> it would seem to be a bug in Clojure's compiler if it is trying to
> make the entire source code of a namespace into a string *literal* in
> dynamically-generated bytecode somewhere rather than a string
> *object*.

Actually it seems it's the IDE, rather than Clojure, that is
evaluating a form containing such a big literal. Since Clojure has no
interpreter, it needs to compile that form.

> Sensible alternatives are a) get the string to whatever
> consumes it by some other means than embedding it as a single
> monolithic constant in bytecode,

This is what we currently do in ABCL (by storing literal objects in a
thread-local variable and retrieving them later when the compiled code
is loaded), but it only works for the runtime compiler, not the file
compiler (in Clojure terms, it won't work with AOT compilation).

> b) convert long strings into shorter
> chunks and emit a static initializer into the bytecode to reassemble
> them with concatenation into a single runtime-computed string constant
> stored in another static field,

This is what I'd like to have :)

> and c) restructure whatever consumes
> the string to consume a seq, java.util.List, or whatever of strings
> instead and feed it digestible chunks (e.g. a separate string for each
> defn or other top-level form, in order of appearance in the input file
> -- surely nobody has *individual defns* exceeding 64KB).

The problem is not in the consumer, but in the form containing the
string; to do what you're proposing, the reader, upon encountering a
big enough string, would have to produce a seq/List/whatever instead,
the compiler would need to be able to dump such an object to a class,
and all Clojure code handling strings would have to be prepared to
handle such an object, too. I think it's a little impractical.

Regarding the size of individual defns, that's an orthogonal problem;
anyway, the size of the _bytecode_ for methods is limited to 64KB (see
<http://java.sun.com/docs/books/jvms/second_edition/html/
ClassFile.doc.html#88659>) and, while pretty big, it's not impossible
to reach it, especially when using complex macros to produce a lot of
generated code. We used to generate such big methods in ABCL because
at one point we tried to spell out in the bytecode all the class names
corresponding to functions in a compiled file, in order to avoid
reflection when loading the compiled functions. For files with many
functions (> 1000 iirc) the generated code became too big. It turned
out that this optimization had a negligible impact on performance, so
we reverted it.

Cheers,
Alessio

Peter T

unread,

Jul 6, 2011, 1:16:48 AM7/6/11

to Clojure

Appreciate the input guys!

Can't comment on the technical side or on where the problem might
stem, but I'll tell you what I can:

> > > Does the file you are evaluating have more than 65535 characters?

Nope. It's about 1400 LOC and not syntactically unique (no unusually
long constants, etc.). It's also not the longest ns in the project:
the longest is around 2000 LOC and is still evaluating fine. If I had
to try find something unusual about the ns, I'd say that it
probably :requires more namespaces than others (22).

FWIW the file now consistently throwing the problem -seems- to have
started throwing the problem after a change in some -other- namespace
so perhaps there's a bad interaction happening somewhere?

As per the suggestion in the "Clojure for large programs" thread, I
pulled some stuff out of the ns and it appears to be evaluating fine
again.

So, some conclusions/observations:

1. I'm more or less satisfied: if I know I can always work around the
problem by using shorter namespaces, I'm happy.
2. While the namespace size seems to be a factor, I'm not convinced
that the problem is as linear as "big namespace = problem". I have
other namespaces that are larger (in line count, character count, and
definitions) that have been evaluating fine without a problem. This
problem feels more random/subtle to me.
3. The problem doesn't [only?] seem to be related to the project size
as a whole (since I was occasionally receiving this error even when
the project was still < 10,000 LOC).
4. There seems to be a discrepancy in behaviour depending on how the
compilation is requested: project-wide command-line compilation seems
to keep working even when Slime/Swank evaluation fails.
5. Personally I don't have any problem with hard limits (e.g. keep
your namespaces/whatever below X lines/definitions/whatever) even if
they're aggressive- but I think it'd be preferable to have an error
message to point out the limit when it gets hit (if that's indeed
what's happening).

That's about all I can contribute :)

Again, thank you all so much for your time!

Ken Wesson

unread,

Jul 6, 2011, 3:07:19 AM7/6/11

to clo...@googlegroups.com

On Tue, Jul 5, 2011 at 4:32 PM, Alessio Stalla <alessi...@gmail.com> wrote:
> On 5 Lug, 18:49, Ken Wesson <kwess...@gmail.com> wrote:
>> 1. A too-large string literal should have a specific error message,
>> rather than generate a misleading one suggesting a different type of
>> problem.
>
> There is no such thing as a too-large string literal in a class file.

That's not what Patrick just said.

>> 2. The limit should not be different from that on String objects in
>> general, namely 2147483647 characters which nobody is likely to hit
>> unless they mistakenly call read-string on that 1080p Avatar blu-ray
>> rip .mkv they aren't legally supposed to possess.
>
> That's a limitation imposed by the Java class file format.

And therefore a bug in the Java class file format, which should allow
any size String that the runtime allows. Using 2 bytes instead of 4
bytes for the length field, as you claim they did, seems to be the
specific error. One would have thought that Java of all languages
would have learned from the Y2K debacle and near-miss with
cyber-armageddon, but limiting a field to 2 of something instead of 4
out of a misguided perception that space was at a premium was exactly
what caused that, too!

>> 3. Though both of the above bugs are in Oracle's Java implementation,
>
> By the above, 1. is a Clojure bug and 2. is not a bug at all.

Oh, 2 is a bug alright. By your definition, Y2K bugs in a piece of
software would also not be bugs. The users of such software would beg
to differ.

>> it would seem to be a bug in Clojure's compiler if it is trying to
>> make the entire source code of a namespace into a string *literal* in
>> dynamically-generated bytecode somewhere rather than a string
>> *object*.
>
> Actually it seems it's the IDE, rather than Clojure, that is
> evaluating a form containing such a big literal. Since Clojure has no
> interpreter, it needs to compile that form.

The same problem has been reported from multiple IDEs, so it seems to
be a problem with eval and/or load-file. The question is not why they
might be using String *objects* that exceed 64K, since they'll need to
use Strings as large as the file gets*. It's why they'd *generate
bytecode* containing String *literals* that large.

And it's not IDEs that generate bytecode it's
clojure.lang.Compiler.java that generates bytecode in this scenario.

* There is a way to reduce the size requirements; crudely, line-seq
could be used to implement a lazy seq of top-level forms built by
consuming lines until delimiters are balanced and them emitting a new
form string, then evaluating these forms one by one. This works with
typical source files that have short individual top-level forms and
have at least 1 line break between any two such and would allow
consuming multi-gig source files if anyone ever had need for such a
thing (I'd hope never to see it unless it was machine-generated for
some purpose). Less crudely, a reader for files could be implemented
that didn't just slurp the file and call read-string on it but instead
read from an IO stream and emitted a seq of top-level forms converted
already into reader-output data structures (but unevaluated). In fact,
read-string could then be implemented in terms of this and a
StringInputStream whose implementation is left as an exercise for the
reader but which ought to be nearly trivial.

>> Sensible alternatives are a) get the string to whatever
>> consumes it by some other means than embedding it as a single
>> monolithic constant in bytecode,
>
> This is what we currently do in ABCL (by storing literal objects in a
> thread-local variable and retrieving them later when the compiled code
> is loaded), but it only works for the runtime compiler, not the file
> compiler (in Clojure terms, it won't work with AOT compilation).

Yes, this is the same issue raised in connection with allowing
arbitrary objects in code in eval.

>> b) convert long strings into shorter
>> chunks and emit a static initializer into the bytecode to reassemble
>> them with concatenation into a single runtime-computed string constant
>> stored in another static field,
>
> This is what I'd like to have :)

Frankly it seems like a bit of a hack to me, though since it would be
used to work around a Y2K-style bug in Java it might be poetic justice
of a sort.

>> and c) restructure whatever consumes
>> the string to consume a seq, java.util.List, or whatever of strings
>> instead and feed it digestible chunks (e.g. a separate string for each
>> defn or other top-level form, in order of appearance in the input file
>> -- surely nobody has *individual defns* exceeding 64KB).
>
> The problem is not in the consumer, but in the form containing the
> string; to do what you're proposing, the reader, upon encountering a
> big enough string, would have to produce a seq/List/whatever instead,
> the compiler would need to be able to dump such an object to a class,
> and all Clojure code handling strings would have to be prepared to
> handle such an object, too. I think it's a little impractical.

I don't think so. The problem isn't with normal strings but only with
strings that get embedded as literals in code; and moreover, the
problem isn't even with those strings exceeding 64k but with whole
.clj files exceeding 64k. The implication is that load-file generates
a class that contains the entire contents of the sourcefile as a
string constant for some reason; so:

a) What does this class do with this string constant? What code consumes it?

b) Can that particular bit of code be rewritten to digest the same
information provided in smaller chunks?

> Regarding the size of individual defns, that's an orthogonal problem;
> anyway, the size of the _bytecode_ for methods is limited to 64KB (see
> <http://java.sun.com/docs/books/jvms/second_edition/html/
> ClassFile.doc.html#88659>) and, while pretty big, it's not impossible
> to reach it, especially when using complex macros to produce a lot of
> generated code.

Another problem for which we will probably need an eventual fix or
workaround. If bytecode can contain a JMP-like instruction it should
be possible to have the compiler split long generated methods and
chain the pieces together without much loss of runtime efficiency,
particularly if it does so at "natural" places -- existing conditional
branches, particularly, and (loop ...) borders -- (defn foo (if x
(lotta-code-1) (lotta-code-2))) for example can be trivially converted
to (defn foo (if x (lotta-code-1) (jmp bar))) (defn bar
(lotta-code-2)) -- though if you had such a jump instruction I'd have
thought implementing real TCO would have been fairly easy, and
apparently it was not.

Failing such a jmp capability you'd have to just use (bar) in that
last example and suffer an additional method call overhead at the
break-point. Again, the obvious way to do it would be to recognize
common branching construct forms such as (if ...) and (cond ...) that
are larger than the threshold but have individual branches that are
not and turn some or all of the branches into their own under-the-hood
methods and calls to those methods.

> We used to generate such big methods in ABCL because
> at one point we tried to spell out in the bytecode all the class names
> corresponding to functions in a compiled file, in order to avoid
> reflection when loading the compiled functions. For files with many
> functions (> 1000 iirc) the generated code became too big. It turned
> out that this optimization had a negligible impact on performance, so
> we reverted it.

I wonder if Clojure is using a similar optimization and would benefit
from its reversion.

Ken Wesson

unread,

Jul 6, 2011, 3:18:54 AM7/6/11

to clo...@googlegroups.com

On Wed, Jul 6, 2011 at 1:16 AM, Peter T <ptaou...@gmail.com> wrote:
>> > > Does the file you are evaluating have more than 65535 characters?
>
> Nope. It's about 1400 LOC and not syntactically unique (no unusually
> long constants, etc.). It's also not the longest ns in the project:
> the longest is around 2000 LOC and is still evaluating fine. If I had
> to try find something unusual about the ns, I'd say that it
> probably :requires more namespaces than others (22).

Well, this is odd!

> 1. I'm more or less satisfied: if I know I can always work around the
> problem by using shorter namespaces, I'm happy.

It creates a tension, though:

1. In another recent thread people have been arguing for relatively
large namespaces.

2. It isn't nice to have code modularization boundaries dictated as
much by bug workaround considerations as by architectural ones!

> 2. While the namespace size seems to be a factor, I'm not convinced
> that the problem is as linear as "big namespace = problem". I have
> other namespaces that are larger (in line count, character count, and
> definitions) that have been evaluating fine without a problem. This
> problem feels more random/subtle to me.

The fact, mentioned recently in that other thread, that clojure.core
is 200K and around 6Kloc and compiles fine also points in this
direction.

My guess would be that it's some function of namespace size in bytes,
number of vars, and possibly as you say number of referred vars -- so
maybe also imports and anything else that grows the namespace's symbol
table too.

> 4. There seems to be a discrepancy in behaviour depending on how the
> compilation is requested: project-wide command-line compilation seems
> to keep working even when Slime/Swank evaluation fails.

It looks like it affects load-file but not AOT compilation. Both
presumably use eval, and eventually Compiler.java, to get the job
done, so the problem is probably in load-file's implementation
specifically. The simplest hypothesis, that it's granularity (the
compiler chokes on a single huge wodge of forms crammed down its
throat in one go but works fine if given the same forms one by one)
fails to fully explain the data because it predicts that AOT should
fail on the namespaces where load-file fails, but evaluating the
top-level forms one by one works as expected. This suggests again that
load-file is the problem rather than Compiler.java, or possibly that
AOT feeds forms to the compiler differently from load-file.

> 5. Personally I don't have any problem with hard limits (e.g. keep
> your namespaces/whatever below X lines/definitions/whatever) even if
> they're aggressive- but I think it'd be preferable to have an error
> message to point out the limit when it gets hit (if that's indeed
> what's happening).

I *do* have a problem with such limits; including limits on function
size. Architecture should be up to the programmer, and even if it is a
*bad idea* for programmers write huge namespaces or huge individual
functions, IMO that isn't a judgment call appropriate for the
*language compiler* to make. And let's not forget that we're a Lisp,
so we do meta-programming, and so we probably want to be able to
digest files, functions, and individual s-expressions that no sane
human would ever write but that may very well occur in
machine-generated inputs to the compiler.

Alessio Stalla

unread,

Jul 6, 2011, 3:28:12 PM7/6/11

to Clojure

On 6 Lug, 09:07, Ken Wesson <kwess...@gmail.com> wrote:

> On Tue, Jul 5, 2011 at 4:32 PM, Alessio Stalla <alessiosta...@gmail.com> wrote:
> > On 5 Lug, 18:49, Ken Wesson <kwess...@gmail.com> wrote:
> >> 1. A too-large string literal should have a specific error message,
> >> rather than generate a misleading one suggesting a different type of
> >> problem.
>
> > There is no such thing as a too-large string literal in a class file.
>
> That's not what Patrick just said.

Not really; he said "65535 characters [...] is the maximum length of a
String literal in Java" (actually, it is 65535 *bytes*, not
characters) and "Clojure will generate a corrupt class if that String
is too long". The class file is incorrect, but it doesn't contain any
string longer than 64KB, because it cannot, by definition. There is no
way for the JVM to detect a string which is too long. The generated
class file contains a legal string followed by garbage; it's that
garbage that makes the class file illegal. The JVM can't know that
originally the string + garbage was intended to be a longer string. In
fact, for certain longer-than-64KB strings, a syntactically valid
class file could still be generated.

> >> 2. The limit should not be different from that on String objects in
> >> general, namely 2147483647 characters which nobody is likely to hit
> >> unless they mistakenly call read-string on that 1080p Avatar blu-ray
> >> rip .mkv they aren't legally supposed to possess.
>
> > That's a limitation imposed by the Java class file format.
>
> And therefore a bug in the Java class file format, which should allow
> any size String that the runtime allows. Using 2 bytes instead of 4
> bytes for the length field, as you claim they did, seems to be the
> specific error. One would have thought that Java of all languages
> would have learned from the Y2K debacle and near-miss with
> cyber-armageddon, but limiting a field to 2 of something instead of 4
> out of a misguided perception that space was at a premium was exactly
> what caused that, too!

A bug is a discrepancy between specification and implementation. Here,
there's no discrepancy: the JVM is correctly implementing the spec.
Now, you might argue that the spec is badly designed, and I might
agree, but it's not a "bug in Oracle's Java implementation" - any
conforming Java implementation must have that (mis)feature.

> >> 3. Though both of the above bugs are in Oracle's Java implementation,
>
> > By the above, 1. is a Clojure bug and 2. is not a bug at all.
>
> Oh, 2 is a bug alright. By your definition, Y2K bugs in a piece of
> software would also not be bugs. The users of such software would beg
> to differ.

User perception and bugs are different things.

> >> it would seem to be a bug in Clojure's compiler if it is trying to
> >> make the entire source code of a namespace into a string *literal* in
> >> dynamically-generated bytecode somewhere rather than a string
> >> *object*.
>
> > Actually it seems it's the IDE, rather than Clojure, that is
> > evaluating a form containing such a big literal. Since Clojure has no
> > interpreter, it needs to compile that form.
>
> The same problem has been reported from multiple IDEs, so it seems to
> be a problem with eval and/or load-file.

Ok, sorry, didn't know that.

> The question is not why they
> might be using String *objects* that exceed 64K, since they'll need to
> use Strings as large as the file gets*. It's why they'd *generate
> bytecode* containing String *literals* that large.
>
> And it's not IDEs that generate bytecode it's
> clojure.lang.Compiler.java that generates bytecode in this scenario.

Yes, ultimately there's a bug in the compiler regardless of where
(form "with-looo....ooong-string") comes from.

> * There is a way to reduce the size requirements; crudely, line-seq
> could be used to implement a lazy seq of top-level forms built by
> consuming lines until delimiters are balanced and them emitting a new
> form string, then evaluating these forms one by one. This works with
> typical source files that have short individual top-level forms and
> have at least 1 line break between any two such and would allow
> consuming multi-gig source files if anyone ever had need for such a
> thing (I'd hope never to see it unless it was machine-generated for
> some purpose). Less crudely, a reader for files could be implemented
> that didn't just slurp the file and call read-string on it but instead
> read from an IO stream and emitted a seq of top-level forms converted
> already into reader-output data structures (but unevaluated). In fact,
> read-string could then be implemented in terms of this and a
> StringInputStream whose implementation is left as an exercise for the
> reader but which ought to be nearly trivial.

Yes, that's a good way of circumventing the problem when dealing with
reasonable input files. But I don't think this is the problem we're
talking about; read on.

> >> Sensible alternatives are a) get the string to whatever
> >> consumes it by some other means than embedding it as a single
> >> monolithic constant in bytecode,
>
> > This is what we currently do in ABCL (by storing literal objects in a
> > thread-local variable and retrieving them later when the compiled code
> > is loaded), but it only works for the runtime compiler, not the file
> > compiler (in Clojure terms, it won't work with AOT compilation).
>
> Yes, this is the same issue raised in connection with allowing
> arbitrary objects in code in eval.
>
> >> b) convert long strings into shorter
> >> chunks and emit a static initializer into the bytecode to reassemble
> >> them with concatenation into a single runtime-computed string constant
> >> stored in another static field,
>
> > This is what I'd like to have :)
>
> Frankly it seems like a bit of a hack to me, though since it would be
> used to work around a Y2K-style bug in Java it might be poetic justice
> of a sort.

It is a sort of hack, yes. An alternative might be to store constants
in a resource external to the class file, unencumbered with silly size
limits.

> >> and c) restructure whatever consumes
> >> the string to consume a seq, java.util.List, or whatever of strings
> >> instead and feed it digestible chunks (e.g. a separate string for each
> >> defn or other top-level form, in order of appearance in the input file
> >> -- surely nobody has *individual defns* exceeding 64KB).
>
> > The problem is not in the consumer, but in the form containing the
> > string; to do what you're proposing, the reader, upon encountering a
> > big enough string, would have to produce a seq/List/whatever instead,
> > the compiler would need to be able to dump such an object to a class,
> > and all Clojure code handling strings would have to be prepared to
> > handle such an object, too. I think it's a little impractical.
>
> I don't think so. The problem isn't with normal strings but only with
> strings that get embedded as literals in code; and moreover, the
> problem isn't even with those strings exceeding 64k but with whole
> .clj files exceeding 64k. The implication is that load-file generates
> a class that contains the entire contents of the sourcefile as a
> string constant for some reason; so:
>
> a) What does this class do with this string constant? What code consumes it?

Hmm, I don't think it's like you say. Without knowing anything about
Clojure's internals, it seems to me that the problem is more likely to
be in a form like the one Patrick posted, (clojure.lang.Compiler/load
(java.io.StringReader. "the-whole-file-as-a-string")), which is
compiled in order to be evaluated in order to compile and load the
file... it is that form, and not the file to be compiled, that
generates the incorrect class file.

> b) Can that particular bit of code be rewritten to digest the same
> information provided in smaller chunks?

If Patrick is right, and I think he is, then the compiler has to
compile (java.io.StringReader. "the-whole-file-as-a-string") in a way
that "the-whole-file-as-a-string" does not appear literally in the
class file. It has either to somehow split the string, or load it from
somewhere else.

> > Regarding the size of individual defns, that's an orthogonal problem;
> > anyway, the size of the _bytecode_ for methods is limited to 64KB (see
> > <http://java.sun.com/docs/books/jvms/second_edition/html/
> > ClassFile.doc.html#88659>) and, while pretty big, it's not impossible
> > to reach it, especially when using complex macros to produce a lot of
> > generated code.
>
> Another problem for which we will probably need an eventual fix or
> workaround. If bytecode can contain a JMP-like instruction it should
> be possible to have the compiler split long generated methods and
> chain the pieces together without much loss of runtime efficiency,
> particularly if it does so at "natural" places -- existing conditional
> branches, particularly, and (loop ...) borders -- (defn foo (if x
> (lotta-code-1) (lotta-code-2))) for example can be trivially converted
> to (defn foo (if x (lotta-code-1) (jmp bar))) (defn bar
> (lotta-code-2)) -- though if you had such a jump instruction I'd have
> thought implementing real TCO would have been fairly easy, and
> apparently it was not.

In fact, no JMP-like instruction exists that can jump to a different
method.

> Failing such a jmp capability you'd have to just use (bar) in that
> last example and suffer an additional method call overhead at the
> break-point. Again, the obvious way to do it would be to recognize
> common branching construct forms such as (if ...) and (cond ...) that
> are larger than the threshold but have individual branches that are
> not and turn some or all of the branches into their own under-the-hood
> methods and calls to those methods.

The positive thing is that the method call overhead disappears thanks
to Hotspot for frequently called methods. The negative thing is that
splicing bytecode is harder than it seems because of jumps and
exception handlers that might be present.

> > We used to generate such big methods in ABCL because
> > at one point we tried to spell out in the bytecode all the class names
> > corresponding to functions in a compiled file, in order to avoid
> > reflection when loading the compiled functions. For files with many
> > functions (> 1000 iirc) the generated code became too big. It turned
> > out that this optimization had a negligible impact on performance, so
> > we reverted it.
>
> I wonder if Clojure is using a similar optimization and would benefit
> from its reversion.

Might be. One day, I really need to look at how the Clojure compiler
works, especially the loading of compiled files - that could be a
great source of inspiration.

Regards,
Alessio

Ken Wesson

unread,

Jul 6, 2011, 4:23:18 PM7/6/11

to clo...@googlegroups.com

On Wed, Jul 6, 2011 at 3:28 PM, Alessio Stalla <alessi...@gmail.com> wrote:
> On 6 Lug, 09:07, Ken Wesson <kwess...@gmail.com> wrote:
>> On Tue, Jul 5, 2011 at 4:32 PM, Alessio Stalla <alessiosta...@gmail.com> wrote:
>> > On 5 Lug, 18:49, Ken Wesson <kwess...@gmail.com> wrote:
>> >> 1. A too-large string literal should have a specific error message,
>> >> rather than generate a misleading one suggesting a different type of
>> >> problem.
>>
>> > There is no such thing as a too-large string literal in a class file.
>>
>> That's not what Patrick just said.
>
> Not really;

>> > That's a limitation imposed by the Java class file format.
>>
>> And therefore a bug in the Java class file format, which should allow
>> any size String that the runtime allows. Using 2 bytes instead of 4
>> bytes for the length field, as you claim they did, seems to be the
>> specific error. One would have thought that Java of all languages
>> would have learned from the Y2K debacle and near-miss with
>> cyber-armageddon, but limiting a field to 2 of something instead of 4
>> out of a misguided perception that space was at a premium was exactly
>> what caused that, too!
>
> A bug is a discrepancy between specification and implementation.

Your extremely narrow definition precludes the notion that a
specification can itself be in error, and therefore excludes, among
others, the famous category of Y2K bugs from meeting your definition
of "bug". It also precludes *anything* being considered a bug in any
software that lacks a formal specification distinct from its
implementation (i.e., almost ALL software!), or in the reference
implementation of any software whose formal specification is a
reference implementation rather than a design document of some sort.

> Now, you might argue that the spec is badly designed, and I might
> agree, but it's not a "bug in Oracle's Java implementation" - any
> conforming Java implementation must have that (mis)feature.

Does the JLS specify this limitation, or does it just say a string
literal is a " followed by any number of non-" characters and
\-escaped "s followed by a " or words to that effect? Because if the
latter, a conforming implementation of the Java language (as in not
directly contradicting the JLS anywhere) could permit longer string
literals, even if Oracle's does not.

>> Oh, 2 is a bug alright. By your definition, Y2K bugs in a piece of
>> software would also not be bugs. The users of such software would beg
>> to differ.
>
> User perception and bugs are different things.

Not past a certain point they aren't. One user disagreeing with the
developers can be written off as wrong. A large percentage of users
(that know about an issue) disagreeing with the developers means it's
the developers that are wrong, unless you take the extreme step of
rejecting the premise that the developers' ultimate job is serving the
needs of the software's user base. And when the developers are wrong
about something and that wrongness is expressed in the code, that
constitutes a bug, though it may not be a deviation from the
"specification" (assuming in that instance there even is a formal
specification distinct from the implementation).

>> Frankly it seems like a bit of a hack to me, though since it would be
>> used to work around a Y2K-style bug in Java it might be poetic justice
>> of a sort.
>
> It is a sort of hack, yes. An alternative might be to store constants
> in a resource external to the class file, unencumbered with silly size
> limits.

Given the deployment architecture around Java, that doesn't even add
any deployment complexity and, depending on how it is implemented, may
even make i18n easier to accomplish in some cases by freeing the
developers from having to set up a bunch of ResourceBundle
infrastructure in each codebase that needs i18n. +1.

>> I don't think so. The problem isn't with normal strings but only with
>> strings that get embedded as literals in code; and moreover, the
>> problem isn't even with those strings exceeding 64k but with whole
>> .clj files exceeding 64k. The implication is that load-file generates
>> a class that contains the entire contents of the sourcefile as a
>> string constant for some reason; so:
>>
>> a) What does this class do with this string constant? What code consumes it?
>
> Hmm, I don't think it's like you say.

Patrick said:

Does the file you are evaluating have more than 65535

characters? As far as I can tell, that is the maximum

length of a String literal in Java

This clearly implies that the problem is the .clj file exceeding the
maximum length of a string literal, and thus implies more indirectly
that the entire source file is being embedded in some class file as a
string constant. Now you seem to be denying that this is occurring. If
so, your disagreement here is with Patrick, not me.

> Without knowing anything about
> Clojure's internals, it seems to me that the problem is more likely to
> be in a form like the one Patrick posted, (clojure.lang.Compiler/load
> (java.io.StringReader. "the-whole-file-as-a-string")), which is
> compiled in order to be evaluated in order to compile and load the
> file... it is that form, and not the file to be compiled, that
> generates the incorrect class file.

That's not disagreeing, that's agreeing. You're saying that as an
intermediate stage it's compiling a class with the source file as a
literal in order to compile the file, and suggesting a particular
answer to question a, which is that the class just invokes
Complier.load() on the string constant.

>> b) Can that particular bit of code be rewritten to digest the same
>> information provided in smaller chunks?

And the answer to this then becomes a resounding Yes, by not having
load-file or any of its cousins, or any IDE load file functions, cram
the whole thing into a string and build a form like your example above
to eval and instead just point Complier.load() at a reader open on the
source file on disk, if unchanged, or if the focused editor is dirty,
possibly instead pass it a StringReader or similar open on the
editor's buffer in memory. (Compiler.load(new
StringReader(editorJTextArea.getText())); or whatever.)

> If Patrick is right, and I think he is, then the compiler has to
> compile (java.io.StringReader. "the-whole-file-as-a-string") in a way
> that "the-whole-file-as-a-string" does not appear literally in the
> class file. It has either to somehow split the string, or load it from
> somewhere else.

Or, fixing the current crop of problems like this, such forms
shouldn't be generated as intermediate steps in loading by the
editor/IDE/load-file/whatever to begin with. See above.

On the other hand, it suggests that (foo "a very long ... string")
would still break, which is not desirable, so ultimately a mechanism
for breaking up large string constants under the hood in Compiler
seems indicated. We just may not need it to solve this specific
instance.

> In fact, no JMP-like instruction exists that can jump to a different
> method.

Seems like a troubling oversight. As I said it would come in very
handy for enabling TCO. It could even be implemented securely -- the
bytecode verifier could require the target to be the start of a method
visible to the method containing the JMP instruction, for instance,
i.e. the same requirements it currently places on the target of
invokevirtual. It would just be non-stack-consuming and otherwise
equivalent to "return otherMethod(args);" (after some suitable storage
of the arguments, where there are any). If return otherMethod(args);
is secure this limited, tail-optimization-enabling JMP could therefore
be secured.

>> Failing such a jmp capability you'd have to just use (bar) in that
>> last example and suffer an additional method call overhead at the
>> break-point. Again, the obvious way to do it would be to recognize
>> common branching construct forms such as (if ...) and (cond ...) that
>> are larger than the threshold but have individual branches that are
>> not and turn some or all of the branches into their own under-the-hood
>> methods and calls to those methods.
>
> The positive thing is that the method call overhead disappears thanks
> to Hotspot for frequently called methods. The negative thing is that
> splicing bytecode is harder than it seems because of jumps and
> exception handlers that might be present.

Hence my suggestion that the compiler work at a higher level. The
post-macroexpansion sexp seems ideal, since it need only look for the
special forms (if ...), (let* ...), and (loop* ...) to find obvious
division points and the nature of functional code is such that very
long functions usually contain these. So it can follow an algorithm
like

1. Try compiling the whole function into a single .invoke method (the
current behavior). If that succeeds, we're done.

2. Catch exception and try to break function at an obvious boundaries
near the midpoint, particularly on either side of a let* or loop*
nearly corresponding to the middle third or at the start of the "else"
of an if where that "else" is about a third to half the code.

3. If all else fails, just split at arbitrary points in e.g. a long
chain of expressions.

4. Generate a .invoke method and some auxiliary private methods that
are called by it and/or each other.

There's some complication; the sub-methods will need to receive
arguments corresponding to the locals in existence at the start of the
split-off piece of code that are used within that piece, for example.
But it doesn't strike me as infeasible.

Reply all

Reply to author

Forward

0 new messages