Smells like multiple bugs to me.
1. A too-large string literal should have a specific error message,
rather than generate a misleading one suggesting a different type of
problem.
2. The limit should not be different from that on String objects in
general, namely 2147483647 characters which nobody is likely to hit
unless they mistakenly call read-string on that 1080p Avatar blu-ray
rip .mkv they aren't legally supposed to possess.
3. Though both of the above bugs are in Oracle's Java implementation,
it would seem to be a bug in Clojure's compiler if it is trying to
make the entire source code of a namespace into a string *literal* in
dynamically-generated bytecode somewhere rather than a string
*object*. Sensible alternatives are a) get the string to whatever
consumes it by some other means than embedding it as a single
monolithic constant in bytecode, b) convert long strings into shorter
chunks and emit a static initializer into the bytecode to reassemble
them with concatenation into a single runtime-computed string constant
stored in another static field, and c) restructure whatever consumes
the string to consume a seq, java.util.List, or whatever of strings
instead and feed it digestible chunks (e.g. a separate string for each
defn or other top-level form, in order of appearance in the input file
-- surely nobody has *individual defns* exceeding 64KB).
--
Protege: What is this seething mass of parentheses?!
Master: Your father's Lisp REPL. This is the language of a true
hacker. Not as clumsy or random as C++; a language for a more
civilized age.
That's not what Patrick just said.
>> 2. The limit should not be different from that on String objects in
>> general, namely 2147483647 characters which nobody is likely to hit
>> unless they mistakenly call read-string on that 1080p Avatar blu-ray
>> rip .mkv they aren't legally supposed to possess.
>
> That's a limitation imposed by the Java class file format.
And therefore a bug in the Java class file format, which should allow
any size String that the runtime allows. Using 2 bytes instead of 4
bytes for the length field, as you claim they did, seems to be the
specific error. One would have thought that Java of all languages
would have learned from the Y2K debacle and near-miss with
cyber-armageddon, but limiting a field to 2 of something instead of 4
out of a misguided perception that space was at a premium was exactly
what caused that, too!
>> 3. Though both of the above bugs are in Oracle's Java implementation,
>
> By the above, 1. is a Clojure bug and 2. is not a bug at all.
Oh, 2 is a bug alright. By your definition, Y2K bugs in a piece of
software would also not be bugs. The users of such software would beg
to differ.
>> it would seem to be a bug in Clojure's compiler if it is trying to
>> make the entire source code of a namespace into a string *literal* in
>> dynamically-generated bytecode somewhere rather than a string
>> *object*.
>
> Actually it seems it's the IDE, rather than Clojure, that is
> evaluating a form containing such a big literal. Since Clojure has no
> interpreter, it needs to compile that form.
The same problem has been reported from multiple IDEs, so it seems to
be a problem with eval and/or load-file. The question is not why they
might be using String *objects* that exceed 64K, since they'll need to
use Strings as large as the file gets*. It's why they'd *generate
bytecode* containing String *literals* that large.
And it's not IDEs that generate bytecode it's
clojure.lang.Compiler.java that generates bytecode in this scenario.
* There is a way to reduce the size requirements; crudely, line-seq
could be used to implement a lazy seq of top-level forms built by
consuming lines until delimiters are balanced and them emitting a new
form string, then evaluating these forms one by one. This works with
typical source files that have short individual top-level forms and
have at least 1 line break between any two such and would allow
consuming multi-gig source files if anyone ever had need for such a
thing (I'd hope never to see it unless it was machine-generated for
some purpose). Less crudely, a reader for files could be implemented
that didn't just slurp the file and call read-string on it but instead
read from an IO stream and emitted a seq of top-level forms converted
already into reader-output data structures (but unevaluated). In fact,
read-string could then be implemented in terms of this and a
StringInputStream whose implementation is left as an exercise for the
reader but which ought to be nearly trivial.
>> Sensible alternatives are a) get the string to whatever
>> consumes it by some other means than embedding it as a single
>> monolithic constant in bytecode,
>
> This is what we currently do in ABCL (by storing literal objects in a
> thread-local variable and retrieving them later when the compiled code
> is loaded), but it only works for the runtime compiler, not the file
> compiler (in Clojure terms, it won't work with AOT compilation).
Yes, this is the same issue raised in connection with allowing
arbitrary objects in code in eval.
>> b) convert long strings into shorter
>> chunks and emit a static initializer into the bytecode to reassemble
>> them with concatenation into a single runtime-computed string constant
>> stored in another static field,
>
> This is what I'd like to have :)
Frankly it seems like a bit of a hack to me, though since it would be
used to work around a Y2K-style bug in Java it might be poetic justice
of a sort.
>> and c) restructure whatever consumes
>> the string to consume a seq, java.util.List, or whatever of strings
>> instead and feed it digestible chunks (e.g. a separate string for each
>> defn or other top-level form, in order of appearance in the input file
>> -- surely nobody has *individual defns* exceeding 64KB).
>
> The problem is not in the consumer, but in the form containing the
> string; to do what you're proposing, the reader, upon encountering a
> big enough string, would have to produce a seq/List/whatever instead,
> the compiler would need to be able to dump such an object to a class,
> and all Clojure code handling strings would have to be prepared to
> handle such an object, too. I think it's a little impractical.
I don't think so. The problem isn't with normal strings but only with
strings that get embedded as literals in code; and moreover, the
problem isn't even with those strings exceeding 64k but with whole
.clj files exceeding 64k. The implication is that load-file generates
a class that contains the entire contents of the sourcefile as a
string constant for some reason; so:
a) What does this class do with this string constant? What code consumes it?
b) Can that particular bit of code be rewritten to digest the same
information provided in smaller chunks?
> Regarding the size of individual defns, that's an orthogonal problem;
> anyway, the size of the _bytecode_ for methods is limited to 64KB (see
> <http://java.sun.com/docs/books/jvms/second_edition/html/
> ClassFile.doc.html#88659>) and, while pretty big, it's not impossible
> to reach it, especially when using complex macros to produce a lot of
> generated code.
Another problem for which we will probably need an eventual fix or
workaround. If bytecode can contain a JMP-like instruction it should
be possible to have the compiler split long generated methods and
chain the pieces together without much loss of runtime efficiency,
particularly if it does so at "natural" places -- existing conditional
branches, particularly, and (loop ...) borders -- (defn foo (if x
(lotta-code-1) (lotta-code-2))) for example can be trivially converted
to (defn foo (if x (lotta-code-1) (jmp bar))) (defn bar
(lotta-code-2)) -- though if you had such a jump instruction I'd have
thought implementing real TCO would have been fairly easy, and
apparently it was not.
Failing such a jmp capability you'd have to just use (bar) in that
last example and suffer an additional method call overhead at the
break-point. Again, the obvious way to do it would be to recognize
common branching construct forms such as (if ...) and (cond ...) that
are larger than the threshold but have individual branches that are
not and turn some or all of the branches into their own under-the-hood
methods and calls to those methods.
> We used to generate such big methods in ABCL because
> at one point we tried to spell out in the bytecode all the class names
> corresponding to functions in a compiled file, in order to avoid
> reflection when loading the compiled functions. For files with many
> functions (> 1000 iirc) the generated code became too big. It turned
> out that this optimization had a negligible impact on performance, so
> we reverted it.
I wonder if Clojure is using a similar optimization and would benefit
from its reversion.
Well, this is odd!
> 1. I'm more or less satisfied: if I know I can always work around the
> problem by using shorter namespaces, I'm happy.
It creates a tension, though:
1. In another recent thread people have been arguing for relatively
large namespaces.
2. It isn't nice to have code modularization boundaries dictated as
much by bug workaround considerations as by architectural ones!
> 2. While the namespace size seems to be a factor, I'm not convinced
> that the problem is as linear as "big namespace = problem". I have
> other namespaces that are larger (in line count, character count, and
> definitions) that have been evaluating fine without a problem. This
> problem feels more random/subtle to me.
The fact, mentioned recently in that other thread, that clojure.core
is 200K and around 6Kloc and compiles fine also points in this
direction.
My guess would be that it's some function of namespace size in bytes,
number of vars, and possibly as you say number of referred vars -- so
maybe also imports and anything else that grows the namespace's symbol
table too.
> 4. There seems to be a discrepancy in behaviour depending on how the
> compilation is requested: project-wide command-line compilation seems
> to keep working even when Slime/Swank evaluation fails.
It looks like it affects load-file but not AOT compilation. Both
presumably use eval, and eventually Compiler.java, to get the job
done, so the problem is probably in load-file's implementation
specifically. The simplest hypothesis, that it's granularity (the
compiler chokes on a single huge wodge of forms crammed down its
throat in one go but works fine if given the same forms one by one)
fails to fully explain the data because it predicts that AOT should
fail on the namespaces where load-file fails, but evaluating the
top-level forms one by one works as expected. This suggests again that
load-file is the problem rather than Compiler.java, or possibly that
AOT feeds forms to the compiler differently from load-file.
> 5. Personally I don't have any problem with hard limits (e.g. keep
> your namespaces/whatever below X lines/definitions/whatever) even if
> they're aggressive- but I think it'd be preferable to have an error
> message to point out the limit when it gets hit (if that's indeed
> what's happening).
I *do* have a problem with such limits; including limits on function
size. Architecture should be up to the programmer, and even if it is a
*bad idea* for programmers write huge namespaces or huge individual
functions, IMO that isn't a judgment call appropriate for the
*language compiler* to make. And let's not forget that we're a Lisp,
so we do meta-programming, and so we probably want to be able to
digest files, functions, and individual s-expressions that no sane
human would ever write but that may very well occur in
machine-generated inputs to the compiler.
Your extremely narrow definition precludes the notion that a
specification can itself be in error, and therefore excludes, among
others, the famous category of Y2K bugs from meeting your definition
of "bug". It also precludes *anything* being considered a bug in any
software that lacks a formal specification distinct from its
implementation (i.e., almost ALL software!), or in the reference
implementation of any software whose formal specification is a
reference implementation rather than a design document of some sort.
> Now, you might argue that the spec is badly designed, and I might
> agree, but it's not a "bug in Oracle's Java implementation" - any
> conforming Java implementation must have that (mis)feature.
Does the JLS specify this limitation, or does it just say a string
literal is a " followed by any number of non-" characters and
\-escaped "s followed by a " or words to that effect? Because if the
latter, a conforming implementation of the Java language (as in not
directly contradicting the JLS anywhere) could permit longer string
literals, even if Oracle's does not.
>> Oh, 2 is a bug alright. By your definition, Y2K bugs in a piece of
>> software would also not be bugs. The users of such software would beg
>> to differ.
>
> User perception and bugs are different things.
Not past a certain point they aren't. One user disagreeing with the
developers can be written off as wrong. A large percentage of users
(that know about an issue) disagreeing with the developers means it's
the developers that are wrong, unless you take the extreme step of
rejecting the premise that the developers' ultimate job is serving the
needs of the software's user base. And when the developers are wrong
about something and that wrongness is expressed in the code, that
constitutes a bug, though it may not be a deviation from the
"specification" (assuming in that instance there even is a formal
specification distinct from the implementation).
>> Frankly it seems like a bit of a hack to me, though since it would be
>> used to work around a Y2K-style bug in Java it might be poetic justice
>> of a sort.
>
> It is a sort of hack, yes. An alternative might be to store constants
> in a resource external to the class file, unencumbered with silly size
> limits.
Given the deployment architecture around Java, that doesn't even add
any deployment complexity and, depending on how it is implemented, may
even make i18n easier to accomplish in some cases by freeing the
developers from having to set up a bunch of ResourceBundle
infrastructure in each codebase that needs i18n. +1.
>> I don't think so. The problem isn't with normal strings but only with
>> strings that get embedded as literals in code; and moreover, the
>> problem isn't even with those strings exceeding 64k but with whole
>> .clj files exceeding 64k. The implication is that load-file generates
>> a class that contains the entire contents of the sourcefile as a
>> string constant for some reason; so:
>>
>> a) What does this class do with this string constant? What code consumes it?
>
> Hmm, I don't think it's like you say.
Patrick said:
Does the file you are evaluating have more than 65535
characters? As far as I can tell, that is the maximum
length of a String literal in Java
This clearly implies that the problem is the .clj file exceeding the
maximum length of a string literal, and thus implies more indirectly
that the entire source file is being embedded in some class file as a
string constant. Now you seem to be denying that this is occurring. If
so, your disagreement here is with Patrick, not me.
> Without knowing anything about
> Clojure's internals, it seems to me that the problem is more likely to
> be in a form like the one Patrick posted, (clojure.lang.Compiler/load
> (java.io.StringReader. "the-whole-file-as-a-string")), which is
> compiled in order to be evaluated in order to compile and load the
> file... it is that form, and not the file to be compiled, that
> generates the incorrect class file.
That's not disagreeing, that's agreeing. You're saying that as an
intermediate stage it's compiling a class with the source file as a
literal in order to compile the file, and suggesting a particular
answer to question a, which is that the class just invokes
Complier.load() on the string constant.
>> b) Can that particular bit of code be rewritten to digest the same
>> information provided in smaller chunks?
And the answer to this then becomes a resounding Yes, by not having
load-file or any of its cousins, or any IDE load file functions, cram
the whole thing into a string and build a form like your example above
to eval and instead just point Complier.load() at a reader open on the
source file on disk, if unchanged, or if the focused editor is dirty,
possibly instead pass it a StringReader or similar open on the
editor's buffer in memory. (Compiler.load(new
StringReader(editorJTextArea.getText())); or whatever.)
> If Patrick is right, and I think he is, then the compiler has to
> compile (java.io.StringReader. "the-whole-file-as-a-string") in a way
> that "the-whole-file-as-a-string" does not appear literally in the
> class file. It has either to somehow split the string, or load it from
> somewhere else.
Or, fixing the current crop of problems like this, such forms
shouldn't be generated as intermediate steps in loading by the
editor/IDE/load-file/whatever to begin with. See above.
On the other hand, it suggests that (foo "a very long ... string")
would still break, which is not desirable, so ultimately a mechanism
for breaking up large string constants under the hood in Compiler
seems indicated. We just may not need it to solve this specific
instance.
> In fact, no JMP-like instruction exists that can jump to a different
> method.
Seems like a troubling oversight. As I said it would come in very
handy for enabling TCO. It could even be implemented securely -- the
bytecode verifier could require the target to be the start of a method
visible to the method containing the JMP instruction, for instance,
i.e. the same requirements it currently places on the target of
invokevirtual. It would just be non-stack-consuming and otherwise
equivalent to "return otherMethod(args);" (after some suitable storage
of the arguments, where there are any). If return otherMethod(args);
is secure this limited, tail-optimization-enabling JMP could therefore
be secured.
>> Failing such a jmp capability you'd have to just use (bar) in that
>> last example and suffer an additional method call overhead at the
>> break-point. Again, the obvious way to do it would be to recognize
>> common branching construct forms such as (if ...) and (cond ...) that
>> are larger than the threshold but have individual branches that are
>> not and turn some or all of the branches into their own under-the-hood
>> methods and calls to those methods.
>
> The positive thing is that the method call overhead disappears thanks
> to Hotspot for frequently called methods. The negative thing is that
> splicing bytecode is harder than it seems because of jumps and
> exception handlers that might be present.
Hence my suggestion that the compiler work at a higher level. The
post-macroexpansion sexp seems ideal, since it need only look for the
special forms (if ...), (let* ...), and (loop* ...) to find obvious
division points and the nature of functional code is such that very
long functions usually contain these. So it can follow an algorithm
like
1. Try compiling the whole function into a single .invoke method (the
current behavior). If that succeeds, we're done.
2. Catch exception and try to break function at an obvious boundaries
near the midpoint, particularly on either side of a let* or loop*
nearly corresponding to the middle third or at the start of the "else"
of an if where that "else" is about a third to half the code.
3. If all else fails, just split at arbitrary points in e.g. a long
chain of expressions.
4. Generate a .invoke method and some auxiliary private methods that
are called by it and/or each other.
There's some complication; the sub-methods will need to receive
arguments corresponding to the locals in existence at the start of the
split-off piece of code that are used within that piece, for example.
But it doesn't strike me as infeasible.