Re: Interner: optional String cleanup

160 views
Skip to first unread message

arn

unread,
Oct 21, 2012, 11:13:05 AM10/21/12
to guava-...@googlegroups.com
Just found that JRE >=1.7u6 doesn't share String internal arrays anymore...

Anyway, it can be useful for older JRE or other types of values. :)

Osvaldo Doederlein

unread,
Oct 21, 2012, 2:10:48 PM10/21/12
to arn, guava-discuss
Right, this was an old pitfall of the platform, causing leaks in careless code. Your proposed optimization is one way to defeat those bugs(*), except that you cannot be sure that the String passed as argument is not retained somewhere else, so there's some risk that this optimization would increase memory usage for some particularly evil users... although that's a very minor issue, would only affect code that already contains a bug.

(*) Strings that retain more than their length() of chars, are only produced by code that uses substring operations. Higher-level idioms like StringBuilder or formatter APIs always give you a "packed" String; likewise for major String-heavy libraries such as XML parsers. So I think the problem should be very small these days; I might be wrong, but that's one reason why Oracle finally got rid of the "optimization" of char[] sharing.

A+
Osvaldo
 

Louis Wasserman

unread,
Oct 21, 2012, 2:47:03 PM10/21/12
to Osvaldo Doederlein, arn, guava-discuss
Are we really going to depend on that?  I'm honestly horrified that Oracle made that change, given how much code makes assumptions about the asymptotics of String.substring.

Osvaldo Doederlein

unread,
Oct 21, 2012, 3:44:36 PM10/21/12
to Louis Wasserman, arn, guava-discuss
On Sun, Oct 21, 2012 at 2:47 PM, Louis Wasserman <wasserm...@gmail.com> wrote:
Are we really going to depend on that?  I'm honestly horrified that Oracle made that change, given how much code makes assumptions about the asymptotics of String.substring.

Well they're clearly betting that this won't create significant problems.  In the last few years Oracle has been working a lot in this area. They have new HotSpot optimizations that recognize common patterns of String operations (e.g. to coalesce concatenations and reduce the number of temporary allocations and data copying). They also tried -XX:+UseCompressedStrings, but this idea was apparently dropped, it's not default in JDK 6 and not available at all in JDK 7.

Both designs create some risk of performance bugs, but sharing is clearly the most risky, as you can have a 10-char substring pinning a megabyte-size string in the heap; and this is not by any means unrealistic, for example people can slurp a big text file into a single string and then parse and keep small pieces of it.  OTOH, the opposite perf bug (to have an important impact) requires lots of substring operations, each cutting a large substring off a larger source string. I can imagine some code that would be affected, like a full-text database / indexer such as Lucene.

A+
Osvaldo

Nathan Bryant

unread,
Oct 21, 2012, 4:00:32 PM10/21/12
to Osvaldo Doederlein, Louis Wasserman, arn, guava-discuss
On 10/21/2012 3:44 PM, Osvaldo Doederlein wrote:
> Both designs create some risk of performance bugs, but sharing is
> clearly the most risky, as you can have a 10-char substring pinning a
> megabyte-size string in the heap; and this is not by any means
> unrealistic, for example people can slurp a big text file into a
> single string and then parse and keep small pieces of it. OTOH, the
> opposite perf bug (to have an important impact) requires lots of
> substring operations, each cutting a large substring off a larger
> source string. I can imagine some code that would be affected, like a
> full-text database /

10-char substrings pinning a meg could happen, yes. But it doesn't seem
to me to be a frequent occurrence and it also doesn't fit the definition
of a "leak"; rather, it is "unexpected memory retention." And there are
workarounds.

If Oracle is going to make changes in the standard library that are
targeted to eliminate common programmer pitfalls, they might have
started by moving a bunch of fields in SimpleDateFormat to be allocated
on the stack rather than the heap.

Osvaldo Doederlein

unread,
Oct 21, 2012, 4:43:48 PM10/21/12
to Nathan Bryant, Louis Wasserman, arn, guava-discuss
On Sun, Oct 21, 2012 at 4:00 PM, Nathan Bryant <nbr...@optonline.net> wrote:
On 10/21/2012 3:44 PM, Osvaldo Doederlein wrote:
Both designs create some risk of performance bugs, but sharing is clearly the most risky, as you can have a 10-char substring pinning a megabyte-size string in the heap; and this is not by any means unrealistic, for example people can slurp a big text file into a single string and then parse and keep small pieces of it.  OTOH, the opposite perf bug (to have an important impact) requires lots of substring operations, each cutting a large substring off a larger source string. I can imagine some code that would be affected, like a full-text database /

10-char substrings pinning a meg could happen, yes. But it doesn't seem to me to be a frequent occurrence and it also doesn't fit the definition of a "leak"; rather, it is "unexpected memory retention." And there are workarounds.

"Unexpected memory retention" is the any kind of leak you can have in a garbage-collected language ;-) there are of course several ways to do that, and substrings are particularly evil because the unexpected retention is in the guts of strings, which behavior is not documented anywhere. (Not even in the String(String) constructor, which single purpose in life is avoiding that sharing.)
 
If Oracle is going to make changes in the standard library that are targeted to eliminate common programmer pitfalls, they might have started by moving a bunch of fields in SimpleDateFormat to be allocated on the stack rather than the heap.

This is a different case because SDF is clearly documented to not be thread-safe. I don't know every API by heart and I often write code with assistance of autocompletion, without checking the javadocs; but I never assume that some API is thread-safe unless I remember it clearly, or check the javadocs. I don't think the platform should go out of its way to protect that kind of sloppy coding.

Agreed that making the change you sugest would be convenient, but the problem is that the "bunch of fields" seems to be 10 fields (counting the non-final fields in JDK 7's impl); it's too much to pass around lots of internal methods via extra parameters. Maybe refactor all mutable fields and all internal methods that use them to a private worker class. This would resemble the alternative design, e.g. Pattern+Matcher, a two-layer system with a thread-safe parent object that does some preprocessing and a non-thread-safe, child worker. Maybe SimpleDateFormat should have adopted that design. Anyway, criticizing Java's Date APIs is a waste of time, it's a dead horse so maybe not the best example for your point ;-)

A+
Osvaldo

Nathan Bryant

unread,
Oct 21, 2012, 4:57:44 PM10/21/12
to Osvaldo Doederlein, Louis Wasserman, arn, guava-discuss
On 10/21/2012 4:43 PM, Osvaldo Doederlein wrote:
>
> "Unexpected memory retention" is the any kind of leak you can have in
> a garbage-collected language ;-)

Yes and no. If you have a static final field that points to a
collection, and you have a piece of code that adds an item to that
collection and never removes it, in my personal lexicon you don't have
unexpected retention -- you have a leak, and instead of having a
solution that is some (possibly huge) constant factor away from optimal
you have a solution that is simply untenable.

>
> This is a different case because SDF is clearly documented to not be
> thread-safe.

Who reads the manual? I'm pretty sure I didn't until FindBugs told me
how wrong I'd been, and I've been shouting "don't store a DateFormat in
a field" from the rooftops ever since, without a lot of people listening..

> I don't know every API by heart and I often write code with assistance
> of autocompletion, without checking the javadocs; but I never assume
> that some API is thread-safe unless I remember it clearly, or check
> the javadocs. I don't think the platform should go out of its way to
> protect that kind of sloppy coding.
Nobody gets thread safety right. And, lately, people have this idea that
they should be able to dependency-inject everything, either via Spring
or via a static constant... and the immediate response to "don't store a
DateFormat in a field", IF people are listening, is "wow, that sucks."
>
> Agreed that making the change you sugest would be convenient, but the
> problem is that the "bunch of fields" seems to be 10 fields (counting
> the non-final fields in JDK 7's impl); it's too much to pass around
> lots of internal methods via extra parameters. Maybe refactor all
> mutable fields and all internal methods that use them to a private
> worker class. This would resemble the alternative design, e.g.
> Pattern+Matcher, a two-layer system with a thread-safe parent object
> that does some preprocessing and a non-thread-safe, child worker.
> Maybe SimpleDateFormat should have adopted that design. Anyway,
> criticizing Java's Date APIs is a waste of time, it's a dead horse so
> maybe not the best example for your point ;-)
True, there's many other problems in j.u.Date and friends, starting with
all those deprecated methods in j.u.Date itself which are never going to
go away.

David Beaumont

unread,
Oct 21, 2012, 5:51:41 PM10/21/12
to Nathan Bryant, Osvaldo Doederlein, Louis Wasserman, arn, guava-discuss
On 21 October 2012 13:57, Nathan Bryant <nbr...@optonline.net> wrote:
> On 10/21/2012 4:43 PM, Osvaldo Doederlein wrote:
>> This is a different case because SDF is clearly documented to not be
>> thread-safe.
>
> Who reads the manual? I'm pretty sure I didn't until FindBugs told me how
> wrong I'd been, and I've been shouting "don't store a DateFormat in a field"
> from the rooftops ever since, without a lot of people listening..

Somewhat off topic:

Interesting you should say that. I was actually thinking how much I
would really like a thread safe SDF about 5 minutes ago.

The problem there is that unlike java.util.Formatter (also not thread
safe), a SDF encapsulates the compiled form of the format string (and
compilation is non-trivial). So to do this there would probably need
to be a thread-local containing a map of commonly used formats (we can
'grep' to figure out what these are) and an LRU cache of dynamic SDFs.
We might even be able to exploit the fact that SDF is *gasp* Cloneable
!!

Or we could just do what Sun never managed and write a thread-safe
replacement (are we ok to read the source or would it need to be clean
room ??)

Cheers,
David

--
David Beaumont :: Îñţérñåţîöñåļîžåţîờñ Libraries :: Google
Google Switzerland GmbH., Brandschenkestrasse 110, CH-8002, Zürich - Switzerland
Tel +41 44 668 1800 :: Fax +41 44 668 1818

nik...@gmail.com

unread,
Oct 21, 2012, 6:57:54 PM10/21/12
to David Beaumont, Nathan Bryant, Osvaldo Doederlein, Louis Wasserman, arn, guava-discuss
Sun never bothered because of jodatime. They'll provide better stuff with jsr310 eventually.

I do like how checkstyle basically browbeats you into jodatime.

Sent from my iPhone

Osvaldo Doederlein

unread,
Oct 21, 2012, 9:07:24 PM10/21/12
to Nathan Bryant, Louis Wasserman, arn, guava-discuss
On Sun, Oct 21, 2012 at 4:57 PM, Nathan Bryant <nbr...@optonline.net> wrote:
On 10/21/2012 4:43 PM, Osvaldo Doederlein wrote:
"Unexpected memory retention" is the any kind of leak you can have in a garbage-collected language ;-)

Yes and no. If you have a static final field that points to a collection, and you have a piece of code that adds an item to that collection and never removes it, in my personal lexicon you don't have unexpected retention -- you have a leak, and instead of having a solution that is some (possibly huge) constant factor away from optimal you have a solution that is simply untenable.

Notice that String.substring()'s javadocs don't inform you that the implementation will pin to the heap all characters from the receiver. This failure of documentation, in my lexicon is severe enough to consider the sharing technique leaky.  It's exactly like if ArrayBlockingQueue.remove() wouldn't null out the object reference, just bump the "first" index to the backing circular array, and not document that fact: most people would call that a leak, even if it was purposeful because it's a bit faster. String.substring()'s missing doc is certainly intended, to not reveal and not restrain implementation details, but they could have used a careful language e.g. "implementations are free to do this...").
 
Nobody gets thread safety right. And, lately, people have this idea that they should be able to dependency-inject everything, either via Spring or via a static constant... and the immediate response to "don't store a DateFormat in a field", IF people are listening, is "wow, that sucks."

Now I agree, but the real lesson is that static fields should be avoided like the plague. Multithreading is not that hard if one is disciplined with global state and sharing; have as little as possible of it, or if you have a lot (e.g. a big cache) then have very simple interfaces protecting that evil shared, mutable stuff.  It's tempting to use simple and efficient sharing (raw static fields) for immutable objects; but just like thread-safety, one cannot just assume that something is immutable just because it seems to makes sense conceptually.

And if people do dependency injection of mutable objects, into objects that are not themselves unshared, they're doing DI wrong. Frameworks that sell DI as a silver bullet and tell you to use it everywhere are part of the problem not the solution. :)

A+
Osvaldo

Reply all
Reply to author
Forward
0 new messages