vector of chars vs string

170 views
Skip to first unread message

Camilo Roca

unread,
May 27, 2016, 7:30:45 AM5/27/16
to Clojure
Hey guys,

First of all I would like to ask some thing. In clojure the following statements results in:
(identical? "foo" (str "f" "oo"))
;;=> false
> (= "foo" (str "f" "oo"))
;;=> true

Everything is ok with that. The next one on the other hand is what puzzles me:
(identical? \f (first (str "f" "oo")))
;;=> true

If what I guess is right, the amount of chars that exist are finite, thus Clojure treats them like a "pool of charts". The question is then why are not strings implemented as vectors of charts instead of using the underlying Java String class? As by using the Java String new allocations would have to be performed every time that a new string needs to be created, even if it contains exactly the same information of an existing string.

This might not be a big deal if the amount of strings is small of lazy produced, but (at least in my case) when I needed to load a relatively small text file (120 MB) fully into memory* then I started having memory problems as (from what I saw with YourKit) I had lots of repeated strings.

I would like to know your thoughts on this idea and implications/problems with it.


---------------------------------------------------------------------- notes
* yes I know that I could lazily analyze the whole file and thus avoid having memory problems, but in some cases such as using sort-by or group-by, there is no other alternative than holding the whole thing and then process it.

Alex Miller

unread,
May 27, 2016, 8:30:44 AM5/27/16
to Clojure

On Friday, May 27, 2016 at 6:30:45 AM UTC-5, Camilo Roca wrote:
If what I guess is right, the amount of chars that exist are finite,

Well, kind of - see Unicode.
 
thus Clojure treats them like a "pool of charts".

Java has a small number of primitive value types - byte, short, int, long, double, float, boolean, and char. In the JVM there is no "pool of primitives" - all of these types (plus Objects) describe how values are actually stored. If you had a "pool" then something would have to refer to it and the pointer would be larger than the value in many cases.

Several of the boxed versions of these primitives do pool the boxed forms in a range. I think Character is pooled for 0-127. Integers use a larger extended pool. This is all defined in the Java spec. This is a Java feature, not a JVM one.
 
The question is then why are not strings implemented as vectors of charts instead of using the underlying Java String class?

Java strings are stored as arrays of chars  (hand-waving over the actual tricky details of multi-byte character encoding) which is about as efficient as you can get in the JVM (once I implemented a hack to String itself to compress and decompress big char arrays - that was fun! totally not worth it). Clojure vectors are certainly not as memory efficient as a simple array - they are a tree structure of nodes containing arrays (hash array mapped tries).
 
As by using the Java String new allocations would have to be performed every time that a new string needs to be created, even if it contains exactly the same information of an existing string.

True, although these operations (being so common) are highly optimized in the JVM and even in the garbage collector. G1 will now autodetect the reuse of literal strings and de-dupe them in the latest versions. In short, the JVM is optimizing strings way more than we ever would and the best thing to do is to leverage the common path there and assume it will continue to be as fast as possible. Java strings can also be interned which forces a literal string to be effectively cached in the JVM and reused - you can explicitly do that via String.intern() but literal strings in your source code are always interned. There are some interesting effects with interning though so it's good to tread carefully with it - we used to use it more inside Clojure but have backed off of it as it makes allocation slower.
 
This might not be a big deal if the amount of strings is small of lazy produced, but (at least in my case) when I needed to load a relatively small text file (120 MB) fully into memory* then I started having memory problems as (from what I saw with YourKit) I had lots of repeated strings.

Interning might be worth looking at it if you're seeing this. In particular for cases like reading a file into maps where the map keys are common - those keys can be interned.

You might also try the G1 string deduplication with something like -XX:+UseG1GC -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics - the last to see the effects. Google around for more info on this and make sure you're on the latest JDK as it's evolved.

Tassilo Horn

unread,
May 27, 2016, 12:18:06 PM5/27/16
to Camilo Roca, Clojure
Camilo Roca <car...@gmail.com> writes:

Hi Camilo,

> Everything is ok with that. The next one on the other hand is what
> puzzles me:
> (identical? \f (first (str "f" "oo")))
> ;;=> true
>
> If what I guess is right, the amount of chars that exist are finite,
> thus Clojure treats them like a "pool of charts". The question is then
> why are not strings implemented as vectors of charts instead of using
> the underlying Java String class?

That wouldn't be better. Two vectors containing the identical chars in
the same order are still different vectors.

(map identical? [\f \o \o] [\f \o \o])
;=> (true true true)

(identical? [\f \o \o] [\f \o \o])
;=> false

So that's actually exactly the same as with strings. And as Alex
already explained, strings are more light-weight than vectors, and they
are already highly optimized, e.g., every equal literal string in your
code will usually be represented by the very same string in memory.

Bye,
Tassilo

Camilo Roca

unread,
May 28, 2016, 3:38:38 PM5/28/16
to Clojure
@alex-miller
thanks, that explained a lot. I did check the UseStringDeduplication feature and that's exactly what I was looking for.

Funny thing is that it can only be implemented because strings are immutable in Java :P thus it is cheaper to store a reference to it since it will never change.

@Tassilo
that's actually a rather interesting example. I always wondered about it but didn't bother to ask it until now. If you see your example, you can see that the JVM is actually recognizing the characters inside the vectors are the same and only allocating one for each occurrence. On the other hand, the vectors although being exactly the same are not being recognized by the JVM as so. Thus, two different allocations are done even though only one would suffice (given their immutability).

That was precisely my point: If you could identify that something is an exact copy of something else, why not just point to the first thing? From Alex miller, answer I know now that this is true for strings and for primitive values present in the source code.

But what about keywords? vectors, hash-maps, sets?

Of course, complaining about allocating those in the source code is non-sense as it would need a huge amount of them to even start to matter. But if you create lots of them at runtime then it is probably worth to look at it.

Say for example a web-server that receives json objects with 70% of its content being (predefined) configuration values. If you allocate new strings/vector/hash-map for every request that you get, then your memory rises quite quickly. On the other hand if you keep a sort of "pool of values" and replace the new values with references to the ones in the pool then the total memory consumed is less than before.

From my understanding of the description of "UseStringDeduplication" this is what's been done for Strings, but I guess NOT for any other object.

Alex Miller

unread,
May 28, 2016, 8:56:58 PM5/28/16
to Clojure

On Saturday, May 28, 2016 at 2:38:38 PM UTC-5, Camilo Roca wrote:
@alex-miller
thanks, that explained a lot. I did check the UseStringDeduplication feature and that's exactly what I was looking for.

Funny thing is that it can only be implemented because strings are immutable in Java :P thus it is cheaper to store a reference to it since it will never change.

@Tassilo
that's actually a rather interesting example. I always wondered about it but didn't bother to ask it until now. If you see your example, you can see that the JVM is actually recognizing the characters inside the vectors are the same and only allocating one for each occurrence. On the other hand, the vectors although being exactly the same are not being recognized by the JVM as so. Thus, two different allocations are done even though only one would suffice (given their immutability).

That was precisely my point: If you could identify that something is an exact copy of something else, why not just point to the first thing? From Alex miller, answer I know now that this is true for strings and for primitive values present in the source code.

But what about keywords? vectors, hash-maps, sets?

Keywords are interned in Clojure - when you "create" one it will return you the identical instance from the cache if it already exists. There are a few other special cases like this too - things like literal {}, [], (), #{}. But in general this has all the same problems with any cache - synchronization/concurrency, knowing when to clear, updating use stats/refcounting, etc.

 
Of course, complaining about allocating those in the source code is non-sense as it would need a huge amount of them to even start to matter. But if you create lots of them at runtime then it is probably worth to look at it.

Say for example a web-server that receives json objects with 70% of its content being (predefined) configuration values. If you allocate new strings/vector/hash-map for every request that you get, then your memory rises quite quickly. On the other hand if you keep a sort of "pool of values" and replace the new values with references to the ones in the pool then the total memory consumed is less than before.

I believe there are some json parsers that do exactly this optimization, not too sure about the Clojure ones though. 
Reply all
Reply to author
Forward
0 new messages