How to do an efficient update of an immutable map

Alan Burlison

unread,

Dec 20, 2011, 6:10:11 PM12/20/11

to scala...@googlegroups.com

I have an immutable map which occasionally needs updating and replacing
with a new version. The update will add some new keys, change the
contents of some others and occasionally delete keys as well. I could
do this with ++ for the addition/updates and -- for the removals but
that would mean creating an intermediate immutable hash. Is there a way
of doing a set of additions, updates and removals all in one go?

--
Alan Burlison
--

Ken McDonald

unread,

Dec 20, 2011, 6:31:44 PM12/20/11

to scala...@googlegroups.com

I believe there should be operators for adding one map to another (in a rush right now so can't check, sorry). So you could create a new map, add your new elements to it which would be fast since it's small (in fact, you could even make it mutable), and then add the new map to the old map to get the final map. I'm assuming that the underlying implementations are efficient (which probably means they use transient mutable structures).

Cheers,

Ken

Alan Burlison

unread,

Dec 20, 2011, 6:42:43 PM12/20/11

to scala-user

That's more or less what I was intending to to for the additions and
updates, but it won't cater for removals - that's the problem. Also
you don't have to use a map for the add/update you can use a list of
key/value pairs.

Ken McDonald

unread,

Dec 20, 2011, 10:44:18 PM12/20/11

to scala...@googlegroups.com

Oh, I see what you mean; you'll get an intermediate map between doing the additions and the removals. I've never seen a method for doing additions and deletions at the same time, and I imagine you've looked as well. I doubt there is a way to improve on that. But I don't see it being a problem--if the creation of the intermediate map causes a performance problem, then even the creation of a single new map per update is suspect, and you should probably be using mutable maps.

Generally speaking, I've found that premature optimization is at the heart of a lot of problems--I've done it many times myself, so I know :-)

Ken

Bernd Johannes

unread,

Dec 21, 2011, 2:00:23 AM12/21/11

to Alan Burlison, scala...@googlegroups.com

What I did in a similar situation was:
- gather the changes qualified in a mutable collection (addition/update versus deletions)
- optimize the changes (e.g. throw out additions/updates when a deletion follows, keep only the "latest" update)
- look at the size of the resulting changes
- when "small" - perform changes on the immutable collection directly
- when "big" - process everything in a mutable collection and transform into an immutable result

That was fast enough (for me) but it requires that you are able to gather the changes during your processing. If they have to be applied on the spot then this model isn't an option.
(In my situation it was a simulation over time so each immutable collection represented a point in time and the mutable buffers were calculation intermediates which got integrated into the t0 collection to yield the t0+1 collection).

Greetings
Bernd

Dennis Haupt

unread,

Dec 21, 2011, 3:45:40 AM12/21/11

to Alan Burlison, scala...@googlegroups.com

get your hands on the newbuilder.

-------- Original-Nachricht --------
> Datum: Tue, 20 Dec 2011 23:10:11 +0000
> Von: Alan Burlison <alan.b...@gmail.com>
> An: scala...@googlegroups.com
> Betreff: [scala-user] How to do an efficient update of an immutable map

Rüdiger Klaehn

unread,

Dec 21, 2011, 4:01:55 AM12/21/11

to Dennis Haupt, Alan Burlison, scala...@googlegroups.com

As far as I know a map builder just contains a map. Bulk operations on maps are not implemented. I guess since map uses a (very flat ) tree internally, you could do mapa ++ mapb by merging the trees instead of just adding all elements of mapb to mapa sequentially.

A data structure that supports bulk union is IntMap/LongMap in scala.collection.immutable. But of course you can only have ints or longs as keys, and it is not as memory efficient as a normal map since it is basically a binary trie instead of a ~32way trie like the new maps.

That said, Map is pretty efficient as it is. To the OP: Did you check if this is an actual performance problem?

Alan Burlison

unread,

Dec 21, 2011, 6:41:27 AM12/21/11

to scala...@googlegroups.com, Ken McDonald

On 21/12/2011 03:44, Ken McDonald wrote:

> Oh, I see what you mean; you'll get an intermediate map between doing the
> additions and the removals. I've never seen a method for doing additions
> and deletions at the same time, and I imagine you've looked as well.

Yes, I've looked and I thought I might be missing something obvious,
which is why I asked :-) Thanks for the confirmation. Deletions will
probably be rare, so I can just gather them up in a list and avoid
creating the intermediate hash if there aren't any deletions to do.

--
Alan Burlison
--

Alan Burlison

unread,

Dec 21, 2011, 6:46:27 AM12/21/11

to Dennis Haupt, scala...@googlegroups.com

On 21/12/2011 08:45, Dennis Haupt wrote:

> get your hands on the newbuilder.

What is that?

--
Alan Burlison
--

Alan Burlison

unread,

Dec 21, 2011, 6:53:19 AM12/21/11

to Rüdiger Klaehn, Dennis Haupt, scala...@googlegroups.com

On 21/12/2011 09:01, R�diger Klaehn wrote:

> As far as I know a map builder just contains a map. Bulk operations on maps
> are not implemented. I guess since map uses a (very flat ) tree internally,
> you could do mapa ++ mapb by merging the trees instead of just adding all
> elements of mapb to mapa sequentially.

I've been blundering around trying to find the code that actually
implements ++ and -- and got pretty lost. From what I can tell it's in
effect a copy & append operation, is that correct? For large maps with
a low number of changes being added that seems rather inefficient.

> A data structure that supports bulk union is IntMap/LongMap in
> scala.collection.immutable. But of course you can only have ints or longs
> as keys, and it is not as memory efficient as a normal map since it is
> basically a binary trie instead of a ~32way trie like the new maps.
>
> That said, Map is pretty efficient as it is. To the OP: Did you check if
> this is an actual performance problem?

Not yet, and in practice I can probably get away with a slightly
inefficient implementation. However I'm trying to understand the
tradeoffs between using a mutable map + locking or using an immutable
map that gets replaced on update - the use case is multithreaded access,
mostly read-only.

--
Alan Burlison
--

Daniel Sobral

unread,

Dec 21, 2011, 7:17:41 AM12/21/11

to Alan Burlison, Rüdiger Klaehn, Dennis Haupt, scala...@googlegroups.com

On Wed, Dec 21, 2011 at 09:53, Alan Burlison <alan.b...@gmail.com> wrote:

An immutable map doesn't get replaced wholesale. Most of the new map
uses the same data as the old map, with just a sprinkling of new data
structures adjusted for the key that was inserted or removed.

--
Daniel C. Sobral

I travel to the future all the time.

Alan Burlison

unread,

Dec 21, 2011, 7:32:44 AM12/21/11

to Daniel Sobral, Rüdiger Klaehn, Dennis Haupt, scala...@googlegroups.com

On 21/12/2011 12:17, Daniel Sobral wrote:

> An immutable map doesn't get replaced wholesale. Most of the new map
> uses the same data as the old map, with just a sprinkling of new data
> structures adjusted for the key that was inserted or removed.

Neat, thanks for the info. If I wanted to go get my head around the
implementation, where would be the best place to start looking?

--
Alan Burlison
--

Rüdiger Klaehn

unread,

Dec 21, 2011, 10:11:02 AM12/21/11

to Tony Morris, Alan Burlison, Dennis Haupt, scala...@googlegroups.com, Daniel Sobral

Forgot reply to all.

I recommend benchmarking the relevant code with google caliper to see if it is actually a problem.

The scala hash map uses a very flat tree structure (maximum 32 children per node), which is much more efficient than a binary tree. Also, a trie is faster than a balanced tree for randomly distributed data like you would expect hash codes to be.

On Wed, Dec 21, 2011 at 2:43 PM, Tony Morris <tmo...@tmorris.net> wrote:

I recommend writing a (immutable) self-balancing binary tree map, with accompanying zipper to achieve your operations efficiently.

Daniel Sobral

unread,

Dec 21, 2011, 10:19:30 AM12/21/11

to Alan Burlison, Rüdiger Klaehn, Dennis Haupt, scala...@googlegroups.com

From what I saw, it uses a TrieMap based on the hash (HashTrieMap),
plus some optimizations to provide good parallelism. This is a rather
complex structure, and I advise you to get a general grounding on
functional tries before tackling the implementation. Daniel Spiewak's
Extreme Cleverness presentation
(http://www.infoq.com/presentations/Functional-Data-Structures-in-Scala)
might get you started. It's very interesting anyway, and explains very
well the basic concepts of data structure reuse (persistent data
structures).

The whole implementation seems to be inside HashMap.scala, so you can
just look up the immutable HashMap class on Scaladoc, and then click
on the link to the source code.

Tony Morris

unread,

Dec 20, 2011, 6:52:26 PM12/20/11

to scala...@googlegroups.com

You want a zipper. Unfortunately there isn't one in the standard Map
implementations. I hear there is one on the way in a third party
library, or you could write one yourself pretty easily. You want to
avoid mutability for efficiency reasons.

--
Tony Morris
http://tmorris.net/

Tony Morris

unread,

Dec 21, 2011, 8:43:44 AM12/21/11

to Alan Burlison, Dennis Haupt, Rüdiger Klaehn, scala...@googlegroups.com, Daniel Sobral

I recommend writing a (immutable) self-balancing binary tree map, with accompanying zipper to achieve your operations efficiently.

On Dec 21, 2011 10:34 PM, "Alan Burlison" <alan.b...@gmail.com> wrote:

Alan Burlison

unread,

Dec 21, 2011, 2:43:40 PM12/21/11

to Daniel Sobral, Rüdiger Klaehn, Dennis Haupt, scala...@googlegroups.com

That's exactly what I was looking for, thanks very much Daniel :-)

--
Alan Burlison
--

Ken McDonald

unread,

Dec 21, 2011, 5:07:22 PM12/21/11

to scala...@googlegroups.com

Also, what is your case for using an immutable data structure? If it is simply to prevent unintended mutations, then you might do the following:

1) Use a mutable map, but that map is never passed out into the world; it is used only for the batch updates.

2) Once a batch update has been done, create an immutable map from the mutable map, and use that for all the data access.

You could actually wrap a mutable map inside a simple class that would force you to use it in a fairly functional manner. Call the class IMap, it has 3 operations; add, delete, and return immutable map. The first two return a new IMap containing the old mutable map, and set a flag marking the original IMap as invalid; after that, any operation on the original IMap throws an error. Of course, this isn't functional, what it really is is an ad hoc way of enforcing a uniqueness constraint such as provided by Concurrent Clean (I have to blow my own horn here--that was my idea, the CC folks were nice enough to like it and implement it), so in this case you're doing at runtime what CC can do at compile time. This isn't ideal, but it's simple, and should sharply reduce imperative-type errors.

Of course, if you need different versions of the map active at the same time, this won't help.

Cheers,

Ken

Rüdiger Klaehn

unread,

Dec 21, 2011, 7:14:30 PM12/21/11

to Dennis Haupt, Alan Burlison, scala...@googlegroups.com

Sometimes the scala collections seem a bit unfinished.

I looked at the code, and it seems that map does in fact support bulk operations. There is a function called merge in scala.collection.immutable.HashMap. But it is not being used when merging two maps with ++.

Somebody went through all the trouble of implementing a bulk merge, but since 99.5% of all users of scala.collection.immutable.Map will use ++ to merge two maps, they will never benefit from it.

Instead ++ just adds all elements of the second map sequentially to the first map. Now, that makes sense when the second argument of ++ is not actually a map, but if both arguments of ++ are maps the bulk merge function should be used!

Here is a microbenchmark showing the difference of using merge vs. using ++:

code:

package mapmerge

import scala.collection.immutable.HashMap

import com.google.caliper.{Runner, Param, SimpleBenchmark}

class MapMergeTest extends SimpleBenchmark {

@Param var size = 0

var a: HashMap[Int, Int] = _

var b: HashMap[Int, Int] = _

override def setUp() {

val odd = (0 until size by 2).map(x => x -> x)

val even = (1 until size by 2).map(x => x -> x)

a = HashMap(even: _*)

b = HashMap(odd: _*)

}

def timeMerge(reps: Int) = {

var result = 0

var i = 0

while (i < reps) {

val c = a.merge(b)

result += c.size

i += 1

}

result

}

def timePlusPlus(reps: Int) = {

var result = 0

var i = 0

while (i < reps) {

val c = a ++ b

result += c.size

i += 1

}

result

}

object MapMergeTest {

def main(args:Array[String]) {

Runner.main(classOf[MapMergeTest], args :_*)

}

results:

java mapmerge.MapMergeTest -Dsize=10,100,1000,10000

0% Scenario{vm=java, trial=0, benchmark=Merge, size=10} 156,73 ns; σ=0,35 ns @ 3 trials

13% Scenario{vm=java, trial=0, benchmark=PlusPlus, size=10} 2107,29 ns; σ=19,64 ns @ 5 trials

25% Scenario{vm=java, trial=0, benchmark=Merge, size=100} 3625,39 ns; σ=8,85 ns @ 3 trials

38% Scenario{vm=java, trial=0, benchmark=PlusPlus, size=100} 14030,30 ns; σ=100,49 ns @ 3 trials

50% Scenario{vm=java, trial=0, benchmark=Merge, size=1000} 29422,37 ns; σ=175,16 ns @ 3 trials

63% Scenario{vm=java, trial=0, benchmark=PlusPlus, size=1000} 156201,21 ns; σ=426,59 ns @ 3 trials

75% Scenario{vm=java, trial=0, benchmark=Merge, size=10000} 423233,09 ns; σ=9805,64 ns @ 10 trials

88% Scenario{vm=java, trial=0, benchmark=PlusPlus, size=10000} 1974952,39 ns; σ=14975,68 ns @ 3 trials

size benchmark ns linear runtime

10 Merge 157 =

10 PlusPlus 2107 =

100 Merge 3625 =

100 PlusPlus 14030 =

1000 Merge 29422 =

1000 PlusPlus 156201 ==

10000 Merge 423233 ======

10000 PlusPlus 1974952 ==============================

vm: java

trial: 0

Process finished with exit code 0

Tony Morris

unread,

Dec 21, 2011, 7:02:38 PM12/21/11

to Alan Burlison, Dennis Haupt, Rüdiger Klaehn, scala...@googlegroups.com, Daniel Sobral

To be clear, I recommend this exercise for gaining an introduction to functional programming and associated data structures -- to "be the best place to start looking to get your head around the implementation."

Happy to start it off if you like.

Alan Burlison

unread,

Dec 22, 2011, 5:15:42 AM12/22/11

to Rüdiger Klaehn, Dennis Haupt, scala...@googlegroups.com

On 22/12/2011 00:14, Rüdiger Klaehn wrote:

> Here is a microbenchmark showing the difference of using merge vs. using ++:

Wow, that's a huge difference even with low numbers of additions.
Thanks for the info, and for a useful example of the SimpleBenchmark
stuff as well :-)

Perhaps a bug should be logged against this?

--
Alan Burlison
--

Alan Burlison

unread,

Dec 22, 2011, 5:21:32 AM12/22/11

to scala...@googlegroups.com, Ken McDonald

On 21/12/2011 22:07, Ken McDonald wrote:

> Also, what is your case for using an immutable data structure? If it is
> simply to prevent unintended mutations, then you might do the following:

It's for multithreaded access. It's OK for threads to have a slightly
out-of-date version, but updates should be atomic. From reading the
Scala literature that would seem to be the classic case where you'd use
an immutable map, hence the question.

> 1) Use a mutable map, but that map is never passed out into the world; it
> is used only for the batch updates.
> 2) Once a batch update has been done, create an immutable map from the
> mutable map, and use that for all the data access.

The map will only have a low percentage of entries modified when it is
modified, so converting between immutable and mutable maps each time
seemed a little clumsy.

> You could actually wrap a mutable map inside a simple class that would
> force you to use it in a fairly functional manner. Call the class IMap, it
> has 3 operations; add, delete, and return immutable map. The first two
> return a new IMap containing the old mutable map, and set a flag marking
> the original IMap as invalid; after that, any operation on the original
> IMap throws an error. Of course, this isn't functional, what it really is
> is an ad hoc way of enforcing a uniqueness constraint such as provided by
> Concurrent Clean (I have to blow my own horn here--that was my idea, the CC
> folks were nice enough to like it and implement it), so in this case you're
> doing at runtime what CC can do at compile time. This isn't ideal, but it's
> simple, and should sharply reduce imperative-type errors.
>
> Of course, if you need different versions of the map active at the same
> time, this won't help.

Threads that have grabbed the old version can carry on using it, so yes
potentially there may be multiple versions active.

--
Alan Burlison
--

Rüdiger Klaehn

unread,

Dec 22, 2011, 5:58:29 AM12/22/11

to Alan Burlison, scala...@googlegroups.com, Ken McDonald

On Thu, Dec 22, 2011 at 11:21 AM, Alan Burlison <alan.b...@gmail.com> wrote:

On 21/12/2011 22:07, Ken McDonald wrote:

Also, what is your case for using an immutable data structure? If it is
simply to prevent unintended mutations, then you might do the following:

It's for multithreaded access. It's OK for threads to have a slightly out-of-date version, but updates should be atomic. From reading the Scala literature that would seem to be the classic case where you'd use an immutable map, hence the question.

Yes, this is the classic case where using an immutable map makes sense, and dressing up a mutable data structure as immutable won't help.

1) Use a mutable map, but that map is never passed out into the world; it
is used only for the batch updates.
2) Once a batch update has been done, create an immutable map from the
mutable map, and use that for all the data access.

The map will only have a low percentage of entries modified when it is modified, so converting between immutable and mutable maps each time seemed a little clumsy.

If the updates are a small number of key->value tuples, just use ++. It will be pretty fast. If the updates are already a s.c.i.HashMap, use merge. That will be even faster. It is probably not worth it to convert a Seq of Pairs into a map just so you can use merge. If in doubt, just benchmark it.

The downside of using merge is that you have to explicitly declare your map as scala.collection.immutable.HashMap, because for whatever reason ++ does not use the fast bulk merge when both sides are HashMaps.

Nils Kilden-Pedersen

unread,

Dec 22, 2011, 9:01:47 AM12/22/11

to Alan Burlison, scala...@googlegroups.com, Ken McDonald

On Thu, Dec 22, 2011 at 4:21 AM, Alan Burlison <alan.b...@gmail.com> wrote:

On 21/12/2011 22:07, Ken McDonald wrote:

Also, what is your case for using an immutable data structure? If it is
simply to prevent unintended mutations, then you might do the following:

It's for multithreaded access. It's OK for threads to have a slightly out-of-date version, but updates should be atomic. From reading the Scala literature that would seem to be the classic case where you'd use an immutable map, hence the question.

If you have multithreaded mutation, remember to use an AtomicReference to hold the map, and do the necessary steps (CAS looping etc.).

Alan Burlison

unread,

Dec 22, 2011, 9:38:17 AM12/22/11

to Nils Kilden-Pedersen, scala...@googlegroups.com, Ken McDonald

On 22/12/2011 14:01, Nils Kilden-Pedersen wrote:

> If you have multithreaded mutation, remember to use an AtomicReference to
> hold the map, and do the necessary steps (CAS looping etc.).

The mutation will only be done by one thread, but I'll still probably
need to lock around the reads/writes of the hash reference.

--
Alan Burlison
--

Nils Kilden-Pedersen

unread,

Dec 22, 2011, 9:56:09 AM12/22/11

to Alan Burlison, scala...@googlegroups.com, Ken McDonald

Why?

√iktor Ҡlang

unread,

Dec 22, 2011, 9:57:11 AM12/22/11

to Alan Burlison, Nils Kilden-Pedersen, scala-user, Ken McDonald

Then you only need to use @volatile

Alan Burlison

unread,

Dec 22, 2011, 12:15:48 PM12/22/11

to Nils Kilden-Pedersen, scala...@googlegroups.com, Ken McDonald

On 22/12/2011 14:56, Nils Kilden-Pedersen wrote:

>> The mutation will only be done by one thread, but I'll still probably need
>> to lock around the reads/writes of the hash reference.

Caching.

--
Alan Burlison
--

Alan Burlison

unread,

Dec 22, 2011, 12:22:03 PM12/22/11

to √iktor Ҡlang, Nils Kilden-Pedersen, scala-user, Ken McDonald

On 22/12/2011 14:57, √iktor Ҡlang wrote:

> Then you only need to use @volatile

Yeah, volatile would do. I keep forgetting that they fixed that in Java5.

--
Alan Burlison
--

Alan Burlison

unread,

Dec 24, 2011, 6:22:24 AM12/24/11

to Rüdiger Klaehn, Dennis Haupt, scala...@googlegroups.com

On 22/12/2011 00:14, Rüdiger Klaehn wrote:

> Sometimes the scala collections seem a bit unfinished.
>
> I looked at the code, and it seems that map does in fact support bulk
> operations. There is a function called merge in
> scala.collection.immutable.HashMap. But it is not being used when merging
> two maps with ++.
>
> Somebody went through all the trouble of implementing a bulk merge, but
> since 99.5% of all users of scala.collection.immutable.Map will use ++ to
> merge two maps, they will never benefit from it.
>
> Instead ++ just adds all elements of the second map sequentially to the
> first map. Now, that makes sense when the second argument of ++ is not
> actually a map, but if both arguments of ++ are maps the bulk merge
> function should be used!

In practice merge doesn't seem to be all that useful. It's only defined
on immutable.HashMap. If you build up the changes you want to merge in
a mutable.HashHap there's no obvious way to convert that to a
immutable.HashMap so you can pass it to merge - the closest method is
toMap but that gives you back a Map rather than a HashMap.

I have to echo the comment above about the collections feeling
unfinished - what I'm trying to do seems like an obvious use case, but
apparently there's no way to do it.

--
Alan Burlison
--

Rüdiger Klaehn

unread,

Dec 24, 2011, 7:10:52 AM12/24/11

to Alan Burlison, Dennis Haupt, scala...@googlegroups.com

On Sat, Dec 24, 2011 at 12:22 PM, Alan Burlison <alan.b...@gmail.com> wrote:

On 22/12/2011 00:14, Rüdiger Klaehn wrote:

Sometimes the scala collections seem a bit unfinished.

I looked at the code, and it seems that map does in fact support bulk
operations. There is a function called merge in
scala.collection.immutable.HashMap. But it is not being used when merging
two maps with ++.

Somebody went through all the trouble of implementing a bulk merge, but
since 99.5% of all users of scala.collection.immutable.Map will use ++ to
merge two maps, they will never benefit from it.

Instead ++ just adds all elements of the second map sequentially to the
first map. Now, that makes sense when the second argument of ++ is not
actually a map, but if both arguments of ++ are maps the bulk merge
function should be used!

In practice merge doesn't seem to be all that useful. It's only defined on immutable.HashMap. If you build up the changes you want to merge in a mutable.HashHap there's no obvious way to convert that to a immutable.HashMap so you can pass it to merge - the closest method is toMap but that gives you back a Map rather than a HashMap.

This is just a hunch, but I think that building an immutable hash map containing changes just so you can use merge instead of adding them one by one is probably not worth it. The time you save using merge instead of ++ will instead be spent building the changes map. The only situation where this would make sense is if you want the time spent in the actual merge to be as short as possible because you have to hold a lock during the merge, but not during the building of the changes map.

I have to echo the comment above about the collections feeling unfinished - what I'm trying to do seems like an obvious use case, but apparently there's no way to do it.

Well, calling them unfinished is probably a bit harsh. They are still vastly better (more consistent, faster when using them naively) than any other collection library I am aware of. Especially the immutable data structures like Vector and Map are amazingly clever. Even if you just use them in a naive way they are very fast for immutable data structures.

They just need some more polish. In particular: you should always get the best possible algorithm if you have the same type on both sides of an operator. So e.g. x union y, when both x and y are a Set, should use a bulk merge or at least check check the sizes of both arguments and add the smaller to the larger. a ++ b should use merge if both a and b are s.c.i.HashSet. The problem is that there is such a vast number methods and traits being mixed in that it can be quite confusing for an outsider to try to add such special cases.

Alan Burlison

unread,

Dec 24, 2011, 12:16:47 PM12/24/11

to Rüdiger Klaehn, Dennis Haupt, scala...@googlegroups.com

On 24 December 2011 12:10, Rüdiger Klaehn <rkl...@googlemail.com> wrote:

> This is just a hunch, but I think that building an immutable hash map
> containing changes just so you can use merge instead of adding them one by
> one is probably not worth it. The time you save using merge instead of ++
> will instead be spent building the changes map. The only situation where
> this would make sense is if you want the time spent in the actual merge to
> be as short as possible because you have to hold a lock during the merge,
> but not during the building of the changes map.

I think you are almost certainly right. And I can build the new map
without locking and then just switch in the reference whilst holding
the lock.

--
Alan Burlison
--

Reply all

Reply to author

Forward