HashMap and/or Symbol problems when under memory stress

128 views
Skip to first unread message

Paul Butcher

unread,
Nov 23, 2012, 5:37:18 AM11/23/12
to scala-user
I spent much of last night battling against a really strange issue with HashMap and Symbol when the VM is under memory stress. I'd appreciate some help characterising it further.

I'm parsing the Moby Free Thesaurus and building a hash map from it. I've reduced the code down to the smallest version that still reproduces the problem here:


It traverses the file twice. In the first pass, it adds all the words (as Symbol instances) that it finds in the thesaurus into a hash map, and on the second it simply checks that each word is in the hash map. I would expect the second loop to never print anything, which is in fact what happens.

However, if line 24 is uncommented, now it finds that between 100 and 200 words are missing (a different set is missing each time it's run, so something non-deterministic is going on here). All line 24 does is allocate a large 2-dimensional array.

The problem goes away again if I use the alternative implementation of next (line 17), which allocates far fewer Symbols.

I see the same behaviour with 2.10.0-RC3 and 2.9.2.

I'd be very grateful to know if anyone else sees the same behaviour if they try running it. And any ideas about what might be going on?

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Daniel Sobral

unread,
Nov 23, 2012, 6:40:05 AM11/23/12
to Paul Butcher, scala-user
Does the problem happen if you just print nodes.size? Or, perhaps, print nodes.size twice?
--
Daniel C. Sobral

I travel to the future all the time.

Alec Zorab

unread,
Nov 23, 2012, 7:01:57 AM11/23/12
to Daniel Sobral, Paul Butcher, scala-user
running it with -Xms4096m -Xmx 4096m it works fine for me (both with line 24 commented and not). 2gb heap died with out of memory errors. I'm using a 25mb txt file called mthesaur.txt that I grabbed from project gutenberg. Scala 2.9.2, jvm 1.7.05

Paul Butcher

unread,
Nov 23, 2012, 7:31:11 AM11/23/12
to Daniel Sobral, scala-user
On 23 Nov 2012, at 11:40, Daniel Sobral <dcso...@gmail.com> wrote:

Does the problem happen if you just print nodes.size? Or, perhaps, print nodes.size twice?

Sorry - I'm not sure what you're suggesting I try, Daniel. Do you mean just print nodes.size instead of printing the missing word?

Paul Butcher

unread,
Nov 23, 2012, 7:32:23 AM11/23/12
to Alec Zorab, Daniel Sobral, scala-user
On 23 Nov 2012, at 12:01, Alec Zorab <alec...@googlemail.com> wrote:

running it with -Xms4096m -Xmx 4096m it works fine for me (both with line 24 commented and not). 2gb heap died with out of memory errors. I'm using a 25mb txt file called mthesaur.txt that I grabbed from project gutenberg. Scala 2.9.2, jvm 1.7.05

Thanks for the feedback, Alec.

Any chance you could try the exact same build.sbt and checked-in version of mthesaur from the git repo? Just in case it's something weird about how I'm running the VM?

Alec Zorab

unread,
Nov 23, 2012, 7:42:25 AM11/23/12
to Paul Butcher, Daniel Sobral, scala-user
Will try to make time this afternoon.

Paul Butcher

unread,
Nov 23, 2012, 7:42:32 AM11/23/12
to Alec Zorab, Daniel Sobral, scala-user
Yes - there's definitely sensitivity to the exact VM memory settings used here. If I try the settings that Alec mentioned, then I do still get errors - but far fewer. If I try:

-Xms8G -Xmx8G

Then I get no errors. Strangely, however, I get many errors with:

-Xmx8G

Yours, thoroughly mystified.

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Paul Butcher

unread,
Nov 23, 2012, 7:43:36 AM11/23/12
to Alec Zorab, Daniel Sobral, scala-user
On 23 Nov 2012, at 12:42, Alec Zorab <alec...@googlemail.com> wrote:

Will try to make time this afternoon.

Thanks, Alec - I really appreciate it.

Daniel Sobral

unread,
Nov 23, 2012, 8:33:54 AM11/23/12
to Paul Butcher, scala-user
Instead of doing this: val graph = Array.ofDim[Int](nodes.size, nodes.size)

This memory sensitivity makes me wonder if there's a weak map somewhere.

√iktor Ҡlang

unread,
Nov 23, 2012, 8:38:33 AM11/23/12
to Daniel Sobral, Paul Butcher, scala-user
./src/library/scala/Symbol.scala:  import java.util.WeakHashMap
./src/library/scala/Symbol.scala:  private val map = new WeakHashMap[K, WeakReference[V]]
--
Viktor Klang

Akka Tech Lead
Typesafe - The software stack for applications that scale

Twitter: @viktorklang

Alec Zorab

unread,
Nov 23, 2012, 8:39:17 AM11/23/12
to Paul Butcher, Daniel Sobral, scala-user
Are you just running it from sbt? I'm getting all sorts of weird stuff if I try 

line 24 commented:
javaOptions ++= Seq("-Xmx4G")//works
javaOptions ++= Seq("-Xmx4G", "-Xms4G") //works

with line 24 uncommented:
javaOptions ++= Seq("-Xmx4G")//this gives dies with an OOMError
javaOptions ++= Seq("-Xmx4G", "-Xms4G") //works

Paul Butcher

unread,
Nov 23, 2012, 8:44:30 AM11/23/12
to Alec Zorab, Daniel Sobral, scala-user
Yeah - I'm just running "sbt run" at the command line.

I'm on OSX and Java 6, so there's a considerable difference between your setup and mine. I'm halfway through installing an Unbuntu VM so that I can try various combinations there.

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Paul Butcher

unread,
Nov 23, 2012, 8:46:24 AM11/23/12
to Daniel Sobral, scala-user
On 23 Nov 2012, at 13:33, Daniel Sobral <dcso...@gmail.com> wrote:

Instead of doing this: val graph = Array.ofDim[Int](nodes.size, nodes.size)

Yeah - that works fine (i.e. no missing map entries).

Paul Butcher

unread,
Nov 23, 2012, 8:48:32 AM11/23/12
to √iktor Ҡlang, Daniel Sobral, scala-user
On 23 Nov 2012, at 13:38, √iktor Ҡlang <viktor...@gmail.com> wrote:

./src/library/scala/Symbol.scala:  import java.util.WeakHashMap
./src/library/scala/Symbol.scala:  private val map = new WeakHashMap[K, WeakReference[V]]

Ouch! So there's the smoking gun.

I'd love to hear opinions on whether it's reasonable to expect that Symbol behaves this way.

Alec Zorab

unread,
Nov 23, 2012, 8:49:09 AM11/23/12
to Paul Butcher, Daniel Sobral, scala-user
Right, so the weakhashmap in symbol explains the stuff falling off the end and my oom errors are caused by needing to allocate a large enough block for that array.

Or put another way: your symbols are being dropped from the Symbol cache to make room for that array and then recreated when you go through again. Therefore they're no longer the same symbol.

I think.

Paul Butcher

unread,
Nov 23, 2012, 8:54:00 AM11/23/12
to Alec Zorab, Daniel Sobral, scala-user
Many thanks for the help categorising it.

Now that I know what's up, I will see about creating a smaller reproduction (which doesn't rely on a huge thesaurus file! :-)

And then I guess we need to have a debate about whether Symbols should be unique for all time, or whether it's reasonable for them to stop being so. I know which side of that debate I will be on...

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Alec Zorab

unread,
Nov 23, 2012, 8:56:20 AM11/23/12
to Paul Butcher, Daniel Sobral, scala-user
Hopefully not the one that encourages enormous memory leaks?

    val nodes = new collection.mutable.HashMap[Symbol, Integer]
    var syms = List.empty[Symbol]
...
    chunks foreach { node =>
      nodes += (node -> nodes.size)
      syms ::= node
    }

probably fixes it

Paul Butcher

unread,
Nov 23, 2012, 9:06:05 AM11/23/12
to Alec Zorab, Daniel Sobral, scala-user
On 23 Nov 2012, at 13:56, Alec Zorab <alec...@googlemail.com> wrote:

Hopefully not the one that encourages enormous memory leaks?

Why would ensuring that any symbol that's actively in use remains unique for all the time that it's in use encourage enormous memory leaks?

    val nodes = new collection.mutable.HashMap[Symbol, Integer]
    var syms = List.empty[Symbol]
...
    chunks foreach { node =>
      nodes += (node -> nodes.size)
      syms ::= node
    }

probably fixes it

Unfortunately not. And I wouldn't expect it to either - the GC already knows that the symbols are still in use (because they're keys in "nodes" map). Adding them to another list doesn't make them any more in use.

Alec Zorab

unread,
Nov 23, 2012, 9:41:26 AM11/23/12
to Paul Butcher, Daniel Sobral, scala-user
You are entirely right, for some reason I'd decided that the hashmap wasn't keeping the key, despite the fact that that's stupid :P

I now don't see why the symbols get collected. You've kept a reference to them in your map, so they should be strongly reachable. I'm clearly missing an important piece of this puzzle somewhere...

Paul Butcher

unread,
Nov 23, 2012, 9:46:06 AM11/23/12
to Alec Zorab, Daniel Sobral, scala-user
On 23 Nov 2012, at 14:41, Alec Zorab <alec...@googlemail.com> wrote:

You are entirely right, for some reason I'd decided that the hashmap wasn't keeping the key, despite the fact that that's stupid :P

I now don't see why the symbols get collected. You've kept a reference to them in your map, so they should be strongly reachable. I'm clearly missing an important piece of this puzzle somewhere...

Yeah - me too. I've been trying to construct a smaller reproduction, and failing miserably. So clearly there's still something more subtle going on here than we thought...

Daniel Sobral

unread,
Nov 23, 2012, 10:03:01 AM11/23/12
to Paul Butcher, Alec Zorab, scala-user
Does replacing Symbol with String remove the problem? That would change the memory footprint, so it could _mask_ the problem, but my best bet lies that way.

Paul Butcher

unread,
Nov 23, 2012, 10:13:30 AM11/23/12
to Daniel Sobral, Alec Zorab, scala-user
On 23 Nov 2012, at 15:03, Daniel Sobral <dcso...@gmail.com> wrote:

Does replacing Symbol with String remove the problem? That would change the memory footprint, so it could _mask_ the problem, but my best bet lies that way.

Yup. And that's what I've already done in the project from which this example was extracted.

But I'm quite keen to work out what's really going on here. It consumed a lot of small-hours-of-the-morning time last night, and I'd like to save anyone else from stumbling into the same tar-pit if I can.

Paul Butcher

unread,
Nov 23, 2012, 10:22:35 AM11/23/12
to Daniel Sobral, Alec Zorab, scala-user
Another datapoint. I've finally finished installing an Ubuntu VM, and I can't reproduce it there at all (with either Java 6 or 7). So, so far, I've only reproduced this on OS X with Java 6.

Is there anyone else out there with OS X who can try this to see if it's something weird about my Mac?

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Alec Zorab

unread,
Nov 23, 2012, 10:32:23 AM11/23/12
to Paul Butcher, Daniel Sobral, scala-user
If not, I'll take a look when I get home tonight.

Paul Butcher

unread,
Nov 23, 2012, 2:48:23 PM11/23/12
to Alec Zorab, Daniel Sobral, scala-user
I've just pushed a simpler version which still reproduces the problem (and means that the project is now completely mis-named, as it no longer uses HashMap :-)

Looking at the code in Symbol.scala, I think that I can see how the problem might occur:


Symbol maintains a WeakHashMap[String, WeakReference[Symbol]]. The problem, I think, is that it's a map from the string that was used to create the symbol. If that string is garbage collected, then the map entry disappears, and the next time a symbol is created with that name, there will be a different instance.

But I haven't yet found a way to reproduce this which works on anything other than my Mac, reading a large file. I suspect that it's because I don't fully understand how strings are garbage collected (I believe that they're handled differently from other objects?). I'll keep trying.

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

√iktor Ҡlang

unread,
Nov 23, 2012, 4:33:24 PM11/23/12
to Paul Butcher, Alec Zorab, Daniel Sobral, scala-user
Isn't the problem here that the name is a String and that may or may not be interned, and might as such disappear quite quickly? (It is weakly-reachable through the weak reference to the Symbol.)

Alec Zorab

unread,
Nov 23, 2012, 4:51:38 PM11/23/12
to √iktor Ҡlang, Paul Butcher, Daniel Sobral, scala-user
Except that the symbol itself has a strong reference to the string, and we have a strong reference to that symbol inside the hashmap...

Paul Butcher

unread,
Nov 23, 2012, 5:04:13 PM11/23/12
to Alec Zorab, √iktor Ҡlang, Daniel Sobral, scala-user
On 23 Nov 2012, at 21:51, Alec Zorab <alec...@googlemail.com> wrote:

Except that the symbol itself has a strong reference to the string, and we have a strong reference to that symbol inside the hashmap...

Ah. Yes - of course.

Which would explain why I've been unable to create a simple reproduction.

Nevertheless, I can still reliably reproduce this when the strings arise from reading a file. The code is very short indeed now:

import io.Source

object HashMapBug {

  def main(args: Array[String]) {
    load("mobythes.aur")
  }

  def load(fileName: String) {

    val nodes = new collection.mutable.ListBuffer[Symbol]
    
    Source.fromFile(fileName).getLines foreach { line =>
      nodes += (line.split(",") map { s => Symbol(s) }).head
    }
    
    val graph = Array.ofDim[Int](nodes.size, nodes.size)
    nodes foreach { node =>
      if (!(Symbol(node.name) eq node))
        println(node)
    }
  }
}

Here's an example of the output on my machine (the list of symbols is different each time):

pauls-macbook:hashmap-bug paul$ sbt run
Detected sbt version 0.12.0
Using /Users/paul/.sbt/0.12.0 as sbt dir, -sbt-dir to override.
[info] Set current project to hashmap-bug (in build file:/Users/paul/personal/hashmap-bug/)
[info] Running HashMapBug 
'bookdealer
'buccaneer
'chap
'clitoris
'concatenate
'conserve
'cooking
'counterproductive
'crease
'crystal ball
'culminating
'cushion
'damsel
'deafening
'defrock
'disenthrone
'dishwater
'disqualify
'drink up
'drip
'dropout
'echelon
'embossed
'equerry
'Eucharist
'excerpt
'expositor
'farm out
'fifth
'frowzy
'frustrated
'gabby
'good taste
'headdress
'hoodwink
'incertitude
'incommutable
'intensive care
'intentionally
'interdict
'jenny
'lea
'lodger
'look back
'lordliness
'marine animal
'mendicancy
'missilery
'monaural system
'moneylender
'morality
'mud flat
'muggy
'muster out
'nares
'nascent
'nerve center
'nightmare
'noncombatant
'out of character
'out of reach
'outreach
'overhanging
'palm oil
'parolee
'persons
'piazza
'plain English
'plasticity
'plebeian
'plebiscite
'posted
'pour forth
'prat
'precinct
'prevaricate
'puce
'punish
'purifier
'purposely
'putrefy
'rabbinate
'raccoon
'restrictive
'rotor
'rub down
'sapphire
'scathe
'scram
'scrap
'shamrock
'shoran
'shotgun
'sizzle
'sledgehammer
'smitten
'smooth sailing
'sniff
'social outcast
'sooner
'sparse
'specious reasoning
'streetcar line
'subduer
'surgical
'Tartuffe
'tear apart
'throwaway
'topknot
'treasure trove
'tyrant
'ugliness
'uncork
'unobtainable
'unsexed
'untalkative
'urbanite
'variety
'view
'vine
'wash down
'wavering
'well-planned
'whoosh
'womanish
'womanize
'wooded
'worldly
'yip
[success] Total time: 6 s, completed Nov 23, 2012 10:00:47 PM

I'd love to know if anyone else can reproduce this at all?

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Paul Butcher

unread,
Nov 23, 2012, 7:18:54 PM11/23/12
to Alec Zorab, √iktor Ҡlang, Daniel Sobral, scala-user
Cracked it!

Here's a REPL session which illustrates the problem (now that I know what's going on, I can reproduce it in the REPL in both OSX and Linux, and with Java6 and 7):

Welcome to Scala version 2.10.0-RC3 (OpenJDK 64-Bit Server VM, Java 1.7.0_09).
Type in expressions to have them evaluated.
Type :help for more information.

scala> var name = "foo" + 1
name: String = foo1

scala> var s1 = Symbol(name)
s1: Symbol = 'foo1

scala> s1 = null
s1: Symbol = null

scala> System.gc

scala> val s2 = Symbol("foo1")
s2: Symbol = 'foo1

scala> name = null
name: String = null

scala> System.gc

scala> val s3 = Symbol("foo1")
s3: Symbol = 'foo1

scala> s2 eq s3
res2: Boolean = false

The problem arises if the symbol object for a particular name is garbage collected (so the weak reference stored in the weak hash map is null), but the string for that name has not been garbage collected (so the key still exists in the weak hash map). If a new symbol object with the same name is added, the weak hash map key will be a different object from the name field of the symbol object. If that string is subsequently garbage collected, the weak hash map entry disappears, so if the symbol is then created for a third time, it will end up being a different object from the second one.

Phew!

I'll create a bug :-)

Paul Butcher

unread,
Nov 23, 2012, 7:31:02 PM11/23/12
to Alec Zorab, √iktor Ҡlang, Daniel Sobral, scala-user
On 24 Nov 2012, at 00:18, Paul Butcher <pa...@paulbutcher.com> wrote:

I'll create a bug :-)

Daniel Sobral

unread,
Nov 23, 2012, 8:51:40 PM11/23/12
to Paul Butcher, Alec Zorab, √iktor Ҡlang, scala-user
Brilliant detective work! It always bother me how obvious some of the toughest bugs look after you figure them out (or had them explained to you :).
Reply all
Reply to author
Forward
0 new messages