Why does this Iterator cause a stack overflow?

1,087 views
Skip to first unread message

Paul Butcher

unread,
Nov 19, 2012, 6:54:45 AM11/19/12
to scala-user
I'm writing code to parse a very simple text file. The problem is that it's large - too large to want to keep the entire file in RAM (several GB). The file is a series of lines, arranged into sections divided by blank lines:

first line of first section
second line of second section

first line of second section

The code that I've come up with to parse this is:

object LinkGraph {

  

  def sections(lines: Iterator[String]) = new Iterator[Iterator[String]] {
    var remaining = lines
    def hasNext = remaining.hasNext
    def next() = {
      val (section, rest) = remaining.span(!_.isEmpty)
      remaining = rest.dropWhile(_.isEmpty)
      section
    }
  }

  def load(lines: Iterator[String]) {
    sections(lines) foreach { section =>
      // HANDLE SECTION
    }
  }
}

It's being called like this:

LinkGraph.load(Source.fromFile(args(0)).getLines)

There are two problems - first, it's *dreadfully* slow. And second, it fails with the following stack overflow after around 8000 sections:

java.lang.StackOverflowError
at sun.nio.cs.SingleByteDecoder.decodeArrayLoop(SingleByteDecoder.java:56)
at sun.nio.cs.SingleByteDecoder.decodeLoop(SingleByteDecoder.java:83)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:544)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:298)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
at scala.collection.Iterator$$anon$2.hasNext(Iterator.scala:892)
at scala.collection.Iterator$$anon$26.hasNext(Iterator.scala:642)
at scala.collection.Iterator$$anon$2.hasNext(Iterator.scala:892)
at scala.collection.Iterator$$anon$27.hasNext(Iterator.scala:666)

(the last four lines repeat many times).

I'm not sure that I understand why this code is going recursive at all? Clearly I'm missing something - I'd appreciate guidance as to what it is.

Thanks in advance,

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Daniel Sobral

unread,
Nov 19, 2012, 6:58:28 AM11/19/12
to Paul Butcher, scala-user
Is there a reason why you can't use grouped instead of the "sections" method you implemented?
--
Daniel C. Sobral

I travel to the future all the time.

Paul Butcher

unread,
Nov 19, 2012, 7:08:24 AM11/19/12
to Daniel Sobral, scala-user
On 19 Nov 2012, at 11:58, Daniel Sobral <dcso...@gmail.com> wrote:

Is there a reason why you can't use grouped instead of the "sections" method you implemented?

grouped creates fixed-size blocks, I think? Whereas I need to find the blank lines and divide where they are.

Daniel Sobral

unread,
Nov 19, 2012, 7:26:08 AM11/19/12
to Paul Butcher, scala-user
On Mon, Nov 19, 2012 at 9:54 AM, Paul Butcher <pa...@paulbutcher.com> wrote:
I'm writing code to parse a very simple text file. The problem is that it's large - too large to want to keep the entire file in RAM (several GB). The file is a series of lines, arranged into sections divided by blank lines:

first line of first section
second line of second section

first line of second section

The code that I've come up with to parse this is:

object LinkGraph {

  

  def sections(lines: Iterator[String]) = new Iterator[Iterator[String]] {
    var remaining = lines
    def hasNext = remaining.hasNext
    def next() = {
      val (section, rest) = remaining.span(!_.isEmpty)

I don't know exactly what the problem is, but this is most likely the cause. The method "span" will cache all lines that comprise "section", so you might as well have read them all into memory. I didn't see any obvious non-tail recursion in it, but maybe there's one.

As a rule, I'd avoid any iterator method that splits the iterator in two.

Dennis Haupt

unread,
Nov 19, 2012, 7:30:13 AM11/19/12
to Daniel Sobral, pa...@paulbutcher.com, scala...@googlegroups.com
try

while (iterator.nonEmpty) {
val group = iterator.takeWhile(_.nonEmpty)
iterator.takeWhile(_.empty)//eat empty lines
}

-------- Original-Nachricht --------
> Datum: Mon, 19 Nov 2012 10:26:08 -0200
> Von: Daniel Sobral <dcso...@gmail.com>
> An: Paul Butcher <pa...@paulbutcher.com>
> CC: scala-user <scala...@googlegroups.com>
> Betreff: Re: [scala-user] Why does this Iterator cause a stack overflow?

> On Mon, Nov 19, 2012 at 9:54 AM, Paul Butcher <pa...@paulbutcher.com>
> wrote:
>
> > I'm writing code to parse a very simple text file. The problem is that
> > it's large - too large to want to keep the entire file in RAM (several
> GB).
> > The file is a series of lines, arranged into sections divided by blank
> > lines:
> >
> > first line of first section
> > second line of second section
> >
> > first line of second section
> > …
> >
> >
> > The code that I've come up with to parse this is:
> >
> > *object* LinkGraph {
> >
> >
> > *def* sections(lines: Iterator[String]) =
> *new*Iterator[Iterator[String]] {
> > *var* remaining = lines
> > *def* hasNext = remaining.hasNext
> > *def* next() = {
> > *val* (section, rest) = remaining.span(!_.isEmpty)
> >
> >
> I don't know exactly what the problem is, but this is most likely the
> cause. The method "span" will cache all lines that comprise "section", so
> you might as well have read them all into memory. I didn't see any obvious
> non-tail recursion in it, but maybe there's one.
>
> As a rule, I'd avoid any iterator method that splits the iterator in two.
>
>
> > remaining = rest.dropWhile(_.isEmpty)
> > section
> > }
> > }
> >
> > *def* load(lines: Iterator[String]) {

Daniel Sobral

unread,
Nov 19, 2012, 7:51:28 AM11/19/12
to Dennis Haupt, Paul Butcher, scala-user
On Mon, Nov 19, 2012 at 10:30 AM, Dennis Haupt <h-s...@gmx.de> wrote:
try

while (iterator.nonEmpty) {
val group = iterator.takeWhile(_.nonEmpty)
iterator.takeWhile(_.empty)//eat empty lines

That code won't work. You cannot use "iterator" again after "takeWhile".

Dennis Haupt

unread,
Nov 19, 2012, 7:56:31 AM11/19/12
to Daniel Sobral, scala...@googlegroups.com, pa...@paulbutcher.com
i just tested - you *can*. orgIt.takewhile(..).toList advances the original iterator (the next call to next will return the element that "takewhile" didn't take), it doesn't break it. the original can still be used. at least it works like that in 2.9.1

-------- Original-Nachricht --------
> Datum: Mon, 19 Nov 2012 10:51:28 -0200
> Von: Daniel Sobral <dcso...@gmail.com>
> An: Dennis Haupt <h-s...@gmx.de>
> CC: Paul Butcher <pa...@paulbutcher.com>, scala-user <scala...@googlegroups.com>

Daniel Sobral

unread,
Nov 19, 2012, 8:41:29 AM11/19/12
to Dennis Haupt, scala-user, Paul Butcher
On Mon, Nov 19, 2012 at 10:56 AM, Dennis Haupt <h-s...@gmx.de> wrote:
i just tested - you *can*. orgIt.takewhile(..).toList advances the original iterator (the next call to next will return the element that "takewhile" didn't take), it doesn't break it. the original can still be used. at least it works like that in 2.9.1

That's called a coincidence. An Iterator has two methods: next and hasNext. Calling next will change the iterator, so it points to the next element. All methods on Iterator are based on these two, so any time an iterator method returns another iterator, the original iterator will be *used* by that method, changing the original iterator.

The implementation of the method certainly has a predictable behavior, but it's not a guaranteed behavior. It may change at will, and it might differ from Iterator to Iterator -- the Iterator returned by Source getLines() is different than the iterator returned by Source.fromFile.

I'll end by quoting myself from Iterator's takeWhile Scaladoc:

Note

Reuse: After calling this method, one should discard the iterator it was called on, and use only the iterator that was returned. Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well.

Paul Butcher

unread,
Nov 19, 2012, 9:07:34 AM11/19/12
to Daniel Sobral, Dennis Haupt, scala-user
So does anyone have any idea how I *can* write this? 

I'm tempted to report this as a bug - if span is holding on to lines that are no longer in use, that has to be a bug, surely?

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Paul Butcher

unread,
Nov 19, 2012, 9:24:51 AM11/19/12
to Daniel Sobral, Dennis Haupt, scala-user
To answer my own question - this works (and is blazing fast - 20 seconds to parse a 2.2G file):

  def sections(lines: Iterator[String]) = new Iterator[Seq[String]] {
    def hasNext = lines.hasNext
    def next() = {
      val section = new ArrayBuffer[String]
      breakable {
        while (lines.hasNext) {
          val line = lines.next
          if (line.isEmpty)
            break
          else
            section += line
        }
      }
      section
    }
  }

But eugh! It's like I've died and gone to Java hell.

Given that the whole *point* of Iterator is that it provides useful tools like span, one would imagine that they should work without causing the kind of problem I'm seeing?

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Dennis Haupt

unread,
Nov 19, 2012, 9:37:52 AM11/19/12
to Paul Butcher, dcso...@gmail.com, scala...@googlegroups.com
extract the algorithm and pass in the method body as a parameter (foreach-style)

the you have all the advantages of ugly code inside (fast & hidden) and the advantages of a nice api from the outside. you can also create a traversable on top of it and go "purely" functional. or make it look like that.

-------- Original-Nachricht --------
> Datum: Mon, 19 Nov 2012 14:24:51 +0000
> Von: Paul Butcher <pa...@paulbutcher.com>
> An: Daniel Sobral <dcso...@gmail.com>
> CC: Dennis Haupt <h-s...@gmx.de>, scala-user <scala...@googlegroups.com>

Daniel Sobral

unread,
Nov 19, 2012, 10:28:10 AM11/19/12
to Paul Butcher, Dennis Haupt, scala-user
On Mon, Nov 19, 2012 at 12:07 PM, Paul Butcher <pa...@paulbutcher.com> wrote:
So does anyone have any idea how I *can* write this? 

I'm tempted to report this as a bug - if span is holding on to lines that are no longer in use, that has to be a bug, surely?

It really keep the lines past their use, as far as I can see. However, the trailing iterator does keep a pointer to the leading iterator, which can lead to a leak in one particular situation -- one I think your code might be doing.

Since you have many sections, I presume you call "span" on the trailing iterator last returned? That is:

val (leader, trailer) = iterator.span(_.nonEmpty)
...
val (leader2, trailer2) = trailer.span(_.nonEmpty)
...
val (leader3, trailer3) = trailer2.span(_.nonEmpty)
...
etc

Naturally, you use a loop or recursion instead of the above. The problem is that since "trailer" points to "leader", and "trailer2" points to "leader2", and, of course, "leader2" points to "trailer" and so on, then all of them will be kept in memory. And if you do not *empty* the leaders, then all of their lines *will* be kept in memory, just in case you one day call next it.

But, given that you say your file is very big, it might just be a matter of each of these trailers calling "next" on the previous trailer. Say you have 1000 sections, then you'll have a call stack of 1000 "next" calls (or hasNext), which fits the stack trace.

Maybe there's a way to implement span without that problem, but I don't see one.

Paul Butcher

unread,
Nov 19, 2012, 11:39:43 AM11/19/12
to Daniel Sobral, Dennis Haupt, scala-user
I'm sorry - I'm clearly still missing something. Why would trailer need to keep a pointer to leader? Or vice-versa? The two iterators returned from span are logically distinct, aren't they?

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Daniel Sobral

unread,
Nov 19, 2012, 11:56:58 AM11/19/12
to Paul Butcher, Dennis Haupt, scala-user
On Mon, Nov 19, 2012 at 2:39 PM, Paul Butcher <pa...@paulbutcher.com> wrote:
I'm sorry - I'm clearly still missing something. Why would trailer need to keep a pointer to leader? Or vice-versa? The two iterators returned from span are logically distinct, aren't they?

Well, consider how you implemented it: you read *all* of the pre-span part into memory, and kept the original iterator pointer.

So, here's how span is implemented (though you can just click on source link the scaladoc and see for yourself): nothing is read before required. If you iterate first through leader then through trailer, no line will ever be buffered in memory. However, if you iterate through trailer at any point in time before finishing the leader iteration, it buffers all the lines remaining on leader, so that it can use the original iterator to get the lines for trailer.

In practice, when you get a line from trailer, it first checks that leader is empty, and, if not, calls a method on it to make it buffer its elements. But to do this, it has to keep a pointer to it.

However, as I said, that is not the problem you are having. It's a *heap* memory leak, not a stack problem. The problem you are having is that the trailing iterator is essential this:

val trailer = new Iterator {
  def hasNext = { specialStuff; self.hasNext } // self is the "this" for the span method
  def next() = { specialStuff; self.next() }
}

So calling "hasNext" on that trailer will cause it to call "hasNext" on the original object. As you apply "span" over the results of previous "span" calls, it accumulates.

I bet something like -Xss:2M will avoid the stack overflow -- jvm's default stack is rather small for functional languages. On the other hand, it will not help with the performance issue.

Rex Kerr

unread,
Nov 19, 2012, 11:59:12 AM11/19/12
to Daniel Sobral, Paul Butcher, scala-user
On Mon, Nov 19, 2012 at 7:26 AM, Daniel Sobral <dcso...@gmail.com> wrote:
On Mon, Nov 19, 2012 at 9:54 AM, Paul Butcher <pa...@paulbutcher.com> wrote:
    def next() = {
      val (section, rest) = remaining.span(!_.isEmpty)

I don't know exactly what the problem is, but this is most likely the cause. The method "span" will cache all lines that comprise "section", so you might as well have read them all into memory. I didn't see any obvious non-tail recursion in it, but maybe there's one.

There are two problems - first, it's *dreadfully* slow. And second, it fails with the following stack overflow after around 8000 sections:

Both symptoms are caused by the same thing, and it's a natural extension of how span is defined.

When you span, you get an two new iterators which each point to the previous iterator.  The second of the two, in particular, defers to the original iterators' hasNext and next with a little extra logic to make sure it's already advanced to the appropriate point.

But now if you have many blocks, you get one level of indirection per block.  Eventually, you stack overflow (but not until you waste enormous amounts of time with linearly increasingly deep stack traversals to get to that original iterator).

To fix, you must come up with your own span that, when it is spanned itself, passes the original forwards instead of the nearest one.

  --Rex

Rex Kerr

unread,
Nov 19, 2012, 12:01:44 PM11/19/12
to Daniel Sobral, Paul Butcher, Dennis Haupt, scala-user
On Mon, Nov 19, 2012 at 11:56 AM, Daniel Sobral <dcso...@gmail.com> wrote:
However, as I said, that is not the problem you are having. It's a *heap* memory leak, not a stack problem.

Nope, stack gets you first, as the exception says.  See the other message I just sent.

  --Rex

Paul Butcher

unread,
Nov 19, 2012, 12:09:42 PM11/19/12
to Rex Kerr, Daniel Sobral, scala-user
On 19 Nov 2012, at 16:59, Rex Kerr <ich...@gmail.com> wrote:

Both symptoms are caused by the same thing, and it's a natural extension of how span is defined.

Ah - OK - got it now. Thanks.

So the "Java-style" implementation I sent earlier probably is the right solution. I guess that, given that iterators are fundamentally side-effecting, it's no surprise that imperative code is the best fit.

Daniel Sobral

unread,
Nov 19, 2012, 12:09:35 PM11/19/12
to Rex Kerr, Paul Butcher, Dennis Haupt, scala-user
I meant that the memory leak is a heap problem, not a stack problem, and, therefore, it is not the problem he is having.
 

  --Rex

Daniel Sobral

unread,
Nov 19, 2012, 12:11:06 PM11/19/12
to Paul Butcher, Rex Kerr, scala-user
On Mon, Nov 19, 2012 at 3:09 PM, Paul Butcher <pa...@paulbutcher.com> wrote:
On 19 Nov 2012, at 16:59, Rex Kerr <ich...@gmail.com> wrote:

Both symptoms are caused by the same thing, and it's a natural extension of how span is defined.

Ah - OK - got it now. Thanks.

So the "Java-style" implementation I sent earlier probably is the right solution. I guess that, given that iterators are fundamentally side-effecting, it's no surprise that imperative code is the best fit.

I have found that iterators are most powerful when you extend Iterator class yourself.

 

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Rex Kerr

unread,
Nov 19, 2012, 12:19:22 PM11/19/12
to Daniel Sobral, Paul Butcher, Dennis Haupt, scala-user
Ah, okay--sorry, didn't read carefully enough.
  --Rex

HamsterofDeath

unread,
Nov 19, 2012, 12:52:48 PM11/19/12
to scala...@googlegroups.com
i know, but it won't chance if you don't chance your scala version^^


Am 19.11.2012 14:41, schrieb Daniel Sobral:
On Mon, Nov 19, 2012 at 10:56 AM, Dennis Haupt <h-s...@gmx.de> wrote:
i just tested - you *can*. orgIt.takewhile(..).toList advances the original iterator (the next call to next will return the element that "takewhile" didn't take), it doesn't break it. the original can still be used. at least it works like that in 2.9.1

That's called a coincidence. An Iterator has two methods: next and hasNext. Calling next will change the iterator, so it points to the next element. All methods on Iterator are based on these two, so any time an iterator method returns another iterator, the original iterator will be *used* by that method, changing the original iterator.

The implementation of the method certainly has a predictable behavior, but it's not a guaranteed behavior. It may change at will, and it might differ from Iterator to Iterator -- the Iterator returned by Source getLines() is different than the iterator returned by Source.fromFile.

I'll end by quoting myself from Iterator's takeWhile Scaladoc:

Note

Reuse: After calling this method, one should discard the iterator it was called on, and use only the iterator that was returned. Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well.


�
> > > > �

> > > >
> > > >
> > > > The code that I've come up with to parse this is:
> > > >
> > > > *object* LinkGraph {
> > > >
> > > >
> > > > � *def* sections(lines: Iterator[String]) =
> > > *new*Iterator[Iterator[String]] {
> > > > � � *var* remaining = lines
> > > > � � *def* hasNext = remaining.hasNext
> > > > � � *def* next() = {
> > > > � � � *val* (section, rest) = remaining.span(!_.isEmpty)

> > > >
> > > >
> > > I don't know exactly what the problem is, but this is most likely the
> > > cause. The method "span" will cache all lines that comprise "section",
> so
> > > you might as well have read them all into memory. I didn't see any
> > obvious
> > > non-tail recursion in it, but maybe there's one.
> > >
> > > As a rule, I'd avoid any iterator method that splits the iterator in
> two.
> > >
> > >
> > > > � � � remaining = rest.dropWhile(_.isEmpty)
> > > > � � � section
> > > > � � }
> > > > � }
> > > >
> > > > � *def* load(lines: Iterator[String]) {
> > > > � � sections(lines) foreach { section =>
> > > > � � � // HANDLE SECTION
> > > > � � }
> > > > � }

> > > > }
> > > >
> > > >
> > > > It's being called like this:
> > > >
> > > > LinkGraph.load(Source.fromFile(args(0)).getLines)
> > > >
> > > >
> > > > There are two problems - first, it's *dreadfully* slow. And second,
> it
> > > > fails with the following stack overflow after around 8000 sections:
> > > >
> > > > java.lang.StackOverflowError
> > > > at
> > >
> sun.nio.cs.SingleByteDecoder.decodeArrayLoop(SingleByteDecoder.java:56)
> > > > �at

> sun.nio.cs.SingleByteDecoder.decodeLoop(SingleByteDecoder.java:83)
> > > > at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:544)
> > > > �at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:298)
> > > > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
> > > > �at java.io.InputStreamReader.read(InputStreamReader.java:167)
> > > > at java.io.BufferedReader.fill(BufferedReader.java:136)
> > > > �at java.io.BufferedReader.readLine(BufferedReader.java:299)
> > > > at java.io.BufferedReader.readLine(BufferedReader.java:362)
> > > > �at

> > > >
> > >
> >
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
> > > > at scala.collection.Iterator$$anon$2.hasNext(Iterator.scala:892)
> > > > �at scala.collection.Iterator$$anon$26.hasNext(Iterator.scala:642)
> > > > at scala.collection.Iterator$$anon$2.hasNext(Iterator.scala:892)
> > > > �at scala.collection.Iterator$$anon$27.hasNext(Iterator.scala:666)
Reply all
Reply to author
Forward
0 new messages