Why does this Iterator cause a stack overflow?

Paul Butcher

unread,

Nov 19, 2012, 6:54:45 AM11/19/12

to scala-user

I'm writing code to parse a very simple text file. The problem is that it's large - too large to want to keep the entire file in RAM (several GB). The file is a series of lines, arranged into sections divided by blank lines:

first line of first section
second line of second section

first line of second section
…

The code that I've come up with to parse this is:

object LinkGraph {

def sections(lines: Iterator[String]) = new Iterator[Iterator[String]] {
var remaining = lines
def hasNext = remaining.hasNext
def next() = {
val (section, rest) = remaining.span(!_.isEmpty)
remaining = rest.dropWhile(_.isEmpty)
section
}
}

def load(lines: Iterator[String]) {
  sections(lines) foreach { section =>
  // HANDLE SECTION
  }
}
}

It's being called like this:

LinkGraph.load(Source.fromFile(args(0)).getLines)

There are two problems - first, it's *dreadfully* slow. And second, it fails with the following stack overflow after around 8000 sections:

java.lang.StackOverflowError
at sun.nio.cs.SingleByteDecoder.decodeArrayLoop(SingleByteDecoder.java:56)
at sun.nio.cs.SingleByteDecoder.decodeLoop(SingleByteDecoder.java:83)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:544)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:298)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
at scala.collection.Iterator$$anon$2.hasNext(Iterator.scala:892)
at scala.collection.Iterator$$anon$26.hasNext(Iterator.scala:642)
at scala.collection.Iterator$$anon$2.hasNext(Iterator.scala:892)
at scala.collection.Iterator$$anon$27.hasNext(Iterator.scala:666)

(the last four lines repeat many times).

I'm not sure that I understand why this code is going recursive at all? Clearly I'm missing something - I'd appreciate guidance as to what it is.

Thanks in advance,

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Daniel Sobral

unread,

Nov 19, 2012, 6:58:28 AM11/19/12

to Paul Butcher, scala-user

Is there a reason why you can't use grouped instead of the "sections" method you implemented?

--
Daniel C. Sobral

I travel to the future all the time.

Paul Butcher

unread,

Nov 19, 2012, 7:08:24 AM11/19/12

to Daniel Sobral, scala-user

On 19 Nov 2012, at 11:58, Daniel Sobral <dcso...@gmail.com> wrote:

Is there a reason why you can't use grouped instead of the "sections" method you implemented?

grouped creates fixed-size blocks, I think? Whereas I need to find the blank lines and divide where they are.

Daniel Sobral

unread,

Nov 19, 2012, 7:26:08 AM11/19/12

to Paul Butcher, scala-user

On Mon, Nov 19, 2012 at 9:54 AM, Paul Butcher <pa...@paulbutcher.com> wrote:

I'm writing code to parse a very simple text file. The problem is that it's large - too large to want to keep the entire file in RAM (several GB). The file is a series of lines, arranged into sections divided by blank lines:

first line of first section
second line of second section

first line of second section
…

The code that I've come up with to parse this is:

object LinkGraph {

def sections(lines: Iterator[String]) = new Iterator[Iterator[String]] {
var remaining = lines
def hasNext = remaining.hasNext

def next() = {
val (section, rest) = remaining.span(!_.isEmpty)

I don't know exactly what the problem is, but this is most likely the cause. The method "span" will cache all lines that comprise "section", so you might as well have read them all into memory. I didn't see any obvious non-tail recursion in it, but maybe there's one.

As a rule, I'd avoid any iterator method that splits the iterator in two.

Dennis Haupt

unread,

Nov 19, 2012, 7:30:13 AM11/19/12

to Daniel Sobral, pa...@paulbutcher.com, scala...@googlegroups.com

try

while (iterator.nonEmpty) {
val group = iterator.takeWhile(_.nonEmpty)
iterator.takeWhile(_.empty)//eat empty lines
}

-------- Original-Nachricht --------
> Datum: Mon, 19 Nov 2012 10:26:08 -0200
> Von: Daniel Sobral <dcso...@gmail.com>
> An: Paul Butcher <pa...@paulbutcher.com>
> CC: scala-user <scala...@googlegroups.com>
> Betreff: Re: [scala-user] Why does this Iterator cause a stack overflow?

> On Mon, Nov 19, 2012 at 9:54 AM, Paul Butcher <pa...@paulbutcher.com>
> wrote:
>
> > I'm writing code to parse a very simple text file. The problem is that
> > it's large - too large to want to keep the entire file in RAM (several
> GB).
> > The file is a series of lines, arranged into sections divided by blank
> > lines:
> >
> > first line of first section
> > second line of second section
> >
> > first line of second section
> > …
> >
> >
> > The code that I've come up with to parse this is:
> >

> > *object* LinkGraph {
> >
> >
> > *def* sections(lines: Iterator[String]) =
> *new*Iterator[Iterator[String]] {
> > *var* remaining = lines
> > *def* hasNext = remaining.hasNext
> > *def* next() = {
> > *val* (section, rest) = remaining.span(!_.isEmpty)

> >
> >
> I don't know exactly what the problem is, but this is most likely the
> cause. The method "span" will cache all lines that comprise "section", so
> you might as well have read them all into memory. I didn't see any obvious
> non-tail recursion in it, but maybe there's one.
>
> As a rule, I'd avoid any iterator method that splits the iterator in two.
>
>
> > remaining = rest.dropWhile(_.isEmpty)
> > section
> > }
> > }
> >

> > *def* load(lines: Iterator[String]) {

Daniel Sobral

unread,

Nov 19, 2012, 7:51:28 AM11/19/12

to Dennis Haupt, Paul Butcher, scala-user

On Mon, Nov 19, 2012 at 10:30 AM, Dennis Haupt <h-s...@gmx.de> wrote:

try

while (iterator.nonEmpty) {
val group = iterator.takeWhile(_.nonEmpty)
iterator.takeWhile(_.empty)//eat empty lines

That code won't work. You cannot use "iterator" again after "takeWhile".

Dennis Haupt

unread,

Nov 19, 2012, 7:56:31 AM11/19/12

to Daniel Sobral, scala...@googlegroups.com, pa...@paulbutcher.com

i just tested - you *can*. orgIt.takewhile(..).toList advances the original iterator (the next call to next will return the element that "takewhile" didn't take), it doesn't break it. the original can still be used. at least it works like that in 2.9.1

-------- Original-Nachricht --------
> Datum: Mon, 19 Nov 2012 10:51:28 -0200
> Von: Daniel Sobral <dcso...@gmail.com>
> An: Dennis Haupt <h-s...@gmx.de>
> CC: Paul Butcher <pa...@paulbutcher.com>, scala-user <scala...@googlegroups.com>

Daniel Sobral

unread,

Nov 19, 2012, 8:41:29 AM11/19/12

to Dennis Haupt, scala-user, Paul Butcher

On Mon, Nov 19, 2012 at 10:56 AM, Dennis Haupt <h-s...@gmx.de> wrote:

i just tested - you *can*. orgIt.takewhile(..).toList advances the original iterator (the next call to next will return the element that "takewhile" didn't take), it doesn't break it. the original can still be used. at least it works like that in 2.9.1

That's called a coincidence. An Iterator has two methods: next and hasNext. Calling next will change the iterator, so it points to the next element. All methods on Iterator are based on these two, so any time an iterator method returns another iterator, the original iterator will be *used* by that method, changing the original iterator.

The implementation of the method certainly has a predictable behavior, but it's not a guaranteed behavior. It may change at will, and it might differ from Iterator to Iterator -- the Iterator returned by Source getLines() is different than the iterator returned by Source.fromFile.

I'll end by quoting myself from Iterator's takeWhile Scaladoc:

Note: Reuse: After calling this method, one should discard the iterator it was called on, and use only the iterator that was returned. Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well.

Paul Butcher

unread,

Nov 19, 2012, 9:07:34 AM11/19/12

to Daniel Sobral, Dennis Haupt, scala-user

So does anyone have any idea how I *can* write this?

I'm tempted to report this as a bug - if span is holding on to lines that are no longer in use, that has to be a bug, surely?

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Paul Butcher

unread,

Nov 19, 2012, 9:24:51 AM11/19/12

to Daniel Sobral, Dennis Haupt, scala-user

To answer my own question - this works (and is blazing fast - 20 seconds to parse a 2.2G file):

def sections(lines: Iterator[String]) = new Iterator[Seq[String]] {
def hasNext = lines.hasNext
def next() = {
val section = new ArrayBuffer[String]
breakable {
while (lines.hasNext) {
val line = lines.next
if (line.isEmpty)
break
else
section += line
}
}
section
}
}

But eugh! It's like I've died and gone to Java hell.

Given that the whole *point* of Iterator is that it provides useful tools like span, one would imagine that they should work without causing the kind of problem I'm seeing?

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Dennis Haupt

unread,

Nov 19, 2012, 9:37:52 AM11/19/12

to Paul Butcher, dcso...@gmail.com, scala...@googlegroups.com

extract the algorithm and pass in the method body as a parameter (foreach-style)

the you have all the advantages of ugly code inside (fast & hidden) and the advantages of a nice api from the outside. you can also create a traversable on top of it and go "purely" functional. or make it look like that.

-------- Original-Nachricht --------
> Datum: Mon, 19 Nov 2012 14:24:51 +0000
> Von: Paul Butcher <pa...@paulbutcher.com>
> An: Daniel Sobral <dcso...@gmail.com>
> CC: Dennis Haupt <h-s...@gmx.de>, scala-user <scala...@googlegroups.com>

Daniel Sobral

unread,

Nov 19, 2012, 10:28:10 AM11/19/12

to Paul Butcher, Dennis Haupt, scala-user

On Mon, Nov 19, 2012 at 12:07 PM, Paul Butcher <pa...@paulbutcher.com> wrote:

So does anyone have any idea how I *can* write this?

I'm tempted to report this as a bug - if span is holding on to lines that are no longer in use, that has to be a bug, surely?

It really keep the lines past their use, as far as I can see. However, the trailing iterator does keep a pointer to the leading iterator, which can lead to a leak in one particular situation -- one I think your code might be doing.

Since you have many sections, I presume you call "span" on the trailing iterator last returned? That is:

val (leader, trailer) = iterator.span(_.nonEmpty)

...

val (leader2, trailer2) = trailer.span(_.nonEmpty)

...

val (leader3, trailer3) = trailer2.span(_.nonEmpty)

...

etc

Naturally, you use a loop or recursion instead of the above. The problem is that since "trailer" points to "leader", and "trailer2" points to "leader2", and, of course, "leader2" points to "trailer" and so on, then all of them will be kept in memory. And if you do not *empty* the leaders, then all of their lines *will* be kept in memory, just in case you one day call next it.

But, given that you say your file is very big, it might just be a matter of each of these trailers calling "next" on the previous trailer. Say you have 1000 sections, then you'll have a call stack of 1000 "next" calls (or hasNext), which fits the stack trace.

Maybe there's a way to implement span without that problem, but I don't see one.

Paul Butcher

unread,

Nov 19, 2012, 11:39:43 AM11/19/12

to Daniel Sobral, Dennis Haupt, scala-user

I'm sorry - I'm clearly still missing something. Why would trailer need to keep a pointer to leader? Or vice-versa? The two iterators returned from span are logically distinct, aren't they?

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Daniel Sobral

unread,

Nov 19, 2012, 11:56:58 AM11/19/12

to Paul Butcher, Dennis Haupt, scala-user

On Mon, Nov 19, 2012 at 2:39 PM, Paul Butcher <pa...@paulbutcher.com> wrote:

I'm sorry - I'm clearly still missing something. Why would trailer need to keep a pointer to leader? Or vice-versa? The two iterators returned from span are logically distinct, aren't they?

Well, consider how you implemented it: you read *all* of the pre-span part into memory, and kept the original iterator pointer.

So, here's how span is implemented (though you can just click on source link the scaladoc and see for yourself): nothing is read before required. If you iterate first through leader then through trailer, no line will ever be buffered in memory. However, if you iterate through trailer at any point in time before finishing the leader iteration, it buffers all the lines remaining on leader, so that it can use the original iterator to get the lines for trailer.

In practice, when you get a line from trailer, it first checks that leader is empty, and, if not, calls a method on it to make it buffer its elements. But to do this, it has to keep a pointer to it.

However, as I said, that is not the problem you are having. It's a *heap* memory leak, not a stack problem. The problem you are having is that the trailing iterator is essential this:

val trailer = new Iterator {

def hasNext = { specialStuff; self.hasNext } // self is the "this" for the span method

def next() = { specialStuff; self.next() }

}

So calling "hasNext" on that trailer will cause it to call "hasNext" on the original object. As you apply "span" over the results of previous "span" calls, it accumulates.

I bet something like -Xss:2M will avoid the stack overflow -- jvm's default stack is rather small for functional languages. On the other hand, it will not help with the performance issue.

Rex Kerr

unread,

Nov 19, 2012, 11:59:12 AM11/19/12

to Daniel Sobral, Paul Butcher, scala-user

On Mon, Nov 19, 2012 at 7:26 AM, Daniel Sobral <dcso...@gmail.com> wrote:

On Mon, Nov 19, 2012 at 9:54 AM, Paul Butcher <pa...@paulbutcher.com> wrote:

def next() = {
val (section, rest) = remaining.span(!_.isEmpty)

I don't know exactly what the problem is, but this is most likely the cause. The method "span" will cache all lines that comprise "section", so you might as well have read them all into memory. I didn't see any obvious non-tail recursion in it, but maybe there's one.

There are two problems - first, it's *dreadfully* slow. And second, it fails with the following stack overflow after around 8000 sections:

Both symptoms are caused by the same thing, and it's a natural extension of how span is defined.

When you span, you get an two new iterators which each point to the previous iterator. The second of the two, in particular, defers to the original iterators' hasNext and next with a little extra logic to make sure it's already advanced to the appropriate point.

But now if you have many blocks, you get one level of indirection per block. Eventually, you stack overflow (but not until you waste enormous amounts of time with linearly increasingly deep stack traversals to get to that original iterator).

To fix, you must come up with your own span that, when it is spanned itself, passes the original forwards instead of the nearest one.

--Rex

Rex Kerr

unread,

Nov 19, 2012, 12:01:44 PM11/19/12

to Daniel Sobral, Paul Butcher, Dennis Haupt, scala-user

On Mon, Nov 19, 2012 at 11:56 AM, Daniel Sobral <dcso...@gmail.com> wrote:

However, as I said, that is not the problem you are having. It's a *heap* memory leak, not a stack problem.

Nope, stack gets you first, as the exception says. See the other message I just sent.

--Rex

Paul Butcher

unread,

Nov 19, 2012, 12:09:42 PM11/19/12

to Rex Kerr, Daniel Sobral, scala-user

On 19 Nov 2012, at 16:59, Rex Kerr <ich...@gmail.com> wrote:

Both symptoms are caused by the same thing, and it's a natural extension of how span is defined.

Ah - OK - got it now. Thanks.

So the "Java-style" implementation I sent earlier probably is the right solution. I guess that, given that iterators are fundamentally side-effecting, it's no surprise that imperative code is the best fit.

Daniel Sobral

unread,

Nov 19, 2012, 12:09:35 PM11/19/12

to Rex Kerr, Paul Butcher, Dennis Haupt, scala-user

I meant that the memory leak is a heap problem, not a stack problem, and, therefore, it is not the problem he is having.

--Rex

Daniel Sobral

unread,

Nov 19, 2012, 12:11:06 PM11/19/12

to Paul Butcher, Rex Kerr, scala-user

On Mon, Nov 19, 2012 at 3:09 PM, Paul Butcher <pa...@paulbutcher.com> wrote:

On 19 Nov 2012, at 16:59, Rex Kerr <ich...@gmail.com> wrote:

Both symptoms are caused by the same thing, and it's a natural extension of how span is defined.

Ah - OK - got it now. Thanks.

So the "Java-style" implementation I sent earlier probably is the right solution. I guess that, given that iterators are fundamentally side-effecting, it's no surprise that imperative code is the best fit.

I have found that iterators are most powerful when you extend Iterator class yourself.

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Rex Kerr

unread,

Nov 19, 2012, 12:19:22 PM11/19/12

to Daniel Sobral, Paul Butcher, Dennis Haupt, scala-user

Ah, okay--sorry, didn't read carefully enough.
--Rex

HamsterofDeath

unread,

Nov 19, 2012, 12:52:48 PM11/19/12

to scala...@googlegroups.com

i know, but it won't chance if you don't chance your scala version^^

Am 19.11.2012 14:41, schrieb Daniel Sobral:

On Mon, Nov 19, 2012 at 10:56 AM, Dennis Haupt <h-s...@gmx.de> wrote:

i just tested - you *can*. orgIt.takewhile(..).toList advances the original iterator (the next call to next will return the element that "takewhile" didn't take), it doesn't break it. the original can still be used. at least it works like that in 2.9.1

That's called a coincidence. An Iterator has two methods: next and hasNext. Calling next will change the iterator, so it points to the next element. All methods on Iterator are based on these two, so any time an iterator method returns another iterator, the original iterator will be *used* by that method, changing the original iterator.

The implementation of the method certainly has a predictable behavior, but it's not a guaranteed behavior. It may change at will, and it might differ from Iterator to Iterator -- the Iterator returned by Source getLines() is different than the iterator returned by Source.fromFile.

I'll end by quoting myself from Iterator's takeWhile Scaladoc:

Note

Reuse: After calling this method, one should discard the iterator it was called on, and use only the iterator that was returned. Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well.

ï¿½

> > > > ï¿½

> > > >
> > > >
> > > > The code that I've come up with to parse this is:
> > > >
> > > > *object* LinkGraph {
> > > >
> > > >

> > > > ï¿½ *def* sections(lines: Iterator[String]) =
> > > *new*Iterator[Iterator[String]] {
> > > > ï¿½ ï¿½ *var* remaining = lines
> > > > ï¿½ ï¿½ *def* hasNext = remaining.hasNext
> > > > ï¿½ ï¿½ *def* next() = {
> > > > ï¿½ ï¿½ ï¿½ *val* (section, rest) = remaining.span(!_.isEmpty)

> > > >
> > > >
> > > I don't know exactly what the problem is, but this is most likely the
> > > cause. The method "span" will cache all lines that comprise "section",
> so
> > > you might as well have read them all into memory. I didn't see any
> > obvious
> > > non-tail recursion in it, but maybe there's one.
> > >
> > > As a rule, I'd avoid any iterator method that splits the iterator in
> two.
> > >
> > >

> > > > ï¿½ ï¿½ ï¿½ remaining = rest.dropWhile(_.isEmpty)
> > > > ï¿½ ï¿½ ï¿½ section
> > > > ï¿½ ï¿½ }
> > > > ï¿½ }
> > > >
> > > > ï¿½ *def* load(lines: Iterator[String]) {
> > > > ï¿½ ï¿½ sections(lines) foreach { section =>
> > > > ï¿½ ï¿½ ï¿½ // HANDLE SECTION
> > > > ï¿½ ï¿½ }
> > > > ï¿½ }

> > > > }
> > > >
> > > >
> > > > It's being called like this:
> > > >
> > > > LinkGraph.load(Source.fromFile(args(0)).getLines)
> > > >
> > > >
> > > > There are two problems - first, it's *dreadfully* slow. And second,
> it
> > > > fails with the following stack overflow after around 8000 sections:
> > > >
> > > > java.lang.StackOverflowError
> > > > at
> > >
> sun.nio.cs.SingleByteDecoder.decodeArrayLoop(SingleByteDecoder.java:56)

> > > > ï¿½at

> sun.nio.cs.SingleByteDecoder.decodeLoop(SingleByteDecoder.java:83)
> > > > at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:544)

> > > > ï¿½at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:298)
> > > > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
> > > > ï¿½at java.io.InputStreamReader.read(InputStreamReader.java:167)
> > > > at java.io.BufferedReader.fill(BufferedReader.java:136)
> > > > ï¿½at java.io.BufferedReader.readLine(BufferedReader.java:299)
> > > > at java.io.BufferedReader.readLine(BufferedReader.java:362)
> > > > ï¿½at

> > > >
> > >
> >
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
> > > > at scala.collection.Iterator$$anon$2.hasNext(Iterator.scala:892)

> > > > ï¿½at scala.collection.Iterator$$anon$26.hasNext(Iterator.scala:642)
> > > > at scala.collection.Iterator$$anon$2.hasNext(Iterator.scala:892)
> > > > ï¿½at scala.collection.Iterator$$anon$27.hasNext(Iterator.scala:666)

Reply all

Reply to author

Forward