first line of first sectionsecond line of second sectionfirst line of second section…
object LinkGraph {
def sections(lines: Iterator[String]) = new Iterator[Iterator[String]] {var remaining = linesdef hasNext = remaining.hasNextdef next() = {val (section, rest) = remaining.span(!_.isEmpty)remaining = rest.dropWhile(_.isEmpty)section}}def load(lines: Iterator[String]) {sections(lines) foreach { section =>// HANDLE SECTION}}}
LinkGraph.load(Source.fromFile(args(0)).getLines)
java.lang.StackOverflowErrorat sun.nio.cs.SingleByteDecoder.decodeArrayLoop(SingleByteDecoder.java:56)at sun.nio.cs.SingleByteDecoder.decodeLoop(SingleByteDecoder.java:83)at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:544)at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:298)at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)at java.io.InputStreamReader.read(InputStreamReader.java:167)at java.io.BufferedReader.fill(BufferedReader.java:136)at java.io.BufferedReader.readLine(BufferedReader.java:299)at java.io.BufferedReader.readLine(BufferedReader.java:362)at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)at scala.collection.Iterator$$anon$2.hasNext(Iterator.scala:892)at scala.collection.Iterator$$anon$26.hasNext(Iterator.scala:642)at scala.collection.Iterator$$anon$2.hasNext(Iterator.scala:892)at scala.collection.Iterator$$anon$27.hasNext(Iterator.scala:666)
Is there a reason why you can't use grouped instead of the "sections" method you implemented?
I'm writing code to parse a very simple text file. The problem is that it's large - too large to want to keep the entire file in RAM (several GB). The file is a series of lines, arranged into sections divided by blank lines:first line of first sectionsecond line of second sectionfirst line of second section…The code that I've come up with to parse this is:object LinkGraph {
def sections(lines: Iterator[String]) = new Iterator[Iterator[String]] {var remaining = linesdef hasNext = remaining.hasNextdef next() = {val (section, rest) = remaining.span(!_.isEmpty)
try
while (iterator.nonEmpty) {
val group = iterator.takeWhile(_.nonEmpty)
iterator.takeWhile(_.empty)//eat empty lines
i just tested - you *can*. orgIt.takewhile(..).toList advances the original iterator (the next call to next will return the element that "takewhile" didn't take), it doesn't break it. the original can still be used. at least it works like that in 2.9.1
Reuse: After calling this method, one should discard the iterator it was called on, and use only the iterator that was returned. Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well.
def sections(lines: Iterator[String]) = new Iterator[Seq[String]] {def hasNext = lines.hasNextdef next() = {val section = new ArrayBuffer[String]breakable {while (lines.hasNext) {val line = lines.nextif (line.isEmpty)breakelsesection += line}}section}}
So does anyone have any idea how I *can* write this?I'm tempted to report this as a bug - if span is holding on to lines that are no longer in use, that has to be a bug, surely?
I'm sorry - I'm clearly still missing something. Why would trailer need to keep a pointer to leader? Or vice-versa? The two iterators returned from span are logically distinct, aren't they?
On Mon, Nov 19, 2012 at 9:54 AM, Paul Butcher <pa...@paulbutcher.com> wrote:def next() = {val (section, rest) = remaining.span(!_.isEmpty)I don't know exactly what the problem is, but this is most likely the cause. The method "span" will cache all lines that comprise "section", so you might as well have read them all into memory. I didn't see any obvious non-tail recursion in it, but maybe there's one.
There are two problems - first, it's *dreadfully* slow. And second, it fails with the following stack overflow after around 8000 sections:
However, as I said, that is not the problem you are having. It's a *heap* memory leak, not a stack problem.
Both symptoms are caused by the same thing, and it's a natural extension of how span is defined.
--Rex
Ah - OK - got it now. Thanks.On 19 Nov 2012, at 16:59, Rex Kerr <ich...@gmail.com> wrote:Both symptoms are caused by the same thing, and it's a natural extension of how span is defined.So the "Java-style" implementation I sent earlier probably is the right solution. I guess that, given that iterators are fundamentally side-effecting, it's no surprise that imperative code is the best fit.
--
paul.butcher->msgCount++
Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?
http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher
On Mon, Nov 19, 2012 at 10:56 AM, Dennis Haupt <h-s...@gmx.de> wrote:
i just tested - you *can*. orgIt.takewhile(..).toList advances the original iterator (the next call to next will return the element that "takewhile" didn't take), it doesn't break it. the original can still be used. at least it works like that in 2.9.1
That's called a coincidence. An Iterator has two methods: next and hasNext. Calling next will change the iterator, so it points to the next element. All methods on Iterator are based on these two, so any time an iterator method returns another iterator, the original iterator will be *used* by that method, changing the original iterator.
The implementation of the method certainly has a predictable behavior, but it's not a guaranteed behavior. It may change at will, and it might differ from Iterator to Iterator -- the Iterator returned by Source getLines() is different than the iterator returned by Source.fromFile.
I'll end by quoting myself from Iterator's takeWhile Scaladoc:
- Note
Reuse: After calling this method, one should discard the iterator it was called on, and use only the iterator that was returned. Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well.
�
> > > > �
> > > >
> > > >
> > > > The code that I've come up with to parse this is:
> > > >
> > > > *object* LinkGraph {
> > > >
> > > >
> > > > � *def* sections(lines: Iterator[String]) =
> > > *new*Iterator[Iterator[String]] {
> > > > � � *var* remaining = lines
> > > > � � *def* hasNext = remaining.hasNext
> > > > � � *def* next() = {
> > > > � � � *val* (section, rest) = remaining.span(!_.isEmpty)
> > > >
> > > >
> > > I don't know exactly what the problem is, but this is most likely the
> > > cause. The method "span" will cache all lines that comprise "section",
> so
> > > you might as well have read them all into memory. I didn't see any
> > obvious
> > > non-tail recursion in it, but maybe there's one.
> > >
> > > As a rule, I'd avoid any iterator method that splits the iterator in
> two.
> > >
> > >
> > > > � � � remaining = rest.dropWhile(_.isEmpty)
> > > > � � � section
> > > > � � }
> > > > � }
> > > >
> > > > � *def* load(lines: Iterator[String]) {
> > > > � � sections(lines) foreach { section =>
> > > > � � � // HANDLE SECTION
> > > > � � }
> > > > � }
> > > > }
> > > >
> > > >
> > > > It's being called like this:
> > > >
> > > > LinkGraph.load(Source.fromFile(args(0)).getLines)
> > > >
> > > >
> > > > There are two problems - first, it's *dreadfully* slow. And second,
> it
> > > > fails with the following stack overflow after around 8000 sections:
> > > >
> > > > java.lang.StackOverflowError
> > > > at
> > >
> sun.nio.cs.SingleByteDecoder.decodeArrayLoop(SingleByteDecoder.java:56)
> > > > �at
> sun.nio.cs.SingleByteDecoder.decodeLoop(SingleByteDecoder.java:83)
> > > > at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:544)
> > > > �at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:298)
> > > > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
> > > > �at java.io.InputStreamReader.read(InputStreamReader.java:167)
> > > > at java.io.BufferedReader.fill(BufferedReader.java:136)
> > > > �at java.io.BufferedReader.readLine(BufferedReader.java:299)
> > > > at java.io.BufferedReader.readLine(BufferedReader.java:362)
> > > > �at
> > > >
> > >
> >
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
> > > > at scala.collection.Iterator$$anon$2.hasNext(Iterator.scala:892)
> > > > �at scala.collection.Iterator$$anon$26.hasNext(Iterator.scala:642)
> > > > at scala.collection.Iterator$$anon$2.hasNext(Iterator.scala:892)
> > > > �at scala.collection.Iterator$$anon$27.hasNext(Iterator.scala:666)