Non-terminating parser combinator when putting rep() or opt() inside rep()

274 views
Skip to first unread message

Eugene Yokota

unread,
Jan 6, 2014, 12:11:48 AM1/6/14
to scala...@googlegroups.com
Hi,

I'm hitting a strange parser combinator behavior, and I'd like some clarification.
Here's a repro session using Scala 2.10.3:

scala> :paste
// Entering paste mode (ctrl-D to finish)

import scala.util.parsing.combinator._

trait RNAParser extends JavaTokenParsers {
  def UUU: Parser[String] = "UUU" ^^ { case _ => "P" }
  def UAG: Parser[String] = "UAG" ^^ { case _ => "*" }
}

object ExampleParser extends RNAParser {
  def parser: Parser[Seq[String]] =
    rep(
      rep(UUU) ~ opt(UAG) ^^ { case p1 ~ p2 => p1.mkString + p2.getOrElse("") }
    )
  def parse(s: String): Unit = {
    println(parseAll(parser, s))
  }
}

// Exiting paste mode, now interpreting.

import scala.util.parsing.combinator._
defined trait RNAParser
defined module ExampleParser

scala> ExampleParser.parse("UUU")
// this doesn't return

The code will keep consuming CPU% and RAM till JVM goes belly up.
In this particular case, changing inner rep(UUU) to rep1(UUU) produces terminating parser, but shouldn't the parser terminate when there are no more tokens?

Regular expression on the other hand is able to nest *:

scala> """((UUU)*(UAG)?)*""".r.findFirstIn("UUU")
res0: Option[String] = Some(UUU)

This originally came up as a scalaxb [#230]. Since it's generating parser code of XML elements from user-provided schema,
I may not be able to work around this by using rep1(...) in some cases.

Thanks,
-eugene


Tony Sloane

unread,
Jan 6, 2014, 5:09:04 AM1/6/14
to Eugene Yokota, scala-user@googlegroups.com User
Hi Eugene,
Ah, but your parser:

   rep (UUU) ~ opt (UAG) 

can succeed without consuming any input which is almost always a bad idea.

To see why you get a loop, consider what happens when the input is “UUU”: the first rep(UUU) succeeds as does the following opt(UAG), leaving the input position at the end of the input. Then, you go around again and look for another rep(UUU) which succeeds again since zero repetitions is fine, then the opt(UAG) succeeds again, and you’re in a loop because you’re still at the end of the input.

A key point is that these sorts of parsers are greedy so they will keep trying to parse as long as things keep working.

In my experience the log parser combinator is very useful for diagnosing this sort of problem. I’ve included modified parser and part of a trace of this parse below.

cheers,
Tony

To get a useful trace, modify the parser definitions as follows:

import scala.util.parsing.combinator._

trait RNAParser extends JavaTokenParsers {
  def UUU: Parser[String] = log ("UUU" ^^ { case _ => "P" }) ("UUU")
  def UAG: Parser[String] = log ("UAG" ^^ { case _ => "*" }) ("UAG")
}

object ExampleParser extends RNAParser {
  def parser: Parser[Seq[String]] =
    rep(
      log (rep(UUU)) ("repUUU") ~ log (opt(UAG)) ("optUAG") ^^ { case p1 ~ p2 => p1.mkString + p2.getOrElse("") }
    )
  def parse(s: String): Unit = {
    println(parseAll(parser, s))
  }
}

Here is the beginning of the trace:

scala> ExampleParser.parse("UUU")
trying repUUU at scala.util.parsing.input.CharSequenceReader@638844a6
trying UUU at scala.util.parsing.input.CharSequenceReader@638844a6
UUU --> [1.4] parsed: P
trying UUU at scala.util.parsing.input.CharSequenceReader@76877f0f
UUU --> [1.4] failure: `UUU' expected but end of source found

UUU
   ^
repUUU --> [1.4] parsed: List(P)
trying optUAG at scala.util.parsing.input.CharSequenceReader@76877f0f
trying UAG at scala.util.parsing.input.CharSequenceReader@76877f0f
UAG --> [1.4] failure: `UAG' expected but end of source found

UUU
   ^
optUAG --> [1.4] parsed: None
trying repUUU at scala.util.parsing.input.CharSequenceReader@76877f0f
trying UUU at scala.util.parsing.input.CharSequenceReader@76877f0f
UUU --> [1.4] failure: `UUU' expected but end of source found

UUU
   ^
repUUU --> [1.4] parsed: List()
trying optUAG at scala.util.parsing.input.CharSequenceReader@76877f0f
trying UAG at scala.util.parsing.input.CharSequenceReader@76877f0f
UAG --> [1.4] failure: `UAG' expected but end of source found

UUU
   ^
optUAG --> [1.4] parsed: None
trying repUUU at scala.util.parsing.input.CharSequenceReader@76877f0f
trying UUU at scala.util.parsing.input.CharSequenceReader@76877f0f
UUU --> [1.4] failure: `UUU' expected but end of source found

UUU
   ^
repUUU --> [1.4] parsed: List()
trying optUAG at scala.util.parsing.input.CharSequenceReader@76877f0f
trying UAG at scala.util.parsing.input.CharSequenceReader@76877f0f
UAG --> [1.4] failure: `UAG' expected but end of source found

UUU
   ^
optUAG --> [1.4] parsed: None
trying repUUU at scala.util.parsing.input.CharSequenceReader@76877f0f
trying UUU at scala.util.parsing.input.CharSequenceReader@76877f0f

<repeat forever>

Eugene Yokota

unread,
Jan 6, 2014, 12:27:45 PM1/6/14
to scala...@googlegroups.com, Eugene Yokota
Hi Tony,


On Monday, January 6, 2014 5:09:04 AM UTC-5, Tony Sloane wrote:
Ah, but your parser:

   rep (UUU) ~ opt (UAG) 

can succeed without consuming any input which is almost always a bad idea.

To see why you get a loop, consider what happens when the input is “UUU”: the first rep(UUU) succeeds as does the following opt(UAG), leaving the input position at the end of the input. Then, you go around again and look for another rep(UUU) which succeeds again since zero repetitions is fine, then the opt(UAG) succeeds again, and you’re in a loop because you’re still at the end of the input.

A key point is that these sorts of parsers are greedy so they will keep trying to parse as long as things keep working.

I speculated that it's the potentially empty subexpression that's causing the problem, but I was hoping it would terminate the loop when it sees the end of input.
That's why I put the equivalent regex:

scala> """((UUU)*(UAG)?)*""".r.findFirstIn("UUU")
res0: Option[String] = Some(UUU)

In my experience the log parser combinator is very useful for diagnosing this sort of problem. I’ve included modified parser and part of a trace of this parse below.
This is useful.

One way to emulate the regex engine may be to force a failure when none of the subexpressions match.

object ExampleParser extends RNAParser {
  def parser: Parser[Seq[String]] =
    rep(
      rep(UUU) ~ opt(UAG) ^^ { case p1 ~ p2 => p1.mkString + p2.getOrElse("") } filter { _ != "" }
    )
  def parse(s: String): Unit = {
    println(parseAll(parser, s))
  }
}

This is a bit annoying since I'd have to know what the mapped value would be for the empty sequence,
but the upside is that I can keep the current zero-or-more occurrence.

-eugene

 

Tony Sloane

unread,
Jan 6, 2014, 5:05:14 PM1/6/14
to scala-user@googlegroups.com User
Hi Eugene,

On 7 Jan 2014, at 4:27 am, Eugene Yokota <eed3...@gmail.com> wrote:

I speculated that it's the potentially empty subexpression that's causing the problem, but I was hoping it would terminate the loop when it sees the end of input.
That's why I put the equivalent regex:

scala> """((UUU)*(UAG)?)*""".r.findFirstIn("UUU")
res0: Option[String] = Some(UUU)

True to the greedy nature of this kind of parser, rep(p) will only stop if p fails. Since your p matches the empty input, it will always succeed. Regex matching is usually not as greedy so the correspondence between repetition in that context and in the parser context is not exact.


In my experience the log parser combinator is very useful for diagnosing this sort of problem. I’ve included modified parser and part of a trace of this parse below.
This is useful.

One way to emulate the regex engine may be to force a failure when none of the subexpressions match.

object ExampleParser extends RNAParser {
  def parser: Parser[Seq[String]] =
    rep(
      rep(UUU) ~ opt(UAG) ^^ { case p1 ~ p2 => p1.mkString + p2.getOrElse("") } filter { _ != "" }
    )
  def parse(s: String): Unit = {
    println(parseAll(parser, s))
  }
}

This is a bit annoying since I'd have to know what the mapped value would be for the empty sequence,
but the upside is that I can keep the current zero-or-more occurrence.

It might be possible to write a custom version of rep that stops when the end of input is reached. It will be hard to avoid the “what to return if the input is empty” problem as long as your sub-expression can match an empty sequence.

cheers,
Tony

Reply all
Reply to author
Forward
0 new messages