[Haskell-cafe] regex-applicative library needs your help! (algorithmic challenge)

20 views
Skip to first unread message

Roman Cheplyaka

unread,
Sep 13, 2011, 1:40:19 AM9/13/11
to haskel...@haskell.org
Please help make the regex-based parsing library efficient!

https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help

--
Roman I. Cheplyaka :: http://ro-che.info/

_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Chris Kuklewicz

unread,
Sep 13, 2011, 3:21:55 AM9/13/11
to haskel...@haskell.org
I wrote regex-tdfa, the efficient (though not yet lightning fast) Posix-like engine.

You are not the first to want an efficient Perl-like engine. The answer you
seek flows from Russ Cox (though in C++):

> http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html

> http://code.google.com/p/re2/

Quoting relevant bit:

> It also finds the leftmost-first match, the same match that Perl would, and
> can return submatch information. The one significant exception is that RE2
> drops support for backreferences¹ and generalized zero-width assertions,
> because they cannot be implemented efficiently.

On 13/09/2011 06:40, Roman Cheplyaka wrote:
> Please help make the regex-based parsing library efficient!
>
> https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help
>

_______________________________________________

Roman Cheplyaka

unread,
Sep 13, 2011, 5:29:55 PM9/13/11
to Chris Kuklewicz, haskel...@haskell.org
* Chris Kuklewicz <has...@list.mightyreason.com> [2011-09-13 08:21:55+0100]

> I wrote regex-tdfa, the efficient (though not yet lightning fast) Posix-like engine.
>
> You are not the first to want an efficient Perl-like engine. The answer you
> seek flows from Russ Cox (though in C++):
>
> > http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html
>
> > http://code.google.com/p/re2/
>
> Quoting relevant bit:
>
> > It also finds the leftmost-first match, the same match that Perl would, and
> > can return submatch information. The one significant exception is that RE2
> > drops support for backreferences¹ and generalized zero-width assertions,
> > because they cannot be implemented efficiently.

Hi Chris, thanks for the response.

There's one thing about Cox's work that I don't understand. On the
page [1] he mentions that the algorithm falls back to direct NFA
simulation (search for "If all else fails" on that page).
This defeats his goal of defending against DoS attacks (the second
paragraph of that page).

And I wouldn't be comfortable with an algorithm that is worst-case
exponential either.

Then there's another issue: I specifically want a combinator library,
and not every automaton-based algorithm can serve this purpose in a
statically typed language (unless there's some clever trick I don't
know).

[1]: http://swtch.com/~rsc/regexp/regexp3.html

--
Roman I. Cheplyaka :: http://ro-che.info/

_______________________________________________

Roman Cheplyaka

unread,
Sep 13, 2011, 5:41:51 PM9/13/11
to Chris Kuklewicz, haskel...@haskell.org
* Roman Cheplyaka <ro...@ro-che.info> [2011-09-14 00:29:55+0300]

> Then there's another issue: I specifically want a combinator library,
> and not every automaton-based algorithm can serve this purpose in a
> statically typed language (unless there's some clever trick I don't
> know).

Just to clarify: by "clever" trick I don't mean a "dirty" trick, like
using unsafeCoerce in frisby[1]. I guess I'm just too young and naive to
use that kind of stuff.

Glushkov automaton is nice because it can be used to implement parsing
combinators in pure Haskell + GADTs.

[1]: http://hackage.haskell.org/packages/archive/frisby/0.1/doc/html/src/Text-Parsers-Frisby.html

Eugene Kirpichov

unread,
Sep 14, 2011, 12:38:10 AM9/14/11
to Roman Cheplyaka, Chris Kuklewicz, haskel...@haskell.org
Hi,
I don't see how fallback to NFA simulation is really a failure wrt DoS
attacks. It's not exponential in time or memory, just linear memory
(in size of regex) instead of constant, and slower than DFA.

--
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/

Roman Cheplyaka

unread,
Sep 14, 2011, 2:32:46 AM9/14/11
to Eugene Kirpichov, Chris Kuklewicz, haskel...@haskell.org
* Eugene Kirpichov <ekirp...@gmail.com> [2011-09-14 08:38:10+0400]

> Hi,
> I don't see how fallback to NFA simulation is really a failure wrt DoS
> attacks. It's not exponential in time or memory, just linear memory
> (in size of regex) instead of constant, and slower than DFA.

Hi Eugene, thanks for pointing that out.

Indeed, I now see that he uses breadth-first rather than depth-first
search. Then he has the same problem as I do. I shall study his code
more closely.

Chris Kuklewicz

unread,
Sep 15, 2011, 5:23:15 AM9/15/11
to haskel...@haskell.org
If you are worried about Denial Of Service attacks then you are worried about
(1) memory explosion, and (2) long runtime. Long runtime can and should be
solved by running a timeout connected to a thread killer, not at the level of
the regular expression engine. The memory explosion is more interesting, and
here the engine's tradeoffs have something to say:

Since I know simple POSIX regular expression matching better, here is an
overview of this without subexpression capture:

(1) The full DFA may be exponentially large and thus cannot be built ahead of time.

(2) Building the DFA in a lazy fashion will cause memory to grow linearly, which
will be a problem if left to run too long.

(2.1) Pure lazy DFA building won't have a way to determine the currently
computed side of the DFA, and is not going to help you.

(2.2) Cheating the "pure" with unsafePerformIO might be able to keep a count of
the number of computed DFA nodes. Good luck with thread safety.

(2.3) Making it a stateful expanding DFA would allow you to control and monitor
the growing number of DFA states.

(2.4) Adjust the stateful DFA to keep a limited cache of computed DFA nodes.

(3) Building an NFA from your regular expression can be cheap, but you would
have be more clever than expanding "a(b{2,99}c){100}d" to have 100 copies of 99
copies of b.

(4) Using an NFA with multiple states (breadth-first) is like the limit of a DFA
with only a single computed node at a time. The number of states active in the
pure NFA can be easily counted and bounded to keep the system from using too
much memory. Without the lazy or stateful caching of DFA states this may be
slightly slower, depending on whether the caches states would be re-used heavily
or not.

If you want POSIX subexpression capture then your memory required go up by a
factor that is, worst case, proportional to the regular expression size. This
worst case factor for the "a(b{2,99}c){100}d" does have to track 100 copies of
99 copies of the tracking information. ( The size of the NFA can still be
clever when adding capture, but the tracked information you now need cannot be
so clever ).

The Russ Cox engine in re2 uses NFA/DFA techniques to achieve Perl semantics,
where it can check the possibilities in parallel. The same problems with
exploding DFA size and tracked information may still apply.

On 14/09/2011 07:32, Roman Cheplyaka wrote:
> * Eugene Kirpichov <ekirp...@gmail.com> [2011-09-14 08:38:10+0400]
>> Hi,
>> I don't see how fallback to NFA simulation is really a failure wrt DoS
>> attacks. It's not exponential in time or memory, just linear memory
>> (in size of regex) instead of constant, and slower than DFA.
>
> Hi Eugene, thanks for pointing that out.
>
> Indeed, I now see that he uses breadth-first rather than depth-first
> search. Then he has the same problem as I do. I shall study his code
> more closely.
>

_______________________________________________

Anton Tayanovskyy

unread,
Sep 17, 2011, 9:46:57 PM9/17/11
to Roman Cheplyaka, haskel...@haskell.org
So you want to encode priorities efficiently as far as I understand
from [1]? Could bit-packing combined with prefix elimination do the
trick? Choice boils down to binary choice. Attach a number N to every
execution thread that sits in a given NFA state. When the thread moves
through a binary choice state, it splits into two threads with
N_{left} = 2 * N + 1 and N_{right} = 2 * N. When two threads arrive at
the same NFA state, the machine picks one with greater N and discards
the other, thus giving priority to early over late and left over
right. This makes these numbers grow quickly, but at any point in the
simulation one can identify the common binary prefix of all the thread
numbers and remove it, because it will not be relevant for comparison.
If this is too costly, amortize and do it every K steps instead of on
every step.

Anton

[1] https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help

--
Kind Regards,
Anton Tayanovskyy

WebSharper™ - type-safe JavaScript in F#
http://intellifactory.com

Anton Tayanovskyy

unread,
Sep 17, 2011, 10:11:00 PM9/17/11
to Roman Cheplyaka, haskel...@haskell.org
By the way, can Haskell have a type that admits regular expression and
only those? I mostly do ML these days, so trying to write up a regex
types in Haskell I was unpleasantly surprised to discover that there
are all sorts of exotic terms inhabiting it, which I did not have to
worry about in ML. Besides `undefined` you can have for terms that try
to define a grammar with nested parentheses. Using regex-applicative
syntax:

expr = ... <|> pure (\ _ x _ -> x) <*> sym "(" <*> expr <*> sym ")"

Is this the so-called 'paucity of types' [1]? What do you do, as a
library designer? Return bottom or try to detect and error out for
these kind of values?

Thanks,

A

[1] http://existentialtype.wordpress.com/tag/recursion/

Chris Smith

unread,
Sep 17, 2011, 11:05:07 PM9/17/11
to Anton Tayanovskyy, haskel...@haskell.org
On Sat, 2011-09-17 at 22:11 -0400, Anton Tayanovskyy wrote:
> By the way, can Haskell have a type that admits regular expression and
> only those? I mostly do ML these days, so trying to write up a regex
> types in Haskell I was unpleasantly surprised to discover that there
> are all sorts of exotic terms inhabiting it, which I did not have to
> worry about in ML. Besides `undefined` you can have for terms that try
> to define a grammar with nested parentheses.

I'm not sure that I understand exactly what the question is... but if
you want to recover the set of values in ML types, you need only add an
'!' before each of the fields of each of your constructors, marking them
as strict fields. Sadly (but unsurprisingly), this does require using a
different version of lists and other fundamental types as well.

You'll be left with bottom, of course, but in strict-language thinking,
that's nothing but an acknowledgement that a function returning that
type may diverge... something you can't exclude in Haskell *or* ML, it's
just that ML has a different natural way of thinking about it. That way
lies Agda and Coq. So you get the next-closest thing: a so-called flat
type, whose values are bottom, and a single layer above it in the
information ordering.

If I've misunderstood the question, my apologies... I haven't actually
been reading this thread.

--
Chris Smith

Roman Cheplyaka

unread,
Sep 18, 2011, 4:47:53 AM9/18/11
to Anton Tayanovskyy, haskel...@haskell.org
* Anton Tayanovskyy <anton.ta...@gmail.com> [2011-09-17 21:46:57-0400]

> So you want to encode priorities efficiently as far as I understand
> from [1]? Could bit-packing combined with prefix elimination do the
> trick? Choice boils down to binary choice. Attach a number N to every
> execution thread that sits in a given NFA state. When the thread moves
> through a binary choice state, it splits into two threads with
> N_{left} = 2 * N + 1 and N_{right} = 2 * N. When two threads arrive at
> the same NFA state, the machine picks one with greater N and discards
> the other, thus giving priority to early over late and left over
> right. This makes these numbers grow quickly, but at any point in the
> simulation one can identify the common binary prefix of all the thread
> numbers and remove it, because it will not be relevant for comparison.
> If this is too costly, amortize and do it every K steps instead of on
> every step.
>
> Anton
>
> [1] https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help

Just common prefix elimination is not sufficient to put a hard bound on
the memory usage. For example, something like /(a*)|((aa)*)/ would still
require an amount of space linear in the string length (if the string
consists of many a's).

More clever techniques that I tried ("monotonically map lists to
smaller lists") are hard to make correct.

I like the approach of Russ Cox[2]. One of the great ideas there (which I
think he didn't emphasize enough) is to have a queue which allows O(1)
testing whether an element is already there [3]. This solves the problem
with priorities -- the states are guaranteed to be enqueued in the order
of their priorities, and there are no repetitions.

To be able to process the states in the proper order I'll have to give
up Glushkov automaton and probably use something similar to your
construction [4].

[2] http://swtch.com/~rsc/regexp/regexp2.html
[3] http://code.google.com/p/re2/source/browse/util/sparse_array.h
[4] http://t0yv0.blogspot.com/2011/07/combinatory-regular-expressions-in.html

Roman Cheplyaka

unread,
Sep 18, 2011, 5:09:22 AM9/18/11
to Anton Tayanovskyy, haskel...@haskell.org
* Anton Tayanovskyy <anton.ta...@gmail.com> [2011-09-17 22:11:00-0400]

> By the way, can Haskell have a type that admits regular expression and
> only those? I mostly do ML these days, so trying to write up a regex
> types in Haskell I was unpleasantly surprised to discover that there
> are all sorts of exotic terms inhabiting it, which I did not have to
> worry about in ML. Besides `undefined` you can have for terms that try
> to define a grammar with nested parentheses. Using regex-applicative
> syntax:
>
> expr = ... <|> pure (\ _ x _ -> x) <*> sym "(" <*> expr <*> sym ")"
>
> Is this the so-called 'paucity of types' [1]? What do you do, as a
> library designer? Return bottom or try to detect and error out for
> these kind of values?
>
> Thanks,
>
> A
>
> [1] http://existentialtype.wordpress.com/tag/recursion/

Essentially, this is an "infinite" regular expression.
I think it'd be possible to enforce this on the type level. (There was a
thread here some time ago about a type of finite lists.)

If a library can work with such infinite trees, and the user is happy
with resulting performance, then, why not.
For example, this is the case with weighted-regexp library, and also
with the first (and buggy) version of regex-applicative [2].

Otherwise, just let the user know that this function doesn't work with
infinite data structures. Note that in the presence of referential
transparency (i.e. without StableNames or similar tools) it is not
possible to decide at runtime if a value is circular or not.

[2] https://github.com/feuerbach/regex-applicative/commit/95ddfbbfb01c47e292dc7f03069eefe00008907f

Roman Cheplyaka

unread,
Sep 18, 2011, 7:00:34 AM9/18/11
to Chris Kuklewicz, haskel...@haskell.org
Chris,

Thank you for an interesting overview.

However, I'm not worried directly about DoS. I just want to build a
regex library which would be convenient to use for parsing regular
languages (by providing applicative interface and Perl semantics), and
not worse than alternatives performance-wise (including worst-case time
and space complexity).

> (3) Building an NFA from your regular expression can be cheap, but you would
> have be more clever than expanding "a(b{2,99}c){100}d" to have 100 copies of 99
> copies of b.

Are there any known algorithms more clever than that?

--
Roman I. Cheplyaka :: http://ro-che.info/

_______________________________________________

Brandon Allbery

unread,
Sep 18, 2011, 11:28:09 AM9/18/11
to Anton Tayanovskyy, haskel...@haskell.org
On Sat, Sep 17, 2011 at 22:11, Anton Tayanovskyy <anton.ta...@gmail.com> wrote:
By the way, can Haskell have a type that admits regular expression and
only those? I mostly do ML these days, so trying to write up a regex
types in Haskell I was unpleasantly surprised to discover that there
are all sorts of exotic terms inhabiting it, which I did not have to
worry about in ML. Besides `undefined` you can have for terms that try
to define a grammar with nested parentheses. Using regex-applicative
syntax:

expr = ... <|> pure (\ _ x _ -> x) <*> sym "(" <*> expr <*> sym ")"

The general case for this is solving the Halting Problem, so neither Haskell nor any other language can do it.  You can disallow infinite values by encoding the length into the type, but (a) in Haskell this is painful and (b) runtime values now require runtime types, which you can accommodate but at the price of reintroducing the problems you are trying to prevent.  (Dependently typed languages work better for this, but I suspect the result is rather more draconian than you intend.)

--
brandon s allbery                                      allb...@gmail.com
wandering unix systems administrator (available)     (412) 475-9364 vm/sms

Anton Tayanovskyy

unread,
Sep 18, 2011, 7:34:18 PM9/18/11
to Roman Cheplyaka, haskel...@haskell.org
> I like the approach of Russ Cox[2]. One of the great ideas there (which I
> think he didn't emphasize enough) is to have a queue which allows O(1)
> testing whether an element is already there [3]. This solves the problem
> with priorities -- the states are guaranteed to be enqueued in the order
> of their priorities, and there are no repetitions.

Hm, this sounds great! So then the number of states is just the size
of the NFA, so the memory does not depend on the input string length?
Have you tried this approach yet? I wouldn't vouch for my code in that
gist, I kind of remember trying to eliminate the duplicates while
preserving order buy not sure if I did it correctly there.

>
> To be able to process the states in the proper order I'll have to give
> up Glushkov automaton and probably use something similar to your
> construction [4].
>
> [2] http://swtch.com/~rsc/regexp/regexp2.html
> [3] http://code.google.com/p/re2/source/browse/util/sparse_array.h
> [4] http://t0yv0.blogspot.com/2011/07/combinatory-regular-expressions-in.html
>
> --
> Roman I. Cheplyaka :: http://ro-che.info/
>

--
Kind Regards,
Anton Tayanovskyy

WebSharper™ - type-safe JavaScript in F#
http://intellifactory.com

_______________________________________________

Anton Tayanovskyy

unread,
Sep 18, 2011, 7:53:25 PM9/18/11
to Brandon Allbery, haskel...@haskell.org
Chris, Brandon, thank you for the input. I believe I understand what
you are saying; to reiterate, yes, in the *general case*, neither ML
nor Haskell types outrule nastiness such as non-termination. Yes I
know about and use Coq a bit. However, ML has an important *special
case*: not using functions. Chris, using bang patterns in Haskell do
not quite help to statically rule out recursive terms, do they? They
just make for a different runtime error. If I said:

data Regex =
...
| Seq !Regex !Regex

Will it actually give the user who tries to construct a recursive
value a compile-time error? I suspect not, I'll try it myself to make
sure.

Roman, thanks, I will try to look that thread up - this is what I am
indeed curious about - how to enforce this at the type level.

Yes, with a vanilla interface sharing is not observable; some
libraries work with it and some do not. I am spooked out by this
though: performance guarantees are very different for regular terms
and terms with sharing. Consider the regex engine. The "infinite"
regular expressions (well, they are not really regular), may suddenly
produce an infinte NFA, so memory use may become linear in the input
string. Yes, a smart user will understand this, especially in an area
that is relatively well-understood such as regular expressions.
However I find it extremely disturbing that library designers do not
communicate requirements through the type system, and make performance
guarantees for all well-typed programs. ML makes it easy, I guess I am
just an ML convert :)


Thanks,

Anton

--

Edward Kmett

unread,
Sep 21, 2011, 8:05:08 PM9/21/11
to Brandon Allbery, haskel...@haskell.org


On Sep 18, 2011, at 11:28 AM, Brandon Allbery <allb...@gmail.com> wrote:

On Sat, Sep 17, 2011 at 22:11, Anton Tayanovskyy <anton.ta...@gmail.com> wrote:
By the way, can Haskell have a type that admits regular expression and
only those? I mostly do ML these days, so trying to write up a regex
types in Haskell I was unpleasantly surprised to discover that there
are all sorts of exotic terms inhabiting it, which I did not have to
worry about in ML. Besides `undefined` you can have for terms that try
to define a grammar with nested parentheses. Using regex-applicative
syntax:

expr = ... <|> pure (\ _ x _ -> x) <*> sym "(" <*> expr <*> sym ")"

The general case for this is solving the Halting Problem, so neither Haskell nor any other language can do it.  You can disallow infinite values by encoding the length into the type, but (a) in Haskell this is painful and (b) runtime values now require runtime types, which you can accommodate but at the price of reintroducing the problems you are trying to prevent.  (Dependently typed languages work better for this, but I suspect the result is rather more draconian than you intend.)

You needn't solve the halting problem. Any reasonable total language can prevent this case and there is no need to keep types as simple as these around at runtime. See the Agda total parser combinator library for an example, but yes, dependent types are overkill.

I usually enforce constraints like this with ! patterns in the constructors, which lets me enforce the fact that at least I know that any attempt to define a cycle like this will bottom out, so I can safely think only inductive thoughts from there out.

Another way is to just define a template Haskell regex quasi quoter, and enforce the limitation there. That of course costs you runtime generation.

-Edward Kmett

Brandon Allbery

unread,
Sep 23, 2011, 1:31:37 AM9/23/11
to Edward Kmett, haskel...@haskell.org
On Wed, Sep 21, 2011 at 20:05, Edward Kmett <ekm...@gmail.com> wrote:
I usually enforce constraints like this with ! patterns in the constructors, which lets me enforce the fact that at least I know that any attempt to define a cycle like this will bottom out, so I can safely think only inductive thoughts from there out.

But strictness annotations act at runtime (assignment time); he wants *compile* time, which puts him into more interesting type territory.
Reply all
Reply to author
Forward
0 new messages