https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help
--
Roman I. Cheplyaka :: http://ro-che.info/
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
You are not the first to want an efficient Perl-like engine. The answer you
seek flows from Russ Cox (though in C++):
> http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html
> http://code.google.com/p/re2/
Quoting relevant bit:
> It also finds the leftmost-first match, the same match that Perl would, and
> can return submatch information. The one significant exception is that RE2
> drops support for backreferences¹ and generalized zero-width assertions,
> because they cannot be implemented efficiently.
On 13/09/2011 06:40, Roman Cheplyaka wrote:
> Please help make the regex-based parsing library efficient!
>
> https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help
>
_______________________________________________
Hi Chris, thanks for the response.
There's one thing about Cox's work that I don't understand. On the
page [1] he mentions that the algorithm falls back to direct NFA
simulation (search for "If all else fails" on that page).
This defeats his goal of defending against DoS attacks (the second
paragraph of that page).
And I wouldn't be comfortable with an algorithm that is worst-case
exponential either.
Then there's another issue: I specifically want a combinator library,
and not every automaton-based algorithm can serve this purpose in a
statically typed language (unless there's some clever trick I don't
know).
[1]: http://swtch.com/~rsc/regexp/regexp3.html
--
Roman I. Cheplyaka :: http://ro-che.info/
_______________________________________________
Just to clarify: by "clever" trick I don't mean a "dirty" trick, like
using unsafeCoerce in frisby[1]. I guess I'm just too young and naive to
use that kind of stuff.
Glushkov automaton is nice because it can be used to implement parsing
combinators in pure Haskell + GADTs.
[1]: http://hackage.haskell.org/packages/archive/frisby/0.1/doc/html/src/Text-Parsers-Frisby.html
--
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/
Hi Eugene, thanks for pointing that out.
Indeed, I now see that he uses breadth-first rather than depth-first
search. Then he has the same problem as I do. I shall study his code
more closely.
Since I know simple POSIX regular expression matching better, here is an
overview of this without subexpression capture:
(1) The full DFA may be exponentially large and thus cannot be built ahead of time.
(2) Building the DFA in a lazy fashion will cause memory to grow linearly, which
will be a problem if left to run too long.
(2.1) Pure lazy DFA building won't have a way to determine the currently
computed side of the DFA, and is not going to help you.
(2.2) Cheating the "pure" with unsafePerformIO might be able to keep a count of
the number of computed DFA nodes. Good luck with thread safety.
(2.3) Making it a stateful expanding DFA would allow you to control and monitor
the growing number of DFA states.
(2.4) Adjust the stateful DFA to keep a limited cache of computed DFA nodes.
(3) Building an NFA from your regular expression can be cheap, but you would
have be more clever than expanding "a(b{2,99}c){100}d" to have 100 copies of 99
copies of b.
(4) Using an NFA with multiple states (breadth-first) is like the limit of a DFA
with only a single computed node at a time. The number of states active in the
pure NFA can be easily counted and bounded to keep the system from using too
much memory. Without the lazy or stateful caching of DFA states this may be
slightly slower, depending on whether the caches states would be re-used heavily
or not.
If you want POSIX subexpression capture then your memory required go up by a
factor that is, worst case, proportional to the regular expression size. This
worst case factor for the "a(b{2,99}c){100}d" does have to track 100 copies of
99 copies of the tracking information. ( The size of the NFA can still be
clever when adding capture, but the tracked information you now need cannot be
so clever ).
The Russ Cox engine in re2 uses NFA/DFA techniques to achieve Perl semantics,
where it can check the possibilities in parallel. The same problems with
exploding DFA size and tracked information may still apply.
On 14/09/2011 07:32, Roman Cheplyaka wrote:
> * Eugene Kirpichov <ekirp...@gmail.com> [2011-09-14 08:38:10+0400]
>> Hi,
>> I don't see how fallback to NFA simulation is really a failure wrt DoS
>> attacks. It's not exponential in time or memory, just linear memory
>> (in size of regex) instead of constant, and slower than DFA.
>
> Hi Eugene, thanks for pointing that out.
>
> Indeed, I now see that he uses breadth-first rather than depth-first
> search. Then he has the same problem as I do. I shall study his code
> more closely.
>
_______________________________________________
Anton
[1] https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help
--
Kind Regards,
Anton Tayanovskyy
WebSharper™ - type-safe JavaScript in F#
http://intellifactory.com
expr = ... <|> pure (\ _ x _ -> x) <*> sym "(" <*> expr <*> sym ")"
Is this the so-called 'paucity of types' [1]? What do you do, as a
library designer? Return bottom or try to detect and error out for
these kind of values?
Thanks,
A
[1] http://existentialtype.wordpress.com/tag/recursion/
I'm not sure that I understand exactly what the question is... but if
you want to recover the set of values in ML types, you need only add an
'!' before each of the fields of each of your constructors, marking them
as strict fields. Sadly (but unsurprisingly), this does require using a
different version of lists and other fundamental types as well.
You'll be left with bottom, of course, but in strict-language thinking,
that's nothing but an acknowledgement that a function returning that
type may diverge... something you can't exclude in Haskell *or* ML, it's
just that ML has a different natural way of thinking about it. That way
lies Agda and Coq. So you get the next-closest thing: a so-called flat
type, whose values are bottom, and a single layer above it in the
information ordering.
If I've misunderstood the question, my apologies... I haven't actually
been reading this thread.
--
Chris Smith
Just common prefix elimination is not sufficient to put a hard bound on
the memory usage. For example, something like /(a*)|((aa)*)/ would still
require an amount of space linear in the string length (if the string
consists of many a's).
More clever techniques that I tried ("monotonically map lists to
smaller lists") are hard to make correct.
I like the approach of Russ Cox[2]. One of the great ideas there (which I
think he didn't emphasize enough) is to have a queue which allows O(1)
testing whether an element is already there [3]. This solves the problem
with priorities -- the states are guaranteed to be enqueued in the order
of their priorities, and there are no repetitions.
To be able to process the states in the proper order I'll have to give
up Glushkov automaton and probably use something similar to your
construction [4].
[2] http://swtch.com/~rsc/regexp/regexp2.html
[3] http://code.google.com/p/re2/source/browse/util/sparse_array.h
[4] http://t0yv0.blogspot.com/2011/07/combinatory-regular-expressions-in.html
Essentially, this is an "infinite" regular expression.
I think it'd be possible to enforce this on the type level. (There was a
thread here some time ago about a type of finite lists.)
If a library can work with such infinite trees, and the user is happy
with resulting performance, then, why not.
For example, this is the case with weighted-regexp library, and also
with the first (and buggy) version of regex-applicative [2].
Otherwise, just let the user know that this function doesn't work with
infinite data structures. Note that in the presence of referential
transparency (i.e. without StableNames or similar tools) it is not
possible to decide at runtime if a value is circular or not.
[2] https://github.com/feuerbach/regex-applicative/commit/95ddfbbfb01c47e292dc7f03069eefe00008907f
Thank you for an interesting overview.
However, I'm not worried directly about DoS. I just want to build a
regex library which would be convenient to use for parsing regular
languages (by providing applicative interface and Perl semantics), and
not worse than alternatives performance-wise (including worst-case time
and space complexity).
> (3) Building an NFA from your regular expression can be cheap, but you would
> have be more clever than expanding "a(b{2,99}c){100}d" to have 100 copies of 99
> copies of b.
Are there any known algorithms more clever than that?
--
Roman I. Cheplyaka :: http://ro-che.info/
_______________________________________________
By the way, can Haskell have a type that admits regular expression and
only those? I mostly do ML these days, so trying to write up a regex
types in Haskell I was unpleasantly surprised to discover that there
are all sorts of exotic terms inhabiting it, which I did not have to
worry about in ML. Besides `undefined` you can have for terms that try
to define a grammar with nested parentheses. Using regex-applicative
syntax:
expr = ... <|> pure (\ _ x _ -> x) <*> sym "(" <*> expr <*> sym ")"
Hm, this sounds great! So then the number of states is just the size
of the NFA, so the memory does not depend on the input string length?
Have you tried this approach yet? I wouldn't vouch for my code in that
gist, I kind of remember trying to eliminate the duplicates while
preserving order buy not sure if I did it correctly there.
>
> To be able to process the states in the proper order I'll have to give
> up Glushkov automaton and probably use something similar to your
> construction [4].
>
> [2] http://swtch.com/~rsc/regexp/regexp2.html
> [3] http://code.google.com/p/re2/source/browse/util/sparse_array.h
> [4] http://t0yv0.blogspot.com/2011/07/combinatory-regular-expressions-in.html
>
> --
> Roman I. Cheplyaka :: http://ro-che.info/
>
--
Kind Regards,
Anton Tayanovskyy
WebSharper™ - type-safe JavaScript in F#
http://intellifactory.com
_______________________________________________
data Regex =
...
| Seq !Regex !Regex
Will it actually give the user who tries to construct a recursive
value a compile-time error? I suspect not, I'll try it myself to make
sure.
Roman, thanks, I will try to look that thread up - this is what I am
indeed curious about - how to enforce this at the type level.
Yes, with a vanilla interface sharing is not observable; some
libraries work with it and some do not. I am spooked out by this
though: performance guarantees are very different for regular terms
and terms with sharing. Consider the regex engine. The "infinite"
regular expressions (well, they are not really regular), may suddenly
produce an infinte NFA, so memory use may become linear in the input
string. Yes, a smart user will understand this, especially in an area
that is relatively well-understood such as regular expressions.
However I find it extremely disturbing that library designers do not
communicate requirements through the type system, and make performance
guarantees for all well-typed programs. ML makes it easy, I guess I am
just an ML convert :)
Thanks,
Anton
--
On Sat, Sep 17, 2011 at 22:11, Anton Tayanovskyy <anton.ta...@gmail.com> wrote:By the way, can Haskell have a type that admits regular expression and
only those? I mostly do ML these days, so trying to write up a regex
types in Haskell I was unpleasantly surprised to discover that there
are all sorts of exotic terms inhabiting it, which I did not have to
worry about in ML. Besides `undefined` you can have for terms that try
to define a grammar with nested parentheses. Using regex-applicative
syntax:
expr = ... <|> pure (\ _ x _ -> x) <*> sym "(" <*> expr <*> sym ")"The general case for this is solving the Halting Problem, so neither Haskell nor any other language can do it. You can disallow infinite values by encoding the length into the type, but (a) in Haskell this is painful and (b) runtime values now require runtime types, which you can accommodate but at the price of reintroducing the problems you are trying to prevent. (Dependently typed languages work better for this, but I suspect the result is rather more draconian than you intend.)
I usually enforce constraints like this with ! patterns in the constructors, which lets me enforce the fact that at least I know that any attempt to define a cycle like this will bottom out, so I can safely think only inductive thoughts from there out.