Low-hanging fruit: A Regex compiler

78 views
Skip to first unread message

Jeffrey Kegler

unread,
Jan 15, 2013, 2:51:25 PM1/15/13
to Marpa Parser Mailing LIst
There are a number of projects that would, I think, be quite popular and
useful but which I simply don't have the cycles to consider doing
myself. One is a Regex compiler -- a compiler from some nice BNF-ish
format, to Perl regular expressions. I'd think this could be very
popular -- it would be very much in the comfort zone of some programmers
who otherwise would not consider using Marpa.

To be specific, this is another specialized Marpa-to-Perl compiler. The
compiler would write a Perl regex, and the Perl regex would be what
actually runs. The value added by Marpa would be that more complex
regexes could be more quickly and easily written, and the output regex
could be nicely pretty-printed and commented.

Sometimes not understood is that one thing regular expressions *cannot*
parse is the representation of a regular expression. Regular
expressions are defined recursively, but they do not themselves deal
with recursion.

One way to think of this project is as a Marpa super-superset of
Regexp::Common, whose functionality could be incorporated. A related
effort within Perl was the DEFINE predicate for sub-patterns, but DEFINE
had horrific syntax and AFAIK was little or never used.

Durand Jean-Damien

unread,
Jan 15, 2013, 3:43:07 PM1/15/13
to marpa-...@googlegroups.com
Interesting, indeed.
Isn't perl6 providing a BNF for it already ? I see in the litterature a BNF tentative for the Perl5 regexps here.
Or do you mean perl6 regexp as well. A popular thing will be perl5+6 compatible.

Ron Savage

unread,
Jan 15, 2013, 4:25:05 PM1/15/13
to marpa-...@googlegroups.com
Hi

Would it make sense to write a new parser for CSV files, using Marpa?

It'd be good to coalesce Text::CSV, Text::CSV_XS and Text::CSV::Encoded.

--
Ron Savage
http://savage.net.au/
Ph: 0421 920 622

Ruslan Zakirov

unread,
Jan 15, 2013, 5:46:48 PM1/15/13
to marpa-...@googlegroups.com
On Wed, Jan 16, 2013 at 12:43 AM, Durand Jean-Damien
<jeandami...@gmail.com> wrote:
> Interesting, indeed.
> Isn't perl6 providing a BNF for it already ? I see in the litterature a BNF
> tentative for the Perl5 regexps here.
> Or do you mean perl6 regexp as well. A popular thing will be perl5+6
> compatible.

It would be awesome to have more targets, for example JavaScript. So
one would be able to write a language (protocol) in BNF form and parse
it in JS on client side and with perl or other language on server
side. It's for sure interesting task and chalenging one.

Markdown language comes to mind as a thing I needed a parser for in JS and Perl.

Also, perl5's regexps are recursive [1].

[1] http://www.catonmat.net/blog/recursive-regular-expressions/

> Le mardi 15 janvier 2013 20:51:25 UTC+1, Jeffrey Kegler a écrit :
>>
>> There are a number of projects that would, I think, be quite popular and
>> useful but which I simply don't have the cycles to consider doing
>> myself. One is a Regex compiler -- a compiler from some nice BNF-ish
>> format, to Perl regular expressions. I'd think this could be very
>> popular -- it would be very much in the comfort zone of some programmers
>> who otherwise would not consider using Marpa.
>>
>> To be specific, this is another specialized Marpa-to-Perl compiler. The
>> compiler would write a Perl regex, and the Perl regex would be what
>> actually runs. The value added by Marpa would be that more complex
>> regexes could be more quickly and easily written, and the output regex
>> could be nicely pretty-printed and commented.
>>
>> Sometimes not understood is that one thing regular expressions *cannot*
>> parse is the representation of a regular expression. Regular
>> expressions are defined recursively, but they do not themselves deal
>> with recursion.
>>
>> One way to think of this project is as a Marpa super-superset of
>> Regexp::Common, whose functionality could be incorporated. A related
>> effort within Perl was the DEFINE predicate for sub-patterns, but DEFINE
>> had horrific syntax and AFAIK was little or never used.



--
Best regards, Ruslan.

Jeffrey Kegler

unread,
Jan 15, 2013, 7:48:36 PM1/15/13
to marpa-...@googlegroups.com
I noted various BNF representations of Perl5 regexes on the Web, but didn't link to any because I haven't judged their quality.  As for whether to include Perl 6 extensions, I didn't have them in mind, but maybe I should have. :-)  But it's really up to whoever decides to tackle the task. -- jeffrey

rns

unread,
Jan 25, 2014, 8:06:58 AM1/25/14
to marpa-...@googlegroups.com
Can Verbal Expressions (stripped of syntactic treacle) serve as such nice BNF-ish format? E.g.

# Create an example of how to test for correctly formed URLs 
start_of_line 
 
"http" maybe "s" then "://" maybe "www." anything_but " "
 
end_of_line

This can be extended as needed for advanced Perl regex features 
and replace function.

Such BNF format would also provide nice syntax highlighting hardly 
possible with regex'es and, with a superset of Regexp::Common 
and pattern reuse functionality turn into a DSL of its own.

Is MarpaX::Regex::Verbal a thing worth trying? What do you think?

Jeffrey Kegler

unread,
Jan 25, 2014, 12:59:06 PM1/25/14
to marpa-...@googlegroups.com
I like the idea.  Regular expressions, as parsed using special-purpose regular expression engines, will always survive, because of the speed.  But a lot of things people do with regular expressions would benefit from more power.  -- jeffrey

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ruslan Shvedov

unread,
Jan 25, 2014, 1:09:56 PM1/25/14
to marpa-...@googlegroups.com
Great, I'll put it on my queue.

Durand Jean-Damien

unread,
Jan 25, 2014, 1:24:06 PM1/25/14
to marpa-...@googlegroups.com
Note that a startup can be the ECMAScript pattern specification - ECMAScript implements a very low subset of perl regexp pattern, but it shows quite well how such a grammar should be structured.

Ruslan Shvedov

unread,
Jan 25, 2014, 1:26:59 PM1/25/14
to marpa-...@googlegroups.com
Thanks Jean Damien! I've already found a couple of some similar but this one is very clear.


Durand Jean-Damien

unread,
Jan 25, 2014, 1:40:17 PM1/25/14
to marpa-...@googlegroups.com
Then you might like my implementation - the only subtility in this simple grammar is that I use these user-defined character classes

Ruslan Shvedov

unread,
Jan 25, 2014, 1:51:23 PM1/25/14
to marpa-...@googlegroups.com
Looks good. I'll certainly have to steal from there. Thanks for pointing me to it.

Ron Savage

unread,
Jan 25, 2014, 5:24:04 PM1/25/14
to marpa-...@googlegroups.com
Regexp::VerbalExpressions is certainly an interesting idea. I had not heard of it.

Nevertheless, I'm a bit worried the end result looks like Cobol, i.e. too wordy. Is it really progress to go from:
  s?
to
  maybe 's'
I'm not yet convinced :-).
Perhaps if you write a use-case I would change my mind.

Ruslan Shvedov

unread,
Jan 25, 2014, 6:35:12 PM1/25/14
to marpa-...@googlegroups.com
Well, verbal expressions are for illustration only, I have in mind smth. along the lines of:

input

../cpan/lib/dev/file.c(11824) : warning C4820: '__unnamed' : '3' bytes padding added after member 'c'
../cpan/lib/dev/file.c(12464) : warning C4100: 'param' : unreferenced formal parameter
file.c(12538) : warning C4127: conditional expression is constant

grammar

    list ::= warning+ separator => [\n]
    
    warning ::= file ('(') line (')') (':' 'warning') code (':') message

    file    ~ [\w\-:/ \\\.]+    # windows and unix paths
    line    ~ int
    code    ~ 'C' int
    message ~ [\w' \-:]+ #'

    int     ~ [\d]+

output

[7,
[8,
[9,'../cpan/lib/dev/file.c'],[10,'11824'],[11,'C4820'],[12,' \'__unnamed\' : \'3\' bytes padding added after member \'c\'']],
[8,
[9,'../cpan/lib/dev/file.c'],[10,'12464'],[11,'C4100'],[12,' \'param\' : unreferenced formal parameter']],
[8,
[9,'file.c'],[10,'12538'],[11,'C4127'],[12,' conditional expression is constant']
]
]

7, 8, 9, 10 are IDs of list, warning, file, line


What (as I think) is remarkable—how the structure of the grammar naturally follows from the input, once you split it to parts, give them names and fit them together.
 
Now, imagine the same grammar in a regex or, worse, Regexp::Grammars or Rec::Descent with their punctuation signs overloaded to line noise. 

When I thought, for comparison, write the same in regex or Regexp::Grammar I just wasn't able to make myself do it, so ugly the syntax/semantics combination started to seem.

Ron Savage

unread,
Jan 25, 2014, 9:00:51 PM1/25/14
to marpa-...@googlegroups.com
Yes, thanx. That's a great example.

Durand Jean-Damien

unread,
Jan 26, 2014, 5:35:53 PM1/26/14
to marpa-...@googlegroups.com
FYI I was reading http://www.unicode.org/reports/tr18/ - less easy to understand but well -;

Ruslan Shvedov

unread,
Jan 27, 2014, 12:14:16 AM1/27/14
to marpa-...@googlegroups.com
Yep, interesting read, thanks for sharing.
Reply all
Reply to author
Forward
0 new messages