Different approaches to enable MULTILINE

1,177 views
Skip to first unread message

milly...@gmail.com

unread,
Jun 12, 2018, 10:03:36 AM6/12/18
to re2-dev
It is well known that multiline mode is very useful. Based on my understanding, RE2 has at least two approaches to enable multiline mode:
1. set_one_line(false) and also set_posix_syntax(true) in RE2 option;
2. Adding flag 'm' to regex pattern. For example, for the regex "abc", we modify it to "(?m:abc)" and then compile it.

So I wonder the differences between the above two approaches. It seems to me that the first approach can not apply to regex patterns which do not follow the posix syntax, such as "(?:text|application)/xml". Thanks for your time.

Best,
Milly 


Paul Wankadia

unread,
Jun 12, 2018, 12:43:01 PM6/12/18
to milly...@gmail.com, re2...@googlegroups.com
It really comes down to whether you want Perl syntax or POSIX syntax. Either way, yes, you can have multi-line mode.

With Perl syntax (!posix_syntax), you get one_line by default. You can't set !one_line via the options, but you can use (?-m) and (?m) in the pattern.

With POSIX syntax (posix_syntax), you get !one_line by default. You can set one_line via the options, but you can't use (?-m) and (?m) in the pattern.

one_line or (?-m) gives you one-line mode.

!one_line or (?m) gives you multi-line mode.

Mi Zhang

unread,
Jun 12, 2018, 9:52:34 PM6/12/18
to jun...@google.com, re2...@googlegroups.com
Hi Paul,

Got it. Thank you!

Best,
Milly

mic...@avinetworks.com

unread,
Sep 26, 2018, 11:57:13 AM9/26/18
to re2-dev
[Apologies for following up so late. I've just hit the same problem.]

Why this asymmetry? This comment in re2.h makes it sound as if multi-line mode didn't work at all in Perl syntax:

    // The following options are only consulted when posix_syntax == true.
    // (When posix_syntax == false these features are always enabled and
    // cannot be turned off.)
    //   perl_classes     (false) allow Perl's \d \s \w \D \S \W
    //   word_boundary    (false) allow Perl's \b \B (word boundary and not)
    //   one_line         (false) ^ and $ only match beginning and end of text

The work-around of (?m) isn't mentioned. I find that confusing, user-unfriendly and unnecessarily complicated.
I have a patch including tests that makes set_one_line(false) work for me with Perl syntax enabled.
Please let me know whether you'd be interested in a PR.

Thanks,
Michael

Paul Wankadia

unread,
Sep 27, 2018, 5:25:55 AM9/27/18
to mic...@avinetworks.com, re2...@googlegroups.com
On Thu, Sep 27, 2018 at 1:57 AM <mic...@avinetworks.com> wrote:

Why this asymmetry? This comment in re2.h makes it sound as if multi-line mode didn't work at all in Perl syntax:

    // The following options are only consulted when posix_syntax == true.
    // (When posix_syntax == false these features are always enabled and
    // cannot be turned off.)
    //   perl_classes     (false) allow Perl's \d \s \w \D \S \W
    //   word_boundary    (false) allow Perl's \b \B (word boundary and not)
    //   one_line         (false) ^ and $ only match beginning and end of text

The work-around of (?m) isn't mentioned. I find that confusing, user-unfriendly and unnecessarily complicated.
I have a patch including tests that makes set_one_line(false) work for me with Perl syntax enabled.
Please let me know whether you'd be interested in a PR.

Thanks, but I think that I would prefer to mention (?m) in the comment. Is there any particular wording that you would have found helpful?

mic...@avinetworks.com

unread,
Sep 27, 2018, 8:49:56 AM9/27/18
to re2-dev
Hi Paul,

thanks for your response. Would you mind elaborating on why you don't like this idea?

Regarding the wording, here's a suggestion:

    // The following options are only consulted when posix_syntax == true.
    // When posix_syntax == false these features are always enabled and
    // cannot be turned off. To use multi-line matching with PCRE syntax,
    // prepend the regular expression with (?m), for example "(?m)^qwer$"
    // will match "one line\nqwer\n another line".
    //   perl_classes     (false) allow Perl's \d \s \w \D \S \W
    //   word_boundary    (false) allow Perl's \b \B (word boundary and not)
    //   one_line         (false) ^ and $ only match beginning and end of text


Paul Wankadia

unread,
Sep 28, 2018, 5:44:55 AM9/28/18
to mic...@avinetworks.com, re2...@googlegroups.com
On Thu, Sep 27, 2018 at 10:49 PM <mic...@avinetworks.com> wrote:

thanks for your response. Would you mind elaborating on why you don't like this idea?

I do like the idea itself. The RE2 options one_line, dot_nl, case_sensitive and never_capture correspond to the Perl modifiers m, s, i and n, respectively, so I wish that RE2 respected one_line when using Perl syntax, but I don't want to make it do so. Because it hasn't done so for nearly ten years. Suddenly imbuing an existing option with a new meaning isn't going to break any builds and probably isn't going to cause any tests to fail – and that's why it's risky. If RE2 were rejecting the options instead of ignoring them, then this change could be made with confidence, but RE2 is nodding and smiling regardless, so I can't rightly make it silently do something subtly different.

API design is hard. API change is harder. :(

Regarding the wording, here's a suggestion:

    // The following options are only consulted when posix_syntax == true.
    // When posix_syntax == false these features are always enabled and
    // cannot be turned off. To use multi-line matching with PCRE syntax,
    // prepend the regular expression with (?m), for example "(?m)^qwer$"
    // will match "one line\nqwer\n another line".

Thanks, I will tweak the comment soon.

Reply all
Reply to author
Forward
0 new messages