Quantification after \Q ... \E

52 views
Skip to first unread message

David Wahlstedt

unread,
Aug 9, 2024, 6:26:02 PM8/9/24
to PCRE2 discussion list
Hi, I tried quantifying after a \Q \E quoting, and it got accepted. I thought it was not allowed.

It seems as if the quantification acts on the last character in the quoting, for instance in \Qabc\E{2,} I got:

PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
/\Qabc\E{2,}/debug
------------------------------------------------------------------
  0  13 Bra
  3     ab
  7     c{2}
 11     c*+
 13  13 Ket
 16     End
------------------------------------------------------------------
Capture group count = 0
First code unit = 'a'
Last code unit = 'c'
Subject length lower bound = 4
abcc
 0: abcc

Is this the intended behaviour?

I found earlier that it is a bit similar with ranges that have quotings in them:
[\Qabc\E-z] means [a-z], the lowest value in the quoting is the lower bound of the range, and similar the upper bound is z in [a-\Qxyz] is [a-z].

Am I right?

Sorry if I have missed something in the man pages, I looked for quite some time in them.

/David

Philip Hazel

unread,
Aug 10, 2024, 4:08:24 AM8/10/24
to David Wahlstedt, PCRE2 discussion list
Hi David,
You are discovering interesting corners in PCRE2. The short answer is, yes, it is intended, and is exactly the same behaviour as Perl. A \Q...\E sequence is just a shorthand for escaping (where necessary) all the characters in between. So, for example, \Q()\E-9 is the same as \(\)-9 where it is more obvious that \) is the start of a range. I have made a note to add some clarification to the documentation of \Q...\E.
Regards,
Philip


--
You received this message because you are subscribed to the Google Groups "PCRE2 discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pcre2-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pcre2-dev/8b42561f-c5a9-4c12-b502-851093795a73n%40googlegroups.com.

David Wahlstedt

unread,
Aug 10, 2024, 6:55:44 AM8/10/24
to PCRE2 discussion list
Thank you for the clarification!

So for instance

[\Qac\E-z] matches a or c-z
 
and

[\Qa-\Ez] matches a,-, or z

so in this case the hyphen does not cause a range a-z to be formed.
This means that the \Q \E is not really just like a macro/preprocessing thing, but actually changes the way the expression is to be interpreted.

So  `a{\Q1\E,2}` matches literally `a{1,2}`, a bit similar to inserting a comment or an isolated \E in such a position.
a{\E1,2} or a{(?#XYZ),2}
are two other ways of "removing meaning" forcing it to be taken as a literal string.

One of the difficulties understanding this syntax is, where goes the borderline between hard and "soft" errors, when to give an error and when to fall back to a literal string interpretation.

I am workng on a PCRE2 parser in Haskell, that given a PCRE2 expression returns an abstract syntax tree, trying to represent the expression as accurately as I can. This leads me into these corner cases. Maybe I'll share it here if it turns out decent, and if it may be of interest.

Best regards,
David


David Wahlstedt

unread,
Aug 10, 2024, 8:40:44 AM8/10/24
to PCRE2 discussion list
Another thing about quantification and quoting:

An emtpy quoting follwed by a quanfitier is illegal:

PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
/\Q\E{2,}/debug
Failed: error 109 at offset 7: quantifier does not follow a repeatable item

But if it is inside a group it is allowed:

PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
/(\Q\E){2,}/debug
------------------------------------------------------------------
  0  19 Bra
  3   5 CBra 1
  8   5 Ket
 11   5 SCBra 1
 16   5 KetRmax
 19  19 Ket
 22     End
------------------------------------------------------------------
Capture group count = 1
May match empty string
Subject length lower bound = 0
aa
 0:
 1: 

If something precedes the empty quoting it is also ok:

PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
/(a|b)\Q\E{2,}+/debug
------------------------------------------------------------------
  0  39 Bra
  3  33 Once
  6   7 CBra 1
 11     a
 13   5 Alt
 16     b
 18  12 Ket
 21   7 CBraPos 1
 26     a
 28   5 Alt
 31     b
 33  12 KetRpos
 36  33 Ket
 39  39 Ket
 42     End
------------------------------------------------------------------
Capture group count = 1
Starting code units: a b
Subject length lower bound = 2
aa
 0: aa
 1: a

Sorry for lengthy posts about these corner cases, I realize there is a lot of work to parse this 100% accurately. I try to make my parser forgiving in uncertain cases, and then one can always impose rules on top of the parsed result if necessary. I think I can safely treat empty quotings as comments (same as isolated \E's).

Best regards,
David


lördag 10 augusti 2024 kl. 10:08:24 UTC+2 skrev Philip Hazel:

Philip Hazel

unread,
Aug 10, 2024, 12:11:13 PM8/10/24
to David Wahlstedt, PCRE2 discussion list

[\Qac\E-z] matches a or c-z

Yes. 

[\Qa-\Ez] matches a,-, or z

Yes, because an escaped hyphen in a class becomes a literal.

This means that the \Q \E is not really just like a macro/preprocessing thing, but actually changes the way the expression is to be interpreted.

Yes. It turns meta-characters into literal characters, just as backslash does on a single character.

So  `a{\Q1\E,2}` matches literally `a{1,2}`,

Oooh! You have found a difference from Perl here. In PCRE2, indeed, it does literally match 'a{1,2}' but in Perl the qualifier applies. I will consider whether to change PCRE2 to match Perl, or just document the difference. I'm leaning towards the former, because in general, non-meta characters within \Q...\E behave the same as they would outside \Q...\E. Conceptually (or possibly even actually) there are two passes: the first pass identifies which characters are meta-characters and which are literals; the second interprets the pattern. So "a{\Q1\E,2}" should behave the same as "a{1,2}" because "1" is not a meta-character so putting it within \Q...\E should have no effect. 

a{\E1,2} or a{(?#XYZ),2}
are two other ways of "removing meaning" forcing it to be taken as a literal string.

For the first of those, PCRE2 and Perl differ, but for the second, they both behave in the same way. An isolated \E should just be ignored, I think; ah yes, that is documented, so PCRE2 is wrong there and needs fixing. Ah yes, I think I see why this has happened. In Perl, the string is probably processed for \Q...\E even before it is interpreted as a regex, whereas in PCRE2 a single pass handles escapes and identifies meta-characters, so when "{" is found without a number or comma following, it is not recognized as a qualifier. I will have to see how easy it is to fix that.

An emtpy quoting follwed by a quanfitier is illegal:

Yes, because an empty quoting might as well not be there; it has no effect whatsover.

But if it is inside a group it is allowed

Yes, because an empty capturing group is allowed, and groups may be qualified, though qualifying an empty group doesn't actually make any sense, of course. Any number times nothing is still nothing:

PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
/ab()*cd/
    abcd
 0: abcd
 1: 


Both Perl and PCRE2 detect the infinite match of nothing in that example. They break the loop and carry on, and give the same result.

The Perl regex pattern syntax has been extended many times over the years (decades, actually). New features don't always fit too well, leading to various edge case oddities, as you have discovered. I will consider the Perl differences and maybe change things, but it will probably be next week at the earliest because there are other things to do...

One of the difficulties understanding this syntax is, where goes the borderline between hard and "soft" errors, when to give an error and when to fall back to a literal string interpretation.

Ah yes, Perl has historically had a very "soft" approach which sometimes causes problems. A warning mode was introduced and nowadays Perl is somewhat stricter in its behaviour. PCRE has always been stricter. For example, Perl will still treat an undefined escape such as \m as a literal "m" unless called in its warning mode, whereas PCRE has always faulted it. However, you are right in that the "soft" errors are a difficulty. 

Regards,
Philip


David Wahlstedt

unread,
Aug 10, 2024, 1:43:20 PM8/10/24
to PCRE2 discussion list
Thanks for the input!

A similar situation is with comments: they are stripped away from the expression, but they do affect how an expression is interpreted, for instance that it may terminate an escape sequence of the form \ddd for octal digits. \22(?#X)2 is octal 22 followed by a '2'. So it affetcs how the expression is parsed. This is documented, but I just want to get a general picture of how things that are "stripped away" affect the things that remain. It is a bit different between comments and emtpy quotings.

So, even if empty quotings and comments are "ignored", they have a meaning in the syntax, and affect what is the parsed result. If we really stripped them away completely in a first pass, that wouldn't happen. I realize one has to stay compatible to Perl. It seems to me that PCRE (v 1  and/or 2) is a de facto standard for most regexes one finds in config files on the internet, like RFC's, yaml files, YANG files, PHP, etc. In these enviroments these corner cases shouldn't occurr, and I guees people choose some kind of common denominator fragment of the language, to be on the safe side. But I haven't been able to find any clear specification anywhere of what that is. I have tried many online tools for regex, like regex101.com (one of the best?), but none of them can handle UTF script names, like \p{greek}, for instance, it all seems to be approximations.

 Maybe I should also consider to have my parser in two passes. Right now I have a second check of the parse tree and manipulate it afterwards, if necessary. It is a workable way, but sometimes not so nice.

Best regards,
David

David Wahlstedt

unread,
Aug 10, 2024, 4:10:09 PM8/10/24
to PCRE2 discussion list
Comments and empty quotings are not allowed inside script names, or any \p properties, it seems. So also in this case they are not stripped away in advance.

Philip Hazel

unread,
Aug 12, 2024, 4:12:27 AM8/12/24
to David Wahlstedt, PCRE2 discussion list

Comments and empty quotings are not allowed inside script names, or any \p properties, it seems. So also in this case they are not stripped away in advance.

Comments of the form (?#comment) are really a form of group whose contents are ignored. Therefore, they can appear where groups may appear, and they terminate any preceding item. In an item such as  [(?#abc)]  there is no comment, for example, because inside a class parentheses are not meta characters. Incidentally, it is also the case that comments cannot appear just anywhere in other languages, e.g. C:

zz.c:4:23: error: expected ',' or ';' before numeric constant
    4 | int i = 123/*comment*/4;

          |                       ^

Similarly the other sort of comment, #-to-end-of-line, which needs the /x option to be enabled, is only recognized at points where a new item might start.
I think the general overview is that comments are just "items" within a Perl regex that happen to have no effect on the execution of the pattern, but which cannot appear within other items. So no, they are not stripped away in advance. 

Regards,
Philip

David Wahlstedt

unread,
Aug 12, 2024, 2:49:42 PM8/12/24
to PCRE2 discussion list
Thank you, it gradually becomes clearer!
I should learn some Perl and have an environment with it, to be able to compare, maybe.

Yes, it makes sense that comments can't be literally everywhere.
One difference between gropus and comments,  though, is that comments are not quantifiable.

Philip Hazel

unread,
Aug 12, 2024, 4:13:35 PM8/12/24
to David Wahlstedt, PCRE2 discussion list
The script called perltest.sh that is part of PCRE2 takes the same input as pcre2test, with some limitations. It is one way of running Perl tests.

Regards,
Philip

--
You received this message because you are subscribed to the Google Groups "PCRE2 discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pcre2-dev+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages