Octal or backreference, scope rules?

32 views
Skip to first unread message

David Wahlstedt

unread,
Sep 5, 2024, 5:44:18 PM9/5/24
to PCRE2 discussion list
Hello,

I wonder about some situations with the "octal or backreference" construct (sorry if I am vague in some respects):
  1. What does it mean when a backreference is inside the group it refers to?
    For instance `(a\1)`, should never match anything, since the reference is not a recursive call, and even if it would, would lead to an infinite expansion.
    So, can backreferences occurring within the group they refer to always be safely ignored, or are there cases when it makes sense?
  2. Backreferences occurring before the group they "refer" to, can it always be inored?
    For instance in `\1(a)`, one can safely ignore `\1` i.e., it matches the empty string?
  3. What about a backreference in an alternative that it cannot refer to?
    for instance `(a)|(b\1)`, the `\1` refers to `(a)`, but it will never match, right? But the reference is still valid? What does this mean, and when does it make sense?
  4. If an octal esape has a prefix that is a valid backreference, is it always counted as the octal escape? For instance, if we have eleven capture groups followed by \111, it is counted as octal, even if group no eleven is present. (I am quite sure)
    I tried with 999 empty capture groups followed by (y)\10001, and it didn't match `yy1`. This indicates that even if octal escapes may be at most length three, longer octal digit sequences "eat" the characters/obscure the reference, right?
The reason I'm asking is that I am working on a Haskell parser for PCRE2, that is now to be adapted to parsing dependent on what groups are in scope: The parsing of a \ddd (where d are octal digits) sequence depends on that, so the number of digits to consume depends on where the sequence occurrs, and I want to get it right. It seems that groups are "in scope" even if they can never be referred to, as above.

Best regards,

 David

Philip Hazel

unread,
Sep 6, 2024, 4:22:10 AM9/6/24
to David Wahlstedt, PCRE2 discussion list
Coincidentally I was discussing this with some other people only a couple of days ago. It's a mess. I understand that different engines resolve the ambiguity in different ways. PCRE2 copies Perl, and it's documented. This is from the pcre2pattern man page:

      The handling of a backslash followed by a digit other than 0 is  compli-
       cated, and Perl has changed over time, causing PCRE2 also to change.

       Outside  a character class, PCRE2 reads the digit and any following dig-
       its as a decimal number. If the number is less than 10, begins with  the
       digit 8 or 9, or if there are at least that many previous capture groups
       in  the  expression,  the entire sequence is taken as a backreference. A
       description of how this works is given later, following  the  discussion
       of  parenthesized  groups.  Otherwise, up to three octal digits are read
       to form a character code.

       Inside a character class, PCRE2 handles \8 and \9 as the literal charac-
       ters "8" and "9", and otherwise reads up to three octal digits following
       the backslash, using them to generate a data character.  Any  subsequent
       digits stand for themselves. For example, outside a character class:

         \040   is another way of writing an ASCII space
         \40    is the same, provided there are fewer than 40
                   previous capture groups
         \7     is always a backreference
         \11    might be a backreference, or another way of
                   writing a tab
         \011   is always a tab
         \0113  is a tab followed by the character "3"
         \113   might be a backreference, otherwise the
                   character with octal code 113
         \377   might be a backreference, otherwise
                   the value 255 (decimal)
         \81    is always a backreference

       Note  that  octal values of 100 or greater that are specified using this
       syntax must not be introduced by a leading zero, because  no  more  than
       three octal digits are ever read.
       
A pattern such as (a\1) can indeed never match, but that is irrelevant to its interpretation. (There are plenty of ways to write patterns that can never match.) In that pattern \1 is a back reference. 

For your comments 1-3, remember that these things can be inside repetitions. Other than the first time round the loop, a self-reference or forward reference may make sense. Example:  /(\d|\+\1){2}/  (a made-up example) which matches any two digits or a digit followed by + and the same digit. Your item 4 is covered by the rules I quoted above. You can check these things by running pcre2test with the -d option:

  re> /(y)\10001/
------------------------------------------------------------------
  0  19 Bra
  3   7 CBra 1
  8     y
 10   7 Ket
 13     @01
 19  19 Ket
 22     End
------------------------------------------------------------------


This shows that it has interpreted \100 as octal (the character @) followed by a literal 01.



--
You received this message because you are subscribed to the Google Groups "PCRE2 discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pcre2-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pcre2-dev/39e692c5-0d14-4975-850e-a388292233ffn%40googlegroups.com.

Zoltan Herczeg

unread,
Sep 6, 2024, 9:16:11 AM9/6/24
to David Wahlstedt, PCRE2 discussion list
PCRE2 is more of a computing language than a "word set descriptor", which was the original concept of regular expressions. In a computing language, you can give any kind of commands, and they can make sense in their current context.

1) /((?(1)\1|.))+/ - not the most efficient way of matching a repeat of the same character, but a valid way
2) No. Similar to above, you put a repeat around the pattern, and now the capturing bracket can have a value.
3) Similar to above.
4) I don't remember.

Regards,
Zoltan

--

David Wahlstedt

unread,
Sep 7, 2024, 3:57:31 PM9/7/24
to PCRE2 discussion list
Thank you!
I think I got the most of it.

It seems a this is a kind of state rather than nested environments: once a group has been seen, it remains in scope thoughout the rest of the expression, no matter how deep it appeared. Or is it possible to remove a group from scope/being valid, by adding sommething that says: "forget this group" ? Maybe the "reset counter" (?|...) can have such effect? I rather believe it shadows old groups, but not invalidating them.

Best,
David

Zoltan Herczeg

unread,
Sep 7, 2024, 9:19:48 PM9/7/24
to David Wahlstedt, PCRE2 discussion list
> Or is it possible to remove a group from scope/being valid

Of course :) For example, recursion restores the value of capturing brackets after a match:

/(?1)(?(DEFINE)(...))\1/

Although the capture bracket is set in the recursion, the backreference does not access it. You can make more interesting patterns, if a capturing bracket is set both inside and outside of recursions.

Regards,
Zoltan


Philip Hazel

unread,
Sep 8, 2024, 3:28:25 AM9/8/24
to David Wahlstedt, PCRE2 discussion list
I guess it depends on what you mean by "scope". It isn't really like a programming language. The set of groups defined in a pattern can always be referenced from anywhere in the pattern, but their captured contents change as matching proceeds. Initially they are all unset, so a reference such as \1 causes a match failure at that point, causing matching to backtrack to a previous backtrack point. If there isn't one, the whole match fails.

Regards,
Philip


David Wahlstedt

unread,
Sep 8, 2024, 5:42:15 AM9/8/24
to PCRE2 discussion list
Thanks, and sorry for me not being clear!
I realize "being in scope" in this respect was a bit misleading. What I'm after is that the secuences of \ followed by digits outside of a character class are parsed as references and not octal escapes. If they capture anything doens't matter in this sense. I think, if I treat all such sequences as references whenever the decimal number they encode is lower or equal to the total number of capture groups of the given regex, it should be a correct syntactical interpretation (not regarding named groups)?

BR,
David

Philip Hazel

unread,
Sep 8, 2024, 10:47:56 AM9/8/24
to David Wahlstedt, PCRE2 discussion list
That is true, but also \1 to \9 are always treated as references - though if the pattern doesn't have that many capturing groups, PCRE2 throws an error:

PCRE2 version 10.44 2024-06-07 (8-bit)
  re> /\4/
Failed: error 115 at offset 1: reference to non-existent subpattern


Also if the number begins with 8 or 9.

Regards,
Philip


David Wahlstedt

unread,
Sep 12, 2024, 11:39:43 AM9/12/24
to PCRE2 discussion list
The follwing matches 'aa':
```
()()()()()()()()()()()()()()()()()()()()()()()()()()(?:()|(a))\28
```
but this one matches 'a 8' ('a', character with octal code 02, '8'):
```
()()()()()()()()()()()()()()()()()()()()()()()()()()(?|()|(a))\28
```
It means that parsing of the \dd sequence depends on where the reference occurs in relation to the group it refers to, including some constructions like `(?|`.

Does recursion have a similar effect? Do you have an example where an "octal or backref" with an octal prefix like this one, is interpreted as an escaped character or reference dependent of where it appears in the recursion?

Best regards,
David

David Wahlstedt

unread,
Sep 12, 2024, 11:58:02 AM9/12/24
to PCRE2 discussion list
Here is a similar example that doesn't behave as I expected:

```
()()()()()()()()()()()()()()()()()()()()()()()()()()(?|()|(a))\27
```

should match aa, but it doesn't, neither in v 10.39 nor 10.45:

```
PCRE2 version 10.39 2021-10-29
/()()()()()()()()()()()()()()()()()()()()()()()()()()(?|()|(a))\27/debug
------------------------------------------------------------------
  0 241 Bra
  3   5 CBra 1
  8   5 Ket
[ ... ] 
232   7 Ket
235  24 Ket
238     \27
241 241 Ket
244     End
------------------------------------------------------------------
Capture group count = 27
Max back reference = 27
May match empty string
Subject length lower bound = 0
aa
Matched, but too many substrings
 0:
 1:
[ ... ]
14: 
```

I tried the same in regex101.com set to PCRE2 mode, and there it matches. (maybe they are wrong?)

BR,
David

Zoltan Herczeg

unread,
Sep 12, 2024, 12:19:14 PM9/12/24
to David Wahlstedt, PCRE2 discussion list
Hi,

I am happy you are using pcre2test, that explains a lot.

The first \28 pattern has 28 capturing brackets. Backslash number
parsing is complicated, with a lot of legacy cases. It is likely
decoded as a backreference.
In the (?| pattern, there are only 27 capturing brackets. Here \28 is
likely decoded as a character.
The \27 example matches as you expected, just the test program does
not print all capturing brackets, as the comment states.
The recursion has no backslash form, you simply get an error if the
index is out of range.

Regards,
Zoltan

On Thu, Sep 12, 2024 at 5:58 PM David Wahlstedt
> To view this discussion on the web visit https://groups.google.com/d/msgid/pcre2-dev/aad0bb1f-2c4a-4817-9389-bb451c77dbb4n%40googlegroups.com.

Philip Hazel

unread,
Sep 12, 2024, 12:29:34 PM9/12/24
to Zoltan Herczeg, David Wahlstedt, PCRE2 discussion list
Zoltan is quite correct. When there are 28 captures \28 will be a back reference; otherwise it will be taken as the octal number 2 followed by the character '8'.

Regards,
Philip


David Wahlstedt

unread,
Sep 13, 2024, 6:22:40 AM9/13/24
to PCRE2 discussion list
I think I undesrstand why
()()()()()()()()()()()()()()()()()()()()()()()()()()(?|()|(a))\27
doesn't match 'aa'.
Inside the (?| group, the empty string in the () always succeeds before the (a), and then \27 refers to an empty string and not an a.
When I tried
()()()()()()()()()()()()()()()()()()()()()()()()()()(?|(b)|(a))\27
It behaved as I expected, and it matches aa or bb.

This means that regex101.com handles the ()()()()()()()()()()()()()()()()()()()()()()()()()()(?|()|(a))\27 incorrectly, as it matches aa.

Am I right?

Best regards,

David

Philip Hazel

unread,
Sep 13, 2024, 11:43:06 AM9/13/24
to David Wahlstedt, PCRE2 discussion list

This means that regex101.com handles the ()()()()()()()()()()()()()()()()()()()()()()()()()()(?|()|(a))\27 incorrectly, as it matches aa.

It is wrong if it actually matches the characters "aa". But it is right if it says "matched" because it can have matched an empty string. That is what PCRE2 (and Perl) does. If you run this under pcre2test you will see that the matched string is actually just the empty string.

Regards,
Philip


Reply all
Reply to author
Forward
0 new messages