Expression Library

0 views

Skip to first unread message

Sueann

unread,

Aug 5, 2024, 2:37:29 AM8/5/24

to grunarcider

Bothpatterns and strings to be searched can be Unicode strings (str)as well as 8-bit strings (bytes).However, Unicode strings and 8-bit strings cannot be mixed:that is, you cannot match a Unicode string with a bytes pattern orvice-versa; similarly, when asking for a substitution, the replacementstring must be of the same type as both the pattern and the search string.

A regular expression (or RE) specifies a set of strings that matches it; thefunctions in this module let you check if a particular string matches a givenregular expression (or if a given regular expression matches a particularstring, which comes down to the same thing).

Regular expressions can be concatenated to form new regular expressions; if Aand B are both regular expressions, then AB is also a regular expression.In general, if a string p matches A and another string q matches B, thestring pq will match AB. This holds unless A or B contain low precedenceoperations; boundary conditions between A and B; or have numbered groupreferences. Thus, complex expressions can easily be constructed from simplerprimitive expressions like the ones described here. For details of the theoryand implementation of regular expressions, consult the Friedl book [Frie09],or almost any textbook about compiler construction.

Repetition operators or quantifiers (*, +, ?, m,n, etc) cannot bedirectly nested. This avoids ambiguity with the non-greedy modifier suffix?, and with other modifiers in other implementations. To apply a secondrepetition to an inner repetition, parentheses may be used. For example,the expression (?:a6)* matches any multiple of six 'a' characters.

(Dot.) In the default mode, this matches any character except a newline. Ifthe DOTALL flag has been specified, this matches any characterincluding a newline. (?s:.) matches any character regardless of flags.

Like the '*', '+', and '?' quantifiers, those where '+' isappended also match as many times as possible.However, unlike the true greedy quantifiers, these do not allowback-tracking when the expression following it fails to match.These are known as possessive quantifiers.For example, a*a will match 'aaaa' because the a* will matchall 4 'a's, but, when the final 'a' is encountered, theexpression is backtracked so that in the end the a* ends up matching3 'a's total, and the fourth 'a' is matched by the final 'a'.However, when a*+a is used to match 'aaaa', the a*+ willmatch all 4 'a', but when the final 'a' fails to find any morecharacters to match, the expression cannot be backtracked and will thusfail to match.x*+, x++ and x?+ are equivalent to (?>x*), (?>x+)and (?>x?) correspondingly.

Causes the resulting RE to match from m to n repetitions of the precedingRE, attempting to match as many repetitions as possible. For example,a3,5 will match from 3 to 5 'a' characters. Omitting m specifies alower bound of zero, and omitting n specifies an infinite upper bound. As anexample, a4,b will match 'aaaab' or a thousand 'a' charactersfollowed by a 'b', but not 'aaab'. The comma may not be omitted or themodifier would be confused with the previously described form.

Causes the resulting RE to match from m to n repetitions of the precedingRE, attempting to match as few repetitions as possible. This is thenon-greedy version of the previous quantifier. For example, on the6-character string 'aaaaaa', a3,5 will match 5 'a' characters,while a3,5? will only match 3 characters.

Causes the resulting RE to match from m to n repetitions of thepreceding RE, attempting to match as many repetitions as possiblewithout establishing any backtracking points.This is the possessive version of the quantifier above.For example, on the 6-character string 'aaaaaa', a3,5+aaattempt to match 5 'a' characters, then, requiring 2 more 'a's,will need more characters than available and thus fail, whilea3,5aa will match with a3,5 capturing 5, then 4 'a'sby backtracking and then the final 2 'a's are matched by the finalaa in the pattern.xm,n+ is equivalent to (?>xm,n).

To match a literal ']' inside a set, precede it with a backslash, orplace it at the beginning of the set. For example, both [()[\]] and[]()[] will match a right bracket, as well as left bracket, braces,and parentheses.

AB, where A and B can be arbitrary REs, creates a regular expression thatwill match either A or B. An arbitrary number of REs can be separated by the'' in this way. This can be used inside groups (see below) as well. Asthe target string is scanned, REs separated by '' are tried from left toright. When one pattern completely matches, that branch is accepted. This meansthat once A matches, B will not be tested further, even if it wouldproduce a longer overall match. In other words, the '' operator is nevergreedy. To match a literal '', use \, or enclose it inside acharacter class, as in [].

Matches whatever regular expression is inside the parentheses, and indicates thestart and end of a group; the contents of a group can be retrieved after a matchhas been performed, and can be matched later in the string with the \numberspecial sequence, described below. To match the literals '(' or ')',use $ or $, or enclose them inside a character class: [(], [)].

This is an extension notation (a '?' following a '(' is not meaningfulotherwise). The first character after the '?' determines what the meaningand further syntax of the construct is. Extensions usually do not create a newgroup; (?P...) is the only exception to this rule. Following are thecurrently supported extensions.

(The flags are described in Module Contents.)This is useful if you wish to include the flags as part of theregular expression, instead of passing a flag argument to there.compile() function.Flags should be used first in the expression string.

A non-capturing version of regular parentheses. Matches whatever regularexpression is inside the parentheses, but the substring matched by the groupcannot be retrieved after performing a match or referenced later in thepattern.

Attempts to match ... as if it was a separate regular expression, andif successful, continues to match the rest of the pattern following it.If the subsequent pattern fails to match, the stack can only be unwoundto a point before the (?>...) because once exited, the expression,known as an atomic group, has thrown away all stack points withinitself.Thus, (?>.*). would never match anything because first the .*would match all characters possible, then, having nothing left to match,the final . would fail to match.Since there are no stack points saved in the Atomic Group, and there isno stack point before it, the entire expression would thus fail to match.

Similar to regular parentheses, but the substring matched by the group isaccessible via the symbolic group name name. Group names must be validPython identifiers, and in bytes patterns they can only containbytes in the ASCII range. Each group name must be defined only once withina regular expression. A symbolic group is also a numbered group, just as ifthe group were not named.

Matches if the current position in the string is not preceded by a match for.... This is called a negative lookbehind assertion. Similar topositive lookbehind assertions, the contained pattern must only match strings ofsome fixed length. Patterns which start with negative lookbehind assertions maymatch at the beginning of the string being searched.

Matches the contents of the group of the same number. Groups are numberedstarting from 1. For example, (.+) \1 matches 'the the' or '55 55',but not 'thethe' (note the space after the group). This special sequencecan only be used to match one of the first 99 groups. If the first digit ofnumber is 0, or number is 3 octal digits long, it will not be interpreted asa group match, but as the character with octal value number. Inside the'[' and ']' of a character class, all numeric escapes are treated ascharacters.

Matches the empty string, but only at the beginning or end of a word.A word is defined as a sequence of word characters.Note that formally, \b is defined as the boundarybetween a \w and a \W character (or vice versa),or between \w and the beginning or end of the string.This means that r'\bat\b' matches 'at', 'at.', '(at)',and 'as at ay' but not 'attempt' or 'atlas'.

The default word characters in Unicode (str) patternsare Unicode alphanumerics and the underscore,but this can be changed by using the ASCII flag.Word boundaries are determined by the current localeif the LOCALE flag is used.

Matches the empty string,but only when it is not at the beginning or end of a word.This means that r'at\B' matches 'athens', 'atom','attorney', but not 'at', 'at.', or 'at!'.\B is the opposite of \b,so word characters in Unicode (str) patternsare Unicode alphanumerics or the underscore,although this can be changed by using the ASCII flag.Word boundaries are determined by the current localeif the LOCALE flag is used.

Matches characters considered alphanumeric in the ASCII character set;this is equivalent to [a-zA-Z0-9_].If the LOCALE flag is used,matches characters considered alphanumeric in the current locale and the underscore.

Octal escapes are included in a limited form. If the first digit is a 0, or ifthere are three octal digits, it is considered an octal escape. Otherwise, it isa group reference. As for string literals, octal escapes are always at mostthree digits in length.

The module defines several functions, constants, and an exception. Some of thefunctions are simplified versions of the full featured methods for compiledregular expressions. Most non-trivial applications always use the compiledform.

When specified, the pattern character '^' matches at the beginning of thestring and at the beginning of each line (immediately following each newline);and the pattern character '$' matches at the end of the string and at theend of each line (immediately preceding each newline). By default, '^'matches only at the beginning of the string, and '$' only at the end of thestring and immediately before the newline (if any) at the end of the string.