Matching of optional parts in regular expressions

123 views
Skip to first unread message

Markus Elfring

unread,
Jul 1, 2004, 5:30:06 AM7/1/04
to
I wonder why some variables do not get the specified values. They
remain empty. I do not see any mistake in the patterns. I try to
process parts of XML files.
Would you like to give any suggestions to correct this?

% info tclversion
8.3
% regexp {<name>(.+)</name>(?:.*?<namespace>(.*?)</namespace>)?.*?<value>(.*?)</value>(?:.*?<quote>[
"']</quote>)?\s*$} \
{
<name>Wetter</name>
<namespace>dwd</namespace>
<value>bewölkt &amp; sonnig</value>
<quote>"</quote>
} z name namespace value quote
1
% foreach X {name namespace value quote} {puts "$X=|[set $X]|"}
name=|Wetter|
namespace=|dwd|
value=|bewölkt &amp; sonnig|
quote=||
% regexp {<name>(.+)</name>(?:.*?<scope>(SYSTEM|PUBLIC)</scope>.*?<S_URI>(.+)</S_URI>.*?(?:<P_URI>(.+)</P_URI>)?)?(?:.*?<definition>(.*?)</definition>)?(?:.*?<attributes>(.*?)</attributes>)?.*?<content>(.*)</content>\s*$}
\
{
<name>gruss</name>
<scope>SYSTEM</scope>
<S_URI>http://XXX/Hallo.dtd</S_URI>
<P_URI>http://YYY/Leute.dtd</P_URI>
<definition><!ELEMENT gruss (#PCDATA)></definition>
<attributes>Versuch="1"</attributes>
<content><h1>Guten Tag!</h1></content>
} z name scope system public definition attributes content
1
% foreach X {name scope system public definition attributes content}
{puts "$X=|[set $X]|"}
name=|gruss|
scope=|SYSTEM|
system=|http://XXX/Hallo.dtd|
public=||
definition=||
attributes=||
content=|<h1>Guten Tag!</h1>|

Markus Elfring

unread,
Jul 8, 2004, 10:02:47 AM7/8/04
to
Does this example show an error in the implementation?

% regexp {\s*(\d+)(?:%(\w+))?} { !123!} z a b
1
% foreach X {a b} {puts "$X=|[set $X]|"}
a=|123|
b=||
% regexp {\s*(\d+)(?:%(\w+))?$} { !123!} z a b
0
% regexp {\s*(\d+)(?:%(\w+))?} { 456%} z a b
1
% foreach X {a b} {puts "$X=|[set $X]|"}
a=|456|
b=||

I think that the specified strings must not match to the pattern.
How do you think about it?

Jonathan Bromley

unread,
Jul 8, 2004, 12:06:27 PM7/8/04
to
On 8 Jul 2004 07:02:47 -0700, Markus....@web.de (Markus Elfring)
wrote:

I don't see any problem. This part of your expression:
(?:%(\w+))?
can match an empty string, which explains what's happening
in your first and third examples. In your second example,
the final $ in your regexp anchors the end of the expression
to the end of your input string, so it doesn't match.

In your previous post, the XML examples were rather more
complicated. Most of the problems appeared to be related
to misunderstanding of the way lazy/greedy matching works.
Again, I couldn't see any issues. However, you may
find it's helpful to use a visualiser such as TREV
(which has just recently been updated to handle non-capturing
subexpressions correctly):

www.doulos.com/knowhow/tcltk/examples/trev

--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL, Verilog, SystemC, Perl, Tcl/Tk, Verification, Project Services

Doulos Ltd. Church Hatch, 22 Market Place, Ringwood, BH24 1AW, UK
Tel: +44 (0)1425 471223 mail:jonathan...@doulos.com
Fax: +44 (0)1425 471573 Web: http://www.doulos.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.

Ulrich Schöbel

unread,
Jul 8, 2004, 12:03:47 PM7/8/04
to
In article <40ed1d8f.0407...@posting.google.com>,

Markus....@web.de (Markus Elfring) writes:
> Does this example show an error in the implementation?

No. All three examples work absolutely correct.

>
> % regexp {\s*(\d+)(?:%(\w+))?} { !123!} z a b
> 1
> % foreach X {a b} {puts "$X=|[set $X]|"}
> a=|123|
> b=||

Regex is not anchored. It matches no whitespace (\s*),
three digits (\d+) and no word introduced by a % sign.

> % regexp {\s*(\d+)(?:%(\w+))?$} { !123!} z a b
> 0

Same as before, but anchored at the end. The final ! is
not matched, so the entire regex doesn't match.

> % regexp {\s*(\d+)(?:%(\w+))?} { 456%} z a b
> 1
> % foreach X {a b} {puts "$X=|[set $X]|"}
> a=|456|
> b=||

Same as first example.

>
> I think that the specified strings must not match to the pattern.
> How do you think about it?

Everything is working ok.

Best regards

Ulrich


--
For those of you who don't get this e-mail, let me know and I'll re-send it.

Bruce Hartweg

unread,
Jul 8, 2004, 1:05:47 PM7/8/04
to

Markus Elfring wrote:

> I wonder why some variables do not get the specified values. They
> remain empty. I do not see any mistake in the patterns. I try to
> process parts of XML files.
> Would you like to give any suggestions to correct this?
>
> % info tclversion
> 8.3
> % regexp {<name>(.+)</name>(?:.*?<namespace>(.*?)</namespace>)?.*?<value>(.*?)</value>(?:.*?<quote>[
> "']</quote>)?\s*$} \
> {
> <name>Wetter</name>
> <namespace>dwd</namespace>
> <value>bewölkt &amp; sonnig</value>
> <quote>"</quote>
> } z name namespace value quote
> 1
> % foreach X {name namespace value quote} {puts "$X=|[set $X]|"}
> name=|Wetter|
> namespace=|dwd|
> value=|bewölkt &amp; sonnig|
> quote=||

in this one, you only have 3 sets of capturing parens, so your forth variable
will never be set (i assume by the name you what to add () around the ["'] )

> % regexp {<name>(.+)</name>(?:.*?<scope>(SYSTEM|PUBLIC)</scope>.*?<S_URI>(.+)</S_URI>.*?(?:<P_URI>(.+)</P_URI>)?)?(?:.*?<definition>(.*?)</definition>)?(?:.*?<attributes>(.*?)</attributes>)?.*?<content>(.*)</content>\s*$}
> \
> {
> <name>gruss</name>
> <scope>SYSTEM</scope>
> <S_URI>http://XXX/Hallo.dtd</S_URI>
> <P_URI>http://YYY/Leute.dtd</P_URI>
> <definition><!ELEMENT gruss (#PCDATA)></definition>
> <attributes>Versuch="1"</attributes>
> <content><h1>Guten Tag!</h1></content>
> } z name scope system public definition attributes content
> 1
> % foreach X {name scope system public definition attributes content}
> {puts "$X=|[set $X]|"}
> name=|gruss|
> scope=|SYSTEM|
> system=|http://XXX/Hallo.dtd|
> public=||
> definition=||
> attributes=||
> content=|<h1>Guten Tag!</h1>|

this one is more a problme of mixing greedy/non-greedy quantifiers - you can add certain
constructs to force greedy/non-greedy at every step, but even that is tricky and error
prone (see MATCHING section of re_syntax man page for gory details). if you are going
to stick with regexp I would recommend to NOT use any non-greedy quantifiers at all, and
specify everything as greedy but make use of [^] and negeative lookahead constraints
to keep the wildcards from gobbling up too much.

That being said, if you are parsing XML you would be better off using an XML parser
rather than REs - check out <http://wiki.tcl.tk/XML> for info, examples, pointers
etc. to various parsers in existance.

Bruce


Markus Elfring

unread,
Jul 9, 2004, 5:30:00 AM7/9/04
to
...
> in this one, you only have 3 sets of capturing parens, so your forth variable
> will never be set (i assume by the name you what to add () around the ["'] )

Your are right. I overlooked the missing parentheses in this case.
Thanks a lot.


...


> That being said, if you are parsing XML you would be better off using an XML parser
> rather than REs - check out <http://wiki.tcl.tk/XML> for info, examples, pointers
> etc. to various parsers in existance.

I try to avoid to use those function libraries in this case because I
need only to transfer a few parameters in a XML style to my functions.
They do not need to receive complete XML documents.

Markus Elfring

unread,
Jul 9, 2004, 6:30:46 AM7/9/04
to
> > % regexp {\s*(\d+)(?:%(\w+))?} { !123!} z a b
> > 1
>
> Regex is not anchored. It matches no whitespace (\s*),
> three digits (\d+) and no word introduced by a % sign.

The command says "1 => match found".
I expect only a match if the string contains a number with optional
leading whitespaces and a word follows a percent character. Should
this be expressed by the following pattern?

% regexp {^\s*(\d+)(?:%(\w+))?} { !123!} z a b
0


> > % regexp {\s*(\d+)(?:%(\w+))?} { 456%} z a b
> > 1
>

> Same as first example.

No. I expect that the second subexpression must not match because an
empty string is not allowed because one character must appear at least
after the delimiter "%".

Jonathan Bromley

unread,
Jul 9, 2004, 6:51:22 AM7/9/04
to
On 9 Jul 2004 03:30:46 -0700, Markus....@web.de (Markus Elfring)
wrote:

>> > % regexp {\s*(\d+)(?:%(\w+))?} { !123!} z a b
[...]


>I expect only a match if the string contains a number with optional
>leading whitespaces and a word follows a percent character.

So, surely you do NOT want the subexpression %(\w+) to be optional?
But you put an "optional" quantifier after it. Hence, your whole
expression happily matches "123". Note that the capturing sub-
expression (\w+) is not matched, so variable "b" is not updated.

On the other hand...

% regexp {\s*(\d+)(?:%(\w+))} { !123!} z a b
0

does not match, because I've removed the ? quantifier.

Finally, here's what I think you probably wanted:

% regexp {\s*(\d+)(?:%(\w+))} { !123%abc} z a b
1
% puts "!$a!$b!"
!123!abc!

matches as you might expect.

Where's the problem?

Markus Elfring

unread,
Jul 9, 2004, 10:22:00 AM7/9/04
to
> So, surely you do NOT want the subexpression %(\w+) to be optional?

No.

% regexp {^(\d+)(?:!(\w+))?$} {123!xyz} z a b
1

The pattern can mean a mapping between a key and a value. The text may
contain just a number without a delimiter if the value is empty or
NULL.

Jonathan Bromley

unread,
Jul 9, 2004, 11:44:41 AM7/9/04
to
On 9 Jul 2004 07:22:00 -0700, Markus....@web.de (Markus Elfring)
wrote:

>> So, surely you do NOT want the subexpression %(\w+) to be optional?

Sorry Markus, I don't quite understand what you are trying to do.

First there's the problem of delimiter - in some of your posts it
seems you're using {%}, in other places {!}. In the example below
I've chosen to use %, because it's easier to see in a small font!

Second, I'm not at all clear how the sample string is "anchored" -
in other words, is there perhaps any more text after the end of the
string? If not, we can use $ to anchor the end of the regexp. But
if there may be more text, it's necessary to use lookaheads or some
other technique.

# Match any string of digits followed by nothing
% regexp -inline {\s*(\d+)(?:%(\w+))?$} {123}
123 123 {}

# Match any string of digits, followed by %, followed by text
% regexp -inline {\s*(\d+)(?:%(\w+))?$} {123%abc}
123%abc 123 abc

# DO NOT match a string of digits, followed by % and NO text
% regexp -inline {\s*(\d+)(?:%(\w+))?$} {123%}
# returns an empty list (no match)

However, if you cannot use the $ anchor, it's harder. You
could use an alternative: the string after the digits is
EITHER nothing at all, OR some string that does not begin
with %, OR % followed by some text:

\s*(\d+)(?:$|[^%\d]|%(\w+))

Notice that the character class [^%\d] needs to include
the digit specifier \d. Without this, the regexp would
still match {123%} because the last digit would be taken
to match [^%]. Another way to do this would be

\m(\d+)\M(?:$|[^%]|%(\w+))

The \m\M word delimiters force (\d+) to match the complete
string of digits.

Markus Elfring

unread,
Jul 9, 2004, 11:59:52 AM7/9/04
to
> ... However, you may

> find it's helpful to use a visualiser such as TREV
> (which has just recently been updated to handle non-capturing
> subexpressions correctly):
>
> www.doulos.com/knowhow/tcltk/examples/trev

I am going to try it and the tool
"http://royo.is-a-geek.com/iserializable/regulator/".

Jonathan Bromley

unread,
Jul 9, 2004, 12:00:06 PM7/9/04
to
On Fri, 09 Jul 2004 16:44:41 +0100, Jonathan Bromley
<jonathan...@doulos.com> wrote:

> Another way to do this would be
>
> \m(\d+)\M(?:$|[^%]|%(\w+))
>
>The \m\M word delimiters force (\d+) to match the complete
>string of digits.

Whoops, that's no good. It would NOT match "1234abcd".
For that string I guess you want to see the digit string 1234,
but you don't want to know about the following text.

My alternative solution

> \s*(\d+)(?:$|[^%\d]|%(\w+))

is, I think, OK.

Markus Elfring

unread,
Jul 13, 2004, 8:21:26 AM7/13/04
to
> Sorry Markus, I don't quite understand what you are trying to do.

I want to extract values between delimiters that are XML tags.
I turned the discussion to a simple example with a single delimiter.

Markus Elfring

unread,
Jul 13, 2004, 8:39:41 AM7/13/04
to
> % regexp {<name>(.+)</name>(?:.*?<scope>(SYSTEM|PUBLIC)</scope>.*?<S_URI>(.+)</S_URI>.*?(?:<P_URI>(.+)</P_URI>)?)?(?:.*?<definition>(.*?)</definition>)?(?:.*?<attributes>(.*?)</attributes>)?.*?<content>(.*)</content>\s*$}

This pattern is better.
{^\s*<name>(.+)</name>(?:\s*<scope>(SYSTEM|PUBLIC)</scope>\s*<S_URI>(.+)</S_URI>\s*(?:<P_URI>(.+)</P_URI>)?)?(?:\s*<definition>(.*)</definition>)?(?:\s*<attributes>(.*)</attributes>)?\s*<content>(.*)</content>\s*$}

Markus Elfring

unread,
Jul 13, 2004, 11:05:12 AM7/13/04
to
> However, if you cannot use the $ anchor, it's harder.

This discussion evolves into the topic "efficient evaluation of
multi-character 'quotes'".
An optimized "direct approach" starts on page 273 in the book
"Mastering Regular Expressions". It seems that the end result can only
be contructed in a readable way by storing pattern parts in variables
of the host programming languages and to put them together. The
pattern language can not reuse predefined parts so far. Several parts
must be repeated to form the fastest pattern.

Reply all
Reply to author
Forward
0 new messages