Need help extracting the contents of an XML tag

Steven Hines

unread,

Mar 11, 2003, 5:26:14 AM3/11/03

to

Hi folks,

I've held off asking this question for a while, but this is driving me
mental so I wonder if someone could shed some light?

I'm using regular expressions in TCL 8.4.1 to extract the contents of
XML tags. Running the following should show you what I'm trying to do:

8><------------------------snip!--------------------------------
console show

set xml "<title>Title text 1</title><para>Paragraph text
1</para><title>Title text 2</title><para>Paragraph text
2</para>\n"

puts "$xml"

set startIndex 0
while {[regexp -indices -start $startIndex
{<title>([^<]+)</title><para>(.*?)</para>} $xml match titleTag
paraTag]} {

# Set the counter to find the next occurrence
set startIndex [lindex $match 1]

# Pull out text we've found
set titleTagText [string range $xml [lindex $titleTag 0] [lindex
$titleTag 1]]
set paraTagText [string range $xml [lindex $paraTag 0] [lindex
$paraTag 1]]

puts "================="
puts "titleTagText = $titleTagText"
puts "paraTag = $paraTagText"
puts "================="

}
8><------------------------snip!--------------------------------

The code produces the following output:

8><------------------------output!--------------------------------
<title>Title text 1</title><para>Paragraph text
1</para><title>Title text 2</title><para>Paragraph text
2</para>

=================
titleTagText = Title text 1
paraTag = Paragraph text 1</para><title>Title text
2</title><para>Paragraph text 2
=================
8><------------------------output!--------------------------------

I can't for the life of me work out why the result of the non-greedy
.*? stretches from the beginning of the first <para> text down to the
end of the second <para> text!! When I try the regexp with simpler
XML, the non-greedy quantifier behaves and limits itself to the
contents of a single tag. Also, if I try my regexp in a different
tool, it produces the correct output.

Some restrictions:
- <title> text will never contain other tags
- <title> tag always guaranteed to contain text
- <para> text may contain HTML formatting tags (e.g. )
- <para> tags NOT always guaranteed to contain text

Any ideas on what I'm missing?

Thanks a lot,

Steve

PS - Apologies for the length of this posting!!

Michael Schlenker

unread,

Mar 11, 2003, 7:09:23 AM3/11/03

to

Steven Hines wrote:
> Hi folks,
>
> I've held off asking this question for a while, but this is driving me
> mental so I wonder if someone could shed some light?
>
> I'm using regular expressions in TCL 8.4.1 to extract the contents of
> XML tags. Running the following should show you what I'm trying to do:
>

Why not simply use a XML parser to do it? It's more robust and just works.

Michael Schlenker

Michael A. Cleverly

unread,

Mar 11, 2003, 9:58:36 AM3/11/03

to Steven Hines

On 11 Mar 2003, Steven Hines wrote:

> I've held off asking this question for a while, but this is driving me
> mental so I wonder if someone could shed some light?
>
> I'm using regular expressions in TCL 8.4.1 to extract the contents of
> XML tags. Running the following should show you what I'm trying to do:

If you can use a compiled extension, I'd suggest you look at tDOM and use
it to parse your XML. See http://www.tdom.org and
http://wiki.tcl.tk/tdom.

However, while general XML is difficult (at best) to parse with regular
expressions, given the restrictions on your input (that you outline
below), it can be done.

> I can't for the life of me work out why the result of the non-greedy
> .*? stretches from the beginning of the first <para> text down to the
> end of the second <para> text!! When I try the regexp with simpler
> XML, the non-greedy quantifier behaves and limits itself to the
> contents of a single tag. Also, if I try my regexp in a different
> tool, it produces the correct output.

Mixing greedy and non-greedy quantifiers in the same regular expression
can have unexpected side effects. See the discussion in this thread for
more details:

http://groups.google.com/groups?th=cda7ef577e79b545

> Some restrictions:
> - <title> text will never contain other tags
> - <title> tag always guaranteed to contain text
> - <para> text may contain HTML formatting tags (e.g. )
> - <para> tags NOT always guaranteed to contain text
>
> Any ideas on what I'm missing?

Given the above restrictions, I'd use the following:

set RE {(?xi)
<title>([^<]+)</title> \s*
<para>((?:(?!</para>).)*)</para> \s*
}

> Thanks a lot,
>
> Steve
>
> PS - Apologies for the length of this posting!!

It was very helpful of you to provide both code, output, an explanation of
what you'd expected differently, as well as the information on what type
of input you are dealing with. No need to apologize about the posts
length. :-)

Michael

Steven Hines

unread,

Mar 11, 2003, 2:32:45 PM3/11/03

to

YOU ABSOLUTE HERO!!!!!

It worked a treat and you've saved me much hairpulling!! I had tried
the negative lookahead but couldn't make any progress with it. It was
the ?: grouping that contained the answer!!

Thank you very very much for a clear and well thought out answer which
gives me much new knowledge to chew on.

Steve

Steven Hines

unread,

Mar 11, 2003, 2:34:36 PM3/11/03

to

Michael Schlenker <sch...@uni-oldenburg.de> wrote in message

> Why not simply use a XML parser to do it? It's more robust and just works.
>
> Michael Schlenker

I can see your point, but this particular problem is part of a much
larger TCL program which makes changes to the XML depending on what it
finds.

Donald Arseneau

unread,

Mar 11, 2003, 7:51:48 PM3/11/03

to

steven...@yahoo.com (Steven Hines) writes:

> while {[regexp -indices -start $startIndex
> {<title>([^<]+)</title><para>(.*?)</para>} $xml match titleTag
> paraTag]} {

Since, as others have pointed out, the problem is with bad/obscure
behaviour of mixing greedy and non-greedy regexp, how about just
making the title match be non-greedy for consistency -- it doesn't
affect the title, but fixes the para match:

{<title>([^<]+?)</title><para>(.*?)</para>}

seems to work. (I am not a real expert)

Donald Arseneau as...@triumf.ca