Regexp for XML

15 views
Skip to first unread message

Stefan Vogel

unread,
Nov 9, 2000, 3:00:00 AM11/9/00
to
Hi,

I would like to match opening and closing-XML-tags in a very simple
manner with regexp.

When having the following string:
% set text {<anc>s</anc><stw1><anc>M</anc></stw1>}
% regexp "<(anc)>(.*?)</anc>" $text dummy tag body
% set body
s

Fine.
Now I'm doing something similar to cope with attributes of the tags.
With the following string
% set text2 {<anc h="m">s</anc><stw1><anc n="1">M</anc></stw1>}
and
% regexp "<(anc)\\s+(.+?)>(.*?)</anc>" $text2 dummy tag attributes body

I get
% set attributes
h="m"
% set body
s</anc><stw1><anc n="1">M


Why doesn't the non greedy operator work in this case?
What have I got to do to match the attributes and(!) the nearest closing
tag?

Thanks
Stefan


Bob Techentin

unread,
Nov 9, 2000, 3:00:00 AM11/9/00
to
Stefan Vogel wrote:
>
> I would like to match opening and closing-XML-tags in a very simple
> manner with regexp.

Stefan,

When you get this worked out to your satisfaction, could you add it to
the Regular Expression Examples page on the wiki?

Thanks,
Bob
--
Bob Techentin techenti...@mayo.edu
Mayo Foundation (507) 284-2702
Rochester MN, 55905 USA http://www.mayo.edu/sppdg/sppdg_home_page.html

Richard.Suchenwirth

unread,
Nov 9, 2000, 3:00:00 AM11/9/00
to
Bob Techentin wrote:
>
> Stefan Vogel wrote:
> >
> > I would like to match opening and closing-XML-tags in a very simple
> > manner with regexp.
>
> Stefan,
>
> When you get this worked out to your satisfaction, could you add it to
> the Regular Expression Examples page on the wiki?
>
http://purl.org/thecliff/tcl/wiki/989.html
--
Schoene Gruesse/best regards, Richard Suchenwirth - +49-7531-86 2703
RC DT2, Siemens Electrocom, Buecklestr. 1-5, D-78467 Konstanz,Germany
-------------- http://purl.org/thecliff/tcl/wiki//Richard*Suchenwirth
AL:The Analytical Engine worketh not! CB:What version dost thou have?

Joe English

unread,
Nov 9, 2000, 3:00:00 AM11/9/00
to
Stefan Vogel <stefan...@avinci.de> wrote:
>
>I would like to match opening and closing-XML-tags in a very simple
>manner with regexp. [...]

>What have I got to do to match the attributes and(!) the nearest closing
>tag?

Some words of wisdom from Steve Ball:

"My characterisation of using REs for parsing XML is that it
is like performing brain surgery with a chainsaw: you get the
job done, but you have to scrape lots of important bits off
the wall and put them back in where they belong."

If you *still* want to do things this way, take
a look at the grammar productions in the XML specification:

<URL: http://www.w3.org/TR/REC-xml >

The relevant part of the grammar forms a regular language;
translating it into a regexp should be straightforward.


--Joe English

jeng...@flightlab.com

Steve Ball

unread,
Nov 10, 2000, 4:05:48 PM11/10/00
to
Joe English wrote:
>
> Stefan Vogel <stefan...@avinci.de> wrote:
> >
> >I would like to match opening and closing-XML-tags in a very simple
> >manner with regexp. [...]
> >What have I got to do to match the attributes and(!) the nearest closing
> >tag?
>
> Some words of wisdom from Steve Ball:
>
> "My characterisation of using REs for parsing XML is that it
> is like performing brain surgery with a chainsaw: you get the
> job done, but you have to scrape lots of important bits off
> the wall and put them back in where they belong."

Ah, indeed!

> If you *still* want to do things this way, take
> a look at the grammar productions in the XML specification:
>
> <URL: http://www.w3.org/TR/REC-xml >
>
> The relevant part of the grammar forms a regular language;
> translating it into a regexp should be straightforward.

Yep - that's more-or-less what I've done. It's not always
straight-forward, though.

The Tcl-only parser in TclXML is just a big regexp engine.
Stefan, are you *really* sure you want to reinvent the wheel?

> What have I got to do to match the attributes and(!) the nearest
> closing tag?

Big gotcha here - '>' is permitted in an attribute value.
So trying to match all text between the element type name and
the '>' character won't work in general.

Cheers,
Steve

--
Steve Ball | waX Me Lyrical XML Editor | Training & Seminars
Zveno Pty Ltd | Web Tcl Complete | XML XSL
http://www.zveno.com/ | TclXML TclDOM | Tcl, Web Development
Steve...@zveno.com +---------------------------+---------------------
Ph. +61 2 6242 4099 | Mobile (0413) 594 462 | Fax +61 2 6242 4099

Reply all
Reply to author
Forward
0 new messages