Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

regexp: parsing HTML tokens

13 views
Skip to first unread message

are...@comcast.net

unread,
May 16, 2005, 8:03:03 PM5/16/05
to
I'm attempting to parse simple HTML/ASP tokens. I'm trying to use
regexp to be more firm in my extraction of the token I'm looking for.
My string is:

set string "<%@ Page Language="vb" AutoEventWireup="false"
CodeBehind="AddContact.apsx.vb"
Inherits="MCSDWebAppxVB.AddContact %>"

What I'm currently using is a regexp with subexpressions to find this,
and I want to catch the beginning of the token and the end of the
token, as everything in between can be variable:

regexp -nocase {(\<\%\@ page) (\%\>)} $string result
-> 0

I know this is a primitive question, but I can seem to get the second
sub-expression to be variable in sequence to the first subexpression.
Again, my regexp would be correct if it produced result =
"<%@ Page %>"

Thank you in advance for your help. I've looked through the literature
and groups on this site to no avail.

Roy Terry

unread,
May 16, 2005, 8:57:59 PM5/16/05
to
<are...@comcast.net> wrote in message
news:1116288183.7...@g47g2000cwa.googlegroups.com...

> I'm attempting to parse simple HTML/ASP tokens. I'm trying to use
> regexp to be more firm in my extraction of the token I'm looking for.
> My string is:
>
> set string "<%@ Page Language="vb" AutoEventWireup="false"
> CodeBehind="AddContact.apsx.vb"
> Inherits="MCSDWebAppxVB.AddContact %>"

I hope the above is not your actual test code as it is invalid Tcl.
Since your string contains literal double quote chars you should quote
it with curly braces:


set string {<%@ Page Language="vb" AutoEventWireup="false"
CodeBehind="AddContact.apsx.vb"
Inherits="MCSDWebAppxVB.AddContact %>}

>
> What I'm currently using is a regexp with subexpressions to find this,
> and I want to catch the beginning of the token and the end of the
> token, as everything in between can be variable:
>
> regexp -nocase {(\<\%\@ page) (\%\>)} $string result
> -> 0

What this pattern lacks is any RE spec to match the
"everything in between. Also < % and @ don't need to be
escaped. Finally it doesn't make much sense to capture the
value of the fixed beginning and ending strings, So ...

% regexp -nocase {<%@ page(.*)%>} $string => middle
1
% puts $middle


Language="vb" AutoEventWireup="false"
CodeBehind="AddContact.apsx.vb"
Inherits="MCSDWebAppxVB.AddContact

Does this help?

Roy

Donal K. Fellows

unread,
May 17, 2005, 6:54:55 AM5/17/05
to
are...@comcast.net wrote:
> regexp -nocase {(\<\%\@ page) (\%\>)} $string result
> -> 0

You're probably after this:

regexp -nocase {<%@\s*page\s*((?:(?!%>).)*)%>} $string \
wholeMatch interestingContent

How does it work? Simple. Firstly, none of <%@> are special chars to RE
by themselves. Secondly, the first part (up to the second \s*) matches
the start of the interesting chunk from your string, and there's a
literal %> at the end. Thirdly, there's a bunch of real RE magic in the
middle:
((?:(?!%>).)*)

Looking at that in detail, we see this in expanded form

( # Stash this sub-RE in a variable
(?: # Pure grouping of zero-width assertion and matcher
(?! %> ) . # any character, as long as it isn't the start of %>
) * # and as many of them as possible
) # That's All Folks!

The key to doing sane parsing is often the negative lookahead assertion.
It makes life *so* much easier!

Donal.

are...@comcast.net

unread,
May 17, 2005, 7:14:44 PM5/17/05
to
Thank you so much for your help. Your RE expression really made a big
difference.

I have two cases I'm still trying to work with that are causing me some
trouble. Let me see if I can map it out for you, not this is written
in symbolics:

<[1]:[2] [ ]>

The <> at the beginnign and the end are expected and searched for; the
expression is contiguous characters.

[1] is a contiguous expression of a plurality of characters allowing
alpha and numeric only.
[2] is a contiguous expression of a plurality of characters allowing
alpha and numeric only, followed by a required space, as indicated
above
[ ] is the same as the RE in your reply; the colon is also required.

I'm using regexp again, since I've been so successful with what you
showed me, here's what doesn't seem to work:
regexp {<*[a-zA-Z0-9]:*[a-zA-Z0-9] ((?:(?!>).)*)>} $string output

It handles the variability of length in the contiguous first two parts
fine, but allows non-alphanumeric characters through.

Your help is as always greatly appreciated!

Donal K. Fellows

unread,
May 18, 2005, 5:16:43 AM5/18/05
to
are...@comcast.net wrote:
> I'm using regexp again, since I've been so successful with what you
> showed me, here's what doesn't seem to work:
> regexp {<*[a-zA-Z0-9]:*[a-zA-Z0-9] ((?:(?!>).)*)>} $string output

You've put the *s in the wrong places, and in any case, if you're
parsing XHTML, you'll want this:

set RE {(?i)<(/?)(?:([-a-z0-9]+):)?([-a-z0-9]+)\s*((?:(?!/?>).)*)(/?)>}
regexp $RE $string -> closer NSpfx localName attributes empty

Variable $closer is non-empty if you've got a close-tag.

Variable $NSpfx has the namespace prefix, if present. The meaning of
this is assigned through inheritable xmlns:... attributes... :^/

Variable $localName has the local name of the element.

Variable $attributes has all the extracted attributes, ready for
parsing. If you're being strict, it's an error for this variable and
$closer to have non-empty content at the same time.

Variable $empty is non-empty if you've got an element without content
and where you should not be looking for a close tag. It is an error for
both this and $closer to have non-empty content at the same time.

If you're really serious about parsing HTML or XML, use a proper HTML or
XML parser. There are enough tricky bits and pieces (non-trivial
entities are the classic examples, but there are many many more) that it
is worth using someone else's parser here.

Donal.

0 new messages