Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Regex matching non-contiguous sheds of text

1 view
Skip to first unread message

DM

unread,
Oct 20, 2004, 1:26:17 PM10/20/04
to
I'm trying to design a regular expression to match the href attribute of <a>
tags. I'm testing it on the command line (on Redhat Linux Enterprise Server)
using grep with the Perl regex option.

Here's the command I'm using:

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/

(On my console, the above is all one line. The URL part --
"TEA-21_Side-by-Side\.pdf" in this example, would be determined at runtime in
the actual Perl script.)

It almost works as expected. I set the color and -o options in order to clearly
show the highlighted match. In most cases it *does* match exactly what I want it to.

However, in a few cases what is matched is totally unexpected.

Here is some sample output:

================================================================================

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/
/home/mtc_website/whats_happening/legislative_update/tea21_04-04.htm:43:href="TEA-21_Side-by-Side.pdf">
/home/mtc_website/whats_happening/legislative_update/tea21_06-04.htm:42:href="TEA-21_Side-by-Side.pdf">
<li> <a href="TEA-21_Side-by-Side.pdf">rong>
<ul>ng="5">.ca.gov</a> s tober LATIVE UPDATE" width="340" height="14" border="0" />

================================================================================

In the file "tea21_06-04.htm" it's going beyond what I indend to match and
scooping up a bunch more stuff. But it isn't even clear to me what it's matching
because the output shows discontinuous shreds of text from within the file.

Here is a sample of that file containing the unexpected match:

================================================================================

<td bgcolor="#CCFFFF"><strong>DOWNLOAD:</strong> <ul>
<li> <a href="TEA-21_Side-by-Side.pdf">Comparison of Highway
Provisions in Surface Transportation Reauthorization Bills</a>
(PDF)
<p> </p>
</li>
<li><a href="HR3550-High-Priority_Proj.xls">H.R. 3550
High-Priority
Projects</a> (Excel)<br />
</li>
</ul></td>
</tr>
</table>
<p><br />
<strong>TEA 21 Reauthorization Conference Committee Comes Closer to
Agreement on Bottom Line Number</strong><br />

================================================================================

Any help would be greatly appreciated.

Thanks,

dm

Jon Ericson

unread,
Oct 20, 2004, 2:39:45 PM10/20/04
to
DM <elektrophyte-yahoo> writes:

> I'm trying to design a regular expression to match the href attribute
> of <a> tags. I'm testing it on the command line (on Redhat Linux
> Enterprise Server) using grep with the Perl regex option.
>
> Here's the command I'm using:
>
> # grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
> /home/mtc_website/
>
> (On my console, the above is all one line. The URL part --
> "TEA-21_Side-by-Side\.pdf" in this example, would be determined at
> runtime in the actual Perl script.)
>
> It almost works as expected. I set the color and -o options in order
> to clearly show the highlighted match. In most cases it *does* match
> exactly what I want it to.
>
> However, in a few cases what is matched is totally unexpected.

If you were actually using perl, this wouldn't be too difficult with
the HTML::Parser module. See perldoc -q html for some discussion
about the pitfalls of using a regex to parse HTML.

Jon

DM

unread,
Oct 20, 2004, 3:39:53 PM10/20/04
to
Jon Ericson wrote:

> DM <elektrophyte-yahoo> writes:
>
>
>>I'm trying to design a regular expression to match the href attribute
>>of <a> tags. I'm testing it on the command line (on Redhat Linux
>>Enterprise Server) using grep with the Perl regex option.
>>
>>Here's the command I'm using:
>>
>># grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
>>/home/mtc_website/

[ ... ]

>>However, in a few cases what is matched is totally unexpected.
>
>
> If you were actually using perl, this wouldn't be too difficult with
> the HTML::Parser module. See perldoc -q html for some discussion
> about the pitfalls of using a regex to parse HTML.
>
> Jon

Thanks for the reply. I don't see how the HTML::Parser module would help me in
the task I described in my original post.

I checked perldoc as you recommended, but the "pitfalls" mentioned don't seem to
apply to what I'm doing.

As I explained in my original post, I'm not trying to do some kind of general
HTML parsing operation, such as stripping out HTML tags. I'm trying to find this
string:

href="[SOME_URL_FRAGMENT].pdf">

My regex almost works, but is acting really weird in a few cases. I'm trying to
nail down the reason for that. Perhaps I have a misconception or
misunderstanding of regex syntax?

Paul Lalli

unread,
Oct 20, 2004, 3:52:21 PM10/20/04
to
"DM" <elektrophyte-yahoo> wrote in message
news:41769fe2$0$796$2c56...@news.cablerocket.com...

> I'm trying to design a regular expression to match the href attribute
of <a>
> tags. I'm testing it on the command line (on Redhat Linux Enterprise
Server)
> using grep with the Perl regex option.
>
> Here's the command I'm using:
>
> # grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
> /home/mtc_website/

[^>]*

means matches EVERYTHING it can in the string. Here it can match
everything until the very last > in the string.

You need to make it non-greedy.

[^>]*?

means to match only as much as as necessary to make the pattern match
succeed.

Paul Lalli


Paul Lalli

unread,
Oct 20, 2004, 3:53:22 PM10/20/04
to
"Paul Lalli" <mri...@gmail.com> wrote in message
news:Vlzdd.4983$bV5.4448@trndny07...

Of course, this applies to the first * in your regexp as well.

href=.*?

Paul Lalli


DM

unread,
Oct 20, 2004, 5:07:55 PM10/20/04
to
>>>Here's the command I'm using:
>>>
>>># grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
>>>/home/mtc_website/
>>
>>[^>]*
>>
>>means matches EVERYTHING it can in the string. Here it can match
>>everything until the very last > in the string.
>>
>>You need to make it non-greedy.
>>
>>[^>]*?
>>
>>means to match only as much as as necessary to make the pattern match
>>succeed.
>
>
> Of course, this applies to the first * in your regexp as well.
>
> href=.*?
>
> Paul Lalli
>
>

OK, thanks. That seems to help.

dm

Jon Ericson

unread,
Oct 20, 2004, 6:30:59 PM10/20/04
to
DM <elektrophyte-yahoo> writes:

> Jon Ericson wrote:
>
>> DM <elektrophyte-yahoo> writes:
>>
>>>However, in a few cases what is matched is totally unexpected.

>> If you were actually using perl, this wouldn't be too difficult with
>> the HTML::Parser module. See perldoc -q html for some discussion
>> about the pitfalls of using a regex to parse HTML.
>

> Thanks for the reply. I don't see how the HTML::Parser module would
> help me in the task I described in my original post.
>
> I checked perldoc as you recommended, but the "pitfalls" mentioned
> don't seem to apply to what I'm doing.
>
> As I explained in my original post, I'm not trying to do some kind of
> general HTML parsing operation, such as stripping out HTML tags. I'm
> trying to find this string:
>
> href="[SOME_URL_FRAGMENT].pdf">
>
> My regex almost works, but is acting really weird in a few cases. I'm
> trying to nail down the reason for that. Perhaps I have a
> misconception or misunderstanding of regex syntax?

It looks like you got some help with the regex itself. I hope this
works for you.

If you're doing something quick and dirty, and you don't mind the
occational mistake, there's nothing wrong with the regex approach.
But little scripts sometimes become mission-critical. If that
happens, the regex might not be a good idea.

Jon

0 new messages