Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Multiline regex not matching

8 views
Skip to first unread message

Shmuel Metz

unread,
Jan 30, 2012, 12:38:48 PM1/30/12
to
I'm having a problem doing a regex match on a multi-line string. The
actual program[1] is rather large, but I've come up with a test case
that demonstrates the problem. I'm running on OS/2, which uses CRLF as
a line end.

#y $FWS = qr/ (?:[ \t]*\15?\12)? [ \t]+ /mx;
my $FWS = qr/ (?:[ \t]*\15?\n)? [ \t]+ /mx;
my $CFWS = qr/
(?: (?:$FWS+ $commentPat)+ $FWS?) |
$FWS
/mx;
my $testheader = <<'EOF';
from localhost (localhost [127.0.0.1])
by lincoln-at-leros.patriot.net (Postfix) with ESMTP id 12BBE55E73
for <mari...@patriot.net>; Fri, 27 Jan 2012 09:23:59 -0500 (EST)
EOF
if ($testheader =~ /(\) $CFWS BY)/imx) {
print STDERR "\n\$testheader matched FWS\n";
print STDERR "\n";
print STDERR "\n\$PREMATCH =$`\n";
print STDERR "\n\$POSTMATCH=$'\n";
foreach (sort keys %+) {
print STDERR "\$+{$_}=$+{$_}\n";
}
print STDERR "\n";
} else {
print STDERR "\n\$testheader did not match FWS\n";
}

If you replace the regex with the one that's commented out, the
results are the same. I dumped $header in hex and the lines are
separated by LF ('0A'X) rather than CRLF ('0D0A'X), which is normal
for Perl. Does anybody see what I'm doing wrong? Thanks.

[1] Available offline on request; it's too big to post here.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to spam...@library.lspace.org

Peter J. Holzer

unread,
Jan 30, 2012, 3:49:19 PM1/30/12
to
On 2012-01-30 17:38, Shmuel Metz <spam...@library.lspace.org.invalid> wrote:
> I'm having a problem doing a regex match on a multi-line string. The
> actual program[1] is rather large, but I've come up with a test case
> that demonstrates the problem. I'm running on OS/2, which uses CRLF as
> a line end.
>
> #y $FWS = qr/ (?:[ \t]*\15?\12)? [ \t]+ /mx;
> my $FWS = qr/ (?:[ \t]*\15?\n)? [ \t]+ /mx;
> my $CFWS = qr/
> (?: (?:$FWS+ $commentPat)+ $FWS?) |
> $FWS
> /mx;
> my $testheader = <<'EOF';
> from localhost (localhost [127.0.0.1])
> by lincoln-at-leros.patriot.net (Postfix) with ESMTP id 12BBE55E73
> for <mari...@patriot.net>; Fri, 27 Jan 2012 09:23:59 -0500 (EST)
> EOF

Since the "EOF" is inserted by one space I guess that you indented the
whole script by one space before posting. If this is true, there is no
space before "by lincoln-at-leros.patriot.net ..."

> if ($testheader =~ /(\) $CFWS BY)/imx) {

so it doesn't match the regexp.

[...]
> If you replace the regex with the one that's commented out, the
> results are the same. I dumped $header in hex and the lines are
> separated by LF ('0A'X) rather than CRLF ('0D0A'X), which is normal
> for Perl. Does anybody see what I'm doing wrong? Thanks.

It does match if leading whitespace is inserted, so it seems that your
regexp is correct, but your data isn't.

hp


--
_ | Peter J. Holzer | Deprecating human carelessness and
|_|_) | Sysadmin WSR | ignorance has no successful track record.
| | | h...@hjp.at |
__/ | http://www.hjp.at/ | -- Bill Code on as...@irtf.org

Shmuel Metz

unread,
Jan 30, 2012, 8:31:42 PM1/30/12
to
In <slrnjie0if.dk...@hrunkner.hjp.at>, on 01/30/2012
at 09:49 PM, "Peter J. Holzer" <hjp-u...@hjp.at> said:

>Since the "EOF" is inserted by one space I guess that you indented
>the whole script by one space before posting.

Yes.

>If this is true, there is no
>space before "by lincoln-at-leros.patriot.net ..."

>If this is true, there is no
>space before "by lincoln-at-leros.patriot.net ..."

Ouch! Another case of seeing what I expect instead of what's there.

>It does match if leading whitespace is inserted, so it seems that
>your regexp is correct, but your data isn't.

Yes, I should have indented the continuation line..

I'm making progress, but still have some matches that aren't doing
what I expect.

With $testheader= from localhost (localhost [127.0.0.1])
by lincoln-at-leros.patriot.net (Postfix) with ESMTP id 12BBE55E73
for <mari...@patriot.net>; Fri, 27 Jan 2012 09:23:59 -0500 (EST):
and

my $FWS = qr/ (?:[ \t]*\15?\n)? [ \t]+ /mx;
my $CFWS = qr/
(?: (?:$FWS+ $commentPat)+ $FWS?) |
$FWS
/mx;
my $RecForPat = qr/$CFWS FOR $FWS (?: $RecPathPat |
$MailboxPat)/imx;


The test $testheader =~ /(\) $RecByPat? $RecWithPat? $RecIdPat?
$RecForPat?) /imx matches up to 12BBE55E73 but a match for just
$testheader =~ /((?<FOR>$RecForPat))/imx matches the for clause and
sets the capture buffer.

Tad McClellan

unread,
Jan 30, 2012, 11:07:20 PM1/30/12
to
Shmuel Metz <spam...@library.lspace.org.invalid> wrote:
> In <slrnjie0if.dk...@hrunkner.hjp.at>, on 01/30/2012
> at 09:49 PM, "Peter J. Holzer" <hjp-u...@hjp.at> said:
>
>>Since the "EOF" is inserted by one space I guess that you indented
>>the whole script by one space before posting.
>
> Yes.
>
>>If this is true, there is no
>>space before "by lincoln-at-leros.patriot.net ..."
>
>>If this is true, there is no
>>space before "by lincoln-at-leros.patriot.net ..."
>
> Ouch! Another case of seeing what I expect instead of what's there.
>
>>It does match if leading whitespace is inserted, so it seems that
>>your regexp is correct, but your data isn't.
>
> Yes, I should have indented the continuation line..


As we have seen, it is critically important for us to see BOTH the
string to be matched against AND the pattern (regex) that is to be
matched if we hope to figure out why it is not behaving as expected.


> I'm making progress, but still have some matches that aren't doing
> what I expect.
>
> With $testheader= from localhost (localhost [127.0.0.1])
> by lincoln-at-leros.patriot.net (Postfix) with ESMTP id 12BBE55E73
> for <mari...@patriot.net>; Fri, 27 Jan 2012 09:23:59 -0500 (EST):


Nobody knows what string is in $testheader (because it is not loaded
up in Real Perl Code) so we cannot help with matching it...


You should make an unambigous representation of your string:

$testheader = "from localhost (localhost [127.0.0.1])\n"
. "by lincoln-at-leros.patriot.net (Postfix) "
. "with ESMTP id 12BBE55E73\n"
. "for <mari...@patriot.net>; Fri, 27 Jan 2012 "
. 09:23:59 -0500 (EST):\n";

--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.

John W. Krahn

unread,
Jan 31, 2012, 3:34:17 AM1/31/12
to
Shmuel (Seymour J.) Metz wrote:
>
> I'm having a problem doing a regex match on a multi-line string. The
> actual program[1] is rather large, but I've come up with a test case
> that demonstrates the problem. I'm running on OS/2, which uses CRLF as
> a line end.
>
> #y $FWS = qr/ (?:[ \t]*\15?\12)? [ \t]+ /mx;
> my $FWS = qr/ (?:[ \t]*\15?\n)? [ \t]+ /mx;
> my $CFWS = qr/
> (?: (?:$FWS+ $commentPat)+ $FWS?) |
> $FWS
> /mx;
> my $testheader =<<'EOF';
> from localhost (localhost [127.0.0.1])
> by lincoln-at-leros.patriot.net (Postfix) with ESMTP id 12BBE55E73
> for<mari...@patriot.net>; Fri, 27 Jan 2012 09:23:59 -0500 (EST)
> EOF
> if ($testheader =~ /(\) $CFWS BY)/imx) {

The /m option in a regular expression affects the behavior the ^ and $
anchors, but you are not using either of those anchors so the use of /m
is superfluous.



John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction. -- Albert Einstein

Shmuel Metz

unread,
Jan 31, 2012, 12:49:01 PM1/31/12
to
In <slrnjieqmt...@tadbox.sbcglobal.net>, on 01/30/2012
at 10:07 PM, Tad McClellan <ta...@seesig.invalid> said:

>Nobody knows what string is in $testheader (because it is not loaded
>up in Real Perl Code) so we cannot help with matching it...

Okay, this time I'll provide a complete program instead of snippets.
Where I'm stuck is in the discrepancy between the first and second
test of $testheader

[1] /(\) $RecByPat? $RecWithPat? $RecIdPat? $RecForPat?) /imx

[2] /((?<FOR>$RecForPat))/imx

#!/usr/bin/perl -W

use 5.010;
use Data::Dumper;
use Regexp::Common qw /net URI/;
use Socket;
use strict;
my $decOctetPat = qr/ \d |
[1-9] \d |
1 \d \d |
2 [0-4] \d |
25 [0-5]
/x;
my $IPv4addressPat = qr/ (?:$decOctetPat\.){3} $decOctetPat /x;
my $IPv6h16 = qr/[[:xdigit:]]{1,4}/;
my $IPv6ls32 = qr/ $IPv6h16 \: $IPv6h16 | $IPv4addressPat /x;
my $IPv6AddrPat = qr/ (?: (?: $IPv6h16 \: ){6} $IPv6ls32 ) |
(?: \:\: (?: $IPv6h16 \: ){5} $IPv6ls32 ) |
(?: (?: $IPv6h16 )? \:\: (?: $IPv6h16 \: ){4} $IPv6ls32 ) |
(?: (?: $IPv6h16 \: $IPv6h16 )? \:\: (?: $IPv6h16 \: ){3} $IPv6ls32 ) |
(?: (?: (?: $IPv6h16 \: ){2} $IPv6h16 )? \:\: (?: $IPv6h16 \: ){2} $IPv6ls32 ) |
(?: (?: (?: $IPv6h16 \: ){3} $IPv6h16 )? \:\: $IPv6h16 \: $IPv6ls32 ) |
(?: (?: (?: $IPv6h16 \: ){5} $IPv6h16 )? \:\: $IPv6ls32 ) |
(?: (?: (?: $IPv6h16 \: ){6} $IPv6h16 )? \:\: )
/x;
my $domainPat = qr/[[:alnum:]]+
[[:alnum:]-]*
(?:\. [[:alnum:]]+ [[:alnum:]-]*)*
/x;
my $addressLiteralPat = qr/\[
(?:$IPv4addressPat |
$IPv6AddrPat
)
\]
/x;
my $atextPat = qr"(?:[\w!#\$%&'*+/=?^`{|}~-]+)";
#y $FWS = qr/ (?:[ \t]*\15?\12)? [ \t]+ /mx;
my $FWS = qr/ (?:[ \t]*\15?\n)? [ \t]+ /mx;
my $atomPat = qr/\s* $atextPat+ \s*/x;
my $ctext = '[\x21-\x27\x2A-\x5B\x5D-\x7E]';
my $dotStringPat = qr/$atextPat+ (?:\. $atextPat)*/x;
my $LinkPat = qr/TCP | $atomPat/xi;
my $dtextPat = '[\x21-\x50\x54-\x7E]';
my $idLeftPat = qr/$dotStringPat/;
my $noFoldLiteralPat = qr/\[ $dtextPat* \]/x;
my $idRightPat = qr/$dotStringPat | $noFoldLiteralPat/x;
my $msgIdPat = qr/\s* \< $idLeftPat \@ $idRightPat \> \s*/x;
my $quotedPairPat = qr/ \\ [\x20-\x7E] /x;
my $qtextPat = '[\x20-\x21\x23-\x5B\x5d-\x7E]';
my $QcontentPat = qr/$qtextPat | $quotedPairPat/x;
my $QuotedStringPat = qr/"$QcontentPat*"/;
my $commentPat =qr/
\(
(?:$FWS?
(?:$ctext | $quotedPairPat | (?R))
)*
$FWS?
\)
/x;
my $CFWS = qr/
(?: (?:$FWS+ $commentPat)+ $FWS?) |
$FWS
/mx;
my $localPartPat = qr/$dotStringPat | $QuotedStringPat/x;
my $MailboxPat = qr/(?<LOCAL_PART>$localPartPat) \@ (?<DOMAIN>$domainPat | $addressLiteralPat)/x;
my $RecPathPat = qr/
\<
(?:\@ $domainPat (?:, \@ $domainPat)* :)?
$MailboxPat
\>
/x;
my $RecByPat = qr!$CFWS
BY
$FWS
(?<BY1>
(?:$domainPat |
\[ $RE{net}{IPv4} \]
)
)
(?:
\s*
\(
(?<BY2>[\s\w\./-]+)
\)
)?
!mxi;
my $RecForPat = qr/$CFWS FOR $FWS (?: $RecPathPat | $MailboxPat)/imx;
my $RecIdPat = qr/$CFWS ID $FWS (?:$atomPat | $msgIdPat)/imx;
my $RecViaPat = qr/$CFWS VIA $FWS $LinkPat/imx;
my $RecWithMS = qr/Microsoft \s+ (?:ESMTP|SMTP) (?:\s+ Server | SVC\(\d+(?:\.\d+)*\))/ix;
my $RecWithPat = qr/$CFWS
WITH $FWS
(?:
(?:ESMTP|SMTP) |
$RecWithMS |
NNFMP # Yahoo
)
/imx;

my $RecOptInfo = qr/
(?<VIA>$RecViaPat)?
(?<WITH>$RecWithPat)?
(?:$RecIdPat)?
(?<FOR>$RecForPat)?
/imx;

my $testheader = <<'EOF';
from localhost (localhost [127.0.0.1])
by lincoln-at-leros.patriot.net (Postfix) with ESMTP id 12BBE55E73
for <mari...@patriot.net>; Fri, 27 Jan 2012 09:23:59 -0500 (EST)
EOF
msg("\n\$testheader= $testheader\n");
msg("\n\$testheader= ".unpack('H*',$testheader)."\n");
if ($testheader =~ /(\) $RecByPat? $RecWithPat? $RecIdPat? $RecForPat?) /imx) {
print STDERR "\n\$testheader matched FWS\n";
print STDERR "\n";
print STDERR "\n\$PREMATCH =$`\n";
print STDERR "\n\$POSTMATCH=$'\n";
msg("\nDumper(\%+):\n");
msg(Dumper(%+),"\n");
msg("\nDumper(\%-):\n");
msg(Dumper(%-),"\n");
foreach (sort keys %+) {
print STDERR "\$+{$_}=$+{$_}\n";
}
print STDERR "\n";
} else {
print STDERR "\n\$testheader did not match FWS\n";
}
# if ($testheader =~ /(?<WITH>$RecWithPat)/im) {
# if ($testheader =~ /$RecOptInfo?/) {
# if ($testheader =~ /($CFWS FOR $FWS (?: $RecPathPat | $MailboxPat))/imx) {
if ($testheader =~ /((?<FOR>$RecForPat))/imx) {
print STDERR "\n\$testheader matched\n";
msg("\nDumper(\%+):\n");
msg(Dumper(%+),"\n");
msg("\nDumper(\%-):\n");
msg(Dumper(%-),"\n");
foreach (sort keys %+) {
print STDERR "\$+{$_}=$+{$_}\n";
}
print STDERR "\n";
} else {
print STDERR "\n\$testheader did not match\n";
}

sub msg {
print STDERR @_;
}

1;
__END__

Ben Morrow

unread,
Jan 31, 2012, 3:40:17 PM1/31/12
to

Quoth Shmuel (Seymour J.) Metz <spam...@library.lspace.org.invalid>:
> In <slrnjieqmt...@tadbox.sbcglobal.net>, on 01/30/2012
> at 10:07 PM, Tad McClellan <ta...@seesig.invalid> said:
>
> >Nobody knows what string is in $testheader (because it is not loaded
> >up in Real Perl Code) so we cannot help with matching it...
>
> Okay, this time I'll provide a complete program instead of snippets.
> Where I'm stuck is in the discrepancy between the first and second
> test of $testheader

The problem is here:

> my $atomPat = qr/\s* $atextPat+ \s*/x;

This needs to be something more like

my $atomPat = qr/$FWS? $atextPat+ $FWS??/x;

(that double-? is *not* a typo). Otherwise the $atomPat in

> my $RecIdPat = qr/$CFWS ID $FWS (?:$atomPat | $msgIdPat)/imx;

swallows all of the whitespace following it (running over the newline),
and the non-optional $CWFS at the start of

> my $RecForPat = qr/$CFWS FOR $FWS (?: $RecPathPat | $MailboxPat)/imx;

has nothing to match, so the whole clause is skipped.

That ?? is like ? (0-or-one), but non-greedy. This means it matches as
little text as possible rather than as much as possible. I'm not certain
what the semantics of the RFCs' ABNF are supposed to be, exactly: it's
possible that most or all of the quantifiers should be non-greedy.

Usually it doesn't matter, of course, because there's some other
constraint that means the pattern can only match one way. In this case,
if you'd insisted on matching all available clauses by including the
semicolon, like this,

/... $RecIdPat? $RecForPat? ; /imx

the 'for' clause would have been forced to match, even though that meant
stealing some whitespace back from the 'id' clause.

Ben

Shmuel Metz

unread,
Feb 1, 2012, 12:40:36 AM2/1/12
to
In <h60mv8-...@anubis.morrow.me.uk>, on 01/31/2012
at 08:40 PM, Ben Morrow <b...@morrow.me.uk> said:

>The problem is here:

>> my $atomPat = qr/\s* $atextPat+ \s*/x;

>This needs to be something more like

> my $atomPat = qr/$FWS? $atextPat+ $FWS??/x;

Consulting RFC 5321, I see

Atom = 1*atext

which means that it should have been something like

my $atomPat = qr/$atextPat+/x;

I'm not sure why I had the \s* in there. Removing it resolved the
problem. Thanks.

Ben Morrow

unread,
Feb 1, 2012, 1:03:34 PM2/1/12
to

Quoth Shmuel (Seymour J.) Metz <spam...@library.lspace.org.invalid>:
> In <h60mv8-...@anubis.morrow.me.uk>, on 01/31/2012
> at 08:40 PM, Ben Morrow <b...@morrow.me.uk> said:
>
> >The problem is here:
>
> >> my $atomPat = qr/\s* $atextPat+ \s*/x;
>
> >This needs to be something more like
>
> > my $atomPat = qr/$FWS? $atextPat+ $FWS??/x;
>
> Consulting RFC 5321, I see
>
> Atom = 1*atext
>
> which means that it should have been something like
>
> my $atomPat = qr/$atextPat+/x;
>
> I'm not sure why I had the \s* in there. Removing it resolved the
> problem. Thanks.

Well, 5322 is more liberal, and has

atom = [CFWS] 1*atext [CFWS]

though I'm not sure why you used \s+ rather than $CFWS.

Ben

Shmuel Metz

unread,
Feb 2, 2012, 5:46:56 AM2/2/12
to
In <mcbov8-...@anubis.morrow.me.uk>, on 02/01/2012
at 06:03 PM, Ben Morrow <b...@morrow.me.uk> said:

>Well, 5322 is more liberal,

For analyzing the Received header I need to use the RFC 5321
definition, since I'm primarily concerned with the ones inserted by
SMTP[1] servers. Then I need to throw in some ad hoc code to handle
nonstandard usages encountered in the wild :-(

> atom = [CFWS] 1*atext [CFWS]

That's misleading, because RFC 5322 uses dot-atom-text in places where
RFC 5321 would use Dot-string.What gets confusing is that the ABNF in
5321 has a few references to 5322.

[1] An RFC 5322 message can pass through other types of servers,
some of which also inseert a Received header field.
0 new messages