Regular Expression To Match Escape Sequences

Jim Monty

unread,

Mar 28, 2000, 3:00:00 AM3/28/00

to

I need to reliably match escape sequences in arbitrary text. A
simple pattern such as, for example, C<m/\\n/g> won't work because
it matches where the substring occurs but is not an escape sequence,
as in the string "c:\\norton". In other words, a pair of backslashes
is the escape sequence that represents a literal backslash and these
escaped backslashes must be accounted for.

A negative lookbehind assertion won't work because "variable length
lookbehind [is] not implemented" and there can be an arbitrary
number of backslashes preceding any occurrence of an escape
metacharacter (e.g., "n").

The Perl Cookbook gives C<s/\\n/\n/g;> for "[t]urning \ followed
by n into a real newline", but this could match in unintended
places, right?

Any suggestions?

--
Jim Monty
mo...@primenet.com
Tempe, Arizona USA

Jeff Pinyan

unread,

Mar 28, 2000, 3:00:00 AM3/28/00

to

[posted & mailed]

>I need to reliably match escape sequences in arbitrary text. A
>simple pattern such as, for example, C<m/\\n/g> won't work because
>it matches where the substring occurs but is not an escape sequence,
>as in the string "c:\\norton". In other words, a pair of backslashes
>is the escape sequence that represents a literal backslash and these
>escaped backslashes must be accounted for.

Right, you can't use a negative lookbehind, so you'll have to use
something like:

s/((?:[^\\]|^)(?:\\\\)*)\\n/$1\n/;

Yes, it's disgusting. But it's what I've found works.

--
MIDN 4/C PINYAN, NROTCURPI, US Naval Reserve ja...@pobox.com
http://www.pobox.com/~japhy/ http://pinyaj.stu.rpi.edu/
PerlMonth - An Online Perl Magazine http://www.perlmonth.com/
The Perl Archive - Articles, Forums, etc. http://www.perlarchive.com/

Ilmari Karonen

unread,

Mar 28, 2000, 3:00:00 AM3/28/00

to

In article <Pine.GSO.4.21.000327...@crusoe.crusoe.net>, Jeff Pinyan wrote:
>>simple pattern such as, for example, C<m/\\n/g> won't work because
>>it matches where the substring occurs but is not an escape sequence,
>>as in the string "c:\\norton". In other words, a pair of backslashes
>>is the escape sequence that represents a literal backslash and these
>>escaped backslashes must be accounted for.
>

> s/((?:[^\\]|^)(?:\\\\)*)\\n/$1\n/;
>
>Yes, it's disgusting. But it's what I've found works.

Since you're also collapsing double backslashes, I'd recommend doing
all the substitutions in one pass:

s/\\(.)/control_char($1)/eg;

..or, if you also want to catch hex or octal character codes:

s/\\(c.|x[0-9A-Fa-f]{1,2}|0[0-7]{0,3}|.)/control_char($1)/eg;

Writing the subroutine is left as an exercise - I'd recommend using a
hash for the single-character cases. If you wanted, you could extend
this to handle special escapes like \U and \L, and maybe even do your
own variable interpolation.

--
Ilmari Karonen - http://www.sci.fi/~iltzu/
Note: Please ignore the pseudonymous troll in this newsgroup.

Michael Carman

unread,

Mar 28, 2000, 3:00:00 AM3/28/00

to

Jim Monty wrote:
>
> I need to reliably match escape sequences in arbitrary text. A

> simple pattern such as, for example, C<m/\\n/g> won't work because
> it matches where the substring occurs but is not an escape sequence,
> as in the string "c:\\norton".

One thing to be aware of: Perl interpolates double qouoted strings, but
not single quoted ones. So "\n" is a newline, while '\n' is just a
backslash followed by the letter 'n'. You wouldn't need the extra \ in
your example if you wrote it as 'c:\norton'

What this means is that your solution will depend on your
implementation. For single quoted strings, a negative character class
may help:

/[^\\]*\n/

But this could still have problems with stuff like '\\\\\n'

For double quoted strings, /\n/ is sufficient, because perl will handle
figuring out what is and isn't an escape. so ("c:\\norton" =~ /\n/) will
evaluate to false.

-mjc

Bart Lateur

unread,

Mar 29, 2000, 3:00:00 AM3/29/00

to

Jim Monty wrote:

>The Perl Cookbook gives C<s/\\n/\n/g;> for "[t]urning \ followed
>by n into a real newline", but this could match in unintended
>places, right?

It's unsolvable as you write it. That is why you normally have to escape
backslashes as well, in some way.

Here's a version as I like it:

%replace = ( n => "\n", t => "\t");
s/\\(.)/$replace{$1} || $1/ge;

This will remove all escaping backslashes, and replace the sequence by a
newline or a tab if the sequence is '\n" or '\t' respectively; or by the
escaped character. It is pretty much like backslash escaping in
double-quotish context in Perl, but for all characters.

Alternatively:

%replace = ( n => "\n", t => "\t", "\\" => "\\");
s/\\([nt\\])/$replace{$1}/g;

is more single-quotish like: only those sequences for which the second
character is in the specified character class, and which also must be a
key for the substitution hash, are replaced.

--
Bart.