Matching upper ASCII characters in RE patterns

Jonathan Pool

unread,

Nov 29, 2010, 7:11:20 PM11/29/10

to perl-u...@perl.org, Susan Colowick

Let's say the character NO-BREAK SPACE (U+00A0) appears in a UTF8-encoded text file (so it appears there as C2A0), and I want to match strings that contain this character.

I write a script (itself encoded with UTF8) in Perl 5.10.0 (on OS X 10.6.5) with:

use encoding 'utf8';
use charnames ':full:';

The script opens the file with:

open FH, '<:utf8', filename.txt;

It reads lines in with:

while <FH> {}

Then, in a regular expression in the script, I can match the NO-BREAK SPACE with any of these patterns:

1. /\N{NO-BREAK SPACE}/

2. / / (where the character between slashes looks like a space but is a no-break space)

3. /[\x7f-\x80]/

Patterns 1 and 2 make sense, but pattern 3 is mysterious to me, because the range specified in pattern 3 includes DELETE and an unnamed character but does not include NO-BREAK SPACE.

Moreover, I expect to be able to match the NO-BREAK SPACE with these patterns, but I cannot:

4. /[\xa0]/

5. /\xa0/

In the related documentation, I have not found anything explaining why pattern 3 works, or anything explaining why patterns 4 and 5 do not work.

I have replicated these anomalies in Perl 5.8.8. under Red Hat Enterprise Linux 5.

I would be delighted to receive explanations or references to documentation that I have overlooked or misunderstood.
ˉ

karl williamson

unread,

Nov 30, 2010, 1:25:39 PM11/30/10

to Jonathan Pool, perl-u...@perl.org, Susan Colowick

Jonathan Pool wrote:
> Let's say the character NO-BREAK SPACE (U+00A0) appears in a UTF8-encoded text file (so it appears there as C2A0), and I want to match strings that contain this character.
>
> I write a script (itself encoded with UTF8) in Perl 5.10.0 (on OS X 10.6.5) with:
>
> use encoding 'utf8';
> use charnames ':full:';
>
> The script opens the file with:
>
> open FH, '<:utf8', filename.txt;

You should always use '<:encoding(utf8)' instead to get utf8 validation.
But that's not the problem here.
I tested it on the very latest development code, and it still fails.
The problem is a bug or bugs in Perl with parsing files encoded in utf8.
I converted the .pl to latin1 and removed the "use encoding 'utf8'",
and it works.

I believe it is known that there are issues with 'use encoding', but I
suggest filing a bug report, by sending email to per...@perl.org.
Attached are two files I created to test. These should be attached to
the bug report so as to not have to be done again.

nobreak_latin1.pl

nobreak_utf8.pl

nobreak_utf8.txt

karl williamson

unread,

Nov 30, 2010, 2:10:16 PM11/30/10

to Jonathan Pool, perl-u...@perl.org, Susan Colowick

karl williamson wrote:
> Jonathan Pool wrote:
>> Let's say the character NO-BREAK SPACE (U+00A0) appears in a
>> UTF8-encoded text file (so it appears there as C2A0), and I want to
>> match strings that contain this character.
>>
>> I write a script (itself encoded with UTF8) in Perl 5.10.0 (on OS X
>> 10.6.5) with:
>>
>> use encoding 'utf8';
>> use charnames ':full:';
>>
>> The script opens the file with:
>>
>> open FH, '<:utf8', filename.txt;
>
> You should always use '<:encoding(utf8)' instead to get utf8 validation.
> But that's not the problem here.
> I tested it on the very latest development code, and it still fails. The
> problem is a bug or bugs in Perl with parsing files encoded in utf8. I
> converted the .pl to latin1 and removed the "use encoding 'utf8'", and
> it works.
>
> I believe it is known that there are issues with 'use encoding', but I
> suggest filing a bug report, by sending email to per...@perl.org.
> Attached are two files I created to test. These should be attached to
> the bug report so as to not have to be done again.

I thought about it some more, and replaced the "use encoding 'utf8'"
with just "use utf8", and it also works there

Jonathan Pool

unread,

Nov 30, 2010, 2:22:12 PM11/30/10

to karl williamson, perl-u...@perl.org, Susan Colowick, Neil Shadrach

Thanks very much for your further information about this issue.

I'll be happy to file a bug report, but I should also mention that the problematic behavior not only exists with "use encoding 'utf8'" and "use utf8", but differs between them. Both produce wrong results, but different wrong results:

With “use encoding 'utf8'”:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is matched by / / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is NOT matched by /[\xa0]/
The NBS is NOT matched by /\xa0/
The NBS is NOT matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/

With “use utf8”:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is matched by / / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is matched by /[\xa0]/
The NBS is matched by /\xa0/
The NBS is matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/

With neither â€œuse encoding 'utf8'â€ nor â€œuse utf8â€ :
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is NOT matched by /Â / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is matched by /[\xa0]/
The NBS is matched by /\xa0/
The NBS is matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/

(The 3rd and 7th patterns, out of 7, should fail.)

(If I include both statements, the behavior is the same as if "use encoding 'utf8'" alone is present. This testing is with "<:encoding(utf8)".)

So, I'm confused as to whether this is 1 bug or more than 1, and how best to document it (or them). Could you advise me on this?

> <nobreak_latin1.pl><nobreak_utf8.pl>

ˉ

karl williamson

unread,

Nov 30, 2010, 4:18:56 PM11/30/10

to Jonathan Pool, perl-u...@perl.org, Susan Colowick, Neil Shadrach

Jonathan Pool wrote:
> Thanks very much for your further information about this issue.
>
> I'll be happy to file a bug report, but I should also mention that the problematic behavior not only exists with "use encoding 'utf8'" and "use utf8", but differs between them. Both produce wrong results, but different wrong results:
>

Just one bug report will be fine. I don't have a Perl 5.10 laying
around to test on, but I can say that the files I sent you did what I
said on 5.13.7. I think that the one that was supposedly in latin1
could have gotten converted to utf8 in the email process. There have
been many significant bug fixes in Perl since 5.10.0.

Jonathan Pool

unread,

Nov 30, 2010, 4:27:06 PM11/30/10

to karl williamson, perl-u...@perl.org, Susan Colowick, Neil Shadrach

> Just one bug report will be fine.

OK.

> I think that the one that was supposedly in latin1 could have gotten converted to utf8 in the email process.

It arrived Latin1-encoded.
ˉ

Michael Ludwig

unread,

Nov 30, 2010, 5:28:58 PM11/30/10

to perl-u...@perl.org

karl williamson schrieb am 30.11.2010 um 11:25 (-0700):
> Jonathan Pool wrote:
> >
> >use encoding 'utf8';

> I believe it is known that there are issues with 'use encoding', but
> I suggest filing a bug report, by sending email to per...@perl.org.

I was advised not to use the encoding pragma, and I'm passing that
advice on to you.

Re: use encoding 'utf8' and \x{00e4} notation
http://code.activestate.com/lists/perl-unicode/3157/

It is broken by desing as "It does at least two things at once that do
not belong together".

Why do my Perl tests fail with `use encoding 'utf8'`?
http://stackoverflow.com/questions/492838/

--
Michael Ludwig

Darren Duncan

unread,

Nov 30, 2010, 5:23:52 PM11/30/10

to perl-u...@perl.org

When encoding issues matter, try putting the files in a tarball or zip archive,
which presumably should treat them as binary files and preserve the encodings
over transport. -- Darren Duncan

Jonathan Pool

unread,

Nov 30, 2010, 5:48:05 PM11/30/10

to Michael Ludwig, perl-u...@perl.org

> I was advised not to use the encoding pragma, and I'm passing that
> advice on to you.

Thank you for the additional references.

As documented in http://rt.perl.org/rt3/Public/Bug/Display.html?id=80030 there seems to be a problem when "use encoding 'utf8'" is removed and replaced with "use utf8", so the problem is not limited to the "encoding" pragma.
ˉ

Aristotle Pagaltzis

unread,

Dec 20, 2010, 6:44:06 AM12/20/10

to perl-u...@perl.org

* Jonathan Pool <po...@utilika.org> [2010-11-30 23:50]:

> As documented in
> http://rt.perl.org/rt3/Public/Bug/Display.html?id=80030 there
> seems to be a problem when "use encoding 'utf8'" is removed and
> replaced with "use utf8", so the problem is not limited to the
> "encoding" pragma. ˉ

However, you can expect the `utf8` pragma to be fixed – though
that won’t help you right now. The `encoding` pragma OTOH is
irretrievably broken.

(There is also consensus that source files in arbitrary encodings
are not a sane idea anyway; if you need more than ASCII, your
code should be in UTF-8 and you should `use utf8`. So no there is
no replacement for that aspect of the `encoding` pragma coming
down the pipe either, now or ever.)

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>