Keeping byte-wise processing as an option

Martin Duerst

unread,

Jan 2, 2004, 4:09:20 PM1/2/04

to perl-u...@perl.org, Jungshik Shin

Dear Perl Unicode experts,

http://www.perldoc.com/perl5.8.0/pod/perlunicode.html says:

"In future, Perl-level operations will be expected to work with characters
rather than bytes."

I very much appreciate all your hard work on the internationalization of Perl.
However, recently I have been working on some things that let me think
that the above statement, if taken directly, may be going somewhat too far.

It is in some cases very useful to use Perl for simple byte-oriented
processing. Some examples that I have are:

1) charlint (see http://www.w3.org/International/charlint/). Among else,
this checks for various 'not-quite-UTF-8' cases such as overlong encodings.
Although both input and output are UTF-8, the program works on these
byte-by-byte.

2) some simple input checking code such as the example at
http://www.w3.org/International/questions/qa-forms-utf-8.html

3) The following simple script (due to Jonathan Coxhead) that
removes a BOM at the start of an UTF-8 file:
#!/usr/bin/perl -pi~ -0777
# program to remove a leading UTF-8 BOM from a file
# works both STDIN -> STDOUT and on the spot (with filename as argument)
s/^\xEF\xBB\xBF//s;

All these were written assuming a simple bytes-in-bytes-out model.
At least the later fails with Perl 5.8.1 when the PERL_UNICODE
environment variable is defined. Jungshik has also reported that
it fails with Perl 5.8.0 with an UTF-8 locale. I have not been
able to confirm this. Similar things will probably apply to the
first two examples, in which case I would need to patch them soon.

What I'm looking for is a very simple way to write perl programs
that work on byte streams. This should be possible without depending
on versions, working both on very old versions as well as future
versions.

Many thanks in advance for your help. Regards, Martin.

Jarkko Hietaniemi

unread,

Jan 2, 2004, 5:31:18 PM1/2/04

to Martin Duerst, perl-u...@perl.org, Jungshik Shin

> "In future, Perl-level operations will be expected to work with
> characters rather than bytes."
>
> I very much appreciate all your hard work on the internationalization
> of Perl.
> However, recently I have been working on some things that let me think
> that the above statement, if taken directly, may be going somewhat too
> far.

I don't think there is any fear of Perl ever going that far.
There is just too much legacy code that would go bang.

> All these were written assuming a simple bytes-in-bytes-out model.
> At least the later fails with Perl 5.8.1 when the PERL_UNICODE
> environment variable is defined.

If you have set PERL_UNICODE you have explicitly requested that your
legacy code should go bang.

> Jungshik has also reported that
> it fails with Perl 5.8.0 with an UTF-8 locale.

Perl 5.8.0 was very broken with UTF-8 locales since it
"auto-PERL_UNICODEd".
We saw (keep seeing) a lot of that since RedHat 8 and 9 had the
unfortunate
combination of both Perl 5.8.0 _and_ UTF-8 locales (which the users
didn't
expect/know about/care about). Lots of code that expected to produce
e.g.
0xff started to produce 0xc3 0xbf. Bang!
Use rather 5.8.1 or later.

> What I'm looking for is a very simple way to write perl programs
> that work on byte streams. This should be possible without depending
> on versions, working both on very old versions as well as future
> versions.

Off-hand I can say that getting both 5.6 and 5.8 work at the same time
may be impossible in spots simply because 5.6 was badly unfinished as
regards to Unicode. No, it won't get fixed. Beyond 5.8, I don't.
Some people may have some tricks they use to get Unicode code working
both
in 5.6 and 5.8, but _in_principle_ the bytes pragma should tell Perl in
both 5.6 and 5.8 that "I want bytes, darn it."

> --
Jarkko Hietaniemi <j...@iki.fi> http://www.iki.fi/jhi/ "There is this
special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

Martin Duerst

unread,

Jan 2, 2004, 6:17:13 PM1/2/04

to Jarkko Hietaniemi, perl-u...@perl.org, Jungshik Shin

Hello Jarkko,

Many thanks for your very quick answer.

At 00:31 04/01/03 +0200, Jarkko Hietaniemi wrote:
>>"In future, Perl-level operations will be expected to work with
>>characters rather than bytes."
>>
>>I very much appreciate all your hard work on the internationalization of
>>Perl.
>>However, recently I have been working on some things that let me think
>>that the above statement, if taken directly, may be going somewhat too far.
>
>I don't think there is any fear of Perl ever going that far.
>There is just too much legacy code that would go bang.

Very good. I didn't really assume there was, but I'd suggest to
tweak that sentence above a bit to make this clear.

>>Jungshik has also reported that
>>it fails with Perl 5.8.0 with an UTF-8 locale.
>
>Perl 5.8.0 was very broken with UTF-8 locales since it "auto-PERL_UNICODEd".
>We saw (keep seeing) a lot of that since RedHat 8 and 9 had the unfortunate
>combination of both Perl 5.8.0 _and_ UTF-8 locales (which the users didn't
>expect/know about/care about). Lots of code that expected to produce e.g.
>0xff started to produce 0xc3 0xbf. Bang!
>Use rather 5.8.1 or later.

If it were just me, that would be easy. But stating on an FAQ
page 'use Perl 5.8.1 or later' for something that worked
probably even in Perl 4 doesn't look like a good idea.

>>What I'm looking for is a very simple way to write perl programs
>>that work on byte streams. This should be possible without depending
>>on versions, working both on very old versions as well as future
>>versions.
>
>Off-hand I can say that getting both 5.6 and 5.8 work at the same time
>may be impossible in spots simply because 5.6 was badly unfinished as
>regards to Unicode. No, it won't get fixed. Beyond 5.8, I don't.

Sorry, I think you missed something in the last sentence. Did you
want to say "I don't know?".

>Some people may have some tricks they use to get Unicode code working both
>in 5.6 and 5.8, but _in_principle_ the bytes pragma should tell Perl in
>both 5.6 and 5.8 that "I want bytes, darn it."

Yes, that seems to do the job. But is this available in 5.0 or earlier?
Or is it possible to write some little code at the start that says
something like:

if (eval "use bytes;") { use bytes; }

(without making the actual invocation restricted to the { ... } ?

Regards, Martin.

Andreas J Koenig

unread,

Jan 3, 2004, 12:56:45 AM1/3/04

to Martin Duerst, Jarkko Hietaniemi, perl-u...@perl.org, Jungshik Shin, mser...@cpan.org

>>>>> On Fri, 02 Jan 2004 18:17:13 -0500, Martin Duerst <due...@w3.org> said:

>>> Jungshik has also reported that
>>> it fails with Perl 5.8.0 with an UTF-8 locale.
>>
>> Perl 5.8.0 was very broken with UTF-8 locales since it "auto-PERL_UNICODEd".
>> We saw (keep seeing) a lot of that since RedHat 8 and 9 had the unfortunate
>> combination of both Perl 5.8.0 _and_ UTF-8 locales (which the users didn't
>> expect/know about/care about). Lots of code that expected to produce e.g.
>> 0xff started to produce 0xc3 0xbf. Bang!
>> Use rather 5.8.1 or later.

> If it were just me, that would be easy. But stating on an FAQ
> page 'use Perl 5.8.1 or later' for something that worked
> probably even in Perl 4 doesn't look like a good idea.

I seem to remember I heard Matt Sergeant (CC'd; Hi Matt, sorry if I
misremember) say that he has a large codebase that works with perl
5.00503, 5.6.x and 5.8.x. I don't think that the tricks you need to
program around the Unicode cliffs through perl versions are collected
in a document.

I can say for sure that I have managed to have the PAUSE code
(ftp://pause.perl.org/pub/PAUSE/PAUSE-code/) run under both 5.6.1 and
5.8.x.

The typical idiom I used was:

if ($] > 5.007) {
require Encode;
# let Encode do some tweaking
}

The tricks that I used, have found their way into
perlunicode.pod/"Porting code from perl-5.6.X".

I suppose your one-liner would work with (untested)

#!/usr/bin/perl -pi~ -0777
# program to remove a leading UTF-8 BOM from a file
# works both STDIN -> STDOUT and on the spot (with filename as argument)

if ($] > 5.007) {
require Encode;
Encode::_utf8_off($_);
}
s/^\xEF\xBB\xBF//s;

>>> What I'm looking for is a very simple way to write perl programs
>>> that work on byte streams. This should be possible without depending
>>> on versions, working both on very old versions as well as future
>>> versions.
>>
>> Off-hand I can say that getting both 5.6 and 5.8 work at the same time
>> may be impossible in spots simply because 5.6 was badly unfinished as
>> regards to Unicode. No, it won't get fixed. Beyond 5.8, I don't.

> Sorry, I think you missed something in the last sentence. Did you
> want to say "I don't know?".

>> Some people may have some tricks they use to get Unicode code working both
>> in 5.6 and 5.8, but _in_principle_ the bytes pragma should tell Perl in
>> both 5.6 and 5.8 that "I want bytes, darn it."

> Yes, that seems to do the job. But is this available in 5.0 or earlier?
> Or is it possible to write some little code at the start that says
> something like:

> if (eval "use bytes;") { use bytes; }

That would be

use if $] >= 5.006, "bytes";

But you would have to make sure that if.pm is available, no option IMO.

> (without making the actual invocation restricted to the { ... } ?

--
andreas

Daisuke Maki

unread,

Jan 3, 2004, 1:07:19 AM1/3/04

to Andreas J Koenig, Martin Duerst, Jarkko Hietaniemi, perl-u...@perl.org, Jungshik Shin, mser...@cpan.org

> > if (eval "use bytes;") { use bytes; }
>
> That would be
>
> use if $] >= 5.006, "bytes";
>
> But you would have to make sure that if.pm is available, no option IMO.

I think the was used in AxKit by the Matt/axkit-dev folks was to put
this line

$INC{ "bytes.pm" }++ if $] < 5.006;

before any mention of use bytes, which I remember thinking was very cute
since it doesn't require any external modules and you can write all of
your code to have "use bytes" without having to worry about Perl versions.

--d

Jarkko Hietaniemi

unread,

Jan 3, 2004, 5:46:01 AM1/3/04

to Martin Duerst, perl-u...@perl.org, Jungshik Shin

> If it were just me, that would be easy. But stating on an FAQ
> page 'use Perl 5.8.1 or later' for something that worked
> probably even in Perl 4 doesn't look like a good idea.

Perl 4? And here I was being afraid that getting 5.6 to work right
would
be tricky... :-) I think we need to define "work" here. I think you
meant
by "working" something like "how can I be certain (portably across Perl
versions) that my script is only ever processing bytes, and only bytes",
while I was thinking by "working" something like "able to process
Unicode".

(Not that even Perl _1_ couldn't be made to _process_ Unicode, but I
guess
I meant "builtin support" by "able".)

>> Off-hand I can say that getting both 5.6 and 5.8 work at the same time
>> may be impossible in spots simply because 5.6 was badly unfinished as
>> regards to Unicode. No, it won't get fixed. Beyond 5.8, I don't.
>
> Sorry, I think you missed something in the last sentence. Did you
> want to say "I don't know?".

Yes, something like that.

>> Yes, that seems to do the job. But is this available in 5.0 or
>> earlier?

Certainly not. It's not even available (does not come standard) before
5.6,
and IIRC "use" came in at 5.0.

Jarkko Hietaniemi

unread,

Jan 3, 2004, 5:52:29 AM1/3/04

to Andreas J Koenig, perl-u...@perl.org, Jungshik Shin, mser...@cpan.org, Martin Duerst

> 5.00503, 5.6.x and 5.8.x. I don't think that the tricks you need to
> program around the Unicode cliffs through perl versions are collected
> in a document.

I think now that people have had time to "Unicodify" their applications
with 5.8.x, starting to collect the tricks required and found useful
would
not be a bad idea. Depending on how much material there would be,
either
the core perlunicode.pod could be augmented, or a completely new pod
could
be created. Maybe some of the knowledge could be encapsulated in a
module?
(or a .pl if someone really wants to do at least some Unicode with Perl
4?)

Guido Flohr

unread,

Jan 3, 2004, 1:38:49 PM1/3/04

to perl-u...@perl.org, Martin Duerst

Martin Duerst wrote:
>> in 5.6 and 5.8, but _in_principle_ the bytes pragma should tell Perl in
>> both 5.6 and 5.8 that "I want bytes, darn it."

But you still get into problem when you pass UTF-8 flagged variables to
legacy modules without the pragma.

> Yes, that seems to do the job. But is this available in 5.0 or earlier?
> Or is it possible to write some little code at the start that says
> something like:

With Locale::Messages::turn_utf_8_off() you can portably turn the UTF-8
flag off on scalars (works with Perl 5.00?-5.8) and force byte-wise
processing of that scalar globally.

By the way, the code I use for Perl 5.6 to turn the flag off looks like
this:

sub turn_utf_8_off
{
use bytes;
$_[0] = join '', split //, $_[0];
}

Is there anything faster available for Perl 5.6 then doing the join/split?

Ciao

Guido
--
Imperia AG, Development
Leyboldstr. 10 - D-50354 Hürth - http://www.imperia.net/