Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[perl #17439] broken Locale::Language in a UTF environment

1 view
Skip to first unread message

per...@perl.org

unread,
Sep 19, 2002, 10:42:59 AM9/19/02
to bugs-bi...@netlabs.develooper.com
# New Ticket Created by ma...@kasei.com
# Please include the string: [perl #17439]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt2/Ticket/Display.html?id=17439 >


This is a bug report for perl from ma...@kasei.com,
generated with the help of perlbug 1.34 running under perl v5.8.0.


-----------------------------------------------------------------
When running in a UTF environment, Locale::Language doesn't load:

LANG=en_GB.utf8 perl -we 'use Locale::Language'

Malformed UTF-8 character (unexpected end of string) at /usr/share/perl/5.8.0/Locale/Language.pm line 115, <DATA> line 109.
Malformed UTF-8 character (unexpected end of string) at /usr/share/perl/5.8.0/Locale/Language.pm line 117, <DATA> line 109.
Malformed UTF-8 character (unexpected non-continuation byte 0x6c, immediately after start byte 0xe5) in lc at /usr/share/perl/5.8.0/Locale/Language.pm line 117, <DATA> line 109.
Malformed UTF-8 character (unexpected end of string) at /usr/share/perl/5.8.0/Locale/Language.pm line 115, <DATA> line 178.
Malformed UTF-8 character (unexpected end of string) at /usr/share/perl/5.8.0/Locale/Language.pm line 117, <DATA> line 178.
Malformed UTF-8 character (unexpected non-continuation byte 0x6b, immediately after start byte 0xfc) in lc at /usr/share/perl/5.8.0/Locale/Language.pm line 117, <DATA> line 178.


The fix:

--- lib/Locale/Language.pm.orig 2002-09-19 15:17:16.000000000 +0200
+++ lib/Locale/Language.pm 2002-09-19 15:17:41.000000000 +0200
@@ -231,7 +231,7 @@
my:Burmese

na:Nauru
-nb:Norwegian Bokmĺl
+nb:Norwegian Bokmal
nd:Ndebele, North
ne:Nepali
ng:Ndonga
@@ -300,7 +300,7 @@
uz:Uzbek

vi:Vietnamese
-vo:Volapük
+vo:Volapuk

wo:Wolof


[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
category=core
severity=medium
---
Site configuration information for perl v5.8.0:

Configured by Debian Project at Sat Sep 14 18:17:32 UTC 2002.

Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration:
Platform:
osname=linux, osvers=2.4.19, archname=i386-linux-thread-multi
uname='linux cyberhq 2.4.19 #1 smp sun aug 4 11:30:45 pdt 2002 i686 unknown unknown gnulinux '
config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i386-linux -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8.0 -Darchlib=/usr/lib/perl/5.8.0 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.0 -Dsitearch=/usr/local/lib/perl/5.8.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.0 -Dd_dosuid -des'
hint=recommended, useposix=true, d_sigaction=define
usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O3',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -I/usr/local/include'
ccversion='', gccversion='2.95.4 20011002 (Debian prerelease)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib
libs=-lgdbm -ldb -ldl -lm -lpthread -lc -lcrypt
perllibs=-ldl -lm -lpthread -lc -lcrypt
libc=/lib/libc-2.2.5.so, so=so, useshrplib=true, libperl=libperl.so.5.8.0
gnulibc_version='2.2.5'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:

---
@INC for perl v5.8.0:
/home/marty/Perl
/etc/perl
/usr/local/lib/perl/5.8.0
/usr/local/share/perl/5.8.0
/usr/lib/perl5
/usr/share/perl5
/usr/lib/perl/5.8.0
/usr/share/perl/5.8.0
/usr/local/lib/site_perl
.

---
Environment for perl v5.8.0:
HOME=/home/marty
LANG=en_GB.UTF-8
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/home/marty/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/sbin:/usr/sbin:/usr/local/sbin
PERLLIB=/home/marty/Perl
PERL_BADLANG (unset)
SHELL=/bin/bash


Marty Pauley

unread,
Sep 20, 2002, 4:52:16 AM9/20/02
to perl5-...@perl.org
I should have added a better explanation to this bug report and the
proposed fix, so here goes:

The DATA is Locale::Language contains 2 Latin-1 characters.
When Perl 5.8 is running in a UTF8 locale it expects the DATA to be UTF8
so it dies when it finds a malformed character.

Adding 'use bytes' to Locale::Language would stop the death, but the
included non-ASCII characters don't work properly on non-Latin1 systems.
So I think it is better to replace the 2 problem characters with ASCII.

Here's my suggested patch. I've tried to ensure I've included the
actual Latin1 characters in this email, but as I don't use a Latin1
system they will probably be converted when I send this: sorry.

--- lib/Locale/Language.pm.orig 2002-09-19 15:17:16.000000000 +0200
+++ lib/Locale/Language.pm 2002-09-19 15:17:41.000000000 +0200
@@ -231,7 +231,7 @@
my:Burmese

na:Nauru

-nb:Norwegian Bokm?l


+nb:Norwegian Bokmal
nd:Ndebele, North
ne:Nepali
ng:Ndonga
@@ -300,7 +300,7 @@
uz:Uzbek

vi:Vietnamese

-vo:Volap?k
+vo:Volapuk

wo:Wolof

--- ./lib/Locale/Codes/t/languages.t.orig 2002-09-19 15:17:16.000000000 +0200
+++ ./lib/Locale/Codes/t/languages.t 2002-09-19 15:17:16.000000000 +0200
@@ -47,7 +47,7 @@
'code2language("nd") eq "Ndebele, North"',
'code2language("ng") eq "Ndonga"',
'code2language("nn") eq "Norwegian Nynorsk"',
- 'code2language("nb") eq "Norwegian Bokm?l"',
+ 'code2language("nb") eq "Norwegian Bokmal"',
'code2language("ny") eq "Chichewa; Nyanja"',
'code2language("oc") eq "Occitan (post 1500)"',
'code2language("os") eq "Ossetian; Ossetic"',

--
Marty

Nick Ing-Simmons

unread,
Sep 20, 2002, 11:22:02 AM9/20/02
to mart...@kasei.com, perl5-...@perl.org
Marty Pauley <mart...@kasei.com> writes:
>I should have added a better explanation to this bug report and the
>proposed fix, so here goes:
>
>The DATA is Locale::Language contains 2 Latin-1 characters.
>When Perl 5.8 is running in a UTF8 locale it expects the DATA to be UTF8
>so it dies when it finds a malformed character.
>
>Adding 'use bytes' to Locale::Language would stop the death, but the
>included non-ASCII characters don't work properly on non-Latin1 systems.
>So I think it is better to replace the 2 problem characters with ASCII.

Why not \x{00xx} escape ? - would be more robust for patching as well.
As mailers (including mine) are variously mangling these diffs.

Nick Ing-Simmons
http://www.ni-s.u-net.com/

Marty Pauley

unread,
Sep 23, 2002, 8:00:00 AM9/23/02
to perl5-...@perl.org
On Fri Sep 20 16:22:02 2002, Nick Ing-Simmons wrote:

> Marty Pauley <mart...@kasei.com> writes:
> >Adding 'use bytes' to Locale::Language would stop the death, but the
> >included non-ASCII characters don't work properly on non-Latin1 systems.
> >So I think it is better to replace the 2 problem characters with ASCII.
>
> Why not \x{00xx} escape ? - would be more robust for patching as well.
> As mailers (including mine) are variously mangling these diffs.

The Language.pm file contains the Latin1 characters in the DATA section
so I can't use the escape sequences there. But the other reason was
more important to me: the Latin1 characters cause bad things to happen
when used in a non-Latin1 environment; in EUC-JP for example, they
either don't display at all, or they merge with the next character and
display some obscure kanji.

--
Marty

h...@crypt.org

unread,
Sep 26, 2002, 6:49:53 AM9/26/02
to perl5-...@perl.org
Marty Pauley <mart...@kasei.com> wrote:
:Here's my suggested patch. I've tried to ensure I've included the

:actual Latin1 characters in this email, but as I don't use a Latin1
:system they will probably be converted when I send this: sorry.

Thanks, applied as change #17927.

Sending the patch as an attachment, either instead of or as well as
the inline version, is usually the best way to ensure the integrity
of the patch when you are unsure what your mailer will do to it.

Hugo

0 new messages