proposed change in utf8 filename semantics

Dmitry Karasik

unread,

Sep 18, 2007, 4:03:50 PM9/18/07

to perl5-...@perl.org

Hello,

For some time I was wondering how it would be optimal to incorporate utf8 in
filenames in perl. The problem, as it seems to me, is that unicode support in
this regard is not orthogonal in Perl, because things like locales and files IO
can be easily managed by 'use locale' and IO layers, whereas unicode characters
in file names are left out. For unix this was never a problem, because
there is no special syntax for filenames in unicode, but for win32 it is, and
working with files that contain unicode letters outside current locale is a
real trouble.

I would therefore like to propose a (non-default) change in semantics, that
will use OS-level unicode API (for win32, wide-char API) when avaialble, and if
explicitly asked. The semantics has two aspects:

1. When a filename-related function is called with a filename scalar that has
SvUTF8 bit set, the function will try to use OS-level unicode API -- if
present. On win32, functions like win32_stat() will check the utf8 context
hints, and depending on the value, will call either stat() or wstat(). For OSes
where no special API is present, no changes in the code is needed, and no
addition runtime expenses are incurred.

2. Functions that return file names, like readdir(), are taught to
differentiate between bytes and utf8 context, regardless of whether OS supports
unicode API or not. I propose to extend syntax of binmode so that two new
calls

binmode( DIRHANDLE, ':utf8')
binmode( DIRHANDLE, ':bytes')

will be recognized, and depending on the last such call, readdir() will return
filenames either with or without SvUTF8 flag on. Again, OS unicode API will be
used where supported, and where it is not, no additional code is required. In
':utf8' mode, all results of PerlIO_readdir() will be simply flagged with
SvUTF8, and the validity of utf8 string can be later checked with utf8::valid,
if necessary.

I'm attaching a patch against 5.10.0 that implements this new behavior for
stat(), opendir(), and readdir() only. I'm unsure whether this patch would be
considered good enough for inclusion, so I don't want to spend more time on
implementing all filename-related functions yet. OTOH if someone would want to
help me with the implementation, that would be really great.

The patch is split in two sections, one for the code and another for the
configuration files. The code patch concerns only .c and .h files, and is
fairly complete. I'm unsure though about the configuration patch - it applies
changes to Configure and win32/config.*, but there are many more pre-complied
config templates for other platforms, so I didn't touch these, and would like
to ask someone to tell me what did I miss ( I basically need to add a new config
variable utf8filenamesemantics).

Of course I'm completely unsure if the idea with the new utf8 filename
semantics will be accepted at all. I understand that it is win32 users that
will benefit most from it, because on unix simple 'Encode::_utf8_on($_) for
@readdir' is all that is needed to treat filenames as unicode. Nevertheless,
if accepted, there will be some little more uniformity in Perl's cross-platform
filename and unicode handing.

This is my first Perl patch, so if I broke some rules here, please don't just
stay silent, tell me what can be done better. I tested it on win32 and freebsd
and linux, seems to be working as expected, I don't know what else should I
test. Please review and/or test it too. To enable it for win32, define
UTF8_FILENAME_SEMANTICS in win32/Makefile, otherwise re-run Configure and
answer yes on 'Perl can be built with experimental UTF8 filename semantics
enabled' question.

--
Sincerely,
Dmitry Karasik

perl5.10.diff

Juerd Waalboer

unread,

Sep 18, 2007, 7:58:37 PM9/18/07

to perl5-...@perl.org

One big problem with filenames and encodings is that it is incredibly
platform dependent. And here, "platform" includes mounted filesystem!

/foo may expect encoding A, whereas /foo/bar wants B. This can result in
a single path of /foo/bar/baz requiring that "foo" be encoded as latin1,
"bar" as A, and "baz" as B.

Unless perl can -somehow- tell (or be told) which encoding is required,
there's really no way to get any cross platform compatibility in this
area.

And note that while mixed filesystem encodings may only occur
occassionally in real systems, there might still be the case of dealing
with user preference, where the MP3 collection is UTF-8, but the photo
album is strictly ISO-8859-1 for compatibility with some old program
that adds captions to the images.

> 1. When a filename-related function is called with a filename scalar
> that has SvUTF8 bit set, the function will try to use OS-level unicode
> API -- if present.

No. The SvUTF8 bit indicates that the internal encoding of the string is
UTF8 rather than ISO-8859-1. (Note that ISO-8859-1 is -officially- a
Unicode encoding too, so Unicode semantics ought to apply.)

Perl already uses the UTF8 flag to decide *semantics* in several places.
While from a historical perspective this may have made sense, it is a
huge mistake that causes a lot of pain and subtle hard-to-catch bugs.

Do not use the UTF8 flag to determine if you're going to use Unicode
semantics or not, in new code. Use something that is visible in Perl
code, instead of some internal variable. For example, a pragma.

Support unicode always or never, or let the user decide. Please do not
apply heuristics here.

> 2. Functions that return file names, like readdir(), are taught to
> differentiate between bytes and utf8 context, regardless of whether OS
> supports unicode API or not.

Instead of "bytes" and "utf8", please let's make that "binary" and
"text", or "bytes" and "characters", because UTF-8 sequences are also
bytes.

> I propose to extend syntax of binmode so that two new calls
> binmode( DIRHANDLE, ':utf8')
> binmode( DIRHANDLE, ':bytes')
> will be recognized, and depending on the last such call, readdir()
> will return filenames either with or without SvUTF8 flag on.

This does not scale to functions like glob and open that don't act
on a DIRHANDLE, but do access directories. When you open
/foo/bar/baz/quux, each part can have its own expected encoding, so you
need to be able to set different encodings for /foo, /foo/bar,
/foo/bar/baz, and /foo/bar/baz/quux.

I think a (non-lexical) pragma or special variable that enables encoding
support (not just UTF-8!) for the filesystem would be a better idea.
When enabled, Perl tries to auto-detect the encoding, with the ability
to override this by explicitly saying that things under "/foo/bar/"
should be encoding B and everything under "/mnt/tmp5" should be encoding
C. Of course, there should also be a way to say that even though
"/foo/bar"'s tree was forced to B, everything under "/foo/bar/baz/quux"
should use auto detection again.

> all results of PerlIO_readdir() will be simply flagged with SvUTF8,
> and the validity of utf8 string can be later checked with utf8::valid,
> if necessary.

That's a scary and potentially dangerous approach. SvUTF8 is treated as
a promise that says "this buffer is valid UTF8". This is why
:encoding(UTF-8) is often a better choice than :utf8. In fact, I'm still
pissed off by the poor huffman coding here.

> I'm attaching a patch against 5.10.0 that implements this new behavior for
> stat(), opendir(), and readdir() only. I'm unsure whether this patch would be
> considered good enough for inclusion

I love it when people send patches. It shows that what they want can
actually be done, and that they're willing to spend time to make it
happen.

However, this is a new feature with potentially major (probably
positive) impact. It lacks documentation and there's practically no time
to test it. I like the idea of having Perl support non-raw-bytes
filenames, but let's first find concensus about what the proper level of
abstraction should be. I'm not the pumpking, of course, but I'd like
this to go into 5.12, not 5.10.

> (I basically need to add a new config variable utf8filenamesemantics).

There's more than just UTF-8. It would be nice if the implementation
went all the way and implemented a framework for other encodings too.

> simple 'Encode::_utf8_on($_) for @readdir' is all that is needed to
> treat filenames as unicode.

Simple but dangerous, and there's no way of knowing that what readdir
returns should actually be interpreted as UTF-8. (AFAIK.)
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sa...@convolution.nl>

Jan Dubois

unread,

Sep 18, 2007, 8:57:13 PM9/18/07

to Dmitry Karasik, perl5-...@perl.org

On Tue, 18 Sep 2007, Dmitry Karasik wrote:
> For some time I was wondering how it would be optimal to incorporate
> utf8 in filenames in perl. The problem, as it seems to me, is that
> unicode support in this regard is not orthogonal in Perl, because
> things like locales and files IO can be easily managed by 'use locale'
> and IO layers, whereas unicode characters in file names are left out.
> For unix this was never a problem, because there is no special syntax
> for filenames in unicode, but for win32 it is, and working with files
> that contain unicode letters outside current locale is a real trouble.
>
> I would therefore like to propose a (non-default) change in semantics,
> that will use OS-level unicode API (for win32, wide-char API) when
> avaialble, and if explicitly asked. The semantics has two aspects:

I think this is the wrong approach. This topic has been discussed a couple
of times, both here and on the perl-unicode mailing list. The consensus
seems to be the approach described in pod/perltodo.pod under the heading
"Virtualize operating system access".

I'm interested in discussing this further, but I'm going to be offline
from sometime next week until the end of October. However, I think this
is a topic for Perl 5.12, so there is no urgency right now.

Note that I added various workarounds to both Perl 5.10 and the included
Win32 module to make it possible to work with Unicode filenames that
cannot be mapped to the ANSI codepage:

Whenever readdir() or glob() have to return a filename that cannot be
mapped back to the system codepage without substitution characters, then
they will return the short 8.3 name instead. As long as you are using
the NTFS filesystem, this name can always be represented in the ANSI
codepage, and therefore be passed back to open() or passed to other
programs etc. (If you are using FAT, then the 8.3 filename may contain
characters from the OEM character set and may still not be representable
in the ANSI codepage).

If you need the long version of a filename returned by readdir() or glob()
then you can always call Win32::GetLongPathName(), which will return
the full name of the file or directory, switching to UTF8 if the string
cannot be represented in the ANSI codepage.

So while accessing non-ANSI filenames from Perl isn't exactly easy, it
is certainly already possible, and mostly seamless, as long as you are
using NTFS. Please check out the other filename related functions in the
Win32.pm module and let me know what you think.

Just remember that using the 8.3 filenames is meant as a workaround until
the "virtual operating system access" is properly implemented. The basic
system is already in place for Win32; the problem is just that we continue
to use char* pointers to pass strings to OS calls, so we lose the string
encoding in the process.

Cheers,
-Jan

Jan Dubois

unread,

Sep 18, 2007, 9:12:17 PM9/18/07

to Juerd Waalboer, perl5-...@perl.org

On Tue, 18 Sep 2007, Juerd Waalboer wrote:
> One big problem with filenames and encodings is that it is incredibly
> platform dependent. And here, "platform" includes mounted filesystem!
>
> /foo may expect encoding A, whereas /foo/bar wants B. This can result in
> a single path of /foo/bar/baz requiring that "foo" be encoded as latin1,
> "bar" as A, and "baz" as B.
>
> Unless perl can -somehow- tell (or be told) which encoding is required,
> there's really no way to get any cross platform compatibility in this
> area.
>
> And note that while mixed filesystem encodings may only occur
> occassionally in real systems, there might still be the case of dealing
> with user preference, where the MP3 collection is UTF-8, but the photo
> album is strictly ISO-8859-1 for compatibility with some old program
> that adds captions to the images.

On Windows you can just call the wide-character APIs and the OS / file
system drivers make sure that each part of the filename is encoded
correctly for the filesystem on which it is stored.

The whole issue only becomes messy when you have to use the byte string
API, when you suddenly have to deal encodings that can only represent a
subset of the full Unicode character set. The real problem of course is
that typical Unix systems don't have a wide-character API that hides
the implementation details from the user.

Cheers,
-Jan

Message has been deleted

Dmitry Karasik

unread,

Sep 19, 2007, 3:25:55 AM9/19/07

to Jan Dubois, Dmitry Karasik, perl5-...@perl.org

Jan> I'm interested in discussing this further, but I'm going to be
Jan> offline from sometime next week until the end of October. However, I
Jan> think this is a topic for Perl 5.12, so there is no urgency right
Jan> now.

Sure thing. That's why there's only a minimal patch.

Jan> Whenever readdir() or glob() have to return a filename that cannot be
Jan> mapped back to the system codepage without substitution characters,
Jan> then they will return the short 8.3 name instead.

I'm aware of that, but take note that filenames should not necessarily
come to perl program using readdir and glob. If f.ex. a user types
in filename that contains the unmappable characters, open() wouldn't be able
to open that file.

Jan> So while accessing non-ANSI filenames from Perl isn't exactly easy,
Jan> it is certainly already possible, and mostly seamless, as long as you
Jan> are using NTFS. Please check out the other filename related functions
Jan> in the Win32.pm module and let me know what you think.

I'm also very much aware of Win32 wide filename support, but my point is that if a
good abstraction of unicode in filenames is found (and I hope that mine is
good), then all that wide filename support can be moved to core.

Jan> Just remember that using the 8.3 filenames is meant as a workaround
Jan> until the "virtual operating system access" is properly implemented.

I have no optinion about "virtual operating system access", and especially
on when it will be implemented, but when it is, there still must be
change in perl-level semantics anyway. Let's test that semantics on the
proposed implementation now, and if it is good, it can be taken as a point
of reference when implementing virtual OS access.

Jan> The basic system is already in place for Win32; the problem is just
Jan> that we continue to use char* pointers to pass strings to OS calls,
Jan> so we lose the string encoding in the process.

Not necessarily -- if we adopt set of flags passed in PL_dir_unicode, then
we're just fine with char* pointers.

--
Sincerely,
Dmitry Karasik

Dmitry Karasik

unread,

Sep 19, 2007, 3:58:56 AM9/19/07

to John E. Malmberg, Dmitry Karasik, perl5-...@perl.org

Hi John!

John> What happens on NTFS if you store a filename using UTF8 encoding
John> through the traditional UNIX calls?

The default behavior stays the same -- filenames will be stored using
byte semantics. If, however, a filename scalar has SvUTF8 flag set,
UTF8 semantics will be used.

John> Now here the problem really get bad. There are programs on VMS that
John> only understand how to represent Unicode filenames in the native
John> language using VTF-8 format.
John> Filenames encoded in the UTF-8 (binary) format are not usable to
John> those programs.

John> Filenames encoded in VTF-7 format are not visible to programs
John> expecting UNIX file syntax.

I have verly little knowledge about VMS, but based on what you say, there
is no way to resolve the conflict, as presented.

John> Which Unicode API should VMS use when the SvUTF8 is set? Native,
John> UTF-8 UNIX, or VTF-7?

I wish I would be able to answer your question, because VMS support will
be vital for my proposition, however as I'm not competent here, I simply don't
know, sorry.

John> I can easily tell if I do a readdir() if the filename it is reading
John> is VTF-7 encoded or not. I do not know for sure if it is utf-8
John> encoded.

But same is for default unix semantics -- results() from readdir can be
easily anaylzed whether they are utf8 or not. The idea is not to analyze
the output at all, but rather lete the users decide what encoding they
want they filenames in.

John> First see if your platform will accept UTF-8 encoded filenames in
John> UNIX syntax as different files from other Unicode encoded.

Possibly. Again, this is VMS-related, so I don't know.

John> I do not like build options, is there any way to make it a run time
John> setting like a mode or a pragma.

Me neither, and I guess there should be no problem adding such a pragma.

John> It may be more practical to first build an external overload to the
John> filename functions to do the translations based on an object type of
John> a file specification, and the properties of that object.
John> This way perl modules can use class of a file specification and its
John> properties as an enhancement to the base perl, and it would be clear
John> that the object in question is a file specification.

This is interesting. Possibly I'm doing a premature optimization, because
the actual changes required in system-independent files are minor, so I
though that direct changes to core that don't change anything on unix
would be good enough.

John> Even more fun would be if someone needed to write a Perl script to
John> rename UTF-8 encoded names to VTF-7 encoded names or the reverse.
John> It might be to maintain hard links between the two encodings.

Would it be ( excuse my VMS ignorance ) more appropriate to treat SvUTF8
flag as indication of which layer to use? So SvUTF8-flagged scalars that
contain characters > 0x7f will be put through VTF-7 layer, and unix-layer
otherwise? But I see that just one SvUTF8 flag might not be enough here.

--
Sincerely,
Dmitry Karasik

Dmitry Karasik

unread,

Sep 19, 2007, 3:13:16 AM9/19/07

to Juerd Waalboer, perl5-...@perl.org

Hi Juerd!

Juerd> Unless perl can -somehow- tell (or be told) which encoding is
Juerd> required, there's really no way to get any cross platform
Juerd> compatibility in this area.

Of course. The idea is that it is the caller that tells perl which
encoding is required, so no heuristics is necessary. readdir() would
therefore return unicode filenames only after being told to do so.

Juerd> And note that while mixed filesystem encodings may only occur
Juerd> occassionally in real systems, there might still be the case of
Juerd> dealing with user preference, where the MP3 collection is UTF-8,
Juerd> but the photo album is strictly ISO-8859-1 for compatibility with
Juerd> some old program that adds captions to the images.

If we're talking about unix mounts, that is a non-issue. For win32
mounts, I don't know, I never encountered them at all, so I don't know
if win32 API takes care of the underlying encoding translations.

Juerd> Perl already uses the UTF8 flag to decide *semantics* in several
Juerd> places. While from a historical perspective this may have made
Juerd> sense, it is a huge mistake that causes a lot of pain and subtle
Juerd> hard-to-catch bugs.

Hm. I was unaware of a point of view that the UTF8 flag was a mistake,
so of course from that point of view the whole proposition would be a
continuation of that mistake, simply put.

Juerd> Do not use the UTF8 flag to determine if you're going to use
Juerd> Unicode semantics or not, in new code. Use something that is
Juerd> visible in Perl code, instead of some internal variable. For
Juerd> example, a pragma.

I tend to agree, however pragmas tend to be global, program- or package-
wise, and what suits best here is individual, perl-call flag.

Juerd> Support unicode always or never, or let the user decide. Please do
Juerd> not apply heuristics here.

I must've written something unclear. There's no heuristics, and it is
user that decides which semantics to use.

Juerd> Instead of "bytes" and "utf8", please let's make that "binary" and
Juerd> "text", or "bytes" and "characters", because UTF-8 sequences are
Juerd> also bytes.

I personally don't really care what names would be, I just thought that
it would better fit with the existing IO layer names, with
binmode(FILE, ':utf8') and the like.

Juerd> This does not scale to functions like glob and open that don't act
Juerd> on a DIRHANDLE, but do access directories. When you open
Juerd> /foo/bar/baz/quux, each part can have its own expected encoding, so
Juerd> you need to be able to set different encodings for /foo, /foo/bar,
Juerd> /foo/bar/baz, and /foo/bar/baz/quux.

This is true for glob, but untrue for open, -- the latter does not return
filenames.

>> all results of PerlIO_readdir() will be simply flagged with SvUTF8, and
>> the validity of utf8 string can be later checked with utf8::valid, if
>> necessary.

Juerd> That's a scary and potentially dangerous approach. SvUTF8 is
Juerd> treated as a promise that says "this buffer is valid UTF8". This is
Juerd> why :encoding(UTF-8) is often a better choice than :utf8. In fact,
Juerd> I'm still pissed off by the poor huffman coding here.

This is also a bit unclear to me. utf8::valid happily returns true
when SvUTF8 if off, and only when it is on it does the actual validity
check. It would be trivial to enforce the promise that scalars that are
flagged with SvUTF8 are really valid, however, the proposed behavior is
also based on behavior of the utf8 IO layer, that simply flags all input
with SvUTF8, valid or not. So this behavior is debatable.

Juerd> Simple but dangerous, and there's no way of knowing that what
Juerd> readdir returns should actually be interpreted as UTF-8. (AFAIK.)

True, but again, lets look at utf8 IO layer: whoever uses that, accepts the
burden of checking the validity of input. Same is for readdir.

--
Sincerely,
Dmitry Karasik

Juerd Waalboer

unread,

Sep 19, 2007, 6:16:38 AM9/19/07

to perl5-...@perl.org

Dmitry Karasik skribis 2007-09-19 9:13 (+0200):

> Juerd> Perl already uses the UTF8 flag to decide *semantics* in several
> Juerd> places. While from a historical perspective this may have made
> Juerd> sense, it is a huge mistake that causes a lot of pain and subtle
> Juerd> hard-to-catch bugs.
> Hm. I was unaware of a point of view that the UTF8 flag was a mistake,

Not the UTF8 flag was a mistake, but using it as a heuristic for
semantics was. In essence, the UTF8 flag indicates that the string is
internally raw bytes, or UTF8 encoded. A raw byte string is interpreted
as ISO-8859-1 whenever it needs to be upgraded. However, with lc, uc,
//i, and character classes, a negative UTF8 indication results in ASCII
semantics, ignoring the second half of ISO-8859-1 altogether.

This is wrong, because from the programmer's perspective, you can now
have $foo eq $bar, while $foo =~ /\w/ and $bar !~ /\w/. Abstraction is
broken.

> Juerd> Do not use the UTF8 flag to determine if you're going to use
> Juerd> Unicode semantics or not, in new code. Use something that is
> Juerd> visible in Perl code, instead of some internal variable. For
> Juerd> example, a pragma.
> I tend to agree, however pragmas tend to be global, program- or package-
> wise, and what suits best here is individual, perl-call flag.

Global is a problem in most cases, but I feel it would be perfect here,
simply because the filesystem is equally global. In fact, it's even
longer lived than your Perl program :)

Better yet, global variables can be localized to dynamic scope. This is
good, because when you set the encoding for /foo, it should work for
encoding-unaware modules too.

Maybe a hash would be nice:

${^FS_ENCODING}{foo} = 'A';
${^FS_ENCODING}{foo}{bar} = 'B';
${^FS_ENCODING}{foo}{bar}{baz}{quux} = 'auto';

open my $fh, ">", "/foo/bar/baz/quux/blah/hello.txt";

Which then actually does:

open my $fh, ">", join("/",
""
encode(detect_encoding("/"), "foo"),
encode("A", "bar"),
encode("B", "baz"),
encode("B", "quux"),
encode(detect_encoding("/foo/bar/baz/quux"), "blah"),
encode(detect_encoding("/foo/bar/baz/quux/blah"), "hello.txt"),
);

Like most things, this would only work if all encodings are ASCII
compatible. (For the "/" separator)

> Juerd> Support unicode always or never, or let the user decide. Please do
> Juerd> not apply heuristics here.
> I must've written something unclear. There's no heuristics, and it is
> user that decides which semantics to use.

Using the UTF8 flag for that would have been a heuristic.

> Juerd> This does not scale to functions like glob and open that don't act
> Juerd> on a DIRHANDLE, but do access directories. When you open
> Juerd> /foo/bar/baz/quux, each part can have its own expected encoding, so
> Juerd> you need to be able to set different encodings for /foo, /foo/bar,
> Juerd> /foo/bar/baz, and /foo/bar/baz/quux.
> This is true for glob, but untrue for open, -- the latter does not return
> filenames.

It does not return them, but it does use them. It has to encode paths
with the same encodings that readdir uses to decode them, or symmetry is
broken and the result of readdir is now useless.

> >> all results of PerlIO_readdir() will be simply flagged with SvUTF8, and
> >> the validity of utf8 string can be later checked with utf8::valid, if
> >> necessary.
> Juerd> That's a scary and potentially dangerous approach. SvUTF8 is
> Juerd> treated as a promise that says "this buffer is valid UTF8". This is
> Juerd> why :encoding(UTF-8) is often a better choice than :utf8. In fact,
> Juerd> I'm still pissed off by the poor huffman coding here.
> This is also a bit unclear to me.

The responsibility for checking the value should be perl's, not the
programmer's.

> that simply flags all input with SvUTF8, valid or not.

Simply flagging is arguably wrong and dangerous. Instead of simply
flagging, the string should be decoded properly. This may result in
exactly the same byte sequence, but provides important checks.

> Juerd> Simple but dangerous, and there's no way of knowing that what
> Juerd> readdir returns should actually be interpreted as UTF-8. (AFAIK.)
> True, but again, lets look at utf8 IO layer: whoever uses that, accepts the
> burden of checking the validity of input. Same is for readdir.

Yes, with no documentation whatsoever pointing out the danger. I'm
looking for tuits to fix this. ":utf8" is used MUCH too easily, because
people DO NOT KNOW that they then have to check for validity themselves.
It's fine when writing (encoding), it's bad when reading (decoding).

John Malmberg

unread,

Sep 19, 2007, 10:01:27 AM9/19/07

to perl5-...@perl.org

Dmitry Karasik wrote:
> Hi John!
>
> John> What happens on NTFS if you store a filename using UTF8 encoding
> John> through the traditional UNIX calls?
>
> The default behavior stays the same -- filenames will be stored using
> byte semantics. If, however, a filename scalar has SvUTF8 flag set,
> UTF8 semantics will be used.

I was referring to outside of Perl.

The VMS ODS-5 file system was developed for use with Pathworks/Advanced
Server to serve files to Microsoft Windows. Pathworks used code
licensed from Microsoft through ATT. So ODS-5 was designed to have a
filename behavior similar to NTFS, with out the support for 8 by 3 names.

So there is a possibility that any issues that VMS has with oddities in
Unicode handling, Microsoft Windows may also have the same on NTFS.

> John> Now here the problem really get bad. There are programs on VMS that
> John> only understand how to represent Unicode filenames in the native
> John> language using VTF-8 format.
> John> Filenames encoded in the UTF-8 (binary) format are not usable to
> John> those programs.
>
> John> Filenames encoded in VTF-7 format are not visible to programs
> John> expecting UNIX file syntax.
>
> I have verly little knowledge about VMS, but based on what you say, there
> is no way to resolve the conflict, as presented.

It requires the user or system administrator to set a flag indicating
what mode to use. An enhancement request has been filed for the VMS C
library to have such a flag. I do not know what the status of that
request is and do not have any direct way to find out. In any case, the
C library change would only help for future versions of VMS.

> John> Which Unicode API should VMS use when the SvUTF8 is set? Native,
> John> UTF-8 UNIX, or VTF-7?
>
> I wish I would be able to answer your question, because VMS support will
> be vital for my proposition, however as I'm not competent here, I simply don't
> know, sorry.

Right now, Perl does not fully support VMS ODS-5 for non-Unicode
filenames, and I need to get that working before I can look at adding
Unicode support. The fact that I also have not really worked with
Unicode is also a hindrance as I do not have any independent test cases
to verify if I get things write.

Latent in the VMS port of Perl, it looks for an external flag to
determine if it should convert UNIX UTF-8 to VTF-7 or pass it through.
Some of the VTF-7 handling is now present, but it is untested.

> John> I can easily tell if I do a readdir() if the filename it is reading
> John> is VTF-7 encoded or not. I do not know for sure if it is utf-8
> John> encoded.
>
> But same is for default unix semantics -- results() from readdir can be
> easily anaylzed whether they are utf8 or not. The idea is not to analyze
> the output at all, but rather lete the users decide what encoding they
> want they filenames in.

Realize that the user may not know or want to care about filename encodings.

With UNIX it is not an issue because everything on the system treats a
filename the same way, regardless of the encoding.

With VMS, it is an issue because there is a traditional native syntax, a
UNIX translation of that syntax, an extended native syntax, and that
extended native syntax requires changes to the UNIX translation.

> John> First see if your platform will accept UTF-8 encoded filenames in
> John> UNIX syntax as different files from other Unicode encoded.
>
> Possibly. Again, this is VMS-related, so I don't know.

No it is related to the platforms that you are using. What you need to
do is a simple test:

Create a file name using characters that require Unicode encoding.

Create a UTF-8 representation of that filename and create a file
with that name in an empty directory.

Create the wide (UCS-2) representation of the above file name.

Use the wide open routine to try to open the existing file that
that you just created.

If that step succeeds, then it means that your platform treats UTF-8 and
UCS-2 representations as the same filename transparently, and it means
that much if any of your hacks are not needed.

If that step fails, then you have the exact same issue as VMS, where
UTF-8 filenames and "wide" filenames are treated as different files, and
that the same special handling is needed to know if a file name string
with the SvUTF8 flag needs to be passed through as binary or converted
to "wide" for use with a "wide" call.

And in the case that the step fails, then you need guidance from
external to the program as to how to handle the UTF-8 code.

> John> I do not like build options, is there any way to make it a run time
> John> setting like a mode or a pragma.
>
> Me neither, and I guess there should be no problem adding such a pragma.
>
> John> It may be more practical to first build an external overload to the
> John> filename functions to do the translations based on an object type of
> John> a file specification, and the properties of that object.
> John> This way perl modules can use class of a file specification and its
> John> properties as an enhancement to the base perl, and it would be clear
> John> that the object in question is a file specification.
>
> This is interesting. Possibly I'm doing a premature optimization, because
> the actual changes required in system-independent files are minor, so I
> though that direct changes to core that don't change anything on unix
> would be good enough.
>
> John> Even more fun would be if someone needed to write a Perl script to
> John> rename UTF-8 encoded names to VTF-7 encoded names or the reverse.
> John> It might be to maintain hard links between the two encodings.
>
> Would it be ( excuse my VMS ignorance ) more appropriate to treat SvUTF8
> flag as indication of which layer to use? So SvUTF8-flagged scalars that
> contain characters > 0x7f will be put through VTF-7 layer, and unix-layer
> otherwise? But I see that just one SvUTF8 flag might not be enough here.

SvUTF8 is a binary flag. I need a flag to indicate how I should
translate UTF-8 encoded file names to native VMS file names.

And I can not trust the UTF-8 flag to have been set, because things like
File::Spec and VMS::Filespec currently do not appear to deal with it and
may strip it off of a processed or created file specification.

That is why the solution may be to create a class to handle filenames
and file systems with methods and properties that are unique to them.

-John
wb8...@qsl.net
Personal Opinion Only

Dmitry Karasik

unread,

Sep 19, 2007, 7:22:00 AM9/19/07

to Juerd Waalboer, perl5-...@perl.org

Hi Juerd!

Juerd> Global is a problem in most cases, but I feel it would be perfect
Juerd> here, simply because the filesystem is equally global. In fact,
Juerd> it's even longer lived than your Perl program :)

Ok, so let's say that the argument goes around how exactly we tell a
syscall which semantics to use. If the consensus would be that SvUTF8
it not good enough, surely a global would do. I'm more concerned about
the actual underlying string translation than using SvUTF8 as (one of
many possible ways to) hint the syscall.

OTOH, if, for the sake of argument, ${^FS_ENCODING} = 'X' is
used, it wouldn't look as beautiful as it would without it. I think
that a simple scalar will do here.

Juerd> This does not scale to functions like glob and open that don't act
Juerd> on a DIRHANDLE, but do access directories. When you open
Juerd> /foo/bar/baz/quux, each part can have its own expected encoding, so
Juerd> you need to be able to set different encodings for /foo, /foo/bar,
Juerd> /foo/bar/baz, and /foo/bar/baz/quux.
>> This is true for glob, but untrue for open, -- the latter does not
>> return filenames.

Juerd> It does not return them, but it does use them. It has to encode
Juerd> paths with the same encodings that readdir uses to decode them, or
Juerd> symmetry is broken and the result of readdir is now useless.

If we're talking in light of having one big switch for FS encodings, then
yes, the symmetry matters. But if each individual call is supplied with
a flag, as that was my intention, then it's perfectly safe to
opendir(bytes) and readdir(characters).

Juerd> Simple but dangerous, and there's no way of knowing that what
Juerd> readdir returns should actually be interpreted as UTF-8. (AFAIK.)
>> True, but again, lets look at utf8 IO layer: whoever uses that, accepts
>> the burden of checking the validity of input. Same is for readdir.

Juerd> Yes, with no documentation whatsoever pointing out the danger. I'm
Juerd> looking for tuits to fix this. ":utf8" is used MUCH too easily,
Juerd> because people DO NOT KNOW that they then have to check for
Juerd> validity themselves. It's fine when writing (encoding), it's bad
Juerd> when reading (decoding).

I agree that this is a problem, however, if you'll be able to fix this,
whatever the fix would be, I'm sure that the same logic could be applied
to readdir().

--
Sincerely,
Dmitry Karasik

Jan Dubois

unread,

Sep 19, 2007, 4:21:49 PM9/19/07

to Dmitry Karasik, perl5-...@perl.org

On Wed, 19 Sep 2007, Dmitry Karasik wrote:
> Jan> Whenever readdir() or glob() have to return a filename that

> Jan> cannot be mapped back to the system codepage without
> Jan> substitution characters, then they will return the short 8.3
> Jan> name instead.

>
> I'm aware of that, but take note that filenames should not necessarily
> come to perl program using readdir and glob. If f.ex. a user types in
> filename that contains the unmappable characters, open() wouldn't be
> able to open that file.

You'll have to call Win32::GetANSIPathName() or Win32::ShortPathName()
on any user supplied filename before you pass them to open() if it
is possible that the input method contains unmappable characters.

[...]

> I'm also very much aware of Win32 wide filename support, but my point
> is that if a good abstraction of unicode in filenames is found (and I
> hope that mine is good), then all that wide filename support can be
> moved to core.

Yes, that should be the goal.

> Jan> Just remember that using the 8.3 filenames is meant as a

> Jan> workaround until the "virtual operating system access" is
> Jan> properly implemented.

>
> I have no optinion about "virtual operating system access", and
> especially on when it will be implemented, but when it is, there
> still must be change in perl-level semantics anyway. Let's test
> that semantics on the proposed implementation now, and if it is
> good, it can be taken as a point of reference when implementing
> virtual OS access.

Why do you need a change in perl-level semantics? I would expect
everything that already works to continue to work as is, but also
for things that are broken right now to start working as well.

> Jan> The basic system is already in place for Win32; the problem is

> Jan> just that we continue to use char* pointers to pass strings to
> Jan> OS calls, so we lose the string encoding in the process.

>
> Not necessarily -- if we adopt set of flags passed in PL_dir_unicode,
> then we're just fine with char* pointers.

I think this breaks down for functions like link() and rename(). Yes,
you can make the mechanism even more complicated, but it becomes very
hacky. Note that we need to define an API that is also very easy to
use from XS extensions.

Currently we are just redefining the names for the standard C library
routines. We could just define a second set of functions that will take
a UTF8 encoded strings, and all internals can then call those when the
SV has the SvUTF8 flag set. For functions that take multiple names
(like rename() and unlink()) we'll have to upgrade the remaining arguments
to UTF8 if at least on argument is UTF8.

Cheers,
-Jan

Dmitry Karasik

unread,

Sep 19, 2007, 11:22:32 AM9/19/07

to John Malmberg, perl5-...@perl.org

John> Create a file name using characters that require Unicode
John> encoding.
John> Create a UTF-8 representation of that filename and create a file
John> with that name in an empty directory.
John> Create the wide (UCS-2) representation of the above file name.
John> Use the wide open routine to try to open the existing file that
John> that you just created.
John> If that step succeeds, then it means that your platform treats UTF-8
John> and UCS-2 representations as the same filename transparently, and it
John> means that much if any of your hacks are not needed.
John> If that step fails, then you have the exact same issue as VMS, where
John> UTF-8 filenames and "wide" filenames are treated as different files,
John> and that the same special handling is needed to know if a file name
John> string with the SvUTF8 flag needs to be passed through as binary or
John> converted to "wide" for use with a "wide" call.

That is the case for win32, yes.

John> And in the case that the step fails, then you need guidance from
John> external to the program as to how to handle the UTF-8 code.

For win32 it is not so, there are 2 different cases. For names
with unicode characters that can be mapped to the system codepage,
U8 and U16 files will be the same. Otherwise, U8 API will simply
won't recognize these files.

John> filenames and file systems with methods and properties that are
John> unique to them.

For VMS, I tend to agree. Also, as Juerd pointed out, SvUTF8 flag might
not be the best indicator for the desired semantics, and so if there
be any other SV flag or global variable for this, the same could be done
to select VMS filename layer. I guess.

--
Sincerely,
Dmitry Karasik

Dmitry Karasik

unread,

Sep 19, 2007, 5:31:05 PM9/19/07

to Jan Dubois, perl5-...@perl.org

Hi Jan!

Jan> Why do you need a change in perl-level semantics?

Because I don't see how semantics in the current perl would fit
for this purpose. If you have in mind some other syntax that
would fit better, please share. I did the proposition off the
best of my knowledge about how perl unicode works, and of
course I'm not claiming the authority.

Jan> I think this breaks down for functions like link() and rename().
Jan> Yes, you can make the mechanism even more complicated, but it becomes
Jan> very hacky. Note that we need to define an API that is also very
Jan> easy to use from XS extensions.

Not really. If the semantics are deduced, as I propose, from SvUTF8, then
all you need to tell a 2-parameter function is which parameter is unicode
and which is not. In case one parameter is in bytes, and another is in
utf8, then win32 layer will call OS-level name translation to upgrade
the bytes-parameter to wide char.

I agree though that if the semantics are deduced from some global
pragma, then it indeed becomes hacky.

--
Sincerely,
Dmitry Karasik

Jan Dubois

unread,

Sep 20, 2007, 1:50:53 AM9/20/07

to Dmitry Karasik, perl5-...@perl.org

On Wed, 19 Sep 2007, Dmitry Karasik wrote:

Hi Dmitry,

> Jan> Why do you need a change in perl-level semantics?
>
> Because I don't see how semantics in the current perl would fit
> for this purpose. If you have in mind some other syntax that
> would fit better, please share. I did the proposition off the
> best of my knowledge about how perl unicode works, and of
> course I'm not claiming the authority.

I envision everything working completely transparently (at least on
Windows), the way Unicode support was originally intended to be:
you should not care if your string is encoded in UTF8 or not, it
behaves exactly the same.

On some operating systems we may need to specify additional hints
at the Perl level via pragmata or whatever, but on systems that have
a Unicode file system interface we can make things work automatically.

> Jan> I think this breaks down for functions like link() and rename().
> Jan> Yes, you can make the mechanism even more complicated, but it becomes
> Jan> very hacky. Note that we need to define an API that is also very
> Jan> easy to use from XS extensions.
>
> Not really. If the semantics are deduced, as I propose, from SvUTF8, then
> all you need to tell a 2-parameter function is which parameter is unicode
> and which is not. In case one parameter is in bytes, and another is in
> utf8, then win32 layer will call OS-level name translation to upgrade
> the bytes-parameter to wide char.
>
> I agree though that if the semantics are deduced from some global
> pragma, then it indeed becomes hacky.

I think your and my intended solutions are pretty close: you are using
an interpreter-global variable to specify that a particular parameter is
UTF8 or not. I would just pass the SV instead and not bother with global
state. E.g. I would define an additional fopen_sv() like this:

FILE *fopen_sv(SV *filename, char *mode);

and use that in the Perl internals everywhere instead of fopen(). The
default implementation could be

#define fopen_sv(f,m) fopen(SvPV_nolen(f), (m))

which would give you the same semantics we have right now. Same thing
for all the other file system functions in the CRT. Once that is working
we can define platform specific versions of fopen_sv() etc. that look at
the SvUTF8 flags of the passed in SVs and e.g. call the wide character
APIs when the flag is set.

On systems without a wide character filesystem API we need to do something
more complicated, based on a filesystem encoding pragma or whatnot.

The different implementations of the *_sv() API should best be implemented
by iperlsys.h, which should be supported by all platforms and not just Windows
(this is the "Virtualize Operating System Access" issue from perltodo.pod).

Of course things get a little messier once you realize that pp_open doesn't
just call fopen(), but do_openn() etc. We need to avoid moving from SV*
to char* prematurely in these callchains so that the lower level function
still have an SV* with all the flags to work with.

Cheers,
-Jan

Message has been deleted

Dmitry Karasik

unread,

Sep 20, 2007, 7:00:51 AM9/20/07

to Glenn Linderman, Jan Dubois, Dmitry Karasik, perl5-...@perl.org

Glenn> set being equivalent to ISO-8859-1, and the Windows file system
Glenn> default 8-bit character set (code page) being something else
Glenn> (usually, at least from the CMD Prompt)? Which means that ASCII
Glenn> names work fine, but extended ASCII names get unexpected
Glenn> translations?

That is actually true. However this only matters if a string with SvUTF8
flag set contains invalid characters, in which case Windows of course
cannot decide what exactly what it should do. Possibly, its default
behavior will be good enough - when converting from a supposedly utf8
string, it will either convert invalid characters either to current locale,
or replace to '?', or simply skip them. In either case, this logic I think
shouldn't be treated as a special case by perl (at least because calling
an extra utf8_valid_string on each IO syscall is not kosher :), and
OS-specific behavior should apply.

--
Sincerely,
Dmitry Karasik

Dmitry Karasik

unread,

Sep 20, 2007, 6:53:36 AM9/20/07

to Jan Dubois, Dmitry Karasik, perl5-...@perl.org

Hi Jan!

Jan> I think your and my intended solutions are pretty close: you are
Jan> using an interpreter-global variable to specify that a particular
Jan> parameter is UTF8 or not. I would just pass the SV instead and not
Jan> bother with global state. E.g. I would define an additional
Jan> fopen_sv() like this:
Jan> FILE *fopen_sv(SV *filename, char *mode);

I would also do that, there's no question about it. The interpreter-global
variable was chosen so that other system implementations don't need to be
touched at all, and possibly 3-rd party XS modules keep the default
behavior (and binary compatibility, but that's just a side-effect).

Wouldn't it be then a good idea to implement unicode support, at least a first
attempt, using the global variable, and then, if everything will behave
as expected, move on to SV parameters?

--
Sincerely,
Dmitry Karasik

John E. Malmberg

unread,

Sep 20, 2007, 9:59:56 AM9/20/07

to Dmitry Karasik, perl5-...@perl.org

Dmitry Karasik wrote:
> John> Create a file name using characters that require Unicode
> John> encoding.
> John> Create a UTF-8 representation of that filename and create a file
> John> with that name in an empty directory.
> John> Create the wide (UCS-2) representation of the above file name.
> John> Use the wide open routine to try to open the existing file that
> John> that you just created.
> John> If that step succeeds, then it means that your platform treats UTF-8
> John> and UCS-2 representations as the same filename transparently, and it
> John> means that much if any of your hacks are not needed.
> John> If that step fails, then you have the exact same issue as VMS, where
> John> UTF-8 filenames and "wide" filenames are treated as different files,
> John> and that the same special handling is needed to know if a file name
> John> string with the SvUTF8 flag needs to be passed through as binary or
> John> converted to "wide" for use with a "wide" call.
>
> That is the case for win32, yes.

Which is the case? Win32 translates UTF-8 <-> UCS-2 automatically or
does not translate?

> John> And in the case that the step fails, then you need guidance from
> John> external to the program as to how to handle the UTF-8 code.
>
> For win32 it is not so, there are 2 different cases. For names
> with unicode characters that can be mapped to the system codepage,
> U8 and U16 files will be the same. Otherwise, U8 API will simply
> won't recognize these files.

I really do not understand the concept of codepages, they appear to be
translations from the internal storage in 8 bits to the the 'index' of
where the character is in for the displayed font.

Is this an auto-conversion of some filenames and not others?

Or does it depend on if the characters out side the ASCII character set
are encoded in UTF-8 or not?

For the following please understand that I do not have a Unicode/UTF-8
translation handy, so these are contrived examples that may not have
valid codepoints in them.

If a VMS filename contains a VTF-7 sequence of '^Uxxxx' such as
'device:[dir]FOO^U0123.type', then the entire file specification on disk
is stored in wide format and treated as UCS-2, and converted back to
VTF-7 when returned to an application. There are iconv routines to
convert from VTF-7 to UCS-2 and such, but not directly to UTF-8 on VMS.

If the VMS filename contains just binary characters such as
'device:[dir]foo^8E^4A.type', then the two hex characters following the
^ are converted to a binary value in the on disk storage.

The VMS C library does translations from UNIX syntax to VMS syntax
automatically and does not have a concept of UTF-8 or VTF-7, so anything
that is not a printing character in DEC-MCS (close to ISO-8859-1) is
converted to ^xx notation when going from UNIX syntax to native.

I think this is also the case with UNIX, it does not check to see if a
filename is UTF-8 or not, it just passes the codes through.

So if Win32 is requiring filenames to conform to UTF-8 then from the
UNIX IO and STDIO library calls, it is not compatible with UNIX.

It is my understanding that all UCS-2 can be mapped to UTF-8 and back,
so it would be odd for the system code page to affect the automatic
conversion, but then again, I have indicated that I do not fully
understand the concept of the system code page.

> John> filenames and file systems with methods and properties that are
> John> unique to them.
>
> For VMS, I tend to agree. Also, as Juerd pointed out, SvUTF8 flag might
> not be the best indicator for the desired semantics, and so if there
> be any other SV flag or global variable for this, the same could be done
> to select VMS filename layer. I guess.

The thing is that you really do not want most programs to care if the on
disk storage of filenames is UTF-8 or UCS-2, you just want it to work.

The issue with UTF-8 in filenames is only an issue in the cases of:

1. The native file system does not allow the binary values that can be
present in UTF-8. (VMS ODS-2 for example)

2. The native file system allows UTF-8 and also a different Unicode
encoding, and has different binary storage for each encoding.
(VMS ODS-5 for example)

Where does Win32 fit in with NTFS and FAT, and ?

And this answer can affect the LINUX world as apparently NTFS file
systems can now be mounted on LINUX, which means that filenames stored
in UCS-2 need to be handled some how.

I aware of ODS-2 file readers for UNIX and Windows, but not anything
that will mount it as a file system. ODS-5 is not that much different
than ODS-2 though so such programs could be updated.

Dmitry Karasik

unread,

Sep 20, 2007, 10:46:16 AM9/20/07

to John E. Malmberg, Dmitry Karasik, perl5-...@perl.org

John> Which is the case? Win32 translates UTF-8 <-> UCS-2 automatically
John> or does not translate?

No, it does not. Win32 have 2 APIs, one for U16 and another for U8 names.
When a U16 call is issued, no translation is undergone. Otherwise, the
current system codepage is used to map U8 to U16. Win32 doesn't know
anything about UTF8, so a file created with U8 API will have name mangled
according to the current codepage.

John> I really do not understand the concept of codepages, they appear to
John> be translations from the internal storage in 8 bits to the the
John> 'index' of where the character is in for the displayed font.

Sort of, because FS and GUI and what not APIs on Win32 they all have
two versions, both for U8 and U16.

John> Is this an auto-conversion of some filenames and not others?

Never.

John> If a VMS filename contains a VTF-7 sequence of '^Uxxxx' such as
John> 'device:[dir]FOO^U0123.type', then the entire file specification on
John> disk is stored in wide format and treated as UCS-2, and converted
John> back to VTF-7 when returned to an application. There are iconv
John> routines to convert from VTF-7 to UCS-2 and such, but not directly
John> to UTF-8 on VMS.

So, if I understand correctly, and correct me if I'm wrong, VMS in this
regard exposed another layer to the problem, because on Win32 only the U16
calls can be used to work with unicode filenames, whereas on VMS both
VTF-7 and UNIX conventions can be used to store UTF-8. That means, that
whichever semantics perl might adopt for unicode filenames, it won't
be sufficient for VMS. If I'm correct, then another bit of information
should be introduced along with the unicode flag, namely to choose between
VTF-7 and UNIX. IIUC, Latin-1 names in VTF-7 would also be prepended with
^U, right?

John> anything that is not a printing character in DEC-MCS (close to
John> ISO-8859-1) is converted to ^xx notation when going from UNIX syntax
John> to native.

But it is not necessarily so, is it? Can one create UNIX filename
with characters above 0x7f?

John> I think this is also the case with UNIX, it does not check to see if
John> a filename is UTF-8 or not, it just passes the codes through.

Yes.

John> So if Win32 is requiring filenames to conform to UTF-8 then from the
John> UNIX IO and STDIO library calls, it is not compatible with UNIX.

No, it doesn't. It's not compatible with UNIX either, though :)

John> The thing is that you really do not want most programs to care if
John> the on disk storage of filenames is UTF-8 or UCS-2, you just want it
John> to work.
John> The issue with UTF-8 in filenames is only an issue in the cases of:
John> 1. The native file system does not allow the binary values that can
John> be present in UTF-8. (VMS ODS-2 for example)
John> 2. The native file system allows UTF-8 and also a different Unicode
John> encoding, and has different binary storage for each encoding. (VMS
John> ODS-5 for example)
John> Where does Win32 fit in with NTFS and FAT, and ?

I think that the question should be rather reversed, in terms how
information travels between FS API, perl scalars, and other user input and
IO. Basically, a unicode scheme must allow the following:

a) Given a string of bytes that has a meaning as a filename for other programs
(OS-dependent strings), it must be understood by open/stat/rename etc.
It should not necessarily convert to proper UTF8 by perl means.

b) readdir() and glob(), if asked to work in unicode semantics, should
return such UTF8 strings, that will make sense when fed to open/stat/ etc.
They should not necessarily convert to byte strings that will make sense
to other programs.

c) The two 'not necessarily' clauses above can be mitigated by OS-dependend
XS module providing conversion between bytes/characters and OS-dependent
representation. Ideally, of course, such set of modules should be a part
of standard perl IO.

So, back to your question, NTFS and FAT fit nicely here - if perl
program is given a string of bytes, it will be mapped internally by Windows
to U16 using the current codepage, and it's not our business anymore
whether there's NTFS or FAT behind the scenes. If readdir returns
utf8-flagged scalars, these are understood by win32-wrappers for open/stat
etc and converted to U16. Finally, as Windows byte filenames cannot
(sometimes) be converted to UTF8 by perl, Win32.xs provides this conversion.

For VMS, if a ^U byte string is given to perl, it should choose VTF-7 layer,
and UNIX layer otherwise, this part is fine. However, what should readdir()
do if encountered a valid UTF8 string using UNIX layer? If it flags it
UTF8, then further calls to open() will create a file with ^U prepended,
which is wrong. If it doesn't flag it with UTF8, then open() will be ok,
but users will be confused, by expecting a UTF8 name and getting bytes.
Clearly, another bit of information is needed.

John> And this answer can affect the LINUX world as apparently NTFS file
John> systems can now be mounted on LINUX, which means that filenames
John> stored in UCS-2 need to be handled some how.

That's beyour our scope - let /sbin/mount_ntfs deal with NTFS naming
business. If a program running under linux attempts IO on a mounted
NTFS/FAT, it should do so using purely unix semantics. After all,
mount_fat and mount_ntfs do have options for character translations.

--
Sincerely,
Dmitry Karasik

Message has been deleted

Jan Dubois

unread,

Sep 20, 2007, 3:27:16 PM9/20/07

to Dmitry Karasik, perl5-...@perl.org

On Thu, 20 Sep 2007, Dmitry Karasik wrote:
> Hi Jan!
>
> Jan> I think your and my intended solutions are pretty close: you are
> Jan> using an interpreter-global variable to specify that a particular
> Jan> parameter is UTF8 or not. I would just pass the SV instead and not
> Jan> bother with global state. E.g. I would define an additional
> Jan> fopen_sv() like this:
> Jan> FILE *fopen_sv(SV *filename, char *mode);
>
> I would also do that, there's no question about it. The interpreter-global
> variable was chosen so that other system implementations don't need to be
> touched at all, and possibly 3-rd party XS modules keep the default
> behavior (and binary compatibility, but that's just a side-effect).

Note that fopen_sv() would be defined *in addition* to the redirection
of fopen(). Old modules would continue to compile, but would of course
continue to have issues with Unicode filenames on at least some platforms.

> Wouldn't it be then a good idea to implement unicode support, at least a first
> attempt, using the global variable, and then, if everything will behave
> as expected, move on to SV parameters?

I think adding code to set these globals all over the core would result
in a big mess that you later would have to clean up anyways. This makes
forensic investigation into Perl internals so much more painful than it
already is. For example look at the pp_open issue I mentioned before:
Where would you set the globals, and where would you unset them? It would
be much better to just pass an SV to do_openn() instead of setting the
globals before calling into do_openn().

I would therefore prefer to first put the proper infrastructure into place,
making sure everything continues to work, and then implement the changes
one operating system at a time.

Cheers,
-Jan

Jan Dubois

unread,

Sep 20, 2007, 3:43:18 PM9/20/07

to Dmitry Karasik, John E. Malmberg, perl5-...@perl.org

On Thu, 20 Sep 2007, Dmitry Karasik wrote:

> Basically, a unicode scheme must allow the following:
>
> a) Given a string of bytes that has a meaning as a filename for other
> programs (OS-dependent strings), it must be understood by
> open/stat/rename etc. It should not necessarily convert to proper
> UTF8 by perl means.

I disagree with this. Filename should always be valid strings in Perl,
either byte strings or UTF8 strings. It is up to the system caller
wrappers to transform them into the platform specific encoding.

> b) readdir() and glob(), if asked to work in unicode semantics, should
> return such UTF8 strings, that will make sense when fed to
> open/stat/ etc. They should not necessarily convert to byte strings
> that will make sense to other programs.

I disagree with this too. I think readdir() and glob() should always
return valid Perl strings. If possible they should be downgraded to
byte strings to make it possible to pass them to other programs that
are not UTF8 aware. If they can't be downgraded, well, then they can't
and remain UTF8 encoded.

> c) The two 'not necessarily' clauses above can be mitigated by OS-

> dependend XS module providing conversion between bytes/characters

> and OS-dependent representation. Ideally, of course, such set of
> modules should be a part of standard perl IO.

Yes, there is certainly a need for addition OS specific glue. But
I would expect most of the common to work automagically.

E.g. one thing that would be useful on Windows is to allow the default
translation between byte strings and UTF8 to assume the system codepage
instead of ISO-8859-1 encoding for byte strings.

Cheers,
-Jan

Jan Dubois

unread,

Sep 20, 2007, 4:48:19 PM9/20/07

to Dmitry Karasik, Glenn Linderman, perl5-...@perl.org

On Thu, 20 Sep 2007, Dmitry Karasik wrote:
> Hi Glenn!
>
> Glenn> If Perl is using ISO-8859-1 and Windows is set to some other code
> Glenn> page, then any characters that do not have the same binary codes in
> Glenn> both code pages would be translated to a different character in the
> Glenn> Unicode file name than the Perl user specified/expected. That is
> Glenn> my point, and you've stated it doesn't matter? Or maybe I don't
> Glenn> understand what you've written.
>
> That is exactly what I've written. Let me expand though why it wouldn't
> matter in the proposed implementation (because in the current state, you're
> absolutely right, it does). Latin1 characters with code > 0x7f are
> presented in UTF8 with 2 bytes, and thus Latin1 byte strings with
> these characters will be different from Latin1 UTF8 strings. If we're
> talking about Latin1 byte and utf8 strings where all characters < 0x80,
> these are indeed the same, but, even being passed through different U8
> and U16 Windows APIs respectively, they will still point to the
> same filename.

I think you are missing the point: strings in Perl are silently upgraded
from byte strings to UTF8, and Perl assumes that the byte strings are
encoded in ISO-8859-1. If they are not, then the upgraded string will no
longer be the correct name for the filesystem object. E.g.

my $dir = "..."; # byte string containing non-ISO-8859-1 character
# from the local ANSI codepage
my $file = "..."; # UTF8 encoded string

open(my $fh, "<$dir/$file") or die;

In this case the $dir name must be upgraded to UTF8 encoding to be
concatenated with $file. If you don't use the local ANSI codepage to
upgrade but simply assume ISO-8859-1, then you have a good chance that
your open() call will now fail.

So at least an option to treat byte strings as being encoded in the
current ANSI codepage would be rather useful on Windows.

This is quite orthogonal to providing Unicode filesystem access in
general and would be useful even with the current version of Perl. I
have this in my todo pile, along with 200 other things that would be
nice to have. I did start with the "use 8.3 filenames for Unicode names
that don't map to the ANSI codepage" that is already in 5.10-tobe and
just haven't gotten around to the other things yet.

Cheers,
-Jan

Jan Dubois

unread,

Sep 20, 2007, 5:31:46 PM9/20/07

to Dmitry Karasik, Glenn Linderman, perl5-...@perl.org

On Thu, 20 Sep 2007, Dmitry Karasik wrote:

> Jan> I think you are missing the point: strings in Perl are silently
> Jan> upgraded from byte strings to UTF8, and Perl assumes that the
> Jan> byte strings are encoded in ISO-8859-1. If they are not, then
> Jan> the upgraded string will no longer be the correct name for the
> Jan> filesystem object. E.g.
>
> Ah. I did miss that point, yes. The silent upgrade is indeed evil in
> that case, and will result in broken names. I actually don't know how
> to handle that. Supplying each scalar with an optional encoding field
> that will be used during upgrade..? I don't know, sounds extreme and
> inefficient. Any ideas?

I don't think tracking an encoding per scalar is going to be useful.
You should have one encoding for byte strings (either ISO-8859-1, or
one selected by a user pragma), and UTF8 strings are always encoded
in UTF8. You can always upgrade from byte string to UTF8 without
losing information.

> Jan> So at least an option to treat byte strings as being encoded in
> Jan> the current ANSI codepage would be rather useful on Windows.
>
> IIRC 'use locale' should do that? Should it be accepted that whoever
> relies on concatenation of byte and character strings, should use
> locale (or whatever switches to ANSI codepage).

I think

use encoding ':locale';

is supposed to do that. Except that it doesn't work on Windows because
it doesn't use the POSIX locale system. This needs to be fixed, which
is really the issue we are talking about.

Cheers,
-Jan

Jan Dubois

unread,

Sep 20, 2007, 5:56:00 PM9/20/07

to Glenn Linderman, Dmitry Karasik, perl5-...@perl.org

On Wed, 19 Sep 2007, Glenn Linderman wrote:
> > E.g. I would define an additional fopen_sv() like this:
> >
> > FILE *fopen_sv(SV *filename, char *mode);
> >
> > and use that in the Perl internals everywhere instead of fopen(). The
> > default implementation could be
> >
> > #define fopen_sv(f,m) fopen(SvPV_nolen(f), (m))
> >
> > which would give you the same semantics we have right now.
>

> It may, but isn't there an issue of perl's default 8-bit character set
> being equivalent to ISO-8859-1, and the Windows file system default
> 8-bit character set (code page) being something else (usually, at least
> from the CMD Prompt)? Which means that ASCII names work fine, but
> extended ASCII names get unexpected translations? So doesn't this mean
> that it would be better to use the wide character APIs all the time, to
> get consistent semantics?

As I wrote already to Dmitry, I think we need a way to use the ANSI codepage
in Perl for the byte string encoding. This is necessary for correct upgrading
to UTF8. Everything else will then work automatically (once we pass the
SV's down to the low level functions so we can call the wide char APIs).

Filenames provided via the commandline from cmd.exe are another issue,
as they are encoded in the OEM codepage and not the ANSI codepage for
historical reasons. But we can translate them to either ANSI, or UTF8
if they don't map into ANSI on program startup, so let's not worry about
this for now.

> Using the wide character APIs all the time would mean always translating
> to UCS-2 at the Perl interfaces (inside your fopen_sv), but calling the
> Windows 8-bit APIs goes through a "thunking layer", which does basically
> the same thing, so doing it ourself shouldn't cost more, and should give
> more consistent character set handling... ???

It doesn't really matter as long as we are using the correct encoding for
byte strings. We need the 8-bit API anyways for old XS code.

Another issue, that I would want to ignore for the moment is Windows 9x
support, which doesn't have functional wide character APIs. This can be
worked around by using the "Microsoft Layer for Unicode" libraries for
Win9X, or we could always use the 8-bit API, or just forget that Win9X
ever existed (I already know that some people disagree with that, so
please ignore the 3rd choice).

Cheers,
-Jan

Message has been deleted

Jan Dubois

unread,

Sep 20, 2007, 10:35:58 PM9/20/07

to Glenn Linderman, Dmitry Karasik, John E. Malmberg, perl5-...@perl.org

On Thu, 20 Sep 2007, Glenn Linderman wrote:
> Every imaginable Windows code page is not implemented... only a
> specific set. And I doubt any Windows supported code page contains
> characters not in Unicode. That's the example I was looking for...

Since the code page maps a specific encoding to Unicode, it is not
really possible to point to a character not *in* Unicode. You can
point to a codepoint that is not in use currently but that would be
pretty useless (you also wouldn't be able to display the character,
or use it in filenames etc).

> Can you reference software to allow the creation and installation of
> custom Windows code pages, thus allowing user-defined (every
> imaginable) Windows code pages?

User defined code pages are not supported on Windows. You are supposed
to "use Unicode" instead...

Cheers,
-Jan

John E. Malmberg

unread,

Sep 21, 2007, 12:26:02 AM9/21/07

to Glenn Linderman, Dmitry Karasik, perl5-...@perl.org

Glenn Linderman wrote:
>
> Why not convert everything to UTF-8 ?

Unix and ODS-5 on VMS both support filenames that can contain sequences
that are not legal for UTF-8.

> Win32 doesn't even allow UTF-8 filenames. Only UCS-2. And the 8-bit
> API provides the "fiction" that it allows filenames in the current
> code page (which is a subset of Unicode, hence a subset of UCS-2
> characters, and the UCS-2 name is stored on disk).
>
> Has anyone made the claim that Win32 is compatible with UNIX?

I would have to do research to see if there are any claims backed by
Microsoft or one of the providers of the POSIX layers.

> There are a variety of "POSIX" layers and/or "Unix-like mappings"
> that can be run in or on top of Windows, that attempt to make Win32
> more Unix like, but I'm not sure they have achieved full
> compatibility.

It may be important to research and test. The features in ODS-5 were
designed with the goal of being 100% compatible with Win32 filenames,
and the ability to store 8 bit binary filenames.

There is already an issue of programs ported from UNIX that end up using
UTF-8 on disk format ending up on systems that expect VTF-7 on disk format.

It may be that programs in the POSIX subsystem have the ability to store
binary 8 bit filenames, using a file system API or option flag that is
not well known, but makes it difficult to mix those applications and
native applications.

John E. Malmberg

unread,

Sep 20, 2007, 11:59:28 PM9/20/07

to Dmitry Karasik, perl5-...@perl.org

Dmitry Karasik wrote:

> So, if I understand correctly, and correct me if I'm wrong, VMS in this
> regard exposed another layer to the problem, because on Win32 only the U16
> calls can be used to work with unicode filenames, whereas on VMS both
> VTF-7 and UNIX conventions can be used to store UTF-8. That means, that
> whichever semantics perl might adopt for unicode filenames, it won't
> be sufficient for VMS. If I'm correct, then another bit of information
> should be introduced along with the unicode flag, namely to choose between
> VTF-7 and UNIX. IIUC, Latin-1 names in VTF-7 would also be prepended with
> ^U, right?

Yes Latin-1 names can be prepended with ^U, however if all of the
characters turn out to fit in the 8 bit character set, I do not know
which on disk format will be used.

> John> anything that is not a printing character in DEC-MCS (close to
> John> ISO-8859-1) is converted to ^xx notation when going from UNIX syntax
> John> to native.
>
> But it is not necessarily so, is it? Can one create UNIX filename
> with characters above 0x7f?

Yes, UNIX does not care. UTF-8 was designed to take advantage of that
so that the Unix file systems and programs would not have to be
converted to handle Unicode filenames.

A UNIX filename can be constructed that would be invalid on UTF-8, and
ODS-5 on VMS allows the same.

ODS-5 on VMS allows characters in a filename that are not allowed on
UNIX, as '/', '\\', and '\0' are legal characters in a ODS-5 filename.

> So, back to your question, NTFS and FAT fit nicely here - if perl
> program is given a string of bytes, it will be mapped internally by Windows
> to U16 using the current codepage, and it's not our business anymore
> whether there's NTFS or FAT behind the scenes. If readdir returns
> utf8-flagged scalars, these are understood by win32-wrappers for open/stat
> etc and converted to U16. Finally, as Windows byte filenames cannot
> (sometimes) be converted to UTF8 by perl, Win32.xs provides this conversion.
>
> For VMS, if a ^U byte string is given to perl, it should choose VTF-7 layer,
> and UNIX layer otherwise, this part is fine. However, what should readdir()
> do if encountered a valid UTF8 string using UNIX layer? If it flags it
> UTF8, then further calls to open() will create a file with ^U prepended,
> which is wrong. If it doesn't flag it with UTF8, then open() will be ok,
> but users will be confused, by expecting a UTF8 name and getting bytes.
> Clearly, another bit of information is needed.

Yes, it is an issue because the programmer normally should not know how
the unicode is going to be actually stored on disk.

At the same time, a programmer may need to force the issue one way or
another.

However much of Perl assumes that filenames are in UNIX format, and
there are a lot of "hidden" places where Perl on VMS translates
filenames between "UNIX" format and VMS format.

If a VTF-7 file spec is converted to UTF-8, and then is passed to the 8
bit file system API, the file system API will get a legal filename, but
will pass it through as binary instead of converting it back to VTF-7.

In order to handle VTF-7, VMS would need to have a wrapper around every
API that references a filename, and would need to know if it needed
conversion to VTF-7 or passed through as binary.

Right now, is the time to come up with a robust design, as the VMS
implementation of Perl is basically ignorant of the either way of VMS
storing Unicode filenames, so there are no existing VMS specific perl
modules to break.

There is an issue though that non-UTF-8 and non-Latin filenames can
exist on UNIX and nothing should be done to change this. VMS should
also learn to deal with these same filenames.

I am still thinking that the thing to do at first is to leave the 8 bit
API alone and implement a class to handle filenames, so that we can tag
them with the attributes that are needed.

And this class can help us shake out what can be done to make things
more efficient in the main perl program to handle Unicode file
specifications.

As I pointed out in another post, there are ways of encoding ISO 8 bit
character sets on VMS in use, and a class may be the place to add
support for those.

Dmitry Karasik

unread,

Sep 20, 2007, 4:42:36 PM9/20/07

to Glenn Linderman, Dmitry Karasik, John E. Malmberg, perl5-...@perl.org

Glenn> Give an example of a Windows byte file name that cannot be
Glenn> converted to UTF-8 by perl?

Easily -- my current russian Windows codepage. Environment doesn't
have $LANG defined -- how would a perl program in that environment
figure out what conversion table to use? Suppose there's a way --
but can anyone guarantee that every imaginable Windows codepage
( GetACP returns integers 1250 and onwards) can be unambiguosly
translated to a encoding name that Encode would understand? If yes,
can anyone guarantee the same on any other platform that supports
character semantics in filesystem?

If yes, then I must say this is excellent -- easier for us.

--
Sincerely,
Dmitry Karasik

Dmitry Karasik

unread,

Sep 20, 2007, 4:57:34 PM9/20/07

to Jan Dubois, Dmitry Karasik, perl5-...@perl.org

Hi Jan!

Jan> I think adding code to set these globals all over the core would
Jan> result in a big mess that you later would have to clean up anyways.

You have a point, but I would actually disagree, because setting and clearing
these globals is a matter of mere two additional lines per syscall.
Yes, perl code is a mess, but the proposed changes to core are actually
really minor, so while you're correct about the whole situation, I
think that in this case it is driving the argument to extreme.

Also, adding the whole new layer of sv calls is definitely beyond
both of my capabilities and understanding of perl.

--
Sincerely,
Dmitry Karasik

Dmitry Karasik

unread,

Sep 20, 2007, 5:05:16 PM9/20/07

to Jan Dubois, Dmitry Karasik, Glenn Linderman, perl5-...@perl.org

Jan> I think you are missing the point: strings in Perl are silently
Jan> upgraded from byte strings to UTF8, and Perl assumes that the byte
Jan> strings are encoded in ISO-8859-1. If they are not, then the upgraded
Jan> string will no longer be the correct name for the filesystem
Jan> object. E.g.

Ah. I did miss that point, yes. The silent upgrade is indeed evil in that
case, and will result in broken names. I actually don't know how to handle
that. Supplying each scalar with an optional encoding field that will be
used during upgrade..? I don't know, sounds extreme and inefficient.
Any ideas?

Jan> So at least an option to treat byte strings as being encoded in the
Jan> current ANSI codepage would be rather useful on Windows.

IIRC 'use locale' should do that? Should it be accepted that whoever
relies on concatenation of byte and character strings, should use locale
(or whatever switches to ANSI codepage).

--
Sincerely,
Dmitry Karasik

Dmitry Karasik

unread,

Sep 21, 2007, 12:59:00 AM9/21/07

to Jan Dubois, Dmitry Karasik, Glenn Linderman, perl5-...@perl.org

Hi Jan!

Jan> use encoding ':locale';

Jan> is supposed to do that. Except that it doesn't work on Windows
Jan> because it doesn't use the POSIX locale system. This needs to be
Jan> fixed, which is really the issue we are talking about.

Yes, I agree. IIUC, should that invocation be the documented and proper use
for byte strings (when fixed)?

--
Sincerely,
Dmitry Karasik

Ben Morrow

unread,

Sep 20, 2007, 2:10:55 PM9/20/07

to perl5-...@perl.org

Quoth wb8...@qsl.net ("John E. Malmberg"):

> Dmitry Karasik wrote:
> > John> Create a file name using characters that require Unicode
> > John> encoding.
> > John> Create a UTF-8 representation of that filename and create a file
> > John> with that name in an empty directory.
> > John> Create the wide (UCS-2) representation of the above file name.
> > John> Use the wide open routine to try to open the existing file that
> > John> that you just created.
> > John> If that step succeeds, then it means that your platform treats UTF-8
> > John> and UCS-2 representations as the same filename transparently, and it
> > John> means that much if any of your hacks are not needed.
> > John> If that step fails, then you have the exact same issue as VMS, where
> > John> UTF-8 filenames and "wide" filenames are treated as different files,
> > John> and that the same special handling is needed to know if a file name
> > John> string with the SvUTF8 flag needs to be passed through as binary or
> > John> converted to "wide" for use with a "wide" call.
> >
> > That is the case for win32, yes.
>
> Which is the case? Win32 translates UTF-8 <-> UCS-2 automatically or
> does not translate?

Does not translate. Win32 has two sets of APIs for handling files; one
expects names to be in the current 8-bit codepage, the other expects
names to be in UTF-16 (not UCS-2).

> > John> And in the case that the step fails, then you need guidance from
> > John> external to the program as to how to handle the UTF-8 code.
> >
> > For win32 it is not so, there are 2 different cases. For names
> > with unicode characters that can be mapped to the system codepage,
> > U8 and U16 files will be the same. Otherwise, U8 API will simply
> > won't recognize these files.
>
> I really do not understand the concept of codepages, they appear to be
> translations from the internal storage in 8 bits to the the 'index' of
> where the character is in for the displayed font.

A codepage is simply an 8-bit character set; none of the windows
codepages are equivalent to iso8859-1, so none is a direct subset of
Unicode.

> Is this an auto-conversion of some filenames and not others?
>
> Or does it depend on if the characters out side the ASCII character set
> are encoded in UTF-8 or not?

Windows does not understand filenames in UTF8 at all. For Perl to open a
file with a UTF8 name it would have to translate to UTF-16 and then pass
the name to the appropriate 'wide' syscall.

> For the following please understand that I do not have a Unicode/UTF-8
> translation handy, so these are contrived examples that may not have
> valid codepoints in them.
>
> If a VMS filename contains a VTF-7 sequence of '^Uxxxx' such as
> 'device:[dir]FOO^U0123.type', then the entire file specification on disk
> is stored in wide format and treated as UCS-2, and converted back to
> VTF-7 when returned to an application. There are iconv routines to
> convert from VTF-7 to UCS-2 and such, but not directly to UTF-8 on VMS.

This sort of escaping in filenames is not supported by Win32 at all. The
names passed to the APIs are expected to be either literal UTF16 or
literal bytes from the codepage.

> I think this is also the case with UNIX, it does not check to see if a
> filename is UTF-8 or not, it just passes the codes through.

Unix treats filenames as a sequence of bytes, which may not include '/'
or "\0". Any further interpretation of those bytes as UTF8 or any other
character set is entirely up to the application. Some unixes are
beginning to standardise on all applications considering filenames to be
UTF8, but the process is not complete. Many Unix filesystems will
contain files whose names are not valid UTF8.

> > John> filenames and file systems with methods and properties that are
> > John> unique to them.
> >
> > For VMS, I tend to agree. Also, as Juerd pointed out, SvUTF8 flag might
> > not be the best indicator for the desired semantics, and so if there
> > be any other SV flag or global variable for this, the same could be done
> > to select VMS filename layer. I guess.
>
> The thing is that you really do not want most programs to care if the on
> disk storage of filenames is UTF-8 or UCS-2, you just want it to work.
>
> The issue with UTF-8 in filenames is only an issue in the cases of:
>
> 1. The native file system does not allow the binary values that can be
> present in UTF-8. (VMS ODS-2 for example)
>
> 2. The native file system allows UTF-8 and also a different Unicode
> encoding, and has different binary storage for each encoding.
> (VMS ODS-5 for example)
>
> Where does Win32 fit in with NTFS and FAT, and ?

My understanding is that NTFS/VFAT/Joliet(CDROMs) always store filenames
(or, at least, long filenames) as UTF-16. Any translation from the
system codepage is done before accessing the filesystem.

> And this answer can affect the LINUX world as apparently NTFS file
> systems can now be mounted on LINUX, which means that filenames stored
> in UCS-2 need to be handled some how.

All facilities for mounting Win32 filesystems under Linux/BSD/&c. allow
the user to specify what the 'local' character set is; the kernel
filesystem driver then translates from UTF-16. Obviously, unless UTF8 is
chosen as the local character set some filenames are untranslatable; the
various systems have various ways of dealing with that, none of which
should affect Perl.

Ben

Ben Morrow

unread,

Sep 20, 2007, 10:55:58 AM9/20/07

to perl5-...@perl.org

Quoth dmi...@karasik.eu.org (Dmitry Karasik):

>
> Glenn> set being equivalent to ISO-8859-1, and the Windows file system
> Glenn> default 8-bit character set (code page) being something else
> Glenn> (usually, at least from the CMD Prompt)? Which means that ASCII
> Glenn> names work fine, but extended ASCII names get unexpected
> Glenn> translations?
>
> That is actually true. However this only matters if a string with SvUTF8
> flag set contains invalid characters,

No, it matters if a string *without* SvUTF8 contains characters which
are different in iso8859-1 and the current windows codepage. For
instance, with the 1252 codepage and this suggested behaviour,

my $file = "\x80";
open my $ANSI, '<', $file or die "ANSI file doesn't exist";
utf8::upgrade $file;
open my $UNICODE, '<', $file or die "Unicode file doesn't exist";

the first open will attempt to open a file whose name is a Euro
character, while the second will attempt to open a file whose name is
the C1 PAD character. They should have the same effect. Of course, the
situation will be much worse with codepages that differ more from
iso8859-1.

Ben

Dmitry Karasik

unread,

Sep 20, 2007, 4:33:37 PM9/20/07

to Glenn Linderman, Dmitry Karasik, Jan Dubois, perl5-...@perl.org

Hi Glenn!

Glenn> If Perl is using ISO-8859-1 and Windows is set to some other code
Glenn> page, then any characters that do not have the same binary codes in
Glenn> both code pages would be translated to a different character in the
Glenn> Unicode file name than the Perl user specified/expected. That is
Glenn> my point, and you've stated it doesn't matter? Or maybe I don't
Glenn> understand what you've written.

That is exactly what I've written. Let me expand though why it wouldn't
matter in the proposed implementation (because in the current state, you're
absolutely right, it does). Latin1 characters with code > 0x7f are
presented in UTF8 with 2 bytes, and thus Latin1 byte strings with
these characters will be different from Latin1 UTF8 strings. If we're
talking about Latin1 byte and utf8 strings where all characters < 0x80,
these are indeed the same, but, even being passed through different U8
and U16 Windows APIs respectively, they will still point to the
same filename.

If you're referring to the fact that byte strings that are passed to U8
Windows API both now and in the proposed implementation will be treated
as raw bytes in the current codepage, then yes, they surely will. This
issue wasn't addressed in my proposal, because I thought the current
behavior is good enough. In case if it is not, then one might argue that
it should be necessary to convert first all byte strings into UTF8 using
charmap from current locale or from 'use encoding'. However, this approach
would be dangerous in case when the string is already in OS-dependent
encoding, f.ex. was passed to the program from outside. I'd suggest that
the byte semantics stays the same, namely, OS-defined behavior disregarding
perl current encoding. This is debatable however, so I'd like to hear opinions.

Glenn> Actually, it is quite impossible for an 8-bit character string
Glenn> (non-SvUTF8) to contain invalid characters, because all the code
Glenn> pages are fully defined with 256 entries, all of which can be
Glenn> mapped to Unicode.

If Windows codepage is funky, then it is not necessarly so. Perl's
Encode might now know how exactly to converts these bytes. I'd suggest
wording "it is impossible for an 8-bit string to contain characters
that will not be recognized by OS U8 API".

Glenn> A string with SvUTF8 set should be interpreted with Unicode
Glenn> semantics, and should never be passed to the 8-bit Windows calls,

I've never suggested that :)

Glenn> If Windows replaced invalid characters with '?', then '?' is
Glenn> illegal in file names, and would definitely be rejected by APIs
Glenn> expecting file names.
Glenn> I doubt Windows will skip characters it doesn't understand, but
Glenn> rather produce an error, but I haven't tested that either, using
Glenn> the wide character API.

Yes, I'd say if one sends an invalid UTF8 string to an OS syscall, then
all bets are off.

--
Sincerely,
Dmitry Karasik

Dmitry Karasik

unread,

Sep 20, 2007, 4:50:48 PM9/20/07

to Jan Dubois, Dmitry Karasik, John E. Malmberg, perl5-...@perl.org

Hi Jan!

Jan> I disagree with this. Filename should always be valid strings in
Jan> Perl, either byte strings or UTF8 strings.

As I've just've answered to Glenn Lindermann, I'm far from sure if
for any Windows locale this conversion is possible with Encode.
If we teach Encode to talk to the OS-level translation mechanism
though, then indeed filenames will be valid strings, but convertable
to UTF8 with some special encoding. That would also solve the problem
when a filename is supplied to the program from outside ( ARGV, ENV )
in the OS-specific encoding.

--
Sincerely,
Dmitry Karasik

Message has been deleted

John E. Malmberg

unread,

Sep 21, 2007, 9:49:34 AM9/21/07

to Dmitry Karasik, perl5-...@perl.org

Dmitry Karasik wrote:
> Hi John!
>
> John> Yes, it is an issue because the programmer normally should not know
> John> how the unicode is going to be actually stored on disk.
> John> At the same time, a programmer may need to force the issue one way
> John> or another.
>
> I'd say, a dedicated global variable , something like
> ${^VMS_UNICODE_IS_VTF7} should be sufficient for a first attempt.

I have a VMS specific global latent in VMS.C that does that. It is
currently not visible to Perl programs.

However there still is another problem to address.

if you have File::Spec->catfile($Binarydir1, $Binarydir2, $utf8file),
for those file systems like UNIX and VMS ODS-5, is it incorrect to
promote $Binarydir1 and $Binarydir2 to SvUTF8 just because $utf8file has
the SvUTF8 flag set.

The encoding of each component of a directory specification needs to be
tracked separately from each other and of the filename.

The handling of file specifications as strings just does not have the
ability to cover this issue.

VMS can tell me if a component is in VTF-7, it can not tell me what
encoding is in use if it is not.

readdir() on UNIX can also not tell me what encoding is present. We can
test for valid UTF-8, but that also adds overhead and may still give the
wrong answer.

In addition, on VMS, there is PathworksV5 and PathworksV6, and at least
one NFS encoding that Perl currently is not aware of. PathowrksV5 and
V6 encode illegal characters in ODS_2 with "__XX" where XX is the ASCII
hex code for that character. There are systems with large directory
trees which have these encodings, and while when viewed from a mounted
share by a non-VMS OS, they look normal, on the VMS host they currently
not readable.

This is another reason that it may be better to attack this from the
point of creating a class to deal with volumes, directories, and file
specifications that may or may not be in Unicode.

Dr.Ruud

unread,

Sep 22, 2007, 6:41:28 AM9/22/07

to perl5-...@perl.org

"John E. Malmberg" schreef:

> This is another reason that it may be better to attack this from the
> point of creating a class to deal with volumes, directories, and file
> specifications that may or may not be in Unicode.

Sounds like a brother of bigint, but then for strings.

For example implemented as an array about string-parts, where each
string-part has its own encoding etc., and starts at a specified offset
in the previous result (0=prepend, -1=concat).

--
Affijn, Ruud

"Gewoon is een tijger."

Dmitry Karasik

unread,

Sep 21, 2007, 8:11:36 AM9/21/07

to demerphq, Dmitry Karasik, Jan Dubois, Glenn Linderman, perl5-...@perl.org

demerphq> Please can we leave use locale out of this if at all possible?
demerphq> It introduces all kinds of insanity at all kinds of levels and
demerphq> if we can deal with unicode filesystems without involving use
demerphq> locale then it would be a very good thing.

Well. We can leave locale out of it, but something else that tells Encode
to use system codepage for upgrading from Latin-1 to Unicode should be
introduced then. I'm not sure that I like 'use encoding q(locale)' either,
but nothing looks closer.

--
Sincerely,
Dmitry Karasik

Message has been deleted

John E. Malmberg

unread,

Sep 23, 2007, 12:38:59 AM9/23/07

to Dmitry Karasik, perl5-...@perl.org

Dmitry Karasik wrote:
> Hi John!
>

> John> if you have File::Spec->catfile($Binarydir1, $Binarydir2,
> John> $utf8file), for those file systems like UNIX and VMS ODS-5, is it
> John> incorrect to promote $Binarydir1 and $Binarydir2 to SvUTF8 just
> John> because $utf8file has the SvUTF8 flag set.
> John> The encoding of each component of a directory specification needs to
> John> be tracked separately from each other and of the filename.
> John> The handling of file specifications as strings just does not have
> John> the ability to cover this issue.
>
> You're right. OTOH, any VMS filename, be it in VTF-7 or UNIX, can be
> unambiguously addressed using byte semantics (and this is not the case
> on win32 with wide character). Therefore, I'd propose to
> convert inside all parameters passed to File::Spec::VMS::catfile() to bytes,
> and only then concatenate. The conversion, if any, will produce either
> a VTF7 or UTF8 byte string, depending on a value of the global
> ${^VMS_FS_UNICODE}; the result will not be unicode, but will be a valid
> VMS filename nevertheless.

The problem is still that we do not have the information to do the
proper conversion to bytes to or from what ever is the encoding for each
segment when the file specification is in UNIX syntax.

It is extremely easy to lose track of the encoding in a string,
especially one used for file specifications, which is in reality a
packed array of strings and delimiters.

As I have not worked with Unicode directly yet and have not done the
basic research, I do not know what the result in Perl is for $result =
$utf8_string . $byte_string; Is result tagged as a utf8 string? Is
$byte_string auto-converted?.

And what happens if $byte_string has sequences in it that are not valid
Unicode.

The modules in blead perl do string concatenation everywhere, so this is
an issue.

And File::Spec::VMS currently only understands VMS ODS-2 syntax, and not
the ODS-5 syntax, or VMS UNIX compatibility modes.

My current plans is to have VMS::Filespec::vmsify($vmsspec, $unixspec)
and VMS::Filespec::rmsexpand($vmsspec, $filespec) follow a logical name,
which in this case is used similar to an environment variable to
determine if UTF-8 sequences found in UNIX file specifications should be
passed through or should be encoded in VTF-7.

That plan allows manual conversions. That still leaves VTF-7 filenames
that are translated to UNIX format to be unusable by any of the C
library routines, until wrappers are put around all of them to
effectively call c<rmsexpand> on their filename arguments.

Which means at that point VMS will be close to what you were originally
proposing, but it still has limitations.

I would like to see some generic UNIX to Native and Native to UNIX
routines be developed to replace many of the routines in VMS::Filespec
to reduce the number of tests for $^O that may be needed for cross platform.

A class for handling file specifications may be the best way to do do
this. This also allows the structure containing the file specification
to contain multiple representations of that file specification, such as
the UNIX format, the NATIVE format, a SHORT format and such.

This handles things like /foo.1.2.3/bar.4.5 in UNIX is treated as
[foo_1_2_3]bar_4.5; on VMS ODS-2 and [foo^.1^.2^3]bar^.4.5 on VMS ODS-5.

With ODS-2, what happens is that if the character in the UNIX filename
is not legal on ODS-2, perl auto-magically in most places substitutes a
C<_> underscore for it. That means after such a conversion, the
original UNIX filename can not be recovered.

It also handles the case where VMS has special formats to make sure that
a file specification fits into a 255 character format, because VMS
programs written with out support for ODS-5 and the DCL (shell) only
allocate 255 characters for buffer space.

Once a long VMS filename is compressed into 255 characters, it is not
possible to always recover the original filename. Only the filename of
the "primary" link to the file can be obtained. So there is reason for
a filename class to track it.

So I see optional auto-conversions around the syscalls as a "do what I
mean" hack, but can not be a robust solution. And they require changes
to Perl with the risk of breaking things.

A class or set of classes on the other hand, can be probably be
implemented with out changing perl or the risk of breaking existing
programs, and can also be implemented on older versions of Perl.

Tels

unread,

Sep 23, 2007, 4:15:31 AM9/23/07

to perl5-...@perl.org, John E. Malmberg, Dmitry Karasik

Moin,

On Sunday 23 September 2007 06:38:59 John E. Malmberg wrote:
> Dmitry Karasik wrote:
> > Hi John!

[snip]

> The problem is still that we do not have the information to do the
> proper conversion to bytes to or from what ever is the encoding for each
> segment when the file specification is in UNIX syntax.
>
> It is extremely easy to lose track of the encoding in a string,

That's why I am always lamenting we need a string struct with attached
encoding, not just some byte buffer and a single-bit flag, that is also
fuddled willi-nilly by everyone and their aunt.

> especially one used for file specifications, which is in reality a
> packed array of strings and delimiters.
>
> As I have not worked with Unicode directly yet and have not done the
> basic research, I do not know what the result in Perl is for $result =
> $utf8_string . $byte_string; Is result tagged as a utf8 string? Is
> $byte_string auto-converted?.

Yes, this is what happens. The reason is that utf8 can represent all
characters, but the byte encoding (assumed to be iso-8859-1) can only
represent a few (255 + 0x0) characters. So the safe (but slow) way is to
always convert the result into UTF-8.

Think of it like:

my $x = Math::BigInt->new('123');
$x += 4;

Even tho 127 would fit into a scalar, the result is still a BigInt.

> And what happens if $byte_string has sequences in it that are not valid
> Unicode.

First, the byte string in ISO-8859-1 has only single-byte "sequences". And
every single byte is a valid character IIRC.

If there was a single byte that is not a valid character (can happen with
some single byte encodings that have bytes that are not assigned to
snything like 0xa0 in some encodings), this is converted to the Unicode
replacement character (codepoint U+fffd):

http://www.fileformat.info/info/unicode/char/fffd/index.htm

Any other byte has (IIRC) a Unicode character assigned somewhere.

So afterwards you always have a valid UTF-8 string.

> The modules in blead perl do string concatenation everywhere, so this is
> an issue.

Yep. Especially as Perl is free to convert strings from/to UTF-8 at any
point.

As for the VMS stuff, it makes my head spin, so I won't comment on it :)

All the best,

Tels

--
Signed on Sun Sep 23 10:07:47 2007 with key 0x93B84C15.
View my photo gallery: http://bloodgate.com/photos
PGP key on http://bloodgate.com/tels.asc or per email.

"Any sufficiently advanced technology is indistinguishable from a rigged
demo."

-- Andy Finkel, computer guy

Dmitry Karasik

unread,

Sep 22, 2007, 5:15:21 AM9/22/07

to John E. Malmberg, Dmitry Karasik, perl5-...@perl.org

Hi John!

John> if you have File::Spec->catfile($Binarydir1, $Binarydir2,
John> $utf8file), for those file systems like UNIX and VMS ODS-5, is it
John> incorrect to promote $Binarydir1 and $Binarydir2 to SvUTF8 just
John> because $utf8file has the SvUTF8 flag set.
John> The encoding of each component of a directory specification needs to
John> be tracked separately from each other and of the filename.
John> The handling of file specifications as strings just does not have
John> the ability to cover this issue.

You're right. OTOH, any VMS filename, be it in VTF-7 or UNIX, can be
unambiguously addressed using byte semantics (and this is not the case
on win32 with wide character). Therefore, I'd propose to
convert inside all parameters passed to File::Spec::VMS::catfile() to bytes,
and only then concatenate. The conversion, if any, will produce either
a VTF7 or UTF8 byte string, depending on a value of the global
${^VMS_FS_UNICODE}; the result will not be unicode, but will be a valid
VMS filename nevertheless.

--
Sincerely,
Dmitry Karasik

John E. Malmberg

unread,

Sep 23, 2007, 10:18:29 AM9/23/07

to Dmitry Karasik, John E. Malmberg, perl5-...@perl.org

Dmitry Karasik wrote:
> John> not the ODS-5 syntax, or VMS UNIX compatibility modes.
>
> John> This handles things like /foo.1.2.3/bar.4.5 in UNIX is treated as
> John> [foo_1_2_3]bar_4.5; on VMS ODS-2 and [foo^.1^.2^3]bar^.4.5 on VMS
> John> ODS-5.
>
> I have a question, the answer to which I think will be crucial here.
> Are there any average, or just any program, that works well with
> both VTF7 and UTF8 under unix?

VTF-7 only exists on VMS.

With Unix there is ISO encodings, pure binary, and UTF-8.

> How do _they_ manage to avoid ambiguities?

With Unix, there is no conversion, everything is just a string of bytes.

> Namely, if you ask a program to save a file under
> "/path/^Uvtf7name/unix-utf8name/", what would be a generally accepted
> behavior?

On UNIX, you would get exactly that.

There is no way to express VTF-7 in a Unix syntax.

In VMS syntax you can have path:[^Uxxxx_vtf7.^xx^xx_utf8name].
Internally, '^Uxxxx_vtf7' would be stored in wide characters, and
'^xx^xx_utf8name would be stored in 8 bit characters'.

So this would be a global issue.

> I'm thinking about a hypothesis, if there is a generally accepted notion
> of two kinds of unicode under VMS, and people do not expect that one set
> of programs works with another scheme, would it be a good idea for perl
> to behave the same way?

No, there is not a generally accepted notion of two types of Unicode.

What happened is that VTF-7 was developed to support international
character sets in Pathworks/Advanced Server (CIFS server). It may also
be in use for non-latin versions of VMS.

So there is one major program and unknown number of other programs
expecting VTF-7 convention. The iconv routines on VMS support
conversions of VTF-7 to many other character encodings, but not to or
from UTF-8.

Many VMS programmers are totally ignorant of UTF-8. But due to ODS-5
transparently supporting UTF-8, an unknonw body of programs, including
SAMBA (CIFS server) are out in the wild and support and use UTF-8.

The two camps seemed to be separate until someone tried to make SAMBA
and Advanced Server serve the same directories.

> Of course, rather than hardcoding one or another way
> to transcode UTF8, make that conversion depend on $^{UNICODE_IS_VTF7}. The
> consequence would be that UTF8 files will be invisible (or severely mangled) when
> $^{UNICODE_IS_VTF7} is 1, and the same but vice versa for VTF7 otherwise.
> That is probably a not really smart proposition, but if, as I'm guessing,
> it goes well along the general VMS user expectaions, then possible there is
> somthing in it?

VMS users expect it to just work, and only now some are just finding out
that there is a conflict, because most new development is in C and
transparently supports UTF-8.

Dmitry Karasik

unread,

Sep 23, 2007, 5:25:54 AM9/23/07

to John E. Malmberg, Dmitry Karasik, perl5-...@perl.org

John> not the ODS-5 syntax, or VMS UNIX compatibility modes.

John> This handles things like /foo.1.2.3/bar.4.5 in UNIX is treated as
John> [foo_1_2_3]bar_4.5; on VMS ODS-2 and [foo^.1^.2^3]bar^.4.5 on VMS
John> ODS-5.

I have a question, the answer to which I think will be crucial here.
Are there any average, or just any program, that works well with

both VTF7 and UTF8 under unix? How do _they_ manage to avoid ambiguities?

Namely, if you ask a program to save a file under
"/path/^Uvtf7name/unix-utf8name/", what would be a generally accepted
behavior?

I'm thinking about a hypothesis, if there is a generally accepted notion

of two kinds of unicode under VMS, and people do not expect that one set
of programs works with another scheme, would it be a good idea for perl

to behave the same way? Of course, rather than hardcoding one or another way

to transcode UTF8, make that conversion depend on $^{UNICODE_IS_VTF7}. The
consequence would be that UTF8 files will be invisible (or severely mangled) when
$^{UNICODE_IS_VTF7} is 1, and the same but vice versa for VTF7 otherwise.
That is probably a not really smart proposition, but if, as I'm guessing,
it goes well along the general VMS user expectaions, then possible there is
somthing in it?

--
Sincerely,
Dmitry Karasik

Jan Dubois

unread,

Sep 23, 2007, 2:27:44 PM9/23/07

to Glenn Linderman, demerphq, John E. Malmberg, Dmitry Karasik, perl5-...@perl.org

On Fri, 21 Sep 2007, Glenn Linderman wrote:
> On approximately 9/21/2007 3:05 AM, came the following characters from
> the keyboard of demerphq:
> > Actually this is not correct. While it is true that Win2k (and maybe
> > earlier NT's) uses UCS-2, XP and later use UTF-16. In XP they added
> > surrogate pair support and so do a proper UTF-16 implementation.

The switch from UCS-2 to UTF-16 was done in Win2K, not WinXP. Note that
this mostly affects the display subsystems, and string sorting functions.
I doubt they changed anything at the filesystem layer. I would suspect
that you could use surrogate characters in NTFS filenames even on WinNT,
they would just not be displayed correctly by the rest of the OS.

> Thanks for the heads up. So that means that Windows is less compatible
> with itself than before, as XP data is now no longer compatible with
> prior versions of Windows. But more compatible with Unicode, so that is
> good.

You are missing some historical perspective. Surrogate code points and
UTF-16 were defined as concepts in Unicode 2, released in July 1996.
Windows NT 4.0 was released in September 1996 and did not yet support
these new concepts.

Note also that UCS-2 and UTF-16 are *identical* representations of
all the characters they have in common (the code points used for
surrogate pairs are unassigned in UCS-2). Therefore there isn't any
compatibility breakage.

It's like claiming that Perl 5.10 is more compatible with Perl 6 than
Perl 5 because the new switch() statement and ~~ operator may work with
Perl 6, but certainly doesn't with Perl 5.8... :)

> If only they'd add UTF-8 support to their 8-bit APIs...

I doubt that will ever happen. The Unicode API is 16 bit, and you have
MultiByteToWideChar() to translate from any codepage (including UTF-7,
and UTF-8) to UTF-16, so it is trivial to write the wrapper yourself.

The 8-bit API is just calling MultiByteToWideChar() for you with the
codepage set to CP_ACP (ANSI codepage), and then passing the UTF-16
result on to the 16-bit API. This is done for backward compatibility
with old applications that use the 8-bit API. But adding a wrapper for
all possible codepages and encodings would just be bloat.

Cheers,
-Jan

John E. Malmberg

unread,

Sep 23, 2007, 2:52:13 PM9/23/07

to Dmitry Karasik, perl5-...@perl.org

Dmitry Karasik wrote:
>
> John> What happened is that VTF-7 was developed to support international
> John> character sets in Pathworks/Advanced Server (CIFS server). It may
> John> also be in use for non-latin versions of VMS.
> John> So there is one major program and unknown number of other programs
> John> expecting VTF-7 convention. The iconv routines on VMS support
> John> conversions of VTF-7 to many other character encodings, but not to
> John> or from UTF-8.
>
> Right. So, for example, if you tell such a program to save a file under
> "/^Uxxxvtf-7/utf8-bytes/file", would that be possible? How would such
> a program resolve the ambiguity?

It is not ambiguous because ^Uxxxx notation can only be used in VMS
syntax, not in UNIX syntax. ^Uxxxx notation can be converted to a wide
character string.

When a UNIX format filename is converted to VMS format, the '^c' and
'^xx' are generated as part of the conversion, where the '^c' are
characters that could be mistaken for punctuation. No conversion
routines for wide character set filenames are available.

When a VMS format filename is converted to UNIX, the '^xx' and '^c'
codes are converted to a single binary 8 bit value. Filenames with
'^Uxxxx' can not be converted by the built in routines.

As an extension, I have been working on getting Perl on VMS to convert
VMS '^Uxxxx' format to UNIX UTF-8 and setting the SvUTF-8 flag and to
follow an logical name that indicates how the reverse conversion should
be done.

> John> Many VMS programmers are totally ignorant of UTF-8. But due to
> John> ODS-5 transparently supporting UTF-8, an unknonw body of programs,
> John> including SAMBA (CIFS server) are out in the wild and support and
> John> use UTF-8.
> John> The two camps seemed to be separate until someone tried to make
> John> SAMBA and Advanced Server serve the same directories.
>
> So, if SAMBA tries to access a VTF-7 paths, does it fail?

Yes, they are invisible to it. And if Advanced Server displays the
UTF-8 files as hex encoded strings instead of the name that the PC
client saved it as. All in all, a mess.

> >> Of course, rather than hardcoding one or another way to transcode UTF8,
> >> make that conversion depend on $^{UNICODE_IS_VTF7}. The consequence
> >> would be that UTF8 files will be invisible (or severely mangled) when
> >> $^{UNICODE_IS_VTF7} is 1, and the same but vice versa for VTF7
> >> otherwise. That is probably a not really smart proposition, but if, as
> >> I'm guessing, it goes well along the general VMS user expectaions, then
> >> possible there is somthing in it?

> John> VMS users expect it to just work, and only now some are just finding
> John> out that there is a conflict, because most new development is in C
> John> and transparently supports UTF-8.
>
> Right. So, if I understand it correctly, Perl's readdir/glob would return
> verbatim "^Uxxx" when it reads a VTF-7 name? If so, then why would it a
> bad idea to return same "^Uxxx" but with SvUTF8 bit set when asked to
> work under unicode semantics, if I am right that "^U" is seen explicitly
> in the result?

Because you only see the '^Uxxxx' or '^xx' or '^x' notation on ASCII
filenames in VMS syntax. They will never show up in a UTF-8 encoded
UNIX format path name.

If VMS is in UNIX report mode, then it should be returning the names in
UTF-8 encoding if needed. That mode is not currently implemented.

Message has been deleted

Jan Dubois

unread,

Sep 23, 2007, 6:50:28 PM9/23/07

to Glenn Linderman, demerphq, John E. Malmberg, Dmitry Karasik, perl5-...@perl.org

On Sun, 23 Sep 2007, Glenn Linderman wrote:
>> The 8-bit API is just calling MultiByteToWideChar() for you with the
>> codepage set to CP_ACP (ANSI codepage), and then passing the UTF-16
>> result on to the 16-bit API. This is done for backward compatibility
>> with old applications that use the 8-bit API. But adding a wrapper
>> for all possible codepages and encodings would just be bloat.
>

> My understanding was that the 8-bit API used the current code page,
> not the ANSI code page. So instead of one code page, and instead of
> all code pages, it would seem to be all but one or two, which is my
> understanding of the current state of affairs... you can't use UTF-8
> (or UTF-7, and maybe a few others I'm unaware of).

You are mixing the ANSI codepage and the OEM codepage here. There is
only a single ANSI codepage on a given Windows system. Changing it
requires a reboot (you can use the AppLocale utility to forge the
setting for a specific application, but that is somewhat of an
extreme measure).

All 8-bit APIs use this ANSI codepage, unless the app has called
SetFileApisToOEM(), in which case it uses the current console code page,
which can be changed using e.g. chcp.exe. I haven't tried it, but you
could try setting the OEM codepage to UTF-8 and switch the filesystem
APIs to OEM to see if that works. But as I pointed out earlier, using
CP_UTF8 as the OEM codepage is not really supported by Microsoft and
does break some functionality, so I would not recommend that.

> That's why it seems to me that adding UTF-8 support would be
> appropriate, but I'm not holding my breath...

I think we are a bit myopic here. Perl is pretty much alone for trying
to have full Unicode support internally while also using UTF-8 for the
internal representation. It looks like most other applications (e.g.
Mozilla), Languages (e.g. Python, Tcl), VMs (e.g. Java, .NET) or GUI
systems (e.g. Windows, OS X, KDE, Qt) are all using UTF-16 for internal
storage because it is a lot easier to process. Maybe there is no huge
demand for an 8-bit Unicode API?

Cheers,
-Jan

Dmitry Karasik

unread,

Sep 23, 2007, 1:07:10 PM9/23/07

to John E. Malmberg, Dmitry Karasik, perl5-...@perl.org

>> I have a question, the answer to which I think will be crucial here.
>> Are there any average, or just any program, that works well with both
>> VTF7 and UTF8 under unix?

John> VTF-7 only exists on VMS.

Duh. I made a mistake - the question should read "under VMS", of course.
So, if your answer would be different then,

>> How do _they_ manage to avoid ambiguities?

>> Namely, if you ask a program to save a file under
>> "/path/^Uvtf7name/unix-utf8name/", what would be a generally accepted
>> behavior?

>> I'm thinking about a hypothesis, if there is a generally accepted

>> notion of two kinds of unicode under VMS, and people do not expect that
>> one set of programs works with another scheme, would it be a good idea
>> for perl to behave the same way?

John> What happened is that VTF-7 was developed to support international
John> character sets in Pathworks/Advanced Server (CIFS server). It may
John> also be in use for non-latin versions of VMS.
John> So there is one major program and unknown number of other programs
John> expecting VTF-7 convention. The iconv routines on VMS support
John> conversions of VTF-7 to many other character encodings, but not to
John> or from UTF-8.

Right. So, for example, if you tell such a program to save a file under
"/^Uxxxvtf-7/utf8-bytes/file", would that be possible? How would such
a program resolve the ambiguity?

John> Many VMS programmers are totally ignorant of UTF-8. But due to

John> ODS-5 transparently supporting UTF-8, an unknonw body of programs,
John> including SAMBA (CIFS server) are out in the wild and support and
John> use UTF-8.
John> The two camps seemed to be separate until someone tried to make
John> SAMBA and Advanced Server serve the same directories.

So, if SAMBA tries to access a VTF-7 paths, does it fail?

>> Of course, rather than hardcoding one or another way to transcode UTF8,

>> make that conversion depend on $^{UNICODE_IS_VTF7}. The consequence
>> would be that UTF8 files will be invisible (or severely mangled) when
>> $^{UNICODE_IS_VTF7} is 1, and the same but vice versa for VTF7
>> otherwise. That is probably a not really smart proposition, but if, as
>> I'm guessing, it goes well along the general VMS user expectaions, then
>> possible there is somthing in it?

John> VMS users expect it to just work, and only now some are just finding
John> out that there is a conflict, because most new development is in C
John> and transparently supports UTF-8.

Right. So, if I understand it correctly, Perl's readdir/glob would return
verbatim "^Uxxx" when it reads a VTF-7 name? If so, then why would it a
bad idea to return same "^Uxxx" but with SvUTF8 bit set when asked to
work under unicode semantics, if I am right that "^U" is seen explicitly
in the result?

--
Sincerely,
Dmitry Karasik

John E. Malmberg

unread,

Sep 23, 2007, 9:20:37 PM9/23/07

to Dmitry Karasik, perl5-...@perl.org

Dmitry Karasik wrote:
> Hi John!
>

> >> So, if SAMBA tries to access a VTF-7 paths, does it fail?

> John> Yes, they are invisible to it. And if Advanced Server displays the
> John> UTF-8 files as hex encoded strings instead of the name that the PC
> John> client saved it as. All in all, a mess.
> ...
> John> Because you only see the '^Uxxxx' or '^xx' or '^x' notation on ASCII
> John> filenames in VMS syntax. They will never show up in a UTF-8 encoded
> John> UNIX format path name.
>
> So, am I right that for a unix-type program it is an accepted behavior to
> be blissfully ignorant of VTF-7 names? Let me propose a fictitious table
> of how, depending on settings and unicode flag, perl would translate file
> names on VMS. Let's say we have 2 global boolean variables, one
> $^{UNICODE_FILENAME_SEMANTICS} and another $^{UNICODE_FILENAME_IS_VTF7},
> that depending on value SvUTF8 selects IO type:
>
> perl scalar|SvUTF8|$U_F_S|$UF_IS_VTF7| IO
> --------------------------------------------
> $ascii | 0 | any | any | UNIX
> $utf8 | 1 | 0 | any | UNIX
> $utf8 | 1 | 1 | 0 | UNIX
> $utf8 | 1 | 1 | 1 | VTF7

> Would that be a satisfactory layout for a new unicode semantics in perl?
> The consequences would be:

First problem is that as near as I can tell, I can not rely on the
SvUTF8 being properly set on the scalar when converting a name to VMS
format. There are too many ways that it can be lost or set when it
should not be.

I also have doubts that there is any way to set the
$^{UNICODE_FILENAME_SEMANTICS} reliably either.

The files on disk are what they are, and they may or may not be Unicode
or UTF-8. A general purpose program should not try to select which type
of files it will deal with.

Any flag that you can set or try to calculate inside the program is
likely to be wrong for some cases.

So to get close to a Do What I Mean mode, with out breaking backwards
compatibility, I am working on doing the following in VMS, which is
close to what you were originally proposing.

1. When the C<VMS::Filespec::unixify> encounters a VTF-7 sequence, it
will convert it to UTF-8 and sets the SvUTF8 flag.

2. When C<unixify> encounters an encoded UTF-8 sequence, it will also
set the UTF-8 flag.

These first two are assuming that no one would put both sequences in the
same file specification, which is not something that can be guaranteed.

3. C<VMS::Filespec::vmsify> can not trust the UTF-8 flag, but will look
at a VMS private logical name (similar to an environment variable) that
indicates that it should produce VTF-7 encoding. The default will be
just do translation.

It is also possible that if unixify() finds a VTF-7 sequence it could
set the environment variable so that future vmsify() calls would
translate it back, however I am not sure that I want to do that.

Currently many, but not all of the UNIX file system calls on VMS have a
hidden call to vmsify() in them, which means that the ones that do, will
pick up support of both flavors of Unicode.

So the final step for supporting VTF-7 would be to add vmsify() wrappers
to the rest of the filename handling routines after the rest of the
kinks are worked out.

Wrappers are done by adding macros to *ish.h, which means that these
changes usually do not affect common code.

> So, I'd like to ask what would you say, on behalf of VMS users, about
> this scheme? Would you say that there are viable choices in a-c?
> Would those choices, if taken as a prototype of unicode filename
> support for VMS, make the state of perl better with filenames than
> it is now?

Right now, Perl on VMS only supports ODS-2 filenames, which is
restricted to a subset of printable ASCII in upper case.

Some ODS-5 filenames will work, but many will still not work. So that
backwards compatibility is maintained, a new API is needed to let perl
modules like File::Spec know what mode that VMS is in, and to allow
setting the mode.

> This is my best effort I can think of. If you have better propositions,
> let me know.

I am trying to get support in for the ODS-5, and that includes trying to
get support for UTF-8 and VTF-7.

What I do not have is enough knowledge of UCS-2 and UTF-8 to create some
test file names to verify what work that I have done so far.

So what I think is needed is to have an external environment variable
control how ambiguous cases are handled, which will likely cover most of
the common cases. Right now I have a VMS private one that is not
documented, so it could become one that is common across platforms.

But I still think that a robust solution is to create a class or classes
to handle filenames, so that the attributes of each section can be tracked.

-John
wb8...@qsl.network
Personal Opinion Only

Message has been deleted

Demerphq

unread,

Sep 24, 2007, 4:20:50 AM9/24/07

to Jan Dubois, Glenn Linderman, John E. Malmberg, Dmitry Karasik, perl5-...@perl.org

On 9/23/07, Jan Dubois <ja...@activestate.com> wrote:
> On Fri, 21 Sep 2007, Glenn Linderman wrote:
> > On approximately 9/21/2007 3:05 AM, came the following characters from
> > the keyboard of demerphq:
> > > Actually this is not correct. While it is true that Win2k (and maybe
> > > earlier NT's) uses UCS-2, XP and later use UTF-16. In XP they added
> > > surrogate pair support and so do a proper UTF-16 implementation.
>
> The switch from UCS-2 to UTF-16 was done in Win2K, not WinXP. Note that
> this mostly affects the display subsystems, and string sorting functions.
> I doubt they changed anything at the filesystem layer. I would suspect
> that you could use surrogate characters in NTFS filenames even on WinNT,
> they would just not be displayed correctly by the rest of the OS.

I should have read the thread further before replying to Glen and I
should have checked my references before posting. You are correct. I
must banish the Win2k is UCS-2 meme from my brain. Sorry about
spreading misconceptions.

> > Thanks for the heads up. So that means that Windows is less compatible
> > with itself than before, as XP data is now no longer compatible with
> > prior versions of Windows. But more compatible with Unicode, so that is
> > good.
>
> You are missing some historical perspective. Surrogate code points and
> UTF-16 were defined as concepts in Unicode 2, released in July 1996.
> Windows NT 4.0 was released in September 1996 and did not yet support
> these new concepts.
>
> Note also that UCS-2 and UTF-16 are *identical* representations of
> all the characters they have in common (the code points used for
> surrogate pairs are unassigned in UCS-2). Therefore there isn't any
> compatibility breakage.

Not only are the surrogate pairs unassigned in UCS-2 they are reserved
for use by UTF-16 alone.

BTW, i found it interesting that Python supported UCS-2 for a long
time and then switched to UCS-4, and never did either Utf-8 or UTf-16
(according to wikipedia).

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Demerphq

unread,

Sep 24, 2007, 4:42:20 AM9/24/07

to Jan Dubois, Glenn Linderman, John E. Malmberg, Dmitry Karasik, perl5-...@perl.org

To me this is unsurprising. Although i wonder if its not a self
fulfilling prophecy. Given that Windows supports UTF-16 natively and
given that Windows is the most widely deployed and used operating
system there are a lot of arguments in favour of using UTF-16
internally. IMO its only our strong *NIX bias and coding style that
lead us to use UTF-8, for pretty much the same reasons that UTF-8 is
used on *NIX. (Primarily because of the habit of using null as a
string terminator in interfaces and the desire to use legacy unicode
unaware filesystems for storage (think slash).) Thus I think that if
Windows had chosen UTF-32 then we would see most of the users of
UTF-16 using UTF-32 as well for simple economic reasons.

BTW, I dont know if its correct but according to Wikipedia (as i
mentioned in a previous reply) Python doesnt do Utf-16 but rather does
UCS-4.

http://en.wikipedia.org/wiki/UCS-2#Use_in_major_operating_systems_and_environments

Dmitry Karasik

unread,

Sep 23, 2007, 3:26:19 PM9/23/07

to John E. Malmberg, perl5-...@perl.org

Hi John!

>> So, if SAMBA tries to access a VTF-7 paths, does it fail?

John> Yes, they are invisible to it. And if Advanced Server displays the
John> UTF-8 files as hex encoded strings instead of the name that the PC
John> client saved it as. All in all, a mess.
...
John> Because you only see the '^Uxxxx' or '^xx' or '^x' notation on ASCII
John> filenames in VMS syntax. They will never show up in a UTF-8 encoded
John> UNIX format path name.

So, am I right that for a unix-type program it is an accepted behavior to
be blissfully ignorant of VTF-7 names? Let me propose a fictitious table
of how, depending on settings and unicode flag, perl would translate file
names on VMS. Let's say we have 2 global boolean variables, one
$^{UNICODE_FILENAME_SEMANTICS} and another $^{UNICODE_FILENAME_IS_VTF7},
that depending on value SvUTF8 selects IO type:

perl scalar|SvUTF8|$U_F_S|$UF_IS_VTF7| IO
--------------------------------------------
$ascii | 0 | any | any | UNIX
$utf8 | 1 | 0 | any | UNIX
$utf8 | 1 | 1 | 0 | UNIX
$utf8 | 1 | 1 | 1 | VTF7

Would that be a satisfactory layout for a new unicode semantics in perl?
The consequences would be:

a) when $U_F_S is off, the behavior either stays as is is now for backward
compatibility, or, to explicitly access a VTF7 name, '^U' is needed to
be prepended. I can't make a choice here.

b) when $U_F_S is on, and $UF_IS_VTF7 is off, any filename (unicode or
bytes) is treated as is, basically, as on any other UNIX box. To
explicitly address a VTF7 name, either '^U' is needed to be prepended,
or a tough choice is made to disable VTF7 access. Again, I can't make
a choice.

c) when both flags are on, a filename with SvUTF8(0) is treated using
UNIX semantics, and with SvUTF8(1) - with VTF7. To explicitly address
a filename with unicode characters but stored as unix filename, either
SvUTF8 flag should be off, or a choice is made to disable UNIX unicode
access. Same choice problem.

So, I'd like to ask what would you say, on behalf of VMS users, about
this scheme? Would you say that there are viable choices in a-c?
Would those choices, if taken as a prototype of unicode filename
support for VMS, make the state of perl better with filenames than
it is now?

This is my best effort I can think of. If you have better propositions,
let me know.

--
Sincerely,
Dmitry Karasik

Tels

unread,

Sep 24, 2007, 9:10:24 AM9/24/07

to perl5-...@perl.org, demerphq, Jan Dubois, Glenn Linderman, John E. Malmberg, Dmitry Karasik

Moin,

On Monday 24 September 2007 10:42:20 demerphq wrote:
> On 9/24/07, Jan Dubois <ja...@activestate.com> wrote:
> > On Sun, 23 Sep 2007, Glenn Linderman wrote:
> > > That's why it seems to me that adding UTF-8 support would be
> > > appropriate, but I'm not holding my breath...
> >
> > I think we are a bit myopic here. Perl is pretty much alone for trying
> > to have full Unicode support internally while also using UTF-8 for the
> > internal representation. It looks like most other applications (e.g.
> > Mozilla), Languages (e.g. Python, Tcl), VMs (e.g. Java, .NET) or GUI
> > systems (e.g. Windows, OS X, KDE, Qt) are all using UTF-16 for internal
> > storage because it is a lot easier to process. Maybe there is no huge
> > demand for an 8-bit Unicode API?
>
> To me this is unsurprising. Although i wonder if its not a self
> fulfilling prophecy. Given that Windows supports UTF-16 natively and
> given that Windows is the most widely deployed and used operating
> system there are a lot of arguments in favour of using UTF-16
> internally.

It is even worse. The windows API supports _only_ conversion _to and from_
UTF-16 (via the widechartobytes and bytestowidechar routines).

Meaning if your string is in ISO-8859-2 and you want it to be in ISO-8859-1
(or UTF-8 or whatever), you basically have to call the conversion routines
twice, with an intermidiate step of UTF-16. This means also allocating
memory twice.

And given that and that MS basically uses UTF-16 everywhere and makes it the
default, and also every wide char constant is by default UTF-16 etc. you
can see that using something different than UTF-16 probably doesn't even
enter the mind of a typical windows developer at all.

Basically every doc out there talks about wide char when
internationalization on windows is mentioned.

> IMO its only our strong *NIX bias and coding style that
> lead us to use UTF-8, for pretty much the same reasons that UTF-8 is
> used on *NIX. (Primarily because of the habit of using null as a
> string terminator in interfaces and the desire to use legacy unicode
> unaware filesystems for storage (think slash).)

I thought that UTF-8 was choosen because it happens to be the same bytes for
the ASCII 0..128 bytes, and it doesn't double the size for strings (not
relevant for short strings, but explain to the user why his 200 Mbyte
string suddenly takes 400 :-).

But I might remember history wrong as I wasn't involved :)

> Thus I think that if
> Windows had chosen UTF-32 then we would see most of the users of
> UTF-16 using UTF-32 as well for simple economic reasons.

I don't think windows programmers would choose UTF-16. They already choose
UTF-16 over UTF-8 (which, for ASCII, is twice a big).

UTF-16 (if you ignore surrogate pairs, which probably a lot of applications
simple do) has the nice property of bytes/2 == characters, which was
probably viewed as "simpler" as the support for UTF-8 with
it's "complicated" bytes <=> character rules. Plus, typical windows
programs just use the API and wide char is the easiest (since it is native)
there.

All the best,

Tels

--
Signed on Mon Sep 24 15:02:56 2007 with key 0x93B84C15.

View my photo gallery: http://bloodgate.com/photos
PGP key on http://bloodgate.com/tels.asc or per email.

"If you think the problem is bad now, just wait until we've solved it."

-- Arthur Kasspe

Dmitry Karasik

unread,

Sep 24, 2007, 2:03:11 AM9/24/07

to John E. Malmberg, perl5-...@perl.org

Hi John!

John> First problem is that as near as I can tell, I can not rely on the
John> SvUTF8 being properly set on the scalar when converting a name to
John> VMS format. There are too many ways that it can be lost or set when
John> it should not be.

Why? One cannot easily lose the flag, once it is there, and one must
be aware of it (at least, in the current state of things) when working
with other unicode aspects.

John> I also have doubts that there is any way to set the
John> $^{UNICODE_FILENAME_SEMANTICS} reliably either.

It would be always set manually, never automatically.

John> The files on disk are what they are, and they may or may not be
John> Unicode or UTF-8. A general purpose program should not try to
John> select which type of files it will deal with.

It is not program, it is user should select this, that's what my point is.

John> Any flag that you can set or try to calculate inside the program is
John> likely to be wrong for some cases.

True. Therefore a well-documented behavior for all cases in all combination
of flags should be a must, so the programmer will set these flags knowing
what can and what cannot be done.

John> 1. When the C<VMS::Filespec::unixify> encounters a VTF-7 sequence,
John> it will convert it to UTF-8 and sets the SvUTF8 flag.
John> 2. When C<unixify> encounters an encoded UTF-8 sequence, it will
John> also set the UTF-8 flag.

Yes, but I also tried to put that in the core, so FileSpec modules, being
a testbed here, either will be abandoned at all, or at least will delegate
most of their functionality to the core.

John> 3. C<VMS::Filespec::vmsify> can not trust the UTF-8 flag, but will
John> look at a VMS private logical name (similar to an environment
John> variable) that indicates that it should produce VTF-7 encoding. The
John> default will be just do translation.

That's the purpose of the thing I tried to name as ${^UNICODE_IS_VTF7}.

John> Currently many, but not all of the UNIX file system calls on VMS
John> have a hidden call to vmsify() in them, which means that the ones
John> that do, will pick up support of both flavors of Unicode.

So vmsify() is called by CORE::open etc? I'm confused now.

John> So what I think is needed is to have an external environment
John> variable control how ambiguous cases are handled, which will likely
John> cover most of the common cases. Right now I have a VMS private one
John> that is not documented, so it could become one that is common across
John> platforms.

Also, my idea of ${^UNICODE_IS_VTF7} that it would get initialized off
the external environment variable.

John> But I still think that a robust solution is to create a class or
John> classes to handle filenames, so that the attributes of each section
John> can be tracked.

Here I doubt the robustness. Imagine you have a class that provides
a full emulation of a scalar, by adding set of extra attributes to it.
Let's say that the class's scalars(X) and native perl scalars (S) are
fully interchangeable (that's an unreacable ideal, but still). Also,
it would be enough to produce X by adding attributes(A) to S.
Here come corner cases:

- Would be the result of join('', split('', $X)) identical to $X (won't
attributes be lost?)
- If you have another class, Y, that provides different sets of attributes,
what would be the result of concatenation of $X and $Y?
- Would be the result of $X =~ s/(.)/chr ord $1/ge be identical to $X?

As I see the VMS problem, it seems to me that introducing extra layer
of abstraction (File::Spec) to do such simple operations as path concatenation already
gone beyond some reasonable threshold of complexity. If that'd be possible
to maintain a good-enogh abstraction where scalar concatentation operations
would be identical to filename concatenations, ( which is also one of the goals
for win32 approach), I think that would greatly simplify the whole VMS
pathname business. Even at the expence that some filenames would be
inaccessible -- but again, I'm not a VMS user, so I don't know how that
would meet an average VMS user expectations.

--
Sincerely,
Dmitry Karasik

Jan Dubois

unread,

Sep 24, 2007, 11:27:25 AM9/24/07

to demerphq, Glenn Linderman, John E. Malmberg, Dmitry Karasik, perl5-...@perl.org

On Mon, 24 Sep 2007, demerphq wrote:
> BTW, I dont know if its correct but according to Wikipedia (as i
> mentioned in a previous reply) Python doesnt do Utf-16 but rather does
> UCS-4.
>
> http://en.wikipedia.org/wiki/UCS-2#Use_in_major_operating_systems_and_environments

I think this information is wrong. Python can be compiled with either
--enable-unicode=ucs2 or --enable-unicode=ucs4, but you can still use
surrogate pairs when building with "ucs2" (characters outside the BMP in
string constants are encoded using a surrogate pair instead of throwing
a compilation error, for example). The only reason to compile Python in
"ucs4" mode is when you are using tkinter, and Tcl has been compiled
with UCS4.

Note that there are generally no advantages to UCS4/UTF32 over UTF16
because even with UCS4 you can end up with variable length encodings
due to the use of combining characters.

Cheers,
-Jan

Demerphq

unread,

Sep 24, 2007, 11:57:40 AM9/24/07

to Tels, perl5-...@perl.org, Jan Dubois, Glenn Linderman, John E. Malmberg, Dmitry Karasik

On 9/24/07, Tels <nospam...@bloodgate.com> wrote:
> Moin,
>
> On Monday 24 September 2007 10:42:20 demerphq wrote:
> > IMO its only our strong *NIX bias and coding style that
> > lead us to use UTF-8, for pretty much the same reasons that UTF-8 is
> > used on *NIX. (Primarily because of the habit of using null as a
> > string terminator in interfaces and the desire to use legacy unicode
> > unaware filesystems for storage (think slash).)
>
> I thought that UTF-8 was choosen because it happens to be the same bytes for
> the ASCII 0..128 bytes, and it doesn't double the size for strings (not
> relevant for short strings, but explain to the user why his 200 Mbyte
> string suddenly takes 400 :-).
>
> But I might remember history wrong as I wasn't involved :)

The history is online and I've posted links to it in other threads on this list.

The original design of UTF8 would have allowed low bit characters like
slash or null to be a subcomponent of a longer encoding. This was
recognized as a mistake as it would not allow utf8 to be used as
filenames on legacy filesystems and operating systems.

The properties of UTF-8 that make it useful for legacy systems are:

A) 1:1 correspondence with ASCII for codepoints 0..127, meaning that
slash and null and other "special" characters do not change.
B) No valid UTF-8 encoding of a codepoint may be the substring of any
valid encoding of any other codepoint.

A+B means that the bytes 0..127 are excluded from the encoding of any
codepoint larger than 127, which in turn means that all the software
that expects nulls to end strings and expects slash to be a directory
separator will be able to work with UTF8 just as though it was working
with Latin-1.

It would have been easier and more space efficient to use an encoding
that allowed 0..127 as a substring of a longer encoding, however this
would have been clumsy as it would have had to work around the issue
of null and slash, or it would have be unsuitable to use with legacy
operating systems and file systems. So while space efficiency was
probably an objective any place where it contradicted the objective of
legacy support the legacy support won.

cheers,

Demerphq

unread,

Sep 24, 2007, 12:05:59 PM9/24/07

to Jan Dubois, Glenn Linderman, John E. Malmberg, Dmitry Karasik, perl5-...@perl.org

On 9/24/07, Jan Dubois <ja...@activestate.com> wrote:

I was under the impression that if we switched to Utf-32 without
supporting combining characters our regex engine would be exactly
compatible with its current implementation but be much easier to
maintain. AFAIR semantic meaning for combining characters (as opposed
to visual representation) is a high order unicode concept that one
needn't support and which we don't support. (Jarkko can pipe up and
hit me with a clue-by-four if he wishes :-) I would like to learn more
but my understanding as it is currently is that we dont support
combining characters except perhaps during case folding, and that case
folding is pretty much the same whether its UTF-32 or not (as its
based on codepoints anyway).

Cheers,

Tels

unread,

Sep 24, 2007, 12:14:45 PM9/24/07

to demerphq, perl5-...@perl.org, Jan Dubois, Glenn Linderman, John E. Malmberg, Dmitry Karasik

Moin,

On Monday 24 September 2007 17:57:40 demerphq wrote:
> On 9/24/07, Tels <nospam...@bloodgate.com> wrote:
> > Moin,
> >
> > On Monday 24 September 2007 10:42:20 demerphq wrote:
> > > IMO its only our strong *NIX bias and coding style that
> > > lead us to use UTF-8, for pretty much the same reasons that UTF-8 is
> > > used on *NIX. (Primarily because of the habit of using null as a
> > > string terminator in interfaces and the desire to use legacy unicode
> > > unaware filesystems for storage (think slash).)
> >
> > I thought that UTF-8 was choosen because it happens to be the same
> > bytes for the ASCII 0..128 bytes, and it doesn't double the size for
> > strings (not relevant for short strings, but explain to the user why
> > his 200 Mbyte string suddenly takes 400 :-).
> >
> > But I might remember history wrong as I wasn't involved :)
>
> The history is online and I've posted links to it in other threads on
> this list.
>
> The original design of UTF8 would have allowed low bit characters like
> slash or null to be a subcomponent of a longer encoding. This was

You mean UTF-16 here?

> recognized as a mistake as it would not allow utf8 to be used as
> filenames on legacy filesystems and operating systems.
>
> The properties of UTF-8 that make it useful for legacy systems are:
>
> A) 1:1 correspondence with ASCII for codepoints 0..127, meaning that
> slash and null and other "special" characters do not change.

Yes.

> B) No valid UTF-8 encoding of a codepoint may be the substring of any
> valid encoding of any other codepoint.

Yes, this is a very very nice property of UTF-8 - you can jump in the middle
and know if you are "inside" a character or at the start of it.

thanx for the explanations,

tels

--
Signed on Mon Sep 24 18:13:16 2007 with key 0x93B84C15.

View my photo gallery: http://bloodgate.com/photos
PGP key on http://bloodgate.com/tels.asc or per email.

"Ein Vorschlag, das Grundgesetz zu modifizieren, ist kein Anschlag auf
die Verfassung."

-- Günther Beckstein

Jan Dubois

unread,

Sep 24, 2007, 5:41:52 PM9/24/07

to demerphq, Glenn Linderman, John E. Malmberg, Dmitry Karasik, perl5-...@perl.org

On Mon, 24 Sep 2007, demerphq wrote:
> On 9/24/07, Jan Dubois <ja...@activestate.com> wrote:
> > Note that there are generally no advantages to UCS4/UTF32 over UTF16
> > because even with UCS4 you can end up with variable length encodings
> > due to the use of combining characters.
>
> I was under the impression that if we switched to Utf-32 without
> supporting combining characters our regex engine would be exactly
> compatible with its current implementation but be much easier to
> maintain. AFAIR semantic meaning for combining characters (as opposed
> to visual representation) is a high order unicode concept that one
> needn't support and which we don't support.

You get the same benefit by switching to UTF-16 at half the price. :)

Surrogate codepoints *always* appear as pairs, one from the 1024 low-half
surrogate code points, and one from the 1024 high-half surrogate code points.
That means any surrogate pair cannot be a substring of any other string,
including any other surrogate pair, so most of the time you don't need to
know about them.

Cheers,
-Jan

Demerphq

unread,

Sep 25, 2007, 3:46:08 AM9/25/07

to Jan Dubois, Glenn Linderman, John E. Malmberg, Dmitry Karasik, perl5-...@perl.org

Im not convinced actually. Youd still need a lot of logic to be
checking "is this a pair or not" throughout the code (remember im
thinking about this from the POV of the regex engine) since we dont
support combining characters using UTF-32 would mean all that logic
just disappears. One of the crucial things is that you want to be able
to jump forward or backward N codepoints efficiently. In UTF-32 this
is trivial, in any other unicode encoding it isn't.

Currently there are things in the regex engine that UTF-8 and UTF-16
would (and do) make exponential because you have to scan and not jump.
UTF-32 allows you proper random access to the string, all the rest
dont.

As an example, certain regex use cases have diabolically worse
performance with unicode strings than with latin-1 strings because of
this. (ive been wanting to fix it for some time but have lacked
tuits). UTF-32 would make this problem just disappear.

Basically the unicode encoding we chose turned our diskdrives into
tapedrives without us noticing :-)

cheers,
yves

Tels

unread,

Sep 25, 2007, 10:37:55 AM9/25/07

to perl5-...@perl.org, demerphq, Jan Dubois, Glenn Linderman, John E. Malmberg, Dmitry Karasik

Moin,

On Tuesday 25 September 2007 09:46:08 demerphq wrote:
> On 9/24/07, Jan Dubois <ja...@activestate.com> wrote:
> > On Mon, 24 Sep 2007, demerphq wrote:
> > > On 9/24/07, Jan Dubois <ja...@activestate.com> wrote:
> > > > Note that there are generally no advantages to UCS4/UTF32 over
> > > > UTF16 because even with UCS4 you can end up with variable length
> > > > encodings due to the use of combining characters.

[snip]

> > You get the same benefit by switching to UTF-16 at half the price. :)
> >
> > Surrogate codepoints *always* appear as pairs, one from the 1024
> > low-half surrogate code points, and one from the 1024 high-half
> > surrogate code points. That means any surrogate pair cannot be a
> > substring of any other string, including any other surrogate pair, so
> > most of the time you don't need to know about them.
>
> Im not convinced actually. Youd still need a lot of logic to be
> checking "is this a pair or not" throughout the code (remember im
> thinking about this from the POV of the regex engine) since we dont
> support combining characters using UTF-32 would mean all that logic
> just disappears. One of the crucial things is that you want to be able
> to jump forward or backward N codepoints efficiently. In UTF-32 this
> is trivial, in any other unicode encoding it isn't.
>
> Currently there are things in the regex engine that UTF-8 and UTF-16
> would (and do) make exponential because you have to scan and not jump.
> UTF-32 allows you proper random access to the string, all the rest
> dont.
>
> As an example, certain regex use cases have diabolically worse
> performance with unicode strings than with latin-1 strings because of
> this. (ive been wanting to fix it for some time but have lacked
> tuits). UTF-32 would make this problem just disappear.
>
> Basically the unicode encoding we chose turned our diskdrives into
> tapedrives without us noticing :-)

There are other benefits to UTF-32:

+ every character is automatically aligned at a 4-byte boundary, and thus
also every substring. Very good for modern 32bit CPUs.

+ converting from bytes to codepoints is not nec., compared to UTF-8, where
it is non-trivial to see if the following X bytes are below a certain
codepoint, or not. (you need first to re-assemble the codepoint from the
bytes)

- on the downside, the string takes more space, thus possible overflowing
the cache size, or worse, main memory. While with the current insane
amount of CPU on-die caches, and main memories, this isn't as relevant as
one might think, it is still a consideration.

For the size reason, it is crucial that the regexp engine continues to
support both single-byte encodings (like latin1) and UTF-8. I don't think
you want to convert 600 Mbyte of genetic data (like "GATTACA") into UTF-32
just to see if it matches /^(G|A)/ ...

All the best,

Tels

--
Signed on Tue Sep 25 16:30:56 2007 with key 0x93B84C15.
Get one of my photo posters: http://bloodgate.com/posters

PGP key on http://bloodgate.com/tels.asc or per email.

Morton's Law: If rats are experimented upon, they will develop cancer.

Jan Dubois

unread,

Sep 25, 2007, 1:05:37 PM9/25/07

to demerphq, Glenn Linderman, John E. Malmberg, Dmitry Karasik, perl5-...@perl.org

On Tue, 25 Sep 2007, demerphq wrote:
> On 9/24/07, Jan Dubois <ja...@activestate.com> wrote:
>> Surrogate codepoints *always* appear as pairs, one from the 1024 low-

>> half surrogate code points, and one from the 1024 high-half surrogate

>> code points. That means any surrogate pair cannot be a substring of
>> any other string, including any other surrogate pair, so most of the
>> time you don't need to know about them.
>
> Im not convinced actually. Youd still need a lot of logic to be
> checking "is this a pair or not" throughout the code (remember im
> thinking about this from the POV of the regex engine) since we dont
> support combining characters using UTF-32 would mean all that logic
> just disappears. One of the crucial things is that you want to be able
> to jump forward or backward N codepoints efficiently. In UTF-32 this
> is trivial, in any other unicode encoding it isn't.

You can do *exactly* the same in UTF-16 as far as the regexp engine is
concerned: you just increment/decrement your string pointer by 1.
Skipping the second character in a surrogate pair would purely be an
optimization, and I'm not even convinced it would be any faster.

Remember:

1) The code points in a surrogate pair come from exclusive ranges
that aren't used by anything else, so they cannot accidentally
match BMP characters.

2) Surrogate code points *never* appear alone, they only occur in pairs.

3) The first and second code point in a surrogate pair come from
*disjoint* ranges, so you can never mistake the second codepoint of a
surrogate pair for the beginning of a supplemental character if you
are backtracking by just one code point either.

There are a few other places, like chr(), ord(), length(), substr(), index()
etc. that would need to know about surrogate pairs, but the code would
be pretty trivial, e.g. incrementing a pointer by character instead of
by code point becomes:

ptr++;
if (0xD800 <= *ptr && *ptr <= 0xDBFF)
ptr++;

> Currently there are things in the regex engine that UTF-8 and UTF-16
> would (and do) make exponential because you have to scan and not
> jump. UTF-32 allows you proper random access to the string, all the
> rest dont.
>
> As an example, certain regex use cases have diabolically worse
> performance with unicode strings than with latin-1 strings because of
> this. (ive been wanting to fix it for some time but have lacked
> tuits). UTF-32 would make this problem just disappear.

Please think about this some more; I'm pretty sure that UTF-16 would
work almost the same in the regexp engine as UTF-32. Please point out
what would break in the regexp engine if you always incremented or
decremented string pointers by 1.

The only area I can think of are character classes; their implementation
would need to know about surrogate pairs. But since character classes
already need to be implemented as sparse arrays and not as simple lookup
tables, I think it would be pretty straight-forward to do the lookup
based on 2 code points instead of just one.

This still doesn't mean that you would ever have to scan instead of
incrementing or decrementing because you would never have a sequence in
a character class that would start with the second code point of a
surrogate pair. So while backtracking by just one codepoint you would
simply check against the character class, notice that it doesn't match
and backtrack one more codepoint.

Cheers,
-Jan

Tels

unread,

Sep 25, 2007, 1:23:23 PM9/25/07

to perl5-...@perl.org, Jan Dubois, demerphq, Glenn Linderman, John E. Malmberg, Dmitry Karasik

Moin,

I think the point is that you cannot know *beforehand* how many increments
you have to do if you want to go forwards "N characters".

In UTF-32, it means you go simply forward N*4 bytes. In UTF-16, you need N
steps since at each step you need to check for a pair.

This doesn't (almost) matter if you go forward one character, but imagine
you go forward 50000 characters. With UTF-32, that is one increment (by
50000 * 4), with UTF-16 (and UTF-8) this is 50000 steps.

All the best,

Tels

--
Signed on Tue Sep 25 19:20:57 2007 with key 0x93B84C15.

View my photo gallery: http://bloodgate.com/photos

PGP key on http://bloodgate.com/tels.asc or per email.

Like my code? Want to hire me to write some code for you? Send email!

Message has been deleted

Jan Dubois

unread,

Sep 25, 2007, 2:14:41 PM9/25/07

to Tels, perl5-...@perl.org, demerphq, Glenn Linderman, John E. Malmberg, Dmitry Karasik

On Tue, 25 Sep 2007, Tels wrote:
> I think the point is that you cannot know *beforehand* how many
> increments you have to do if you want to go forwards "N characters".
>
> In UTF-32, it means you go simply forward N*4 bytes. In UTF-16, you
> need N steps since at each step you need to check for a pair.

As I pointed out before, in the regexp engine you can also just move
by 2*N bytes and not worry about it. The encoding will still provide
the correct matching semantics.

> This doesn't (almost) matter if you go forward one character, but
> imagine you go forward 50000 characters. With UTF-32, that is one
> increment (by 50000 * 4), with UTF-16 (and UTF-8) this is 50000 steps.

That is true, but moving more than 1 character only really applies to
substr() and the 3-arg form of index()/rindex().

I'm not claiming that UTF-16 is *always* better than UTF-32, but that
it doesn't share the algorithmic complexity of UTF-8, especially when
you are moving backwards in strings, which the regexp engine does a
lot.

I think it requires actual benchmarking to compare the performance of
UTF-16 and UTF-32 storage of larger character strings. UTF-16 has the
advantage if you exhaust your L2 cache; UTF-32 has an advantage when you
jump by a large number of characters. My gut feeling tells me that the
first will be more important in most real-world applications, but I
could easily be wrong.

Anyways, this is an interesting discussion, but I'm leaving in a few
hours and won't have email access again until late October, so I'll
have to leave this thread now...

Cheers,
-Jan

Tels

unread,

Sep 25, 2007, 2:59:51 PM9/25/07

to perl5-...@perl.org, Jan Dubois, demerphq, Glenn Linderman, John E. Malmberg, Dmitry Karasik

Moin,

On Tuesday 25 September 2007 20:14:41 Jan Dubois wrote:
> On Tue, 25 Sep 2007, Tels wrote:
> > I think the point is that you cannot know *beforehand* how many
> > increments you have to do if you want to go forwards "N characters".
> >
> > In UTF-32, it means you go simply forward N*4 bytes. In UTF-16, you
> > need N steps since at each step you need to check for a pair.
>
> As I pointed out before, in the regexp engine you can also just move
> by 2*N bytes and not worry about it. The encoding will still provide
> the correct matching semantics.

Now you have me confused. I don't understand how you can move 2*N bytes
forwards when you want to move N characters forward.

For instance:

/(.{123})/

would need to capture 123 characters. And to find out how many characters
you have, in both UTF-8 and UTF_16, you have to count them. UTF_16 is
simpler than UTF-8 in that regard as it takes less steps per character, but
UTF-32 still is simpler: it only takes on step to go forward one character.

UTF-8: 1..6 steps (depending on length of character)
UTF_16: 1..2 steps (depending)
UTF-32: 1 step

> > This doesn't (almost) matter if you go forward one character, but
> > imagine you go forward 50000 characters. With UTF-32, that is one
> > increment (by 50000 * 4), with UTF-16 (and UTF-8) this is 50000 steps.
>
> That is true, but moving more than 1 character only really applies to
> substr() and the 3-arg form of index()/rindex().

I am not that familiar with the regexp engine, I just imagined it might do
similiar things. E.g. when you have a buffer with N bytes, it needs to know
how many characters are still in that buffer. In UTF-32, it is simple N/4,
but with UTF-8 or UTF-16 you need to count them. Which means looping over
them.

(This is what Yves meant with his Disk vs Tape comment. One is random-acces,
the other on sequential)

> I'm not claiming that UTF-16 is *always* better than UTF-32, but that
> it doesn't share the algorithmic complexity of UTF-8, especially when
> you are moving backwards in strings, which the regexp engine does a
> lot.

But you cannot simple move back N characters backwards if you have to check
at each position whether you have to move by 2 or 4 bytes. *confused*

> I think it requires actual benchmarking to compare the performance of
> UTF-16 and UTF-32 storage of larger character strings. UTF-16 has the
> advantage if you exhaust your L2 cache; UTF-32 has an advantage when you
> jump by a large number of characters. My gut feeling tells me that the
> first will be more important in most real-world applications, but I
> could easily be wrong.

I agree that only benchmarks can show this. However, converting the regexp
engine is no small task.

> Anyways, this is an interesting discussion, but I'm leaving in a few
> hours and won't have email access again until late October, so I'll
> have to leave this thread now...

No worries - I gladly wait for your answer. It is not that I have the time
to write a UTF-32 engine :-D

all the best,

Tels

--
Signed on Tue Sep 25 20:52:50 2007 with key 0x93B84C15.

Get one of my photo posters: http://bloodgate.com/posters

PGP key on http://bloodgate.com/tels.asc or per email.

"We have problems like this all of the time," Kirk said, trying to
reassure me. "Sometimes its really hard to get things burning."

-- http://tinyurl.com/qmg5