Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: uuencode: multi-bytes char in remote file name contains bytes >0x80

5 views
Skip to first unread message

Bruce Korb

unread,
Jul 3, 2011, 2:52:39 PM7/3/11
to 張叁, toe lin, bug-gn...@gnu.org
On 07/03/11 04:14, 張叁 wrote:
> Hi.
> I can not speak much in english..
> sorry....
> I hope code tell
>
> my code is just showing my meaning.
> may not works well.
>
> -------
>
> duhuanpeng


Hi Duhuanpeng,

Your patch is based on code that is 6 years old. Please be kind enough
to base it on more recent code. Also, please be kind enough to try to describe
what it is you are trying to do. Since you have difficulty doing it in English,
perhaps I can persuade Eric (Toe Lin/Madmorn) to please be kind enough to
translate (assuming you are writing Chinese). Thank you!

Regards, Bruce

Bruce Korb

unread,
Jul 3, 2011, 3:18:08 PM7/3/11
to 張叁, bug-gnulib List, bug-gn...@gnu.org
On 07/03/11 04:14, 張叁 wrote:
> my code is just showing my meaning.
> may not works well.

Hi Duhuanpeng,

RE: enhancement to have uuencode encode output file name:

A few other things that will be needed:

1. changes to mark the file name as an encoded file name
2. parallel changes to uudecode that will then convert the
hex encoding of the file name to the file name to actually use
3. This should be an option to uuencode, rather than a compile time setting.
uudecode should detect it.
4. research into uuencode/uudecode to ensure the format enhancement
is compatible with the POSIX spec (does nothing to violate it):
http://pubs.opengroup.org/onlinepubs/009604599/utilities/uuencode.html
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uudecode.html

My reading of it seems to indicate that there isn't any wiggle room on the
"begin" line, for either base 64 or traditional encoding. I think it would
require a preamble line that put uudecode into a non-POSIX state wherein
uudecode would know that the input file name was encoded.

Given that the encoded file may get handed off to a uudecode that
knows nothing about this magical state, the file name encoding should
encode '/' characters with the '/' character. The technique in your
patch simply hex encodes the entire file name string.

I've CC-ed the gnulib list because there are folks there much more i18n
literate there than I am.

Sorry to make it so difficult....

Regards, Bruce

Bruno Haible

unread,
Jul 3, 2011, 4:43:55 PM7/3/11
to bug-g...@gnu.org, bug-gn...@gnu.org, 張叁, Bruce Korb
Referring to
<http://lists.gnu.org/archive/html/bug-gnu-utils/2011-07/msg00000.html>:

An obvious problem with the patch is that it considers a file name to be a
byte sequence. But different users may work in different locales, with
different encodings. If a Chinese user with file names in GB18030 encoding
sends a file to a user whose file names are UTF-8 encoded, or vice versa, the
file name needs to be converted. The usual approach for such cases is to use
UTF-8 as a "pivot" encoding. For example, in 'pax' [1] file names are
transferred in UTF-8 encoding.

But actually, what's the point of the patch? The most frequently used
archive programs for interchange are probably 'tar'/'pax', 'zip', and '7-zip'.
- 'pax' has support for Unicode file names [1]; the biggest problem is that
the 'pax' format is the default one for GNU 'tar'.
- 'zip' has support for Unicode file names [2][3].
- '7-zip' supports Unicode file names as well [4].

Users who really want to transfer files with non-ASCII names can use one
of these three archive formats and send an uuencoded archive.

Bruno

[1] http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html
[2] http://www.info-zip.org/UnZip.html
[3] http://info.michael-simons.eu/2010/01/05/create-zip-archives-containing-unicode-filenames-with-java/
[4] http://www.7-zip.org/7z.html
--
In memoriam Yuri Shchekochikhin <http://en.wikipedia.org/wiki/Yuri_Shchekochikhin>

Eric

unread,
Jul 3, 2011, 8:30:13 PM7/3/11
to 張叁, bug-gn...@gnu.org
你好:
什么问题,你用中文跟我说吧:)

2011/7/4 Bruce Korb <bk...@gnu.org>

> On 07/03/11 04:14, 張叁 wrote:
>

>> Hi.
>> I can not speak much in english..
>> sorry....
>> I hope code tell
>>

>> my code is just showing my meaning.
>> may not works well.
>>

>> -------
>>
>> duhuanpeng
>>
>
>
> Hi Duhuanpeng,
>
> Your patch is based on code that is 6 years old. Please be kind enough
> to base it on more recent code. Also, please be kind enough to try to
> describe
> what it is you are trying to do. Since you have difficulty doing it in
> English,
> perhaps I can persuade Eric (Toe Lin/Madmorn) to please be kind enough to
> translate (assuming you are writing Chinese). Thank you!
>
> Regards, Bruce
>

--
best *wishes, Eric*

Eli Zaretskii

unread,
Jul 4, 2011, 6:52:13 AM7/4/11
to Bruno Haible, bug-gn...@gnu.org, bug-g...@gnu.org, q24...@gmail.com, bk...@gnu.org
> From: Bruno Haible <br...@clisp.org>
> Date: Sun, 3 Jul 2011 22:43:55 +0200
> Cc: bug-gn...@gnu.org, 張叁 <q24...@gmail.com>,
> Bruce Korb <bk...@gnu.org>

>
> Referring to
> <http://lists.gnu.org/archive/html/bug-gnu-utils/2011-07/msg00000.html>:
>
> An obvious problem with the patch is that it considers a file name to be a
> byte sequence. But different users may work in different locales, with
> different encodings.

Doesn't the same problem exist with the file's data itself? And yet
we are not bothered by that, and boldly go ahead and encode the data.

IMO, it's not uuencode's problem to solve. The correspondents need to
solve it "by some other means" (TM), for file data as for its name.

Bruno Haible

unread,
Jul 4, 2011, 4:58:43 PM7/4/11
to Eli Zaretskii, bug-gn...@gnu.org, bug-g...@gnu.org, q24...@gmail.com, bk...@gnu.org
Eli,

> > An obvious problem with the patch is that it considers a file name to be a
> > byte sequence. But different users may work in different locales, with
> > different encodings.

And users want to see the original filenames. Users don't want to see mojibake,
that is, a mix of garbled characters (see attached screenshot).

> Doesn't the same problem exist with the file's data itself?

No, there is normally no problem with the contents of the files, because users
have learned to use file formats that are independent of locale. When users
send images (.jpeg or .png), text documents (.html, .odt, .rtf, even .doc),
presentations (.pdf, .odp), etc. they have no problem. And those few users who
receive plain text (.txt) files have the option to change the character
encoding in the browser they use to view the text file (in mozilla: via the
View > Encoding menu).

But when uudecode has created files with garbled names on the receiver's disk,
there is no program which will magically fix it.

> IMO, it's not uuencode's problem to solve. The correspondents need to
> solve it "by some other means" (TM), for file data as for its name.

No, communication that matches users' reasonable expectations does not
work like this.

Bruno
--
In memoriam Yonatan Netanyahu <http://en.wikipedia.org/wiki/Yonatan_Netanyahu>

filenames.png
mojibake.png

Eli Zaretskii

unread,
Jul 4, 2011, 10:49:44 PM7/4/11
to Bruno Haible, bug-gn...@gnu.org, bug-g...@gnu.org, q24...@gmail.com, bk...@gnu.org
> From: Bruno Haible <br...@clisp.org>
> Date: Mon, 4 Jul 2011 22:58:43 +0200
> Cc: bug-g...@gnu.org,
> bug-gn...@gnu.org,
> q24...@gmail.com,
> bk...@gnu.org

>
> > Doesn't the same problem exist with the file's data itself?
>
> No, there is normally no problem with the contents of the files, because users
> have learned to use file formats that are independent of locale.

Text files are always in some encoding, so I doubt that this was
really true.

> But when uudecode has created files with garbled names on the receiver's disk,
> there is no program which will magically fix it.

Obviously, Emacs can.

> > IMO, it's not uuencode's problem to solve. The correspondents need to
> > solve it "by some other means" (TM), for file data as for its name.
>
> No, communication that matches users' reasonable expectations does not
> work like this.

What, email is no longer an option?

Bruce Korb

unread,
Jul 5, 2011, 10:40:51 AM7/5/11
to 張叁, bug-g...@gnu.org, "Eric" toe lin, Eric Blake, bug-gn...@gnu.org, Bruno Haible
Hi Duhuanpeng,

On 07/05/11 06:44, 張叁 wrote:
> Let me try to write something in English.
> Please to correct my English. :-)

Eric is helping me in some i18n stuff for NTP, hopefully he can help
translate when things become confused. Please include original
Chinese plus your English so he can detect miscommunication
(Thank you so much for helping, Eric!).

> the problem is users using uuencode to uuencode a file, he may expect every
> btye is ASCII in encodeed file, but when a NOT-ASCII file name apears,
> the problem comes.

Either I am not understanding your response, or you misunderstood the points
raised by Bruno, Eli et al. In any event, before going forward, one needs
to understand several things:

1. why not use pax (or some other standard utility) to create an archive
that embeds the file name within it? At that point, the archive
can be uuencoded for transfer by email with no loss in file names.

2. Assuming that you want a localized file name for this archive file,
you thus still want to encode the file name for transmission.
To do this, you would use code like this:
dst = malloc(2 * strlen(p) + 1);
while (*p) {
if (*p == '/') // if I am not mistaken, '/' is always a '/' char
*(dst++) = '/'
else
{
sprintf(dst, "%02X", (unsigned)*p);
dst += 2;
}
p++;
}
*dst = '\0';

3. Any uuencode-ed file with an encoded file name in it would need to
be marked so that uudecode could cope (translate the encoded name).
This format change should be compatible with POSIX specifications
for the uuencode output. e.g. a preamble to the "begin"
line and not be part of that begin line? Maybe a prefix line:
puts("encoded-file-name\n");
Eric Blake would be a better person for suggesting ways to "extend"
the POSIX format. If this is worth the bother, then adding options
after the file name on the begin line would surely be "more convenient"....

4. uudecode needs matching changes, to detect the encoded file name.

5. your patch is still based on very, very old code.
Please base it on current code:
http://ftp.gnu.org/gnu/sharutils/sharutils-4.11.1.tar.gz

> Let's focus on this question before further discuss.
>
> Q: Is it necessary to do this:
> add a option, make uuencode supports the file name encoding.
>
> I also post my code here, but it's still buggy.
> 1. strlen may be wrong to count how many bytes in argv[optind].

It is the correct count, but output buffer should be allocated.
Regards, Bruce

> 2011/7/5 Bruno Haible <br...@clisp.org <mailto:br...@clisp.org>>


>
> Eli,
>
> > > An obvious problem with the patch is that it considers a file name to be a
> > > byte sequence. But different users may work in different locales, with
> > > different encodings.
>
> And users want to see the original filenames. Users don't want to see mojibake,
> that is, a mix of garbled characters (see attached screenshot).
>

> > Doesn't the same problem exist with the file's data itself?
>
> No, there is normally no problem with the contents of the files, because users

> have learned to use file formats that are independent of locale. When users
> send images (.jpeg or .png), text documents (.html, .odt, .rtf, even .doc),
> presentations (.pdf, .odp), etc. they have no problem. And those few users who
> receive plain text (.txt) files have the option to change the character
> encoding in the browser they use to view the text file (in mozilla: via the
> View > Encoding menu).
>

> But when uudecode has created files with garbled names on the receiver's disk,
> there is no program which will magically fix it.
>

Eric Blake

unread,
Jul 5, 2011, 11:06:12 AM7/5/11
to Bruce Korb, bug-g...@gnu.org, "Eric" toe lin, bug-gn...@gnu.org, 張叁, Bruno Haible
On 07/05/2011 08:40 AM, Bruce Korb wrote:
> 2. Assuming that you want a localized file name for this archive file,
> you thus still want to encode the file name for transmission.
> To do this, you would use code like this:
> dst = malloc(2 * strlen(p) + 1);
> while (*p) {
> if (*p == '/') // if I am not mistaken, '/' is always a '/' char

The next version of POSIX will be enforcing that '/' and '.' are
unambiguous across all POSIX encodings supported by all locales on a
system (it was a happy accident that no POSIX system has attempted to do
otherwise), as well as further clarifying that yes, filenames are not
necessarily character strings in all locales, unless those filenames are
drawn solely from the portable filename character set.

See http://austingroupbugs.net/view.php?id=291

There are, however, some non-POSIX encodings where '/' can appear as the
second byte in a shift-state sequence encoder (ISO-2022-JP-2), although
they are rare in practice these days.

Also, if you worry about systems where backslash is a directory
separator, there are encodings such as Shift_JIS where '\\' can appear
as a second byte within a multi-byte character (hence, '\\' is
ambiguous, even though '/' is not).

> 3. Any uuencode-ed file with an encoded file name in it would need to
> be marked so that uudecode could cope (translate the encoded name).
> This format change should be compatible with POSIX specifications
> for the uuencode output. e.g. a preamble to the "begin"
> line and not be part of that begin line? Maybe a prefix line:
> puts("encoded-file-name\n");
> Eric Blake would be a better person for suggesting ways to "extend"
> the POSIX format. If this is worth the bother, then adding options
> after the file name on the begin line would surely be "more
> convenient"....

I'm not quite sure what you are asking me to do here. Maybe it helps to
read the current POSIX requirements on uuencode output:

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uuencode.html

Note this statement:

"The standard output shall be a text file"

but if filename is _not_ a character string in the current locale, then
the output would _not_ be a text file (among other things, a text file
has the property that at least one locale can interpret every byte
sequence in the file as valid characters). At which point, we are no
longer constrained by POSIX, and can arguably do whatever we want! That
is, supporting file names that consist of characters outside of the
portable file name character set (a-z, A-Z, 0-9, ., _, /, and -) is
already outside the realm of what POSIX requires uuencode to support,
and it would be just as reasonable for uuencode to refuse to operate on
such file names as it would be for uuencode to emit some sort of header
that tells uudecode how to try and decode a string back into characters
appropriate for the current locale.

>> 1. strlen may be wrong to count how many bytes in argv[optind].

No, strlen is _always_ the way to count how many bytes are in an element
of argv, since each argv entry is always a NUL-terminated sequence of
bytes (that might also, but are not required to, have meaning when
interpreted as multi-byte characters under the current locale).

--
Eric Blake ebl...@redhat.com +1-801-349-2682
Libvirt virtualization library http://libvirt.org

signature.asc

John Cowan

unread,
Jul 5, 2011, 11:45:08 AM7/5/11
to Eric Blake, 張å??, Eric toe lin, bug-gn...@gnu.org, Bruce Korb, bug-g...@gnu.org, Bruno Haible
Eric Blake scripsit:

> [B]ut if filename is _not_ a character string in the current locale, then


> the output would _not_ be a text file (among other things, a text file
> has the property that at least one locale can interpret every byte
> sequence in the file as valid characters).

Say what? The name of a file is not a byte sequence in the file. I
don't see how it follows that because a file is a text file, its name
is a character string in some locale.

--
John Cowan http://ccil.org/~cowan co...@ccil.org
In might the Feanorians / that swore the unforgotten oath
brought war into Arvernien / with burning and with broken troth.
and Elwing from her fastness dim / then cast her in the waters wide,
but like a mew was swiftly borne, / uplifted o'er the roaring tide.
--the Earendillinwe

Eric Blake

unread,
Jul 5, 2011, 11:58:26 AM7/5/11
to John Cowan, �, Eric toe lin, bug-gn...@gnu.org, Bruce Korb, bug-g...@gnu.org, Bruno Haible
On 07/05/2011 09:45 AM, John Cowan wrote:
> Eric Blake scripsit:
>
>> [B]ut if filename is _not_ a character string in the current locale, then
>> the output would _not_ be a text file (among other things, a text file
>> has the property that at least one locale can interpret every byte
>> sequence in the file as valid characters).
>
> Say what? The name of a file is not a byte sequence in the file. I
> don't see how it follows that because a file is a text file, its name
> is a character string in some locale.

When used according to POSIX, the 'decode_pathname' argument (POSIX
notation, or REMOTEFILE argument in 'uuencode --help' notation) is
output literally in the resulting output of 'uuencode' on the line
starting with "begin"; that resulting output is also required by POSIX
to be a text file. It also helps to read elsewhere in the POSIX
requirements on uuencode: "If there are characters in decode_pathname
that are not in the portable filename character set the results are
unspecified." Therefore, you _cannot_ use uuencode to pass the name of
a file that contains non-portable characters and still have output that
complies with POSIX.

Which means that for our particular implementation of uuencode, if we
encounter a file name that contains any bytes not already in the
portable file name set, then we can do whatever we want (error out, or
output some sort of prefix line that tells knowledgeable uudecode
implementations that we are about to send an encoded form of a file
name, output a binary file rather than a text file [by outputting the
file name as a literal sequence of bytes, even though those bytes are
not characters in the current locale], or anything else), all as an
extension to POSIX. Of course, our goal should be to have the
out-of-the-box behavior provide the most likely use (that is, it would
be better if we could just make uuencode work on all possible file
names, even on the ones where POSIX does not require any particular
behavior).

signature.asc

Bruce Korb

unread,
Jul 5, 2011, 12:58:27 PM7/5/11
to Eric Blake, bug-g...@gnu.org, "Eric" toe lin, bug-gn...@gnu.org, 張叁, Bruno Haible
On 07/05/11 08:06, Eric Blake wrote:
> I'm not quite sure what you are asking me to do here. Maybe it helps to
> read the current POSIX requirements on uuencode output:
>
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uuencode.html

I read that, though I was sure not as carefully as someone regularly in
on the meeting. :) Reading it again:

> The pathname of the file into which the uudecode utility shall place
> the decoded file. ... If there are


> characters in decode_pathname that are not in the portable filename
> character set the results are unspecified.

==>> ^^^^ needs a comma after "set".

leads me to believe that this:

begin 444 hex-encode-EN:414243

can, for example, be validly used to create a file named "ABC"
in the english language domain, with obvious extensions for CN.
The ":" character being non-portable relieves the application from
being required to create a file named "hex-encode-EN:414243".
It is just that a new uudecode might surprise someone trying
to create a file named "hex-encode-DE:BADF". Given that this sharutil
stuff was supposed to be moribund when I took it over a decade ago,
perhaps not a critical incompatibility.....

Cheers - Bruce

Eric Blake

unread,
Jul 5, 2011, 1:13:45 PM7/5/11
to Bruce Korb, bug-g...@gnu.org, "Eric" toe lin, bug-gn...@gnu.org, 張叁, Bruno Haible
On 07/05/2011 10:58 AM, Bruce Korb wrote:
> On 07/05/11 08:06, Eric Blake wrote:
>> I'm not quite sure what you are asking me to do here. Maybe it helps to
>> read the current POSIX requirements on uuencode output:
>>
>> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uuencode.html
>
> I read that, though I was sure not as carefully as someone regularly in
> on the meeting. :) Reading it again:
>
>> The pathname of the file into which the uudecode utility shall place
>> the decoded file. ... If there are
>> characters in decode_pathname that are not in the portable filename
>> character set the results are unspecified.
> ==>> ^^^^ needs a comma after "set".
>
> leads me to believe that this:
>
> begin 444 hex-encode-EN:414243
>
> can, for example, be validly used to create a file named "ABC"
> in the english language domain, with obvious extensions for CN.
> The ":" character being non-portable relieves the application from
> being required to create a file named "hex-encode-EN:414243".

You are correct that it would work. The ':' cannot appear in any file
created by a POSIX-compliant use of uuencode, therefore uudecode, upon
seeing a ':' in the file name position, can assume that the input file
must have been created by a uuencode version that was using an extension
to POSIX, and therefore uudecode can blindly try to decode that name
without violating POSIX.

> It is just that a new uudecode might surprise someone trying
> to create a file named "hex-encode-DE:BADF".

Under your encoding scheme for all possible filenames that fall outside
the bounds of the POSIX portable file name set, you merely encode such a
file name as:

begin 444 hex-encode-EN:6865782d656e636f64652d44453a42414446

that is, presence of : in the desired output name implies that the file
name must be encoded, just the same as any 8-bit byte also makes that
implication.

signature.asc

John Cowan

unread,
Jul 5, 2011, 1:18:53 PM7/5/11
to Eric Blake, �, Eric toe lin, bug-gn...@gnu.org, Bruce Korb, bug-g...@gnu.org, Bruno Haible
Eric Blake scripsit:

> When used according to POSIX, the 'decode_pathname' argument (POSIX
> notation, or REMOTEFILE argument in 'uuencode --help' notation) is
> output literally in the resulting output of 'uuencode' on the line
> starting with "begin"; that resulting output is also required by POSIX
> to be a text file.

I grasp that, but the definition of "text file" merely requires that the
bytes be interpretable in *some* locale, not in the actually relevant
locale. A locale in which the character encoding is 8859-1, for
example, has a character for every byte, which means that every file is,
*as far as this part of the definition goes*, a text file.

> It also helps to read elsewhere in the POSIX requirements on

> uuencode: "If there are characters in decode_pathname that are not


> in the portable filename character set the results are unspecified."

> Therefore, you _cannot_ use uuencode to pass the name of a file that
> contains non-portable characters and still have output that complies
> with POSIX.

Well, it's good to know about that requirement, since it means Posix
is irrelevant to the case. However, your second sentence is false,
because if Posix does not specify the result, then any result is
Posix-compliant.

> Which means that for our particular implementation of uuencode, if
> we encounter a file name that contains any bytes not already in the
> portable file name set, then we can do whatever we want (error out,
> or output some sort of prefix line that tells knowledgeable uudecode
> implementations that we are about to send an encoded form of a file
> name, output a binary file rather than a text file [by outputting the
> file name as a literal sequence of bytes, even though those bytes
> are not characters in the current locale], or anything else), all as
> an extension to POSIX. Of course, our goal should be to have the
> out-of-the-box behavior provide the most likely use (that is, it would
> be better if we could just make uuencode work on all possible file
> names, even on the ones where POSIX does not require any particular
> behavior).

Agreed.

--
John Cowan co...@ccil.org http://ccil.org/~cowan
The whole of Gaul is quartered into three halves.
--Julius Caesar

Bruce Korb

unread,
Jul 5, 2011, 2:12:38 PM7/5/11
to Eric Blake, bug-g...@gnu.org, "Eric" toe lin, bug-gn...@gnu.org, 張叁, Bruno Haible
On 07/05/11 10:13, Eric Blake wrote:
> begin 444 hex-encode-EN:6865782d656e636f64652d44453a42414446
>
> that is, presence of : in the desired output name implies that the file
> name must be encoded, just the same as any 8-bit byte also makes that
> implication.

Yep, but part of the whole point of uuencode is that the output is 7-bit ASCII,
meaning the output should not contain any funny characters at all, including
the "begin" line.

Eric Blake

unread,
Jul 5, 2011, 4:13:52 PM7/5/11
to Bruce Korb, bug-g...@gnu.org, "Eric" toe lin, bug-gn...@gnu.org, 張叁, Bruno Haible
On 07/05/2011 12:12 PM, Bruce Korb wrote:
> On 07/05/11 10:13, Eric Blake wrote:
>> begin 444 hex-encode-EN:6865782d656e636f64652d44453a42414446
>>
>> that is, presence of : in the desired output name implies that the file
>> name must be encoded, just the same as any 8-bit byte also makes that
>> implication.
>
> Yep, but part of the whole point of uuencode is that the output is 7-bit
> ASCII,

"hex-encode-EN:6865782d656e636f64652d44453a42414446" _is_ 7-bit ASCII
encoded name, which when decoded will result in a filename that happens
to contain a single colon, namely "hex-encode-DE:BADF". That is, if the
file name to be generated into the uuencode output contains any 8-bit
byte or contains any non-portable 7-bit file name character (newline,
colon, ESC, or even comma for that matter!), then uuencode instead
outputs "hex-encode-$LL:..." with $LL determined by the current locale
or encoding (at least, that appears to be how you were envisioning
things), followed by every other byte in the file name being output as
an encoded representation.

> meaning the output should not contain any funny characters at all,
> including
> the "begin" line.

Right - and by you making uuencode use "hex-encode:..." for all
filenames that would otherwise be funny, you've met that goal.

signature.asc

Bruno Haible

unread,
Jul 5, 2011, 5:11:22 PM7/5/11
to Eric Blake, bug-gn...@gnu.org, bug-g...@gnu.org
Eric Blake wrote:
> The next version of POSIX will be enforcing that '/' and '.' are
> unambiguous across all POSIX encodings supported by all locales on a
> system

We are already make use of it in lib/mbschr.c and lib/mbsrchr.c.

> There are, however, some non-POSIX encodings where '/' can appear as the
> second byte in a shift-state sequence encoder (ISO-2022-JP-2), although
> they are rare in practice these days.

They are nonexistent for many years already. In 1999, Stephen Turnbull
had a web page describing some of the weird effects that he got with
non-ASCII file names in a ISO-2022-JP-2 locale. I think this was enough
to convince everyone that locales with stateful encodings are not practical.

> Also, if you worry about systems where backslash is a directory
> separator, there are encodings such as Shift_JIS where '\\' can appear
> as a second byte within a multi-byte character (hence, '\\' is
> ambiguous, even though '/' is not).

Yes, such locales exist, even on glibc systems where such locales are not
ISO C 99 compliant.

Bruno

Bruce Korb

unread,
Jul 5, 2011, 5:12:01 PM7/5/11
to Eric Blake, bug-g...@gnu.org, "Eric" toe lin, bug-gn...@gnu.org, 張叁, Bruno Haible
On 07/05/11 13:13, Eric Blake wrote:
> On 07/05/2011 12:12 PM, Bruce Korb wrote:
>> On 07/05/11 10:13, Eric Blake wrote:
>>> begin 444 hex-encode-EN:6865782d656e636f64652d44453a42414446
>>>
>>> that is, presence of : in the desired output name implies that the file
>>> name must be encoded, just the same as any 8-bit byte also makes that
>>> implication.
>>
>> Yep, but part of the whole point of uuencode is that the output is 7-bit
>> ASCII,
>
> "hex-encode-EN:6865782d656e636f64652d44453a42414446" _is_ 7-bit ASCII

I think we are in vehement agreement (agreeing, but not quite understanding
what each other has meant to say).

So, an option could be omitted by doing a strspn() with the set of
POSIX file name characters, but I don't think that is a good idea.
I suspect there are some folks expecting sloppiness in allowing
non-portable characters in the file name. Therefore, it should be an
explicit request (option) to do it.

Meanwhile, I'm still waiting for a good answer to, "Why do it at all?"
since the file(s) could get rolled into a "pax"ball. :)

張叁

unread,
Jul 6, 2011, 11:02:41 AM7/6/11
to Bruce Korb, bug-g...@gnu.org, "Eric" toe lin, Eric Blake, bug-gn...@gnu.org, Bruno Haible
>
> If there are characters in *decode_pathname* that are not in the portable
>
filename character set the results are _unspecified_.
>

2011/7/6 Bruce Korb <bk...@gnu.org>

> On 07/05/11 13:13, Eric Blake wrote:
>
>> On 07/05/2011 12:12 PM, Bruce Korb wrote:
>>
>>> On 07/05/11 10:13, Eric Blake wrote:
>>>

>>>> begin 444 hex-encode-EN:**6865782d656e636f64652d44453a42**414446


>>>>
>>>> that is, presence of : in the desired output name implies that the file
>>>> name must be encoded, just the same as any 8-bit byte also makes that
>>>> implication.
>>>>
>>>
>>> Yep, but part of the whole point of uuencode is that the output is 7-bit
>>> ASCII,
>>>
>>

>> "hex-encode-EN:**6865782d656e636f64652d44453a42**414446" _is_ 7-bit ASCII

Eric

unread,
Jul 6, 2011, 1:17:54 PM7/6/11
to Bruce Korb, Eric Blake, Bruno Haible, Eli Zaretskii, bug-g...@gnu.org, bug-gn...@gnu.org, q24...@gmail.com
Hi all:

If there are characters in *decode_pathname* that are not in the portable
>

filename character set the results are unspecified.


I think we could do some work to solve the non-ASCII file name problem
according to the above feature. the implementation is not a problem, the
point is whether it is necessary or not to do so ( add an option for
uuencode),
if not, then we would use other tools to solve the problem; otherwise we
could discuss on how to implement.

so is it necessary to add an option to uuencode?

--
best *wishes, Eric*

Eli Zaretskii

unread,
Jul 6, 2011, 1:29:07 PM7/6/11
to Bruce Korb, q24...@gmail.com, mad...@gmail.com, br...@clisp.org, bug-gn...@gnu.org, bug-g...@gnu.org, ebl...@redhat.com
> Date: Tue, 05 Jul 2011 14:12:01 -0700
> From: Bruce Korb <bk...@gnu.org>
> CC: 張叁 <q24...@gmail.com>,
> Bruno Haible <br...@clisp.org>,
> Eli Zaretskii <el...@gnu.org>, bug-g...@gnu.org, bug-gn...@gnu.org,
> "\"Eric\" toe lin" <mad...@gmail.com>

>
> Meanwhile, I'm still waiting for a good answer to, "Why do it at all?"

Because not doing that would severely limit uuencode's domain of
applicability? Why would users be coerced to use pax when all they
need is send a single file? Sounds like a gratuitous limitation to
me.


Bruce Korb

unread,
Jul 6, 2011, 2:43:01 PM7/6/11
to Eric, Bruno Haible, bug-gn...@gnu.org, Eric Blake, q24...@gmail.com
On 07/06/11 10:17, Eric wrote:
> Hi all:
>
> If there are characters in /decode_pathname/ that are not in the portable

>
> filename character set the results are unspecified.
>
>
> I think we could do some work to solve the non-ASCII file name problem
> according to the above feature. the implementation is not a problem, the
> point is whether it is necessary or not to do so ( add an option for uuencode),
> if not, then we would use other tools to solve the problem; otherwise we
> could discuss on how to implement.
>
> so is it necessary to add an option to uuencode?

I think that, even though not completely necessary, an option should be required.
The reason being that there are non-portable characters used in file names
that are nonetheless, 7-bit ascii characters and work as expected now.
By encoding these files, current versions of uudecode become confused.
A common name (e.g. with a '+' character, "clang+llvm.tar.gz", becomes
"hex-encode:636c616e672b6c6c766d2e7461722e677a") and the old uudecode
would create a file with that long name. The encoding needs to be an option.

A marker is required on the output so that uudecode can detect an encoded
file name. The encoded file name should be of such a format that old
uudecoders create some sort of file that can get renamed.

========

I think the arguments are sufficient to make the changes.
The change will include uudecode changes so it can detect
and handle the encoded file names, and uudecode will get
an "encode-filename" ("-e") option.

duhuanpeng, I will do this in the coming weeks.
Please be patient.

Thanks. Regards, Bruce

Bruno Haible

unread,
Jul 6, 2011, 3:55:01 PM7/6/11
to Bruce Korb, Eric, bug-gn...@gnu.org, Eric Blake, q24...@gmail.com
Bruce Korb wrote:
> I think the arguments are sufficient to make the changes.
> The change will include uudecode changes so it can detect
> and handle the encoded file names, and uudecode will get
> an "encode-filename" ("-e") option.

Where and how will the charset conversion of the filenames be handled?

Remember, in the scope of a single user or a single machine, it can be OK
to treat a file name as a mere sequence of bytes - assuming all users on
that machine use the same encoding. But when a user in an UTF-8 locale
send a file named "jörg" to some recipients, and some of the recipients
get a file named "jörg" created on their disk and others a file named
"jörg" (namely those in a ISO-8859-15 locale) and others a file named
"j枚rg" (namely those in a GB18030 locale), that will be viewed as bug.

There are two ways to deal with it:
a) Do the charset conversion on the receiver's side, and on the sender's
side only embed the charset. The most well-known encoding of this
kind is probably the way subject lines are encoded in MIME:
"jörg" would become
=?iso-8859-1?Q?j=F6rg?=
or
=?utf-8?Q?j=C3=B6rg?=
or
hex-encode:3F69736F2D383835392D313F513F6A3D463672673F
b) Do the charset conversion both on the sender's side and on the
receiver's side, and always send filenames converted to UTF-8.
Example:
j=C3=B6rg
or
hex-encode:6AC3B67267

Bruno
--
In memoriam Jan Hus <http://en.wikipedia.org/wiki/Jan_Hus>

Bruce Korb

unread,
Jul 6, 2011, 4:21:29 PM7/6/11
to Bruno Haible, Eric, bug-gn...@gnu.org, Eric Blake, q24...@gmail.com
On 07/06/11 12:55, Bruno Haible wrote:
> Bruce Korb wrote:
>> I think the arguments are sufficient to make the changes.
>> The change will include uudecode changes so it can detect
>> and handle the encoded file names, and uudecode will get
>> an "encode-filename" ("-e") option.
>
> Where and how will the charset conversion of the filenames be handled?

Yes, it will be.

> There are two ways to deal with it:
> a) Do the charset conversion on the receiver's side, and on the sender's
> side only embed the charset. The most well-known encoding of this
> kind is probably the way subject lines are encoded in MIME:

> "j�rg" would become


> =?iso-8859-1?Q?j=F6rg?=
> or
> =?utf-8?Q?j=C3=B6rg?=
> or
> hex-encode:3F69736F2D383835392D313F513F6A3D463672673F
> b) Do the charset conversion both on the sender's side and on the
> receiver's side, and always send filenames converted to UTF-8.
> Example:
> j=C3=B6rg
> or
> hex-encode:6AC3B67267

I pick the way that is most robust and prone to the fewest problems.
You tell me, please. :) I'll do what you suggest and run the result
past both you and our new friend, =?GB2312?B?j4jI/g==?=

Cheers - Bruce

Bruno Haible

unread,
Jul 6, 2011, 4:56:00 PM7/6/11
to Bruce Korb, Eric, bug-gn...@gnu.org, Eric Blake, q24...@gmail.com
Hi Bruce,

> I pick the way that is most robust and prone to the fewest problems.
> You tell me, please. :)

OK :)

> > a) Do the charset conversion on the receiver's side, and on the sender's
> > side only embed the charset. The most well-known encoding of this
> > kind is probably the way subject lines are encoded in MIME:

> > "jörg" would become


> > =?iso-8859-1?Q?j=F6rg?=
> > or
> > =?utf-8?Q?j=C3=B6rg?=
> > or
> > hex-encode:3F69736F2D383835392D313F513F6A3D463672673F

This approach was preferred between ca. 1995 and 1999, because at that time,
it was not clear that Unicode would succeed in the way it did.

> > b) Do the charset conversion both on the sender's side and on the
> > receiver's side, and always send filenames converted to UTF-8.
> > Example:
> > j=C3=B6rg
> > or
> > hex-encode:6AC3B67267

Whereas this approach b) is the preferred one since ca. 2001.

> I'll do what you suggest and run the result
> past both you and our new friend, =?GB2312?B?j4jI/g==?=

You are presenting a good argument for b) and against a). Namely, the charset
label is often wrong. As in your example: It claims to be GB2312, but is in
fact CP936, an extension of GB2312 [1].

$ echo -n j4jI/g== | base64 -d | iconv -f GB2312 -t UTF-8
iconv: (stdin):1:0: cannot convert
$ echo -n j4jI/g== | base64 -d | iconv -f CP936 -t UTF-8
張叁

Such mislabeling is present in email and HTML, for historical reasons. It is
better to use approach b), because it does not require that the sender and
receiver have a common understanding what they mean by "GB2312" (or worse:
by "Big5").

Additionally, approach b) also leads to shorter strings usually than
approach a). Which is also a consideration, given that uuencode's output
should fit in 80 columns.

Bruno

[1] http://www.haible.de/bruno/charsets/conversion-tables/GB2312.html

Eric Blake

unread,
Jul 6, 2011, 5:15:54 PM7/6/11
to Bruce Korb, Eric, bug-gn...@gnu.org, Bruno Haible, q24...@gmail.com
On 07/06/2011 02:21 PM, Bruce Korb wrote:
> On 07/06/11 12:55, Bruno Haible wrote:
>> Bruce Korb wrote:
>>> I think the arguments are sufficient to make the changes.
>>> The change will include uudecode changes so it can detect
>>> and handle the encoded file names, and uudecode will get
>>> an "encode-filename" ("-e") option.
>>
>> Where and how will the charset conversion of the filenames be handled?
>
> Yes, it will be.

The only sane approach is to assume that the current locale of the user
running uuencode normally sees sane filenames, and transliterate from
the user's locale into UTF-8. Either the filename is a character string
in the user's current locale (and therefore, every character can be
transliterated into UTF-8; perhaps trivially if the user's locale is
already UTF-8), or the filename is already random bytes that the user
cannot see as characters in their current locale. In the latter case,
you can still do a 1:1 mapping, where all invalid bytes are mapped to a
2nd-half of a UTF-8 surrogate pair.

Then, take that UTF-8 multibyte sequence (including 2nd-half surrogate
pair mappings for all invalid bytes that were not characters), and
flatten it into something that is just ascii.

On the uudecode side, take the ascii and convert it back to UTF-8, then
transliterate into the user's current locale. Here, the transliteration
might be lossy (if the user's charset doesn't support all the characters
that were in the input) - here, I'm not sure whether best practice is to
transliterate from the unrepresentable character to '?' or to leave the
unrepresentable character as raw Unicode bytes (the latter is what leads
to mojibake). But if the receiver's current locale is UTF-8, lossy
transliteration is not an issue. Meanwhile, if the encoded string
contained any unmatched 2nd-half surrogate pairs, you can unambiguously
recover the raw byte that was not a character, and use that byte as-is.

The nice part about this algorithms is that if both sender and receiver
only use a subset of characters that exist in both charsets, then they
both see the same filename, even if the two locations are using
different charset. If the receiver is using UTF-8 (which is more and
more common these days), they will see whatever name the sender saw
regardless of the sender's charset. The only place where mojibake still
happens if the sender uses characters that are not in the receivers
charset - and that's not entirely a real loss, since it was already the
case that the sender is doing non-portable things by sending
non-portable filename characters in the first place.

>> or
>> =?utf-8?Q?j=C3=B6rg?=

You want some sort of utf-8 encoding, and preferably one that encodes
only the non-portable characters. This type of naming looks best to me.

signature.asc

Bruce Korb

unread,
Jul 8, 2011, 7:11:38 PM7/8/11
to Eric Blake, Eric, Bruno Haible, bug-gn...@gnu.org, q24...@gmail.com

Hi Eric(s),

This mojibake stuff is mumbo jumbo to me.

I looked into the iconv(3p) function a bit and it seems to be dependent
upon some characters strings that are different from what one might
put in LANG or LC_ALL or LC_NAME environment variables. Those guys
take things like EN_us, for example, not character set specifications.
So how am I to know what the current character set it if all I know is
CN_hk, for example? I also didn't find a "this is how you do it" cookbook
or tutorial. I'd have this wired just as soon as I could figure out what
string to pass to iconv_open(3p). Pointers certainly appreciated!

Regards, Bruce

Eric Blake

unread,
Jul 8, 2011, 7:25:11 PM7/8/11
to Bruce Korb, Eric, Bruno Haible, bug-gn...@gnu.org, q24...@gmail.com
On 07/08/2011 05:11 PM, Bruce Korb wrote:
>
> Hi Eric(s),
>
> This mojibake stuff is mumbo jumbo to me.

mojibake is what happens when you interpret bytes from one character set
as though they were characters in another character set, and then
convert them according to that wrong assumption. A common symptom is
that when you view UTF-8 text with a unibyte Latin-1 charset, each
multibyte UTF-8 character appears as multiple 8-bit random characters
from Latin-1.

>
> I looked into the iconv(3p) function a bit and it seems to be dependent
> upon some characters strings that are different from what one might
> put in LANG or LC_ALL or LC_NAME environment variables. Those guys
> take things like EN_us, for example, not character set specifications.
> So how am I to know what the current character set it if all I know is
> CN_hk, for example?

I suggest using the gnulib module localcharset which provides the
function locale_charset(). That should give an answer which is safe to
pass to iconv() as one of the two charsets, with "utf-8" being the other
charset.

signature.asc

Bruno Haible

unread,
Jul 8, 2011, 7:52:11 PM7/8/11
to Bruce Korb, Eric, bug-gn...@gnu.org, Eric Blake, q24...@gmail.com
Bruce Korb wrote:
> I'd have this wired just as soon as I could figure out what
> string to pass to iconv_open(3p).  Pointers certainly appreciated!

This would be locale_charset () from the gnulib module 'localcharset'.

Additionally, instead of doing the iconv() calls yourself - the error
handling can be complicated - you could make use of a "streaming iconv",
that is, an stream that is based on another stream and does an iconv()
conversion loop on the fly.

For output streams, this exists in gettext [1][2]; for input stream
it should work in a similar way.

Bruno

[1] http://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob_plain;f=gnulib-local/lib/iconv-ostream.oo.h;hb=HEAD
[2] http://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob_plain;f=gnulib-local/lib/iconv-ostream.oo.c;hb=HEAD
--
In memoriam Jean Moulin <http://en.wikipedia.org/wiki/Jean_Moulin>

Eli Zaretskii

unread,
Jul 9, 2011, 2:45:28 AM7/9/11
to Eric Blake, mad...@gmail.com, bug-gn...@gnu.org, q24...@gmail.com, br...@clisp.org, bk...@gnu.org
> Date: Fri, 08 Jul 2011 17:25:11 -0600
> From: Eric Blake <ebl...@redhat.com>
> Cc: Eric <mad...@gmail.com>, Bruno Haible <br...@clisp.org>,
> bug-gn...@gnu.org, q24...@gmail.com

>
> I suggest using the gnulib module localcharset which provides the
> function locale_charset().

We are talking about the encoding of file names in the file system,
right? If so, is it necessarily the case that `locale_charset' will
produce the correct value for that? I'm not sure. At least for
Windows, `locale_charset' is not necessarily TRT.

Would it make sense to let the user specify the local encoding, at
least as an option?

Bruno Haible

unread,
Jul 9, 2011, 5:59:13 AM7/9/11
to Eli Zaretskii, bug-g...@gnu.org, mad...@gmail.com, bug-gn...@gnu.org, q24...@gmail.com, Eric Blake, bk...@gnu.org
[re-adding CC bug-gnulib]

Eli Zaretskii wrote:
> > I suggest using the gnulib module localcharset which provides the
> > function locale_charset().
>
> We are talking about the encoding of file names in the file system,
> right?

Yes.

> If so, is it necessarily the case that `locale_charset' will
> produce the correct value for that?

Yes. locale_charset() returns the encoding that the user has set - some way
or the other - for his locale. If file names on the part of the disk that the
user accesses are not in this encoding, he would face serious mojibake.

> I'm not sure. At least for Windows, `locale_charset' is not necessarily TRT.

What is the "right thing" for detecting the encoding of file names on Windows,
if not GetACP()?

There is a known bug in locale_charset: If a mingw program calls setlocale()
with a locale different from the system one, locale_charset() will ignore this
setlocale() call. But other than that, what is wrong?

> Would it make sense to let the user specify the local encoding, at
> least as an option?

No, because the encoding is not the only aspect of cultural conventions.
There's also the language, the sort order, and more. The user has the
possibility to set a locale in the Windows Control Panel. But it does not
allow to change the encoding. Why?
1. Because the OS uses GetACP() in many places.
2. Because the ISO C functions like strftime() etc. from MSVCRT wouldn't
work in de_DE.UTF-8 or zh_CN.GB18030 (or similar) locales.

Bruno
--
In memoriam Báb <http://en.wikipedia.org/wiki/Báb>

Eli Zaretskii

unread,
Jul 9, 2011, 7:08:25 AM7/9/11
to Bruno Haible, q24...@gmail.com, mad...@gmail.com, bug-gn...@gnu.org, bk...@gnu.org, bug-g...@gnu.org, ebl...@redhat.com
> From: Bruno Haible <br...@clisp.org>
> Date: Sat, 9 Jul 2011 11:59:13 +0200
> Cc: Eric Blake <ebl...@redhat.com>,
> bk...@gnu.org,
> mad...@gmail.com,
> bug-gn...@gnu.org,
> q24...@gmail.com

>
> > I'm not sure. At least for Windows, `locale_charset' is not necessarily TRT.
>
> What is the "right thing" for detecting the encoding of file names on Windows,
> if not GetACP()?

GetACP is correct for ANSI APIs, but not for "wide" (a.k.a. Unicode)
APIs. The latter use UTF-16, because that's how the Windows
filesystems encode the file names, at least with NTFS. I'm sure you
already know all that.

> > Would it make sense to let the user specify the local encoding, at
> > least as an option?
>
> No, because the encoding is not the only aspect of cultural conventions.
> There's also the language, the sort order, and more.

How are these relevant to the specific issue of encoding of the file
name passed to uuencode? The suggestion was to let the user specify
the encoding of the file name, and _only_ that encoding.

> The user has the possibility to set a locale in the Windows Control
> Panel. But it does not allow to change the encoding.

One _can_ change the encoding via the Control Panel, and that does
affect GetACP, AFAIK (if you logout/login after changing the current
setting). But again, I don't see how is this relevant to the issue at
hand.

It sounds possible, even if improbable, that the locale's charset is
set incorrectly, in which case locale_charset will return a wrong
value. I'm asking whether it would be a good idea to cater to such
situations, however improbable they sound, by providing a user option.

Bruce Korb

unread,
Jul 9, 2011, 11:17:51 AM7/9/11
to Eli Zaretskii, bug-g...@gnu.org, mad...@gmail.com, ebl...@redhat.com, bug-gn...@gnu.org, q24...@gmail.com, Bruno Haible
On 07/09/11 04:08, Eli Zaretskii wrote:
> ... I'm asking whether it would be a good idea to cater to such

> situations, however improbable they sound, by providing a user option.

There's already one uudecode option:

-o, --output-file=FILE direct output to FILE

but adding another to override the `locale_charset' wouldn't be
too hard to do, either.

Bruce Korb

unread,
Apr 1, 2013, 1:28:43 PM4/1/13
to Bruno Haible, Eric, Eric Blake, bug-gn...@gnu.org, q24...@gmail.com
Digging up an old thread:
http://lists.gnu.org/archive/html/bug-gnu-utils/2011-07/msg00000.html

Yesterday's release of sharutils "officially" adds a new option
to uuencode:

> uuencode (GNU sharutils) - encode a file into email friendly text
> Usage: uuencode [ -<flag> | --<name> ]... [<in-file>] <output-name>
>
> -m, --base64 convert using base64
> -e, --encode-file-name encode the output file name
> [...]

that will create an encoded output file name with an extension:

> $ echo hello | uuencode -e -m goodbye
> begin-base64-encoded 644 Z29vZGJ5ZQ==
> aGVsbG8K
> ====

The hyphen separated words appended to "begin" are, essentially,
decoding options. uudecode is written to accept them in any order
(i.e. "begin-encoded-base64" is also acceptable).
These two variations of the "begin" line are not, obviously,
in the POSIX specified format for uuencode-ed files.

0 new messages