Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Dropping remains of support for non-UTF-8 file paths on Gtk platforms (was: Re: Please do not use GetNativePath and GetNativeTarget in XP code and Windows-specific code)

110 views
Skip to first unread message

Henri Sivonen

unread,
Dec 4, 2017, 6:20:00 AM12/4/17
to Makoto Kato, Masatoshi Kimura, dev-platform
On Fri, Dec 1, 2017 at 3:15 AM, Makoto Kato <m_k...@ga2.so-net.ne.jp> wrote:
> I think that we don't have any data when user doesn't use non-UTF-8
> (and C) locale such as ja_JP.eucJP. We should get data via telemetry.

What should the telemetry measure? (Measuring whether we compute paths
to be UTF-8 in the code that still supports non-UTF-8 configurations
would probably be the wrong thing to measure, because the "C" locale
doesn't compute to UTF-8 and no one has cared enough to fix that.)

What kind of telemetry data would we need to see in order to proceed
with https://bugzilla.mozilla.org/show_bug.cgi?id=960957 (removing the
remains of support for non-UTF-8 file paths)?

And if we didn't proceed with that course of action, what would the
alternative course of action be? The current state doesn't really
support non-UTF-8 file paths.
https://bugzilla.mozilla.org/show_bug.cgi?id=1342659 has been open for
9 months, and the user was upgrading from Firefox 17 to 50 in order to
notice the problem, so the bug has been there for more than 9 months
before the complaint.
https://bugzilla.mozilla.org/show_bug.cgi?id=848268 has been open for
5 years. It looks like no one cares enough about non-UTF-8
configurations to make Gecko do what arguably would be the right way
to support non-UTF-8 file paths: using the glib file path conversion
functions, which don't do non-UTF-8 things unless the
G_BROKEN_FILENAMES environment variable has been set. The name of the
environment variable is very telling.

Considering that we mainly get Gtk telemetry from Ubuntu which has had
UTF-8 paths since its introduction and our support for non-UTF-8 file
paths has been broken for years without much complaint (AFAIK 3 users
reporting non-UTF-8: two EUC-JP and one ISO-8859-something in the past
5 years), can we really expect telemetry to tell us anything useful?
Non-UTF-8 paths are an even more deeply legacy configuration than
non-PulseAudio audio, and telemetry told us was OK to go
PulseAudio-only.

Four years ago, smontagu said that in the last 5 years (i.e. since
nine years ago now), the code he had contributed assumed UTF-8 on
*nix: https://www.mail-archive.com/dev-pl...@lists.mozilla.org/msg06083.html

I suggest that instead of delaying with a round of telemetry, we make
all non-Windows platforms in nsNativeCharsetUtils.cpp use what's
currently the OSX/Android code path.

--
Henri Sivonen
hsiv...@hsivonen.fi
https://hsivonen.fi/

Masatoshi Kimura

unread,
Dec 4, 2017, 7:05:32 AM12/4/17
to dev-pl...@lists.mozilla.org
On 2017/12/04 20:19, Henri Sivonen wrote:
> I suggest that instead of delaying with a round of telemetry, we make
> all non-Windows platforms in nsNativeCharsetUtils.cpp use what's
> currently the OSX/Android code path.

+1

Some other data points:
* If by any chance a profile path contains non-ASCII characters on
non-UTF-8 UNIX systems, Firefox 57.0.1 must have broken the profile just
like 57.0 broke it on Windows. But we didn't hear any such complaints.

* Our GMP service assumes that the native encoding is always UTF-8
except Windows. Some media playbacks must have been broken on UNIX
systems unless the locale is UTF-8.

I agree that telemetry is waste of time in this case.

Henri Sivonen

unread,
Dec 5, 2017, 8:25:04 AM12/5/17
to Masatoshi Kimura, dev-platform
On Mon, Dec 4, 2017 at 2:04 PM, Masatoshi Kimura <VYV0...@nifty.ne.jp> wrote:
> * If by any chance a profile path contains non-ASCII characters on
> non-UTF-8 UNIX systems, Firefox 57.0.1 must have broken the profile just
> like 57.0 broke it on Windows. But we didn't hear any such complaints.

Are you referring to
https://hg.mozilla.org/mozilla-central/rev/345fe119b8cf using
GetPath() on all platforms and not just Windows?

Experimenting in an Ubuntu VM, Firefox 57.0.1 indeed fails to save
prefs and history (but saves the HTTP disk cache and various other
things) if the profile path has an illegal byte in it. Additionally,
on Debian-based systems generally, adduser only allows usernames (and,
thereby in the common case where the home directory matches the user
name, home directories) that conform to the POSIX portable username
rules (subset of ASCII). useradd appears to have no such safeguards.

Henri Sivonen

unread,
Dec 5, 2017, 9:37:34 AM12/5/17
to dev-platform
If the glibc ja_JP.eucjp locale has been generated prior to login,
LC_ALL is set to ja_JP.eucjp and the profile path contains a byte pair
that's valid EUC-JP but invalid UTF-8, Firefox 57.0.1 saves prefs but
doesn't save history or bookmarks and is unable to complete File: Save
Page As... (fails silently before asking where to save even if the
EUC-JP bytes are just in the profile path and not in the home
directory path). Additionally, downloads fail it the download target
path has non-ASCII EUC-JP bytes in it.

So far, I've found Solaris documentation that rules out EUC-JP user
names: https://docs.oracle.com/cd/E23824_01/html/821-1474/attributes-5.html#scrolltoc
. In addition to Debian adduser enforcing POSIX username portability,
I found an anecdote suggesting that some management tool of RHEL from
years ago did, too.

I haven't found conclusive documentation explaining that there can't
be EUC-JP usernames and matching home directories out there on Linux
or BSD systems. However, per the above observation about failure to
save history, save bookmarks, invoke Save As... or download anything
if the path has EUC-JP bytes, it seems safe to conclude that if an
EUC-JP home directory name exists somewhere out there, Firefox is
already very broken on such a system.

ISHIKAWA,chiaki

unread,
Dec 5, 2017, 9:38:49 AM12/5/17
to dev-pl...@lists.mozilla.org
On 2017/12/05 22:24, Henri Sivonen wrote:
> On Mon, Dec 4, 2017 at 2:04 PM, Masatoshi Kimura <VYV0...@nifty.ne.jp> wrote:
>> * If by any chance a profile path contains non-ASCII characters on
>> non-UTF-8 UNIX systems, Firefox 57.0.1 must have broken the profile just
>> like 57.0 broke it on Windows. But we didn't hear any such complaints.
>
> Are you referring to
> https://hg.mozilla.org/mozilla-central/rev/345fe119b8cf using
> GetPath() on all platforms and not just Windows?
>
> Experimenting in an Ubuntu VM, Firefox 57.0.1 indeed fails to save
> prefs and history (but saves the HTTP disk cache and various other
> things) if the profile path has an illegal byte in it. Additionally,
> on Debian-based systems generally, adduser only allows usernames (and,
> thereby in the common case where the home directory matches the user
> name, home directories) that conform to the POSIX portable username
> rules (subset of ASCII). useradd appears to have no such safeguards.
>

There are other non-ASCII character issues such as
https://bugzilla.mozilla.org/show_bug.cgi?id=1258613

But the bug I mention occurs because some characters are encoded
DIFFERENTLY under iOS and the rest of the world when UTF-8 is used.
So I think the bug will be there no matter whether this non-UTF-8 path
is removed or not.

By mentioning the bug, I just wanted to point out that there *ARE*
obnoxious bugs regarding non-ASCII character handling in mozilla software.
But majority of the Japanese users probably failed to file the non-ASCII
character bugs, and just think, "oh, another instance of Japanese
characters not passed correctly between mozilla applications and the
external programs, etc.": this type of Japanese character mungling has
been so common before, so it is simply ignored as one of those bugs, OR
when the problem happens it is so difficult to figure out WHERE the buck
stops (i.e., on what program either outputs incorrect character strings
or what program parses input incorrectly, OR BOTH.) And if the producer
and consumer seem to disagree on what type of encoding is used, it is
usually not quite clear even to ordinary programming type people WHAT is
the correct way on a given platform, and who is to blame, and thus many
simply failed to analyze the issue thoroughly and give up half way.
Oh well.

I have noted that, in the last three months or so, some Japanese strings
copied from Mozilla TB or mozilla FF are not parsed correctly when
they are inserted into other programs.
This did not happen before.

But due to exactly the same reason I noted above, I am not sure which
program is to blame, and have not bothered to
pester mozilla bugzilla with possibly false-positive bug reports.
If the problem persists in the next few months, I may file a bug.

TIA for people's attention.

Henri Sivonen

unread,
Dec 5, 2017, 10:17:34 AM12/5/17
to ISHIKAWA,chiaki, dev-platform
On Tue, Dec 5, 2017 at 4:37 PM, ISHIKAWA,chiaki <ishi...@yk.rim.or.jp> wrote:
> There are other non-ASCII character issues such as
> https://bugzilla.mozilla.org/show_bug.cgi?id=1258613

Very weird bug! (Summary for others: decomposed voiced sound mark is
rendered on the wrong base character.)

> But the bug I mention occurs because some characters are encoded DIFFERENTLY
> under iOS and the rest of the world when UTF-8 is used.

HFS+ decomposed Unicode leakage to other systems causes pain, but the
topic of this thread isn't affected by Unicode normalization.

> By mentioning the bug, I just wanted to point out that there *ARE* obnoxious
> bugs regarding non-ASCII character handling in mozilla software.
> But majority of the Japanese users probably failed to file the non-ASCII
> character bugs, and just think, "oh, another instance of Japanese characters
> not passed correctly between mozilla applications and the external programs,
> etc."

Possibly, but unfiled bugs don't get fixed and, as seen with non-ASCII
path handling with non-UTF-8 Linux locales, even filed bugs don't get
fixed. As the experiments documented in my previous email to this
thread indicate, non-UTF-8 paths already cause such breakage that at
this point it no longer makes sense to even pretend to support them.

> I have noted that, in the last three months or so, some Japanese strings
> copied from Mozilla TB or mozilla FF are not parsed correctly when
> they are inserted into other programs.

That seems worrying, but without knowing which OS and which apps, I
can't really comment further.

Jonathan Kew

unread,
Dec 5, 2017, 11:04:37 AM12/5/17
to Henri Sivonen, ISHIKAWA,chiaki, dev-platform
On 05/12/2017 15:16, Henri Sivonen wrote:
> On Tue, Dec 5, 2017 at 4:37 PM, ISHIKAWA,chiaki <ishi...@yk.rim.or.jp> wrote:
>> There are other non-ASCII character issues such as
>> https://bugzilla.mozilla.org/show_bug.cgi?id=1258613
>
> Very weird bug! (Summary for others: decomposed voiced sound mark is
> rendered on the wrong base character.)

Not all that weird, really; it's almost certainly due to using a font
that doesn't support the combining mark. Commented in the bug.

JK

ISHIKAWA,chiaki

unread,
Dec 5, 2017, 3:10:15 PM12/5/17
to dev-pl...@lists.mozilla.org
Thank you for the comment in the bug.

But I wonder if there is a clearly discernible property/attribute of a
font which allows the combining of a mark and the other font that doesn't.

Basically, the issue appears under linux OS (Debian GNU/Linux) which I
use daily.

Without knowing which font is causing the issue (not supporting the
combining of mark), I can't fix it.

Since the "normalization" of string into canonical form under linux
seems to solve the problem, I am inclined to have the OS or OS-supplied
library do that, but I am not entirely sure where the rendering happens.

Come to think of it, I am not sure whether the iOS mail client handles
the filename of an attachment that is sent from Windows or from Linux.
The party with whom I exchanged the problematic e-mails mentioned that
there are e-mails with attachments which cannot be saved under the
original name and a machine-generated filename seemed to be used.
Oh well, I will investigate this a bit during holiday break.

TIA


Mike Hommey

unread,
Dec 5, 2017, 4:58:35 PM12/5/17
to Henri Sivonen, dev-platform, ISHIKAWA, chiaki
On Tue, Dec 05, 2017 at 05:16:52PM +0200, Henri Sivonen wrote:
> On Tue, Dec 5, 2017 at 4:37 PM, ISHIKAWA,chiaki <ishi...@yk.rim.or.jp> wrote:
> > There are other non-ASCII character issues such as
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1258613
>
> Very weird bug! (Summary for others: decomposed voiced sound mark is
> rendered on the wrong base character.)
>
> > But the bug I mention occurs because some characters are encoded DIFFERENTLY
> > under iOS and the rest of the world when UTF-8 is used.
>
> HFS+ decomposed Unicode leakage to other systems causes pain, but the
> topic of this thread isn't affected by Unicode normalization.
>
> > By mentioning the bug, I just wanted to point out that there *ARE* obnoxious
> > bugs regarding non-ASCII character handling in mozilla software.
> > But majority of the Japanese users probably failed to file the non-ASCII
> > character bugs, and just think, "oh, another instance of Japanese characters
> > not passed correctly between mozilla applications and the external programs,
> > etc."
>
> Possibly, but unfiled bugs don't get fixed and, as seen with non-ASCII
> path handling with non-UTF-8 Linux locales, even filed bugs don't get
> fixed. As the experiments documented in my previous email to this
> thread indicate, non-UTF-8 paths already cause such breakage that at
> this point it no longer makes sense to even pretend to support them.

Wouldn't it make sense, then, to actively fail to even start Firefox in
such cases, instead of pretending it kind of works at all, if we can't
even save history or bookmarks properly?

Mike

Henri Sivonen

unread,
Dec 7, 2017, 5:27:12 AM12/7/17
to Mike Hommey, dev-platform
On Tue, Dec 5, 2017 at 11:57 PM, Mike Hommey <m...@glandium.org> wrote:
> Wouldn't it make sense, then, to actively fail to even start Firefox in
> such cases, instead of pretending it kind of works at all, if we can't
> even save history or bookmarks properly?

Good point. Filed https://bugzilla.mozilla.org/show_bug.cgi?id=1423855 . Thanks.
0 new messages