Please do not use GetNativePath and GetNativeTarget in XP code and Windows-specific code

Masatoshi Kimura

unread,

Nov 29, 2017, 11:08:55 AM11/29/17

to dev-platform

On Windows, the native charset is not UTF-8. Using these functions will
cause bugs such as [1] or [2].

Note that you can't simply replace all GetNative(Target|Path) to
Get(Target|Path) + NS_ConvertUTF16toUTF8. It will make things *worse*.
If you are using the path in logs or serialization formats (such as
[3]), probably it's OK to use NS_ConvertUTF16toUTF8. But if you are
passing the path to Operating system API functions or CRT functions,
probably you should change the function to a wide character variant
instead of converting the Unicode path. If you are passing the path to
some third-party libraries, read the documentation (or even the source)
of the library to determine the encoding of the path. If the library
wants the native charset, you will have to leave GetNative(Target|Path)
until the library is fixed to accept wide character paths or UTF-8 paths.

If you are not sure what to do with the Get(Target|Path) result, please
do not hesitate to ask me.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1418325
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1420427
[3] https://bugzilla.mozilla.org/show_bug.cgi?id=1363482

Kris Maglione

unread,

Nov 29, 2017, 2:54:49 PM11/29/17

to Masatoshi Kimura, dev-platform

We should really mark these functions as deprecated, or add some
sort of static analysis to make it more difficult to misuse
them. It's not obvious when/where these functions
should/shouldn't be used.

>_______________________________________________
>dev-platform mailing list
>dev-pl...@lists.mozilla.org
>https://lists.mozilla.org/listinfo/dev-platform

--
Kris Maglione
Senior Firefox Add-ons Engineer
Mozilla Corporation

Doing linear scans over an associative array is like trying to club
someone to death with a loaded Uzi.
--Larry Wall

Karl Tomlinson

unread,

Nov 29, 2017, 5:09:17 PM11/29/17

to

I've always found this confusing, and so I'll write down the
understanding I've reached, in the hope that either it will help
others, or others can help me by correcting if these are
misunderstandings.

On Unix systems:

`nativePath`

contains the bytes corresponding to the native filename used
by native system calls.

`path`

is a UTF-16 encoding of an attempt to provide a human
readable version of the native filename. This involves
interpreting native bytes according to the character encoding
specified by the current locale of the application as
indicated by nl_langinfo(CODESET).

For different locales, the same file can have a different
`path`.

The native bytes may not be valid UTF-8, and so if the
character encoding is UTF-8, then there may not be a valid
`path` that can be encoded to produce the same `nativePath`.

It is best to use `nativePath` for working with filenames,
including conversion to URI, but use `path` when displaying
names in the UI.

On WINNT systems:

`path`

contains wide characters corresponding to the native filename
used by native wide character system APIs. For at least most
configurations, I assume wide characters are UTF-16, in which
case this is also human readable.

`nativePath`

is an attempt to represent the native filename in the native
multibyte character encoding specified by the current locale
of the application.

For different locales, I assume the same file can have a
different `nativePath`.

I assume there is not necessarily a valid multibyte character
encoding, and so there may not be a valid `nativePath` that
can be decoded to produce the same `path`.

It is best to use `path` for working with filenames.
Conversion to URI involves assuming `path` is UTF-16 and
converting to UTF-8.

The parameters mean very different things on different systems,
and so it is not generally possible to write XP code with either
of these, but Gecko attempts to do so anyway.

The numbers of applications not using UTF-8 and filenames not
valid UTF-8 are much smaller on Unix systems than the numbers of
applications not using UTF-8 and non-ASCII filenames on WINNT
systems, and so choosing to work with `path` provides more
compatibility than working with `nativePath`.

Kris Maglione

unread,

Nov 29, 2017, 6:00:59 PM11/29/17

to Karl Tomlinson, dev-pl...@lists.mozilla.org

On Thu, Nov 30, 2017 at 11:09:07AM +1300, Karl Tomlinson wrote:
> The native bytes may not be valid UTF-8, and so if the
> character encoding is UTF-8, then there may not be a valid
> `path` that can be encoded to produce the same `nativePath`.

I think you mean "is not UTF-8"?

> It is best to use `nativePath` for working with filenames,
> including conversion to URI, but use `path` when displaying
> names in the UI.

I don't think this is true. The native filename isn't even
available to JS, which always deals with UTF-16 strings.

And for converting file paths to URIs, we should always be using
NewFileURI. It looks like that tries to use the UTF-16 path,
converted to UTF-8, but falls back to the native path if a
round-trip from and to the native charset doesn't produce the
same path.

> contains wide characters corresponding to the native filename
> used by native wide character system APIs. For at least most
> configurations, I assume wide characters are UTF-16, in which
> case this is also human readable.

On Windows, the UTF-16 path always corresponds to the wide
native pathname, which is always UTF-16.

> It is best to use `path` for working with filenames.
> Conversion to URI involves assuming `path` is UTF-16 and
> converting to UTF-8.

I think, in general, it's best to avoid using paths at all,
except when converting to URIs or dealing with paths that need
to be displayed to or read from users. We have APIs to convert
between nsIFiles and NSPR file descriptors, which generally work
best when available, and work correctly cross-platform.

The reason this issue came up was that I needed to serialize
pathnames to a cache file, and while the GetNativePath/SetNativePath
round trip should usually work correctly for that use case, we
sometimes wound up with single-byte UTF-8 input paths for the
pathnames when they came from sources other than the cache file.

Anne van Kesteren

unread,

Nov 30, 2017, 2:55:36 AM11/30/17

to Kris Maglione, dev-platform, Karl Tomlinson

On Thu, Nov 30, 2017 at 12:00 AM, Kris Maglione <kmag...@mozilla.com> wrote:
> I don't think this is true. The native filename isn't even available to JS,
> which always deals with UTF-16 strings.

JS deals with 16-bit unsigned integers. In particular, you can
represent lone surrogates in JS, but not in UTF-16.

> On Windows, the UTF-16 path always corresponds to the wide native pathname,
> which is always UTF-16.

Per https://github.com/rust-lang/rust/issues/12056 that is not true
and Windows has the same flaw as JS.

--
https://annevankesteren.nl/

Masatoshi Kimura

unread,

Nov 30, 2017, 7:03:17 AM11/30/17

to dev-pl...@lists.mozilla.org

I intentionally ignored non-UTF-8 UNIX locales because our support for
those locales is already half-broken and almost nobody cares about that.
For example, OS.File assumes that the filesystem encoding is always
UTF-8 on UNIX while nsIFile does not. This discrepancy caused a bug[1]
that did not get much attention.

I think it's time to stop pretending to support non-UTF-8 UNIX locales.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1342659

Karl Tomlinson

unread,

Nov 30, 2017, 4:56:41 PM11/30/17

to

> On Thu, Nov 30, 2017 at 11:09:07AM +1300, Karl Tomlinson wrote:
>> The native bytes may not be valid UTF-8, and so if the
>> character encoding is UTF-8, then there may not be a valid
>> `path` that can be encoded to produce the same `nativePath`.

Kris Maglione writes:
> I think you mean "is not UTF-8"?

If the bytes in the native filename are not valid UTF-8, then they
cannot be decoded as UTF-8. When the character encoding specified
by the current locale is UTF-8, or sometimes even if not
(e.g. OS.File), attempts are made to interpret the filename as
UTF-8 for `path`. These attempts may fail.

Perhaps similar issues may exist with other assumed encodings, but
UTF-8 is the common one.

If anyone were to ever want to try to display these filenames, then
the reference would be
https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#g-filename-display-name

There is some indication of what this does at
https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#g-get-filename-charsets

> And for converting file paths to URIs, we should always be using
> NewFileURI. It looks like that tries to use the UTF-16 path,
> converted to UTF-8, but falls back to the native path if a
> round-trip from and to the native charset doesn't produce the
> same path.

I don't know what using `path` in URIs is trying to support.
Other applications use `nativePath` and so that is better for IPC
such as drag'n'drop and clipboards.

Makoto Kato

unread,

Nov 30, 2017, 8:15:48 PM11/30/17

to Masatoshi Kimura, dev-platform

I think that we don't have any data when user doesn't use non-UTF-8
(and C) locale such as ja_JP.eucJP. We should get data via telemetry.

-- Makoto

On Thu, Nov 30, 2017 at 9:02 PM, Masatoshi Kimura <VYV0...@nifty.ne.jp> wrote:
> I intentionally ignored non-UTF-8 UNIX locales because our support for
> those locales is already half-broken and almost nobody cares about that.
> For example, OS.File assumes that the filesystem encoding is always
> UTF-8 on UNIX while nsIFile does not. This discrepancy caused a bug[1]
> that did not get much attention.
>
> I think it's time to stop pretending to support non-UTF-8 UNIX locales.
>
> [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1342659
>
> On 2017/11/30 7:09, Karl Tomlinson wrote:

Gregory Szorc

unread,

Dec 4, 2017, 8:01:47 PM12/4/17

to Karl Tomlinson, dev-platform

My understanding from contributing to Mercurial (version control tools are
essentially virtual filesystems that need to store paths and realize them
on multiple operating systems and filesystems) is as follows.

On Linux/POSIX, a path is any sequence of bytes not containing a NULL byte.
As such, a path can be modeled by char*. Any tool that needs to preserve
paths from and to I/O functions should be using NULL terminated bytes
internally. For display, etc, you can attempt to decode those bytes using
an encoding. But there's no guarantee the byte sequences from the
filesystem/OS will be e.g. proper UTF-8. And if you normalize all paths to
e.g. Unicode internally, this can lead to round tripping errors.

On Windows, there are multiple APIs for I/O. Under the hood, NTFS is
purportedly using UTF-16 to store filenames. Although I can't recall if it
is actual UTF-16 or just byte pairs. This means you should be using the
*W() functions for all I/O. e.g. CreateFileW(). (Note: if you use
CreateFileW(), you can also use the "\\?\" prefix on the filename to avoid
MAX_PATH (260 character) limitations. If you want to preserve filenames on
Windows, you should be using these functions. If you use the *A() functions
(e.g. CreateFileA()) or use the POSIX libc compatibility layer (e.g.
open()), it is very difficult to preserve exact byte sequences. Further
clouding matters is that values from environment variables, command line
arguments, etc may be in unexpected/different encodings. I can't recall
specifics here. But there are definitely cases where the bytes being
exposed may not match exactly what is stored in NTFS.

In addition to that, there are various normalizations that the operating
system or filesystem may apply to filenames. For example, Windows has
various reserved filenames that the Windows API will disallow (but NTFS can
store if you use the NTFS APIs directly) and MacOS or its filesystem will
silently munge certain Unicode code points (this is a fun one because of
security implications).

In all cases, if a filename originates from something other than the
filesystem, it is probably a good idea to normalize it to Unicode
internally and then spit out UTF-8 (or whatever the most-native API
expects).

Many programming languages paper over these subtle differences leading to
badness. For example, the preferred path APIs for Python and Rust assume
paths are Unicode (they have their own logic to perform encoding/decoding
behind the scenes and depending on settings, run-time failures or data loss
can occur). In both cases, there are OS-specific/native path primitives
that give you access to the raw bytes. If you want to be resilient around
preserving data and not munging byte sequences, these primitives should be
used. But it can be difficult because tons of software normalizes paths to
Unicode and having to deal with a platform-specific data type in all
consumers can be very annoying.

I hope this info is useful!

Gregory Szorc

unread,

Dec 4, 2017, 8:15:18 PM12/4/17

to Gregory Szorc, dev-platform, Karl Tomlinson

On Mon, Dec 4, 2017 at 5:00 PM, Gregory Szorc <g...@mozilla.com> wrote:

> On Wed, Nov 29, 2017 at 5:09 PM, Karl Tomlinson <moz...@karlt.net> wrote:
>

Quick follow-up: reading https://simonsapin.github.io/wtf-8/ would be a
good deep dive.

Robert O'Callahan

unread,

Dec 4, 2017, 10:23:05 PM12/4/17

to Gregory Szorc, dev-platform, Karl Tomlinson

On Tue, Dec 5, 2017 at 2:00 PM, Gregory Szorc <g...@mozilla.com> wrote:

> Many programming languages paper over these subtle differences leading to
> badness. For example, the preferred path APIs for Python and Rust assume
> paths are Unicode (they have their own logic to perform encoding/decoding
> behind the scenes and depending on settings, run-time failures or data loss
> can occur).

I don't think that's true for Rust. Rust's `Path` and `PathBuf` are the
preferred data types and wrap an underlying `OsStr`/`OsString`, which
claims to be able to represent any path for the target platform. On Linux
an `OsString` is just an array of bytes with no specified encoding, and on
Windows it appears to be an array of WTF-8 bytes, so that claim seems valid.

Of course PathBuf isn't exactly what you want for an application like
Mercurial, since there I guess you want a type that represents any path
from any platform, including platforms other than the target platform...

Rob
--
lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf
toD
selthor stor edna siewaoeodm or v sstvr esBa kbvted,t
rdsme,aoreseoouoto
o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea
lurpr
.a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr
esn

Gregory Szorc

unread,

Dec 4, 2017, 11:26:08 PM12/4/17

to Robert O'Callahan, dev-platform, Gregory Szorc, Karl Tomlinson

On Mon, Dec 4, 2017 at 7:22 PM, Robert O'Callahan <rob...@ocallahan.org>
wrote:

> On Tue, Dec 5, 2017 at 2:00 PM, Gregory Szorc <g...@mozilla.com> wrote:
>
>> Many programming languages paper over these subtle differences leading to
>> badness. For example, the preferred path APIs for Python and Rust assume
>> paths are Unicode (they have their own logic to perform encoding/decoding
>> behind the scenes and depending on settings, run-time failures or data
>> loss
>> can occur).
>
>
> I don't think that's true for Rust. Rust's `Path` and `PathBuf` are the
> preferred data types and wrap an underlying `OsStr`/`OsString`, which
> claims to be able to represent any path for the target platform. On Linux
> an `OsString` is just an array of bytes with no specified encoding, and on
> Windows it appears to be an array of WTF-8 bytes, so that claim seems valid.
>

Yes. I was confusing Rust's handling of paths with environment variables
and command line arguments. std::env::vars() and std::env::args() panic if
an element isn't "Unicode." (The docs for those types don't say how Rust
chooses which encoding it treats the underlying bytes as and now I'm
legitimately curious.) That's why there are vars_os() and args_os() to get
the raw data. I got mixed up and thought there was a Path and OsPath
distinction as well.

>
> Of course PathBuf isn't exactly what you want for an application like
> Mercurial, since there I guess you want a type that represents any path
> from any platform, including platforms other than the target platform...
>

Indeed. Mercurial treats all paths as bytes (for better or worse). There
are known problems with path handling on Windows. For the curious, read
https://www.mercurial-scm.org/wiki/WindowsUTF8Plan and
https://www.mercurial-scm.org/wiki/EncodingStrategy. And, uh, I should have
linked to EncodingStrategy before because that contains a lot of useful
info. Although it is a bit light on links to back up the research. That
page was mostly authored by mpm, so I generally trust the accuracy of the
content.

Henri Sivonen

unread,

Dec 5, 2017, 4:57:00 AM12/5/17

to dev-platform

On Tue, Dec 5, 2017 at 3:00 AM, Gregory Szorc <g...@mozilla.com> wrote:
> My understanding from contributing to Mercurial (version control tools are
> essentially virtual filesystems that need to store paths and realize them
> on multiple operating systems and filesystems) is as follows.

When writing system tools that work with files, it makes sense to do
things correctly--i.e. the way Rust's standard library does with
Path/PathBuf being distinct types from text strings capable of
round-tripping bytes on *nix and 16-bit units on Windows.

Considering that Firefox isn't a low-level file management tool and
already has a legacy of letting file paths not fully stay within
nsIFile but may have places where they travel as JS strings or
OS.File, I think it's not worthwhile to go through the engineering and
unit testing effort to support profile directory / download directory
/ file upload paths that aren't valid UTF-16 on Windows or that aren't
valid UTF-8 on *nix.

On Windows, though, we probably should run our unit tests with paths
that are valid UTF-16 but contain characters that aren't representable
in the system's "ANSI code page".

--
Henri Sivonen
hsiv...@hsivonen.fi
https://hsivonen.fi/