My understanding from contributing to Mercurial (version control tools are
essentially virtual filesystems that need to store paths and realize them
on multiple operating systems and filesystems) is as follows.
On Linux/POSIX, a path is any sequence of bytes not containing a NULL byte.
As such, a path can be modeled by char*. Any tool that needs to preserve
paths from and to I/O functions should be using NULL terminated bytes
internally. For display, etc, you can attempt to decode those bytes using
an encoding. But there's no guarantee the byte sequences from the
filesystem/OS will be e.g. proper UTF-8. And if you normalize all paths to
e.g. Unicode internally, this can lead to round tripping errors.
On Windows, there are multiple APIs for I/O. Under the hood, NTFS is
purportedly using UTF-16 to store filenames. Although I can't recall if it
is actual UTF-16 or just byte pairs. This means you should be using the
*W() functions for all I/O. e.g. CreateFileW(). (Note: if you use
CreateFileW(), you can also use the "\\?\" prefix on the filename to avoid
MAX_PATH (260 character) limitations. If you want to preserve filenames on
Windows, you should be using these functions. If you use the *A() functions
(e.g. CreateFileA()) or use the POSIX libc compatibility layer (e.g.
open()), it is very difficult to preserve exact byte sequences. Further
clouding matters is that values from environment variables, command line
arguments, etc may be in unexpected/different encodings. I can't recall
specifics here. But there are definitely cases where the bytes being
exposed may not match exactly what is stored in NTFS.
In addition to that, there are various normalizations that the operating
system or filesystem may apply to filenames. For example, Windows has
various reserved filenames that the Windows API will disallow (but NTFS can
store if you use the NTFS APIs directly) and MacOS or its filesystem will
silently munge certain Unicode code points (this is a fun one because of
security implications).
In all cases, if a filename originates from something other than the
filesystem, it is probably a good idea to normalize it to Unicode
internally and then spit out UTF-8 (or whatever the most-native API
expects).
Many programming languages paper over these subtle differences leading to
badness. For example, the preferred path APIs for Python and Rust assume
paths are Unicode (they have their own logic to perform encoding/decoding
behind the scenes and depending on settings, run-time failures or data loss
can occur). In both cases, there are OS-specific/native path primitives
that give you access to the raw bytes. If you want to be resilient around
preserving data and not munging byte sequences, these primitives should be
used. But it can be difficult because tons of software normalizes paths to
Unicode and having to deal with a platform-specific data type in all
consumers can be very annoying.
I hope this info is useful!