[Python-ideas] Introduce some obvious way to encode and decode filenames from Python code

46 views
Skip to first unread message

Sven Marnach

unread,
Jul 16, 2012, 10:49:52 AM7/16/12
to python...@python.org
Currently, there is no obvious way to encode a filename in the default
filesystem encoding. To pipe some filenames to the stdin of a
subprocess, I effectively used

encoded_name = file_name.encode(sys.getfilesystemencoding())

which mostly worked. There are cases where this fails, though: on
Linux with LANG=C and filenames that contain non-ASCII characters, for
example, or in any situation where the default filesystem encoding
can't decode a filename.

The correct way to do this seems to be something like

if sys.platform == "nt":
errors = "strict"
else:
errors = "surrogateescape"
encoded_name = file_name.encode(sys.getfilesystemencoding()
errors=errors)

I think there should be (1) some documentation on the issue and (2) a
more obvious way to do encode filenames.

1. The most useful reference I could find in the docs is

http://docs.python.org/dev/c-api/unicode.html#file-system-encoding

and there is a short paragraph at

http://docs.python.org/dev/library/os.html#file-names-command-line-arguments-and-environment-variables

The filename encoding applies to basically all Python library
functions (including built-ins like `open()`) and should probably
be documented at a more prominent spot. The "surrogateescape"
error handler isn't mentioned here

http://docs.python.org/dev/howto/unicode.html#unicode-filenames

2. There should be some way to access the C API functions for decoding
and encoding filenames from Python. I don't have a good idea how
to do this – maybe by adding a meta-encoding "filesystem", or by
adding functions to the standard library.

Did I miss something? Any thoughts?

Cheers,
Sven
_______________________________________________
Python-ideas mailing list
Python...@python.org
http://mail.python.org/mailman/listinfo/python-ideas

Antoine Pitrou

unread,
Jul 16, 2012, 11:49:56 AM7/16/12
to python...@python.org
On Mon, 16 Jul 2012 15:49:52 +0100
Sven Marnach <sv...@marnach.net> wrote:
> Currently, there is no obvious way to encode a filename in the default
> filesystem encoding. To pipe some filenames to the stdin of a
> subprocess, I effectively used
>
> encoded_name = file_name.encode(sys.getfilesystemencoding())

Well, how about os.fsencode() and os.fsdecode()?

http://docs.python.org/dev/library/os.html#os.fsencode

Regards

Antoine.


--
Software development and contracting: http://pro.pitrou.net

Sven Marnach

unread,
Jul 16, 2012, 1:04:41 PM7/16/12
to python...@python.org
Antoine Pitrou schrieb am Mon, 16. Jul 2012, um 17:49:56 +0200:
> On Mon, 16 Jul 2012 15:49:52 +0100
> Sven Marnach <sv...@marnach.net> wrote:
> > Currently, there is no obvious way to encode a filename in the default
> > filesystem encoding. To pipe some filenames to the stdin of a
> > subprocess, I effectively used
> >
> > encoded_name = file_name.encode(sys.getfilesystemencoding())
>
> Well, how about os.fsencode() and os.fsdecode()?
>
> http://docs.python.org/dev/library/os.html#os.fsencode

Oh, great, there they are! I think these functions should be
mentioned in these sections to make them easier to find:

[1]: http://docs.python.org/dev/library/os.html#file-names-command-line-arguments-and-environment-variables
[2]: http://docs.python.org/dev/library/sys.html#sys.getfilesystemencoding
[3]: http://docs.python.org/dev/howto/unicode.html#unicode-filenames

I'll post an issue on the issue tracker.

Cheers,
Sven

Victor Stinner

unread,
Jul 16, 2012, 1:23:59 PM7/16/12
to python...@python.org
>> Well, how about os.fsencode() and os.fsdecode()?
>>
>> http://docs.python.org/dev/library/os.html#os.fsencode
>
> Oh, great, there they are! I think these functions should be
> mentioned in these sections to make them easier to find:
>
> [1]: http://docs.python.org/dev/library/os.html#file-names-command-line-arguments-and-environment-variables
> [2]: http://docs.python.org/dev/library/sys.html#sys.getfilesystemencoding
> [3]: http://docs.python.org/dev/howto/unicode.html#unicode-filenames
>
> I'll post an issue on the issue tracker.

Hi,

I wrote these functions when I worked in this topic for Python 3. Yes,
it would be great if you write a patch to mention these functions in
the doc.

Someone also complained that the surrogateescape error handler is not
mentionned in any FS related function.

Victor

And Clover

unread,
Jul 16, 2012, 7:00:32 PM7/16/12
to python...@python.org
On 16/07/12 18:23, Victor Stinner wrote:
> I wrote these functions when I worked in this topic for Python 3. Yes,
> it would be great if you write a patch to mention these functions in
> the doc.

Sure.

But should we be encouraging their use on Windows? I would have thought
it the best thing to stick with the Unicode string for paths on NT, so
that the native Win32 Unicode APIs are used instead of the
ANSI-code-page-bound C stdio. Encoding down to the fsencoding for
Windows just means that any path including a character that isn't in the
ANSI CP will fail.

In lieu of some kind of abstract filepath object thatcould represent
either bytes or str (depending on platform), how about a function that
takes a str and only encodes it to bytes if the platform requires it?

cheers,

--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
gtalk:chat?jid=bob...@gmail.com

Antoine Pitrou

unread,
Jul 16, 2012, 7:41:24 PM7/16/12
to python...@python.org
On Tue, 17 Jul 2012 00:00:32 +0100
And Clover <and...@doxdesk.com> wrote:

> On 16/07/12 18:23, Victor Stinner wrote:
> > I wrote these functions when I worked in this topic for Python 3. Yes,
> > it would be great if you write a patch to mention these functions in
> > the doc.
>
> Sure.
>
> But should we be encouraging their use on Windows? I would have thought
> it the best thing to stick with the Unicode string for paths on NT, so
> that the native Win32 Unicode APIs are used instead of the
> ANSI-code-page-bound C stdio. Encoding down to the fsencoding for
> Windows just means that any path including a character that isn't in the
> ANSI CP will fail.

Well even under Unix, these functions are only useful for very
specialized cases. For normal usage, PEP 383 guarantees that all
filenames, including theoretically undecodable ones, pass through
properly. When piping filenames between Python processes, you can use
whatever encoding you want (or you can also use json or pickle).

The only remaining use case is sending some filenames to an external
(non-Python) program over a bytes stream, or reading some filenames
emitted by such a program. Here, you need bytes under Windows as well.

Regards

Antoine.


--
Software development and contracting: http://pro.pitrou.net


Victor Stinner

unread,
Jul 16, 2012, 9:03:24 PM7/16/12
to And Clover, python...@python.org
2012/7/17 And Clover <and...@doxdesk.com>:
> But should we be encouraging their use on Windows? I would have thought it
> the best thing to stick with the Unicode string for paths on NT, so that the
> native Win32 Unicode APIs are used instead of the ANSI-code-page-bound C
> stdio. Encoding down to the fsencoding for Windows just means that any path
> including a character that isn't in the ANSI CP will fail.

os.fsencode() should not be used explicitly on Windows.

> In lieu of some kind of abstract filepath object thatcould represent either
> bytes or str (depending on platform), how about a function that takes a str
> and only encodes it to bytes if the platform requires it?

You can use the str (Unicode) type on all platforms with Python 3, so
use os.fsdecode(). os.listdir(str) does return str filenames on any
platform for example.

Victor

Stefan Behnel

unread,
Jul 17, 2012, 12:57:43 AM7/17/12
to python...@python.org
Victor Stinner, 17.07.2012 03:03:
> 2012/7/17 And Clover:
>> But should we be encouraging their use on Windows? I would have thought it
>> the best thing to stick with the Unicode string for paths on NT, so that the
>> native Win32 Unicode APIs are used instead of the ANSI-code-page-bound C
>> stdio. Encoding down to the fsencoding for Windows just means that any path
>> including a character that isn't in the ANSI CP will fail.
>
> os.fsencode() should not be used explicitly on Windows.
>
>> In lieu of some kind of abstract filepath object thatcould represent either
>> bytes or str (depending on platform), how about a function that takes a str
>> and only encodes it to bytes if the platform requires it?
>
> You can use the str (Unicode) type on all platforms with Python 3, so
> use os.fsdecode(). os.listdir(str) does return str filenames on any
> platform for example.

That's not the main use case I see, though. When talking to C libraries,
for example, they will usually require a byte encoded file path and also
return one. Getting the encoding right in this case is really not trivial.

I would expect that the above functions do "the right thing" also on
Windows here, unless the library really has a win32 specific file API (and
that's not likely).

Stefan

Mark Lawrence

unread,
Jul 17, 2012, 9:20:57 AM7/17/12
to python...@python.org
On 17/07/2012 02:03, Victor Stinner wrote:
>
> os.fsencode() should not be used explicitly on Windows.
>
> Victor
>

Should there be a note in the docs to this effect?

--
Cheers.

Mark Lawrence.

Ned Batchelder

unread,
Jul 17, 2012, 9:59:32 AM7/17/12
to python...@python.org

On 7/16/2012 11:49 AM, Antoine Pitrou wrote:
> On Mon, 16 Jul 2012 15:49:52 +0100
> Sven Marnach <sv...@marnach.net> wrote:
>> Currently, there is no obvious way to encode a filename in the default
>> filesystem encoding. To pipe some filenames to the stdin of a
>> subprocess, I effectively used
>>
>> encoded_name = file_name.encode(sys.getfilesystemencoding())
> Well, how about os.fsencode() and os.fsdecode()?
>
> http://docs.python.org/dev/library/os.html#os.fsencode
It's too bad these are not called os.path.encode() and os.path.decode(),
since they fit so nicely into os.path's charter of manipulating strings
representing file paths.

--Ned.

> Regards
>
> Antoine.

Victor Stinner

unread,
Jul 17, 2012, 10:31:57 AM7/17/12
to Ned Batchelder, python...@python.org
>> Well, how about os.fsencode() and os.fsdecode()?
>>
>> http://docs.python.org/dev/library/os.html#os.fsencode
>
> It's too bad these are not called os.path.encode() and os.path.decode(),
> since they fit so nicely into os.path's charter of manipulating strings
> representing file paths.

os.fsencode()/fsdecode() are not specific to filesystems: you can use
these functions to encode/decode command line arguments, environment
variable, text from/to a console (sys.std*), etc.

The "fs" letters from the name comes from the encoding used by these
functions: sys.get*filesystem*encoding().

For example, os.fsencode() used by the subprocess module and
posixpath.expanduser() modules, and os.fsdecode() is used by
os.get_exec_path() and shutil.rmtree().

Victor

Sven Marnach

unread,
Jul 17, 2012, 5:25:42 PM7/17/12
to python...@python.org
Victor Stinner schrieb am Tue, 17. Jul 2012, um 03:03:24 +0200:
> os.fsencode() should not be used explicitly on Windows.

What else should I do to pipe filenames to another process? At least,
os.fsencode() seems to work, even with cyrillic filenames.

Cheers,
Sven

MRAB

unread,
Jul 17, 2012, 5:52:57 PM7/17/12
to python-ideas
On 17/07/2012 22:25, Sven Marnach wrote:
> Victor Stinner schrieb am Tue, 17. Jul 2012, um 03:03:24 +0200:
>> os.fsencode() should not be used explicitly on Windows.
>
> What else should I do to pipe filenames to another process? At least,
> os.fsencode() seems to work, even with cyrillic filenames.
>
Encode to UTF-8?

Sven Marnach

unread,
Jul 18, 2012, 6:36:54 AM7/18/12
to python...@python.org
MRAB schrieb am Tue, 17. Jul 2012, um 22:52:57 +0100:
> On 17/07/2012 22:25, Sven Marnach wrote:
> >Victor Stinner schrieb am Tue, 17. Jul 2012, um 03:03:24 +0200:
> >>os.fsencode() should not be used explicitly on Windows.
> >
> >What else should I do to pipe filenames to another process? At least,
> >os.fsencode() seems to work, even with cyrillic filenames.
> >
> Encode to UTF-8?

I don't have control over the other process (it's ExifTool in batch
mode), so I have to use whatever encoding is considered the standard
to encode filenames on Windows. `os.fsencode()` works fine for this,
and Victor answered off-list that it would be fine in this case.

Cheers,
Sven
Reply all
Reply to author
Forward
0 new messages