Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Why exception from os.path.exists()?

2,034 views
Skip to first unread message

Marko Rauhamaa

unread,
May 31, 2018, 8:03:23 AM5/31/18
to

This surprising exception can even be a security issue:

>>> os.path.exists("\0")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.6/genericpath.py", line 19, in exists
os.stat(path)
ValueError: embedded null byte

Most other analogous reasons *don't* generate an exception, nor is that
possibility mentioned in the specification:

https://docs.python.org/3/library/os.path.html?#os.path.exists

Is the behavior a bug? Shouldn't it be:

>>> os.path.exists("\0")
False


Marko

Chris Angelico

unread,
May 31, 2018, 8:46:58 AM5/31/18
to
A Unix path name cannot contain a null byte, so what you have is a
fundamentally invalid name. ValueError is perfectly acceptable.

ChrisA

Marko Rauhamaa

unread,
May 31, 2018, 9:03:15 AM5/31/18
to
Chris Angelico <ros...@gmail.com>:

> On Thu, May 31, 2018 at 10:03 PM, Marko Rauhamaa <ma...@pacujo.net> wrote:
>>
>> This surprising exception can even be a security issue:
>>
>> >>> os.path.exists("\0")
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> File "/usr/lib64/python3.6/genericpath.py", line 19, in exists
>> os.stat(path)
>> ValueError: embedded null byte
>
> [...]
>
> A Unix path name cannot contain a null byte, so what you have is a
> fundamentally invalid name. ValueError is perfectly acceptable.

At the very least, that should be emphasized in the documentation. The
pathname may come from an external source. It is routine to check for
"/", "." and ".." but most developers (!?) would not think of checking
for "\0". That means few test suites would catch this issue and few
developers would think of catching ValueError here. The end result is
unpredictable.


Marko

Chris Angelico

unread,
May 31, 2018, 9:10:19 AM5/31/18
to
The rules for paths come from the underlying system. You'll get quite
different results on Windows than you do on Unix. What should be
documented? Should it also be documented that you can get strange
errors when your path involves three different operating systems and
five different file systems? Is that Python's responsibility, or
should it be generally accepted that invalid values can cause
ValueError?

Do you have an actual use-case where it is correct for an invalid path
to be treated as not existing?

ChrisA

Marko Rauhamaa

unread,
May 31, 2018, 9:38:42 AM5/31/18
to
Chris Angelico <ros...@gmail.com>:
> Do you have an actual use-case where it is correct for an invalid path
> to be treated as not existing?

Note that os.path.exists() returns False for other types of errors
including:

* File might exist but you have no access rights

* The pathname is too long for the file system

* The pathname is a broken symbolic link

* The pathname is a circular symbolic link

* The hard disk ball bearings are chipped

I'm not aware of any other kind of a string argument that would trigger
an exception except the presence of a NUL byte.

The reason for the different treatment is that the former errors are
caught by the kernel and converted to False by os.path.exists(). The NUL
byte check is carried out by Python's standard library.


Marko

Chris Angelico

unread,
May 31, 2018, 10:02:10 AM5/31/18
to
On Thu, May 31, 2018 at 11:38 PM, Marko Rauhamaa <ma...@pacujo.net> wrote:
> Chris Angelico <ros...@gmail.com>:
>> Do you have an actual use-case where it is correct for an invalid path
>> to be treated as not existing?
>
> Note that os.path.exists() returns False for other types of errors
> including:
>
> * File might exist but you have no access rights
>
> * The pathname is too long for the file system
>
> * The pathname is a broken symbolic link
>
> * The pathname is a circular symbolic link
>
> * The hard disk ball bearings are chipped

All of those are conceptually valid filenames, and it's perfectly
reasonable to ask if the file exists. Running the same program inside
a chroot might result in a True.

> I'm not aware of any other kind of a string argument that would trigger
> an exception except the presence of a NUL byte.

With a zero byte in the file name, it is not a valid file name under
any Unix-based OS. Regardless of the file system, "\0" is not valid.

> The reason for the different treatment is that the former errors are
> caught by the kernel and converted to False by os.path.exists(). The NUL
> byte check is carried out by Python's standard library.

That's because the kernel, having declared that zero bytes are
invalid, uses ASCIIZ filenames. It's way simpler that way. So the
Python string cannot validly be turned into input for the kernel. It's
on par with trying to represent 2**53+1.0 - it's not representable and
will behave differently. With floats, you get something close to the
requested value; with strings, they'd be truncated. But either way,
you absolutely cannot represent the file name "spam\0ham" to any Unix
kernel, because the file name is fundamentally invalid.

Can someone on Windows see if there are other path names that raise
ValueError there? Windows has a whole lot more invalid characters, and
invalid names as well.

ChrisA

Gregory Ewing

unread,
May 31, 2018, 10:15:15 AM5/31/18
to
Chris Angelico wrote:

> A Unix path name cannot contain a null byte, so what you have is a
> fundamentally invalid name. ValueError is perfectly acceptable.

It would also make sense for it could simply return False, since
a file with such a name can't exist.

This is analogous to the way comparing objects of different types
for equality returns False instead of raising an exception.

--
Greg

MRAB

unread,
May 31, 2018, 10:51:29 AM5/31/18
to
On Windows, the path '<' is invalid, but os.path.exists('<') returns
False, not an error.

The path '' is also invalid, but os.path.exists('') returns False, not
an error.

I don't see why '\0' should behave any differently.

Paul Moore

unread,
May 31, 2018, 10:55:56 AM5/31/18
to
On 31 May 2018 at 15:01, Chris Angelico <ros...@gmail.com> wrote:
> Can someone on Windows see if there are other path names that raise
> ValueError there? Windows has a whole lot more invalid characters, and
> invalid names as well.

On Windows:

>>> os.path.exists('\0')
ValueError: stat: embedded null character in path

>>> os.path.exists('?')
False

>>> os.path.exists('\u77412')
False

>>> os.path.exists('\t')
False

Honestly, I think the OP's point is correct. os.path.exists should
simply return False if the filename has an embedded \0 - at least on
Unix. I don't know if Windows allows \0 in filenames, but if it does,
then os.path.exists should respect that...

Although I wouldn't consider this as anything even remotely like a
significant issue...

Paul

Steven D'Aprano

unread,
May 31, 2018, 11:14:10 AM5/31/18
to
On Thu, 31 May 2018 22:46:35 +1000, Chris Angelico wrote:
[...]
>> Most other analogous reasons *don't* generate an exception, nor is that
>> possibility mentioned in the specification:
>>
>> https://docs.python.org/3/library/os.path.html?#os.path.exists
>>
>> Is the behavior a bug? Shouldn't it be:
>>
>> >>> os.path.exists("\0")
>> False
>
> A Unix path name cannot contain a null byte, so what you have is a
> fundamentally invalid name. ValueError is perfectly acceptable.

It should still be documented.

What does it do on Windows if the path is illegal?



--
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

Paul Moore

unread,
May 31, 2018, 11:30:39 AM5/31/18
to
On 31 May 2018 at 16:11, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> On Thu, 31 May 2018 22:46:35 +1000, Chris Angelico wrote:
> [...]
>>> Most other analogous reasons *don't* generate an exception, nor is that
>>> possibility mentioned in the specification:
>>>
>>> https://docs.python.org/3/library/os.path.html?#os.path.exists
>>>
>>> Is the behavior a bug? Shouldn't it be:
>>>
>>> >>> os.path.exists("\0")
>>> False
>>
>> A Unix path name cannot contain a null byte, so what you have is a
>> fundamentally invalid name. ValueError is perfectly acceptable.
>
> It should still be documented.
>
> What does it do on Windows if the path is illegal?

Returns False (confirmed with paths of '?' and ':', among others).

Paul

Terry Reedy

unread,
May 31, 2018, 12:59:27 PM5/31/18
to
Please open an issue on the tracker if there is not one for this already.


--
Terry Jan Reedy

Marko Rauhamaa

unread,
May 31, 2018, 1:21:54 PM5/31/18
to
Terry Reedy <tjr...@udel.edu>:
issue 33721 created


Marko

Chris Angelico

unread,
May 31, 2018, 1:26:15 PM5/31/18
to
On Fri, Jun 1, 2018 at 12:51 AM, MRAB <pyt...@mrabarnett.plus.com> wrote:
> On 2018-05-31 14:38, Marko Rauhamaa wrote:
>>
> On Windows, the path '<' is invalid, but os.path.exists('<') returns False,
> not an error.
>
> The path '' is also invalid, but os.path.exists('') returns False, not an
> error.
>
> I don't see why '\0' should behave any differently.

Okay, if it's just returning False for all the Windows invalid paths,
then sure, the Unix invalid paths can behave the same way.

Thanks for checking that (you and Paul equally).

ChrisA

Grant Edwards

unread,
May 31, 2018, 1:46:02 PM5/31/18
to
On 2018-05-31, Paul Moore <p.f....@gmail.com> wrote:
> On 31 May 2018 at 15:01, Chris Angelico <ros...@gmail.com> wrote:
>> Can someone on Windows see if there are other path names that raise
>> ValueError there? Windows has a whole lot more invalid characters, and
>> invalid names as well.
>
> On Windows:
>
>>>> os.path.exists('\0')
> ValueError: stat: embedded null character in path
>
>>>> os.path.exists('?')
> False
>
>>>> os.path.exists('\u77412')
> False
>
>>>> os.path.exists('\t')
> False
>
> Honestly, I think the OP's point is correct. os.path.exists should
> simply return False if the filename has an embedded \0 - at least on
> Unix.

Except on the platform in quetion filenames _don't_ contain an
embedded \0. What was passed was _not_ a path/filename.

You might as well have passed a floating point number or a dict.

> Although I wouldn't consider this as anything even remotely like a
> significant issue...

Agreed, but the thread will continue for months and generate hundreds
of followup.

--
Grant Edwards grant.b.edwards Yow! You were s'posed
at to laugh!
gmail.com

Barry Scott

unread,
Jun 1, 2018, 9:02:43 AM6/1/18
to
I think the reason for the \0 check is that if the string is passed to the
operating system with the \0 you can get surprising results.

If \0 was not checked for you would be able to get True from:

os.file.exists('/home\0ignore me')

This is because a posix system only sees '/home'.
Surely ValueError is reasonable?

Once you know that all of the string you provided is given to the operating
system it can then do whatever checks it sees fit to and return a suitable
result.

As an aside Windows has lots of special filenames that you have to know about
if you are writting robust file handling. AUX, COM1, \this\is\also\COM1 etc.

Barry

>
>
> Marko




Paul Moore

unread,
Jun 1, 2018, 9:24:00 AM6/1/18
to
On 1 June 2018 at 13:15, Barry Scott <ba...@barrys-emacs.org> wrote:
> I think the reason for the \0 check is that if the string is passed to the
> operating system with the \0 you can get surprising results.
>
> If \0 was not checked for you would be able to get True from:
>
> os.file.exists('/home\0ignore me')
>
> This is because a posix system only sees '/home'.

So because the OS API can't handle filenames with \0 in (because that
API uses null-terminated strings) Python has to special case its
handling of the check. That's fine.

> Surely ValueError is reasonable?

Well, if the OS API can't handle filenames with embedded \0, we can be
sure that such a file doesn't exist - so returning False is
reasonable.

> Once you know that all of the string you provided is given to the operating
> system it can then do whatever checks it sees fit to and return a suitable
> result.

As the programmer, I don't care. The Python interpreter should take
care of that for me, and if I say "does file 'a\0b' exist?" I want an
answer. And I don't see how anything other than "no it doesn't" is
correct. Python allows strings with embedded \0 characters, so it's
possible to express that question in Python - os.path.exists('a\0b').
What can be expressed in terms of the low-level (C-based) operating
system API shouldn't be relevant.

Disclaimer - the Python "os" module *does* expose low-level
OS-dependent functionality, so it's not necessarily reasonable to
extend this argument to other functions in os. But it seems like a
pretty solid argument in this particular case.

> As an aside Windows has lots of special filenames that you have to know about
> if you are writting robust file handling. AUX, COM1, \this\is\also\COM1 etc.

I don't think that's relevant in this context.

Paul

Marko Rauhamaa

unread,
Jun 1, 2018, 9:38:17 AM6/1/18
to
Paul Moore <p.f....@gmail.com>:
> On 1 June 2018 at 13:15, Barry Scott <ba...@barrys-emacs.org> wrote:
>> Once you know that all of the string you provided is given to the
>> operating system it can then do whatever checks it sees fit to and
>> return a suitable result.
>
> As the programmer, I don't care. The Python interpreter should take
> care of that for me, and if I say "does file 'a\0b' exist?" I want an
> answer. And I don't see how anything other than "no it doesn't" is
> correct. Python allows strings with embedded \0 characters, so it's
> possible to express that question in Python - os.path.exists('a\0b').
> What can be expressed in terms of the low-level (C-based) operating
> system API shouldn't be relevant.

Interestingly, you get a False even for existing files if you don't have
permissions to access the file. Arguably, that answer is misleading, and
an exception would be justified. But since os.path.exists() returns a
False even in that situation, it definitely should return False for a
string containing a NUL character.


Marko

Richard Damon

unread,
Jun 1, 2018, 9:41:24 AM6/1/18
to
On 5/31/18 1:43 PM, Grant Edwards wrote:
> On 2018-05-31, Paul Moore <p.f....@gmail.com> wrote:
>> On 31 May 2018 at 15:01, Chris Angelico <ros...@gmail.com> wrote:
>>
>> Honestly, I think the OP's point is correct. os.path.exists should
>> simply return False if the filename has an embedded \0 - at least on
>> Unix.
> Except on the platform in quetion filenames _don't_ contain an
> embedded \0. What was passed was _not_ a path/filename.
>
> You might as well have passed a floating point number or a dict.
>
>
I think this is a key point. os.path.exists needs to pass a null
terminated string to the system to ask it about the file. The question
comes what should we do if we pass it a value that can't (or we won't)
represent as such a string.

The confusion is that in python, a string with an embedded null is
something pretty much like a string without an embedded null, so the
programmer might not think of it as being the wrong type. Thus we have
several options.

1) we can treat os.path.exists('foo\0bar') the same as
os.path.exists(1.5) and raise the exception.

2) we can treat os.path.exists('foo\0bar') as specifying a file that can
never exists and bypass the system call are return false.

3) we can process os.path.exists('foo\0bar') by just passing the string
to the system call, making it the same as os.path.exists('foo')

The last is probably the one that we can say is likely wrong, but
arguments could be made for either of the first two.

--
Richard Damon

Chris Angelico

unread,
Jun 1, 2018, 9:59:14 AM6/1/18
to
On Fri, Jun 1, 2018 at 11:41 PM, Richard Damon <Ric...@damon-family.org> wrote:
> The confusion is that in python, a string with an embedded null is
> something pretty much like a string without an embedded null, so the
> programmer might not think of it as being the wrong type. Thus we have
> several options.
>
> 1) we can treat os.path.exists('foo\0bar') the same as
> os.path.exists(1.5) and raise the exception.

1.5 raises TypeError, which is correct. But the type of "foo\0bar" is
str, which is a perfectly valid type. ValueError is more correct here.
And that's what currently happens.

Possibly more confusing, though, is this:

>>> os.path.exists(1)
True
>>> os.path.exists(2)
True
>>> os.path.exists(3)
False

I think it's testing that the file descriptors exist, because
os.path.exists is defined in terms of os.stat, which can stat a path
or an FD. So os.path.exists(fd) is True if that fd is open, and False
if it isn't. But os.path.exists is not documented as accepting FDs.
Accident of implementation or undocumented feature? Or maybe
accidental feature?

> 2) we can treat os.path.exists('foo\0bar') as specifying a file that can
> never exists and bypass the system call are return false.

That's what's being proposed.

> 3) we can process os.path.exists('foo\0bar') by just passing the string
> to the system call, making it the same as os.path.exists('foo')
>
> The last is probably the one that we can say is likely wrong, but
> arguments could be made for either of the first two.

More than "likely wrong"; it's definitely wrong, and deceptively so. I
don't think anyone would support this case.

ChrisA

Marko Rauhamaa

unread,
Jun 1, 2018, 10:18:36 AM6/1/18
to
Chris Angelico <ros...@gmail.com>:

> Possibly more confusing, though, is this:
>
>>>> os.path.exists(1)
> True
>>>> os.path.exists(2)
> True
>>>> os.path.exists(3)
> False
>
> I think it's testing that the file descriptors exist, because
> os.path.exists is defined in terms of os.stat, which can stat a path
> or an FD. So os.path.exists(fd) is True if that fd is open, and False
> if it isn't. But os.path.exists is not documented as accepting FDs.
> Accident of implementation or undocumented feature? Or maybe
> accidental feature?

What's more:

>>> os.path.exists(-100)
False
>>> os.path.exists(-1000000000000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.6/genericpath.py", line 19, in exists
os.stat(path)
OverflowError: fd is less than minimum

One could argue -100 is less than minimum...

The common denominator is that "\0" and -1000000000000 are caught by
Python's standard library while "" and -100 are caught by the OS.


Marko

Grant Edwards

unread,
Jun 1, 2018, 10:41:32 AM6/1/18
to
On 2018-06-01, Paul Moore <p.f....@gmail.com> wrote:

> Python allows strings with embedded \0 characters, so it's possible
> to express that question in Python - os.path.exists('a\0b'). What
> can be expressed in terms of the low-level (C-based) operating
> system API shouldn't be relevant.

Python allows floating point numbers, so it is possible to express
this question in python: os.path.exists(3.14159). Is the fact that
the underlying OS/filesystem can't identify files via a floating point
number relevent? Should it return False or raise ValueError?

--
Grant Edwards grant.b.edwards Yow! How do I get HOME?
at
gmail.com

Paul Moore

unread,
Jun 1, 2018, 11:03:21 AM6/1/18
to
On 1 June 2018 at 15:38, Grant Edwards <grant.b...@gmail.com> wrote:
> On 2018-06-01, Paul Moore <p.f....@gmail.com> wrote:
>
>> Python allows strings with embedded \0 characters, so it's possible
>> to express that question in Python - os.path.exists('a\0b'). What
>> can be expressed in terms of the low-level (C-based) operating
>> system API shouldn't be relevant.
>
> Python allows floating point numbers, so it is possible to express
> this question in python: os.path.exists(3.14159). Is the fact that
> the underlying OS/filesystem can't identify files via a floating point
> number relevent? Should it return False or raise ValueError?

I'm not sure if you're asking a serious question here, or trying to
make some sort of point, but os.path.exists is documented as taking a
string, so passing a float should be a TypeError. And it is.

But as I already said, this is a huge amount of effort spent on a
pretty trivial corner case, so I'll duck out of this thread now.
Paul

Richard Damon

unread,
Jun 1, 2018, 11:59:00 AM6/1/18
to
On 6/1/18 9:58 AM, Chris Angelico wrote:
> On Fri, Jun 1, 2018 at 11:41 PM, Richard Damon <Ric...@damon-family.org> wrote:
>> The confusion is that in python, a string with an embedded null is
>> something pretty much like a string without an embedded null, so the
>> programmer might not think of it as being the wrong type. Thus we have
>> several options.
>>
>> 1) we can treat os.path.exists('foo\0bar') the same as
>> os.path.exists(1.5) and raise the exception.
> 1.5 raises TypeError, which is correct. But the type of "foo\0bar" is
> str, which is a perfectly valid type. ValueError is more correct here.
> And that's what currently happens.
>
> Possibly more confusing, though, is this:
>
>>>> os.path.exists(1)
> True
>>>> os.path.exists(2)
> True
>>>> os.path.exists(3)
> False
>
> I think it's testing that the file descriptors exist, because
> os.path.exists is defined in terms of os.stat, which can stat a path
> or an FD. So os.path.exists(fd) is True if that fd is open, and False
> if it isn't. But os.path.exists is not documented as accepting FDs.
> Accident of implementation or undocumented feature? Or maybe
> accidental feature?
>
>> 2) we can treat os.path.exists('foo\0bar') as specifying a file that can
>> never exists and bypass the system call are return false.
> That's what's being proposed.
>
>> 3) we can process os.path.exists('foo\0bar') by just passing the string
>> to the system call, making it the same as os.path.exists('foo')
>>
>> The last is probably the one that we can say is likely wrong, but
>> arguments could be made for either of the first two.
> More than "likely wrong"; it's definitely wrong, and deceptively so. I
> don't think anyone would support this case.
>
> ChrisA

I would say that one way to look at it is that os.path.exists
fundamentally (at the OS level) expects a parameter of the 'type' of
either a nul terminated string or a file descriptor (aka fixed width
integer). One issue we have is that these 'types' don't directly map to
Python types.

We can basically make a call to os.path.exists with 4 different types of
parameter:

1) The parameter has a totally wrong type of type that just doesn't map
to one of the expected type. This gives a TypeError exeception.

2) The parameter has a Python type that maps to right OS 'type' but has
a value that prevents us from properly converting it to a corresponding
value of that type. This could be a integral value out of range for the
fixed width type used, or a string which contains an embedded nul.
Currently these generate an OverflowError for out of range integer and a
ValueError for a bad string

3) The parameter can be mapped to the proper type but the value is
somehow illegal (the number fits the type, but isn't legal for a file
descriptor, or a string has a value that can't represent a real file).
In this case, os.path.exists doesn't try to validate the parameter but
just passes it along and returns a value based on the answer it gets.

4) The parameter represents a legal value of a right type, so as above
we pass the value and get back the answer.

The fundamental question is about case 2. Should os.path.exist, having
been give a value of the right 'Python Type' but not matching the type
of the operating system parameter identify this as an error (as it
currently does), or should it be changed to decide that if it could
somehow get that parameter to the os, then it would say that the file
doesn't exist, and so return false. I would say that if you accept that,
should we also say that if we pass a totally wrong type, why shouldn't
we again return false instead of a TypeError, after all, if we pass it a
dictionary, they certainly is no file like that in existence,

The real question comes which method is more useful, which is most apt
to be the one we want, and which one is the better building block for a
program.

One thing to note as an advantage for the current method, it is trivial
with the current decision to write a mypathexists that would accept
strings with nuls embedded and return false, just call os.path.exists
inside a try, and catch the ValueError and return false. You could also
extend it to catch OverflowError and/or TypeError. On the other hand, if
os.path.exists swallows these errors and just returns false, then it is
a lot more work to make a wrapper that throws the errors, you basically
would need to precheck for bad values and throw, and then if you move to
a system that happened to allow nuls in the file name (and the python
code knew that), your wrapper code now is wrong as you had to build in
implementation knowledge into the user code.

--
Richard Damon

Steven D'Aprano

unread,
Jun 1, 2018, 6:13:14 PM6/1/18
to
On Fri, 01 Jun 2018 09:41:04 -0400, Richard Damon wrote:

> I think this is a key point. os.path.exists needs to pass a null
> terminated string to the system to ask it about the file.

That's an implementation detail utterly irrelevant to Python programmers.
If the OS expects a pointer to a block of UTF-16 bytes, would you say "oh
well, since Python doesn't have low level pointers, we simply can't
provide this functionality?"

No of course we would not. os.path.exists should take a regular Python
string and adapt it to whatever the implementation requires.


> The question
> comes what should we do if we pass it a value that can't (or we won't)
> represent as such a string.

You get a TypeError:


py> os.path.exists([])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/genericpath.py", line 19, in exists
os.stat(path)
TypeError: argument should be string, bytes or integer, not list



> The confusion is that in python, a string with an embedded null is
> something pretty much like a string without an embedded null, so the
> programmer might not think of it as being the wrong type.

But it *isn't* the wrong type. It is the same type:


py> type("abc") is type("a\0c")
True


So TypeError is out, since the type is right, only the value is wrong.
But ValueError is wrong too:

What does os.path.exists return when given "the wrong value" (i.e. a
string that doesn't match an existing path)? It returns False, not raise
an exception.


> Thus we have several options.

Only one of which is consistent with the rest of os.path.exist()'s
behaviour, only one of which is sensible. Return False, like every other
pathname that doesn't exist.

Chris Angelico

unread,
Jun 1, 2018, 6:50:54 PM6/1/18
to
On Sat, Jun 2, 2018 at 8:10 AM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> But it *isn't* the wrong type. It is the same type:
>
>
> py> type("abc") is type("a\0c")
> True
>
>
> So TypeError is out, since the type is right, only the value is wrong.

I agree with you so far. This parallels the fact that math.asin(5)
doesn't raise TypeError, since an integer is perfectly valid input to
the arcsine function.

> But ValueError is wrong too:
>
> What does os.path.exists return when given "the wrong value" (i.e. a
> string that doesn't match an existing path)? It returns False, not raise
> an exception.

Hang on, how is "a string that doesn't match an existing path" the
wrong value? The point is to ask if a path exists. A path that doesn't
exist is still valid. There are four possibilities:

1) You're asking a question that doesn't even make sense, like
os.path.exists(Ellipsis)

2) You're asking about a path that exists and you can prove that it
does. Return True.

3) You're asking about a path that doesn't exist and you can prove
that it doesn't. Return False.

4) You're asking about a path that you can't be sure about - maybe
there's a permissions error.

With perms errors, it's less clear what's the right thing to do. I
think letting an OSError bubble up would be appropriate here, but
others may disagree. Similarly if you try to access a network mount
that is currently disconnected, a device that has crashed, etc.
Returning False (as in the current implementation) is plausible; I
don't think it's the ideal, but since it would break backward
compatibility to change it now, I'm not advocating making that change.

The real question is whether os.path.exists("a\0c") is more akin to
os.path.exists(Ellipsis) or to os.path.exists("/mnt/broken/spam"). It
is never going to be valid to ask whether Ellipsis exists, and it is
never going to be valid to ask whether a\0c exists. The broken mount
might potentially become valid, and the same path name could change in
meaning if you are chrooted, so it is a sane question to ask.

My ideal preference would be for True to mean "we know for certain
that this exists" and False "we know for certain that this doesn't
exist" (which might be because one of its parent components doesn't
exist - if /spam doesn't exist, then /spam/ham doesn't either, and
that's just straight False). Everything else should raise an
exception. That said, though, I'm not hugely fussed and don't have
actual use-cases here, beyond the back-of-the-mind feeling that saying
"no that doesn't exist" is deceptively confident about something that
isn't actually certain.

ChrisA

Gregory Ewing

unread,
Jun 1, 2018, 7:09:20 PM6/1/18
to
Marko Rauhamaa wrote:
> Interestingly, you get a False even for existing files if you don't have
> permissions to access the file.

Obviously in that case, instead of True or False it should
return FileNotFound. :-)

https://thedailywtf.com/articles/What_Is_Truth_0x3f_

--
Greg

Steven D'Aprano

unread,
Jun 1, 2018, 7:19:03 PM6/1/18
to
On Fri, 01 Jun 2018 14:38:56 +0000, Grant Edwards wrote:

> On 2018-06-01, Paul Moore <p.f....@gmail.com> wrote:
>
>> Python allows strings with embedded \0 characters, so it's possible to
>> express that question in Python - os.path.exists('a\0b'). What can be
>> expressed in terms of the low-level (C-based) operating system API
>> shouldn't be relevant.
>
> Python allows floating point numbers, so it is possible to express this
> question in python: os.path.exists(3.14159). Is the fact that the
> underlying OS/filesystem can't identify files via a floating point
> number relevent? Should it return False or raise ValueError?

Certainly not a ValueError, that would be silly. The fact that it is an
illegal value is subordinate to the fact that it is the wrong type.

It should either return False, or raise TypeError. Of the two, since
3.14159 cannot represent a file on any known OS, TypeError would be more
appropriate.

But since "\0" is the correct type (a string), and the fact that it
happens to be illegal on POSIX is a platform-dependent detail of no more
importance than the fact that "?" is illegal on Windows, it should be
treated as any other platform-dependent illegal file and return False.

Gregory Ewing

unread,
Jun 1, 2018, 7:37:40 PM6/1/18
to
Grant Edwards wrote:
> Python allows floating point numbers, so it is possible to express
> this question in python: os.path.exists(3.14159). Is the fact that
> the underlying OS/filesystem can't identify files via a floating point
> number relevent? Should it return False or raise ValueError?

I don't know about that, but it's clear that
os.path.exists(1j) shoud raise OnlyInYourDreamsError.

--
Greg

Steven D'Aprano

unread,
Jun 1, 2018, 7:39:37 PM6/1/18
to
On Thu, 31 May 2018 17:43:28 +0000, Grant Edwards wrote:

> Except on the platform in quetion filenames _don't_ contain an embedded
> \0. What was passed was _not_ a path/filename.

"/wibble/rubbish/nobodyexpectsthespanishinquistion" is not a pathname on
my system either, and os.path.exists() returns False for that. As it is
supposed to.

I'd be willing to bet that:

import secrets # Python 3.6+
s = "/" + secrets.token_hex(1024) + "/spam"

is not a pathname on any computer in the world. (If it is even legal.)
And yet os.path.exists(s) returns False.

The maximum number of file components under POSIX is (I believe) 256. And
yet:

py> os.path.exists("/a"*1000000)
False

"/a" by one million cannot possibly be a path under POSIX.


> the thread will continue for months and generate hundreds of followup.

Only because some people insist on exercising their right to be wrong.

eryk sun

unread,
Jun 1, 2018, 7:46:53 PM6/1/18
to
On Fri, Jun 1, 2018 at 3:58 PM, Richard Damon <Ric...@damon-family.org> wrote:
>
> The fundamental question is about case 2. Should os.path.exist, having
> been give a value of the right 'Python Type' but not matching the type
> of the operating system parameter identify this as an error (as it
> currently does), or should it be changed to decide that if it could
> somehow get that parameter to the os, then it would say that the file
> doesn't exist, and so return false.

AFAIK, this behavior hasn't been documented. So it can either be
documented, and thus never allow NUL in paths, or else every call that
currently raises ValueError for this case should raise a pretend
FileNotFoundError. No change to exists(), isdir(), and isfile() would
be required.

For Windows, there's another case that's in more of a grey area.
Python 3.6+ uses UTF-8 as the file-system encoding in Windows.
Internally it transcodes between UTF-8 and the native UTF-16 encoding.
The "surrogatepass" error handler is used in order to faithfully
handle invalid surrogates, which the system allows. This leaves no
simple way to smuggle invalid UTF-8 sequences into the filename and
rountrip back to bytes, so UnicodeDecodeError (a subclass of
ValueError) is raised. The same invalid UTF-8 would pass silently in
POSIX, which uses bytes paths and the "surrogateescape" handler.

Trivia:
The native NT API of Windows can use device names that contain NUL
characters because it uses counted strings in the OBJECT_ATTRIBUTES
record that's used to access named objects (e.g. Device, Section, Job,
Event, Semaphore, etc). I've tested that this works. A file system
could also allow NUL in names, but Microsoft's drivers reserve NUL as
an invalid character, as would any driver that uses the file-system
runtime library. That said, native NT applications have a limited
scope, so it's almost pointless to speculate.

Steven D'Aprano

unread,
Jun 1, 2018, 7:51:17 PM6/1/18
to
On Fri, 01 Jun 2018 11:58:42 -0400, Richard Damon wrote:

> I would say that one way to look at it is that os.path.exists
> fundamentally (at the OS level) expects a parameter of the 'type' of
> either a nul terminated string or a file descriptor (aka fixed width
> integer). One issue we have is that these 'types' don't directly map to
> Python types.


What a strange and unhelpful way to look at it. When calling
os.path.exists() from Python, are you interacting directly with the OS in
low-level C or assembly, or using a high-level language like Python?

That's not a rhetorical question. I would like to know what you think we
are doing when we type into the Python interpreter

import os
os.path.exists("pathname")

Well actually no I lie. Of course its a rhetorical question: I'm sure you
know full well that you're using Python, a high-level language.


> We can basically make a call to os.path.exists with 4 different types of
> parameter:
>
> 1) The parameter has a totally wrong type of type that just doesn't map
> to one of the expected type. This gives a TypeError exeception.
>
> 2) The parameter has a Python type that maps to right OS 'type' but has
> a value that prevents us from properly converting it to a corresponding
> value of that type. This could be a integral value out of range for the
> fixed width type used,

That is a reasonable case for either ValueError or OverflowError,
OverflowError being a more specific (and therefore useful) exception to
raise. Being a low-level detail, file descriptors are inherently limited
to a fixed width and it truly is an error to supply something outside of
that width.

> or a string which contains an embedded nul.

The fact that this is illegal under POSIX is an irrelevant platform-
dependent detail. Irrelevant in the sense that it shouldn't change the
API of the function. "<" doesn't raise ValueError on Windows, and ""
doesn't raise ValueError on any platform. Why should POSIX nulls be
treated as special?


> Currently these generate an OverflowError for out of range integer and a
> ValueError for a bad string

Incorrect: as Paul and MRAB (and myself) have pointed out, bad strings
return False, with the surprising exception of null-embedded strings
under POSIX.

Illegal strings like "<" on Windows return False. Excessive long strings,
or strings with too many path components, return False. The empty string
returns False.

There is no good reason to raise ValueError on strings with null in them.


> 3) The parameter can be mapped to the proper type but the value is
> somehow illegal (the number fits the type, but isn't legal for a file
> descriptor, or a string has a value that can't represent a real file).
> In this case, os.path.exists doesn't try to validate the parameter but
> just passes it along and returns a value based on the answer it gets.

Right. And if the answer it gets is "illegal value", it returns False.



> 4) The parameter represents a legal value of a right type, so as above
> we pass the value and get back the answer.
>
> The fundamental question is about case 2. Should os.path.exist, having
> been give a value of the right 'Python Type' but not matching the type
> of the operating system parameter identify this as an error (as it
> currently does),

That's wrong, that is not what it does, except in the surprising case of
nulls under POSIX.


> or should it be changed to decide that if it could
> somehow get that parameter to the os, then it would say that the file
> doesn't exist, and so return false.

Of course.


> I would say that if you accept that,
> should we also say that if we pass a totally wrong type, why shouldn't
> we again return false instead of a TypeError, after all, if we pass it a
> dictionary, they certainly is no file like that in existence,

Because os.path.exists is documented as accepting strings, bytes and
ints, and everything else is a TypeError.

If you want to make a case for relaxing the type restrictions on
os.path.exists, go right ahead, but I won't be supporting you.

Steven D'Aprano

unread,
Jun 1, 2018, 7:56:47 PM6/1/18
to
On Sat, 02 Jun 2018 08:50:38 +1000, Chris Angelico wrote:

> My ideal preference would be for True to mean "we know for certain that
> this exists" and False "we know for certain that this doesn't exist"

We cannot make that promise, because we might not have permission to view
the file. Since we don't have a three-state True/False/Maybe flag,
os.path.exists() just returns False.

> (which might be because one of its parent components doesn't exist - if
> /spam doesn't exist, then /spam/ham doesn't either, and that's just
> straight False).

And if the path is an illegal value, it also straight doesn't exist. Like
the empty string, like "<" on Windows, like strings with too many path
components.

Chris Angelico

unread,
Jun 1, 2018, 7:57:13 PM6/1/18
to
On Sat, Jun 2, 2018 at 9:37 AM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> On Thu, 31 May 2018 17:43:28 +0000, Grant Edwards wrote:
>
>> Except on the platform in quetion filenames _don't_ contain an embedded
>> \0. What was passed was _not_ a path/filename.
>
> "/wibble/rubbish/nobodyexpectsthespanishinquistion" is not a pathname on
> my system either, and os.path.exists() returns False for that. As it is
> supposed to.
>
> I'd be willing to bet that:
>
> import secrets # Python 3.6+
> s = "/" + secrets.token_hex(1024) + "/spam"
>
> is not a pathname on any computer in the world. (If it is even legal.)
> And yet os.path.exists(s) returns False.

With both of these, the path cannot exist because its first component
does not exist. Absent a /wibble on your system, the entire long path
is unable to exist. That is a natural consequence of the hierarchical
structure of file systems. I'm fairly sure 2KB of path is valid on all
major OSes today, which means that it's exactly the same as /wibble -
the first component doesn't exist, ergo the path doesn't exist.

> The maximum number of file components under POSIX is (I believe) 256. And
> yet:
>
> py> os.path.exists("/a"*1000000)
> False
>
> "/a" by one million cannot possibly be a path under POSIX.

I can't actually find that listed anywhere. Citation needed. But
assuming you're right, POSIX is still a set of minimum requirements -
not maximums, to my knowledge. If some operating system permits longer
paths with more components, it won't be non-compliant on that basis.
So it's still plausible to ask "does this path exist", and it's
perfectly correct to look at the first "/a/" and check if there's
anything named "a" in your root directory, and return False upon
finding none. The question is sane, unlike os.path.exists([]).

ChrisA

Chris Angelico

unread,
Jun 1, 2018, 8:05:37 PM6/1/18
to
On Sat, Jun 2, 2018 at 9:54 AM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> On Sat, 02 Jun 2018 08:50:38 +1000, Chris Angelico wrote:
>
>> My ideal preference would be for True to mean "we know for certain that
>> this exists" and False "we know for certain that this doesn't exist"
>
> We cannot make that promise, because we might not have permission to view
> the file. Since we don't have a three-state True/False/Maybe flag,
> os.path.exists() just returns False.

Being unable to view the file is insignificant, but presumably you
mean we might not have permission to view the containing directory.
And we do get that information:

>>> os.stat("/root/.bashrc")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
PermissionError: [Errno 13] Permission denied: '/root/.bashrc'
>>> os.stat("/var/.bashrc")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: '/var/.bashrc'

The permissions error is reported differently by stat, but then
exists() just says "oh that's an OS error so we're going to say it
doesn't exist". If you truly want the reliable form of os.path.exists,
it would be this:

try:
os.stat(path)
return True
except FileNotFoundError:
return False

Anything that ISN'T "this file exists" or "this file doesn't exist"
will be signalled with an exception.

ChrisA

Steven D'Aprano

unread,
Jun 1, 2018, 8:17:15 PM6/1/18
to
On Sat, 02 Jun 2018 09:56:58 +1000, Chris Angelico wrote:

> On Sat, Jun 2, 2018 at 9:37 AM, Steven D'Aprano
> <steve+comp....@pearwood.info> wrote:
>> On Thu, 31 May 2018 17:43:28 +0000, Grant Edwards wrote:
>>
>>> Except on the platform in quetion filenames _don't_ contain an
>>> embedded \0. What was passed was _not_ a path/filename.
>>
>> "/wibble/rubbish/nobodyexpectsthespanishinquistion" is not a pathname
>> on my system either, and os.path.exists() returns False for that. As it
>> is supposed to.
>>
>> I'd be willing to bet that:
>>
>> import secrets # Python 3.6+
>> s = "/" + secrets.token_hex(1024) + "/spam"
>>
>> is not a pathname on any computer in the world. (If it is even legal.)
>> And yet os.path.exists(s) returns False.
>
> With both of these, the path cannot exist because its first component
> does not exist.

Since /wibble doesn't exist, neither does /wibble/a\0b


py> os.path.exists("/wibble")
False
py> os.path.exists("/wibble/a\0b")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/storage/torrents/torrents/python/Python-3.6.4/Lib/
genericpath.py", line 19, in exists
os.stat(path)
ValueError: embedded null byte


Oops.


> Absent a /wibble on your system, the entire long path is
> unable to exist. That is a natural consequence of the hierarchical
> structure of file systems. I'm fairly sure 2KB of path is valid on all
> major OSes today,

But probably not 2K in a single path component.

But that's not really my point: I was responding to Grant, who claimed
that \0 is not a pathname (or filename) and therefore ValueError is the
correct response. But there are lots of things which aren't pathnames, or
even which *cannot be* pathnames, and yet they return False. What makes
\0 so special?


> which means that it's exactly the same as /wibble -
> the first component doesn't exist, ergo the path doesn't exist.
>
>> The maximum number of file components under POSIX is (I believe) 256.
>> And yet:
>>
>> py> os.path.exists("/a"*1000000)
>> False
>>
>> "/a" by one million cannot possibly be a path under POSIX.
>
> I can't actually find that listed anywhere. Citation needed.

https://eklitzke.org/path-max-is-tricky



> But
> assuming you're right, POSIX is still a set of minimum requirements -
> not maximums, to my knowledge.

It isn't even a set of minimum requirements. "<" is legal under POSIX,
but not Windows.



> If some operating system permits longer
> paths with more components, it won't be non-compliant on that basis. So
> it's still plausible to ask "does this path exist", and it's perfectly
> correct to look at the first "/a/" and check if there's anything named
> "a" in your root directory, and return False upon finding none. The
> question is sane, unlike os.path.exists([]).

Correct.

Just as it is sane to ask if path "a\0b" exists. If it happens to be
illegal on POSIX, just as "<" is illegal under Windows, it is still sane
to ask, and you should get False returned.

Chris Angelico

unread,
Jun 1, 2018, 8:33:11 PM6/1/18
to
On Sat, Jun 2, 2018 at 10:14 AM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
>> But
>> assuming you're right, POSIX is still a set of minimum requirements -
>> not maximums, to my knowledge.
>
> It isn't even a set of minimum requirements. "<" is legal under POSIX,
> but not Windows.

Windows isn't POSIX compliant.

Anyhow, I've come to the conclusion that we're all about equally wrong
here, so I'm going to just stop arguing. Anyone who wants the
behaviour I described can get it easily enough via os.stat; and if you
want ValueError to turn into False, that's also easy enough.

ChrisA

eryk sun

unread,
Jun 1, 2018, 8:49:18 PM6/1/18
to
On Sat, Jun 2, 2018 at 12:14 AM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
>
> It isn't even a set of minimum requirements. "<" is legal under POSIX,
> but not Windows.

"<" (i.e. the DOS_STAR wildcard character) is valid in device and
stream names. It's only invalid in filenames, since it's reserved for
wildcard matching in file-system calls.

For example, in the following we have a device named
'<DEVICENAME|"*?/>:' (note the name contains a forward slash) and a
stream named '<STREAMNAME|"*?>'. The stream component requires NTFS,
ReFS, or CDFS; FAT doesn't support it. Stream names also allow ASCII
control characters, except NUL.

>>> DefineDosDevice(0, '<DEVICENAME|"*?/>:', 'C:\\Temp')
>>> f = open(r'\\?\<DEVICENAME|"*?/>:\FILENAME.TXT:<STREAMNAME|"*?>', 'w')
>>> os.path.exists(r'\\?\<DEVICENAME|"*?/>:\FILENAME.TXT:<STREAMNAME|"*?>')
True

Grant Edwards

unread,
Jun 1, 2018, 9:53:38 PM6/1/18
to
On 2018-06-01, Steven D'Aprano <steve+comp....@pearwood.info> wrote:

> But since "\0" is the correct type (a string), and the fact that it
> happens to be illegal on POSIX is a platform-dependent detail of no more
> importance than the fact that "?" is illegal on Windows, it should be
> treated as any other platform-dependent illegal file and return False.

That sounds reasonable.

What about the case where somebody calls

os.path.exists("/tmp/foo\x00bar")

If /tmp/foo exists should it return True? That's what would happen if
you passed that string directly to the libc call.

--
Grant



Grant Edwards

unread,
Jun 1, 2018, 9:57:32 PM6/1/18
to
On 2018-06-01, Steven D'Aprano <steve+comp....@pearwood.info> wrote:
> On Thu, 31 May 2018 17:43:28 +0000, Grant Edwards wrote:
>
>> Except on the platform in quetion filenames _don't_ contain an embedded
>> \0. What was passed was _not_ a path/filename.
>
> "/wibble/rubbish/nobodyexpectsthespanishinquistion" is not a pathname on
> my system either,

I disagree. On Unix systems that _is_ a path. There may or may not
be a file that exists with that path. OTOH "/wibble\x00/whatever" is
not a Unix path.

--
Grant




Serhiy Storchaka

unread,
Jun 2, 2018, 12:54:43 AM6/2/18
to
01.06.18 16:58, Chris Angelico пише:
> Possibly more confusing, though, is this:
>
>>>> os.path.exists(1)
> True
>>>> os.path.exists(2)
> True
>>>> os.path.exists(3)
> False
>
> I think it's testing that the file descriptors exist, because
> os.path.exists is defined in terms of os.stat, which can stat a path
> or an FD. So os.path.exists(fd) is True if that fd is open, and False
> if it isn't. But os.path.exists is not documented as accepting FDs.
> Accident of implementation or undocumented feature? Or maybe
> accidental feature?

Accident of implementation. In Python 3.3 os.stat() became accepting
integer file descriptors (as os.fstat()). os.path.exists() just passes
its argument to os.stat().

Serhiy Storchaka

unread,
Jun 2, 2018, 1:04:40 AM6/2/18
to
02.06.18 03:05, Chris Angelico пише:
> The permissions error is reported differently by stat, but then
> exists() just says "oh that's an OS error so we're going to say it
> doesn't exist". If you truly want the reliable form of os.path.exists,
> it would be this:
>
> try:
> os.stat(path)
> return True
> except FileNotFoundError:
> return False
>
> Anything that ISN'T "this file exists" or "this file doesn't exist"
> will be signalled with an exception.

And this is how pathlib.Path.exists() was implemented. But for
os.path.exists() we are limited by backward compatibility.

Steven D'Aprano

unread,
Jun 2, 2018, 6:29:47 AM6/2/18
to
On Sat, 02 Jun 2018 10:32:55 +1000, Chris Angelico wrote:

> On Sat, Jun 2, 2018 at 10:14 AM, Steven D'Aprano
> <steve+comp....@pearwood.info> wrote:
>>> But
>>> assuming you're right, POSIX is still a set of minimum requirements -
>>> not maximums, to my knowledge.
>>
>> It isn't even a set of minimum requirements. "<" is legal under POSIX,
>> but not Windows.
>
> Windows isn't POSIX compliant.

Technically, Windows is POSIX compliant. You have to turn off a bunch of
features, turn on another bunch of features, and what you get is the bare
minimum POSIX compliance possible, but it's enough to tick the check box
for POSIX compliance.

What what of it? POSIX is not a minimum set of requirements for Python.
POSIX is a set of standards that describes how Linux/Unix/MacOS systems
are expected to behave. Adhering to the POSIX standard isn't a
requirement for Python.


> Anyhow, I've come to the conclusion that we're all about equally wrong
> here

To paraphrase Isaac Asimov, "People who say the earth is flat are wrong,
and people who say the earth is a sphere are wrong, but if you say that
those two groups of people are equally wrong, you are more wrong than
both of them put together."

Chris Angelico

unread,
Jun 2, 2018, 6:59:07 AM6/2/18
to
On Sat, Jun 2, 2018 at 8:27 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> On Sat, 02 Jun 2018 10:32:55 +1000, Chris Angelico wrote:
>
>> On Sat, Jun 2, 2018 at 10:14 AM, Steven D'Aprano
>> <steve+comp....@pearwood.info> wrote:
>>>> But
>>>> assuming you're right, POSIX is still a set of minimum requirements -
>>>> not maximums, to my knowledge.
>>>
>>> It isn't even a set of minimum requirements. "<" is legal under POSIX,
>>> but not Windows.
>>
>> Windows isn't POSIX compliant.
>
> Technically, Windows is POSIX compliant. You have to turn off a bunch of
> features, turn on another bunch of features, and what you get is the bare
> minimum POSIX compliance possible, but it's enough to tick the check box
> for POSIX compliance.

Really? I didn't know that Windows path names were POSIX compliant. Or
do you have to use the Cygwin fudge to count Windows as POSIX? And
what about POSIX signal handling?

Citation needed, big-time.

ChrisA

Steven D'Aprano

unread,
Jun 2, 2018, 7:12:33 AM6/2/18
to
On Sat, 02 Jun 2018 01:51:07 +0000, Grant Edwards wrote:

> What about the case where somebody calls
>
> os.path.exists("/tmp/foo\x00bar")
>
> If /tmp/foo exists should it return True?

That depends on whether /tmp/foo is a directory containing a file \0bar
or not. Since that is not a legal file name on POSIX systems, it should
return False. On Windows, I don't know whether such a file could exist or
not.

> That's what would happen if
> you passed that string directly to the libc call.

Fortunately, as Python programmers, we're not passing the string directly
to the libc call. We're passing a Python string to a Python function.

Nor are we receiving the result directly back from the libc call, since
it neither returns bool objects, nor raises exceptions on error.

Steven D'Aprano

unread,
Jun 2, 2018, 7:17:32 AM6/2/18
to
On Sat, 02 Jun 2018 20:58:43 +1000, Chris Angelico wrote:

>>> Windows isn't POSIX compliant.
>>
>> Technically, Windows is POSIX compliant. You have to turn off a bunch
>> of features, turn on another bunch of features, and what you get is the
>> bare minimum POSIX compliance possible, but it's enough to tick the
>> check box for POSIX compliance.
>
> Really? I didn't know that Windows path names were POSIX compliant. Or
> do you have to use the Cygwin fudge to count Windows as POSIX? And what
> about POSIX signal handling?
>
> Citation needed, big-time.

https://en.wikipedia.org/wiki/Microsoft_POSIX_subsystem

https://technet.microsoft.com/en-us/library/bb463220.aspx

https://brianreiter.org/2010/08/24/the-sad-history-of-the-microsoft-posix-
subsystem/

Chris Angelico

unread,
Jun 2, 2018, 7:28:58 AM6/2/18
to
On Sat, Jun 2, 2018 at 9:13 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> On Sat, 02 Jun 2018 20:58:43 +1000, Chris Angelico wrote:
>
>>>> Windows isn't POSIX compliant.
>>>
>>> Technically, Windows is POSIX compliant. You have to turn off a bunch
>>> of features, turn on another bunch of features, and what you get is the
>>> bare minimum POSIX compliance possible, but it's enough to tick the
>>> check box for POSIX compliance.
>>
>> Really? I didn't know that Windows path names were POSIX compliant. Or
>> do you have to use the Cygwin fudge to count Windows as POSIX? And what
>> about POSIX signal handling?
>>
>> Citation needed, big-time.
>
> https://en.wikipedia.org/wiki/Microsoft_POSIX_subsystem
>
> https://technet.microsoft.com/en-us/library/bb463220.aspx
>
> https://brianreiter.org/2010/08/24/the-sad-history-of-the-microsoft-posix-
> subsystem/

Can someone confirm whether or not all the listed signals are actually
supported? We know that Ctrl-C maps to the internal Windows interrupt
handler, and "kill process" maps to the internal Windows "terminate",
but can you send a different process all the different signals and
handle them differently?

I also can't find anything about path names there. What does POSIX say
about the concept of relative paths? Does Windows comply with that?

"Windows has some features which are compatible with the equivalent
POSIX features" is not the same as "Technically, Windows is POSIX
compliant".

ChrisA

Paul Moore

unread,
Jun 2, 2018, 7:42:12 AM6/2/18
to
My apologies, I don't have time to hunt out complete references now,
but my recollection is that Windows (the OS) is POSIX compliant (as
noted, with certain configurations, etc). However, the Win32 API
(which is what most people think of when they say "Windows") is not
POSIX compatible. As an example, Windows (the kernel) has the
capability to implement fork(), but this isn't exposed via the Win32
API. To implement fork() you need to go to the raw kernel layer. Which
is basically what the Windows Linux subsystem (bash on Windows 10)
does - it's a user-level implementation of the POSIX API using Win32
kernel calls.

Paul

Peter J. Holzer

unread,
Jun 2, 2018, 11:21:17 AM6/2/18
to
On 2018-06-02 01:51:07 +0000, Grant Edwards wrote:
> On 2018-06-01, Steven D'Aprano <steve+comp....@pearwood.info> wrote:
> > But since "\0" is the correct type (a string), and the fact that it
> > happens to be illegal on POSIX is a platform-dependent detail of no more
> > importance than the fact that "?" is illegal on Windows, it should be
> > treated as any other platform-dependent illegal file and return False.
>
> That sounds reasonable.
>
> What about the case where somebody calls
>
> os.path.exists("/tmp/foo\x00bar")
>
> If /tmp/foo exists should it return True?

No.

> That's what would happen if you passed that string directly to the
> libc call.

You can't really pass that string directly to the libc call. The libc
calling convention uses \0 as the string delimiter, so you are really
passing "/tmp/foo" to the libc (there happen to be more bytes after the
terminator, but they are not part of the string as far as libc is
concerned), which is not the same as "/tmp/foo\x00bar".

So that has to be handled by Python before calling libc.

hp

--
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | h...@hjp.at | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
signature.asc

Tim Chase

unread,
Jun 2, 2018, 12:05:58 PM6/2/18
to
On 2018-06-02 00:14, Steven D'Aprano wrote:
> Since /wibble doesn't exist, neither does /wibble/a\0b
>
>
> py> os.path.exists("/wibble")
> False
> py> os.path.exists("/wibble/a\0b")
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/storage/torrents/torrents/python/Python-3.6.4/Lib/
> genericpath.py", line 19, in exists
> os.stat(path)
> ValueError: embedded null byte
>
> Oops.

Existence is a sketchy sort of thing. For example, sometimes the OS
hides certain directory entries from some syscalls while allowing
visibility from others. The following comes as on the FreeBSD system
I have at hand:

>>> import os
>>> '.bashrc' in os.listdir() # yes, it sees hidden dot-files
True
>>> '.zfs' in os.listdir() # but the OS hides .zfs/ from listings
False
>>> os.path.exists('.zfs') # yet it exists and can cd into it
True
>>> os.chdir('.zfs') # and you can chdir into it
>>> os.listdir() # but you can't listdir in it
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument

Dan Stromberg

unread,
Jun 2, 2018, 2:29:37 PM6/2/18
to
On Sat, Jun 2, 2018 at 4:28 AM, Chris Angelico <ros...@gmail.com> wrote:

> On Sat, Jun 2, 2018 at 9:13 PM, Steven D'Aprano
> <steve+comp....@pearwood.info> wrote:
> > On Sat, 02 Jun 2018 20:58:43 +1000, Chris Angelico wrote:
> Can someone confirm whether or not all the listed signals are actually
> supported? We know that Ctrl-C maps to the internal Windows interrupt
> handler, and "kill process" maps to the internal Windows "terminate",
> but can you send a different process all the different signals and
> handle them differently?
>
> I also can't find anything about path names there. What does POSIX say
> about the concept of relative paths? Does Windows comply with that?
>
> "Windows has some features which are compatible with the equivalent
> POSIX features" is not the same as "Technically, Windows is POSIX
> compliant".
>

The way I heard it, some (US government?) contracts required POSIX
compliance, so Microsoft added just enough of a mostly-useless POSIX layer
to Windows to be able to win the contracts, without actually being a useful
POSIX system.

Gregory Ewing

unread,
Jun 2, 2018, 7:47:55 PM6/2/18
to
Paul Moore wrote:
> Windows (the kernel) has the
> capability to implement fork(), but this isn't exposed via the Win32
> API. To implement fork() you need to go to the raw kernel layer. Which
> is basically what the Windows Linux subsystem (bash on Windows 10)
> does

What people usually mean by "POSIX compliant" is not "it's
possible to implement the POSIX API on top of it". By that
definition, a raw PC without any software is POSIX compliant.

--
Greg

Richard Damon

unread,
Jun 2, 2018, 8:06:49 PM6/2/18
to
But it isn't just that it is possible, Microsoft provides that layer, it
just isn't the normal API they suggest using and needs to be explicitly
enabled.

--
Richard Damon

Steven D'Aprano

unread,
Jun 2, 2018, 8:12:32 PM6/2/18
to
On Sat, 02 Jun 2018 21:28:41 +1000, Chris Angelico wrote:

> On Sat, Jun 2, 2018 at 9:13 PM, Steven D'Aprano
> <steve+comp....@pearwood.info> wrote:
>> On Sat, 02 Jun 2018 20:58:43 +1000, Chris Angelico wrote:
>>
>>>>> Windows isn't POSIX compliant.
>>>>
>>>> Technically, Windows is POSIX compliant. You have to turn off a bunch
>>>> of features, turn on another bunch of features, and what you get is
>>>> the bare minimum POSIX compliance possible, but it's enough to tick
>>>> the check box for POSIX compliance.
>>>
>>> Really? I didn't know that Windows path names were POSIX compliant. Or
>>> do you have to use the Cygwin fudge to count Windows as POSIX? And
>>> what about POSIX signal handling?
>>>
>>> Citation needed, big-time.
>>
>> https://en.wikipedia.org/wiki/Microsoft_POSIX_subsystem
>>
>> https://technet.microsoft.com/en-us/library/bb463220.aspx
>>
>> https://brianreiter.org/2010/08/24/the-sad-history-of-the-microsoft-
posix-
>> subsystem/
>
> Can someone confirm whether or not all the listed signals are actually
> supported?

Unless people do their testing under Windows with the POSIX subsystem
installed, such testing is likely to fail.


> We know that Ctrl-C maps to the internal Windows interrupt
> handler, and "kill process" maps to the internal Windows "terminate",
> but can you send a different process all the different signals and
> handle them differently?
>
> I also can't find anything about path names there. What does POSIX say
> about the concept of relative paths? Does Windows comply with that?

That's a curious question to ask. If you don't know what POSIX says about
a feature, why would you question whether Windows complies with it?


> "Windows has some features which are compatible with the equivalent
> POSIX features" is not the same as "Technically, Windows is POSIX
> compliant".

Chris, you seem to be labouring under the misapprehension that the claim
is that a stock standard Windows <whatever version> installation complies
with the latest version of the POSIX standard.

That's not the case.

The claim (and fact) is that Windows NT with the POSIX subsystem
installed (and possibly other changes made) is technically compliant with
version 1 of the POSIX standard, which even in 1997 was only a fraction
of what most Unix systems provided.

https://en.wikipedia.org/wiki/POSIX#Versions

Just enough to allow bean counters to tick the box that says "POSIX
compliant" in a government requirements form, provided the technical
people involved either don't get a say, or do get a say and actually want
Windows but have to satisfy some bureaucratic requirement for POSIX.

Nobody thinks that standard Windows counts as a Unix.

Steven D'Aprano

unread,
Jun 2, 2018, 8:35:53 PM6/2/18
to
On Sun, 03 Jun 2018 11:47:40 +1200, Gregory Ewing wrote:

> Paul Moore wrote:
>> Windows (the kernel) has the
>> capability to implement fork(), but this isn't exposed via the Win32
>> API. To implement fork() you need to go to the raw kernel layer. Which
>> is basically what the Windows Linux subsystem (bash on Windows 10) does
>
> What people usually mean by "POSIX compliant" is not "it's possible to
> implement the POSIX API on top of it".

What people usually mean by "POSIX compliant" is "Unix or Linux".

But that's not what the POSIX standard requires. It requires a set of
APIs. I doubt it cares where or how they are implemented.



> By that definition, a raw PC without any software is POSIX compliant.

Do you really mean to say that a computer that won't boot is POSIX
compliant? Yeah, good luck getting that one past the user acceptance
testing.

Chris Angelico

unread,
Jun 2, 2018, 8:38:47 PM6/2/18
to
Let's just rewind this subthread a little bit. YOU said that the
behaviour of os.path.exists on Unix systems should be "return False
for invalid things" on the basis that the Windows invalid paths return
False. Remember? Or are you just too het up about arguing this point
that you don't care why you're arguing? I said that Windows isn't
POSIX, and pointed out just a couple of ways in which, to a
programmer, Windows behaves very differently to POSIX-compliant
systems. The two examples I gave were signals and relative paths. Now,
if you want to tell me that we can completely ignore drive letters on
Windows, then sure. Go ahead. Tell me that relative paths behave
sanely on Windows just as long as you have only a single drive. Or
tell me that the POSIX standard permits three different types of
relative path. And with signals, can you show me that a process can
send another process a variety of different signals, and that the
receiving process can handle them differently?

Claiming that Windows technically ticks some box is utterly meaningless to that.

ChrisA

Gregory Ewing

unread,
Jun 2, 2018, 10:25:32 PM6/2/18
to
Steven D'Aprano wrote:
> Do you really mean to say that a computer that won't boot is POSIX
> compliant?

No, I was pointing out the absurdity of saying that the Windows
kernel layer is POSIX compliant, which is what the post I was
replying to seemed to be saying.

--
Greg

Steven D'Aprano

unread,
Jun 3, 2018, 1:51:04 AM6/3/18
to
On Sun, 03 Jun 2018 10:38:34 +1000, Chris Angelico wrote:

> Let's just rewind this subthread a little bit. YOU said that the
> behaviour of os.path.exists on Unix systems should be "return False for
> invalid things" on the basis that the Windows invalid paths return
> False. Remember?

No, invalid paths on Linux return False too:

py> os.path.exists("")
False


I can make a VFAT partition under Linux:

[steve@ando ~]$ dd if=/dev/zero of=fat.fs bs=1024 count=48
48+0 records in
48+0 records out
49152 bytes (49 kB) copied, 0.0149677 seconds, 3.3 MB/s
[steve@ando ~]$ /sbin/mkfs.vfat fat.fs
mkfs.vfat 2.11 (12 Mar 2005)
[steve@ando ~]$ mkdir dos
[steve@ando ~]$ sudo mount -o loop fat.fs ./dos
[sudo] password for steve:


I can write to it (as root), but not all file names are valid:

[steve@ando ~]$ sudo touch ./dos/foo
[steve@ando ~]$ sudo touch ./dos/"foo?"
touch: setting times of `./dos/foo?': No such file or directory
[steve@ando ~]$ ls ./dos
foo

And even though I'm using Linux, I get the right answer, legal file name
or not legal file name.


[steve@ando ~]$ python3.5 -c "import os; \
> print(os.path.exists('./dos/foo'))"
True
[steve@ando ~]$ python3.5 -c "import os; \
> print(os.path.exists('./dos/foo?'))"
False



> I said that Windows isn't POSIX,

And I said, as a by-the-by, that technically Windows is POSIX compliant,
for a very pedantically true but dubious in practice value of compliant.


> and pointed out just a couple of ways in which, to a programmer, Windows
> behaves very differently to POSIX-compliant systems.

Is that Windows out of the box, or Windows with the POSIX subsystem
installed and active?

You keep talking about "POSIX-compliant", but POSIX is a family of
standards. A system can be compliant with one POSIX standard without
being compliant to the others.

And ironically, neither Linux, OpenBSD, FreeBSD nor Darwin are fully
POSIX compliant, merely "mostly" compliant. (Or at least, they haven't
been certified as such.)

Not that it matters much in practice.


In any case, the minutia of POSIX versus Windows, the availability of
drive letters and signals etc are utterly irrelevant to the question of
what os.path.exists should do.

Just as it ought to be utterly irrelevant that on Linux native C strings
are null terminated.

Barry Scott

unread,
Jun 4, 2018, 6:35:21 AM6/4/18
to


> On 1 Jun 2018, at 14:23, Paul Moore <p.f....@gmail.com> wrote:
>
> On 1 June 2018 at 13:15, Barry Scott <ba...@barrys-emacs.org> wrote:
>> I think the reason for the \0 check is that if the string is passed to the
>> operating system with the \0 you can get surprising results.
>>
>> If \0 was not checked for you would be able to get True from:
>>
>> os.file.exists('/home\0ignore me')
>>
>> This is because a posix system only sees '/home'.

Turns out that this is a limitation on Windows as well.
The \0 is not allowed for Windows, macOS and Posix.

>
> So because the OS API can't handle filenames with \0 in (because that
> API uses null-terminated strings) Python has to special case its
> handling of the check. That's fine.
>
>> Surely ValueError is reasonable?
>
> Well, if the OS API can't handle filenames with embedded \0, we can be
> sure that such a file doesn't exist - so returning False is
> reasonable.

I think most of the file APIs check for \0 and raise ValueError on python3
and TypeError on python2.

os.path.exists() is not special and I don't think should be be changed.

>
>> Once you know that all of the string you provided is given to the operating
>> system it can then do whatever checks it sees fit to and return a suitable
>> result.
>
> As the programmer, I don't care. The Python interpreter should take
> care of that for me, and if I say "does file 'a\0b' exist?" I want an
> answer. And I don't see how anything other than "no it doesn't" is
> correct. Python allows strings with embedded \0 characters, so it's
> possible to express that question in Python - os.path.exists('a\0b').
> What can be expressed in terms of the low-level (C-based) operating
> system API shouldn't be relevant.
>
> Disclaimer - the Python "os" module *does* expose low-level
> OS-dependent functionality, so it's not necessarily reasonable to
> extend this argument to other functions in os. But it seems like a
> pretty solid argument in this particular case.
>
>> As an aside Windows has lots of special filenames that you have to know about
>> if you are writting robust file handling. AUX, COM1, \this\is\also\COM1 etc.
>
> I don't think that's relevant in this context.

I think it is. This started because the OP was surprised that they needed to check for \0.
There are related surprised waiting. I'm point out that its more then \0 a robust
piece of code will need to consider.

Barry

Steven D'Aprano

unread,
Jun 4, 2018, 8:03:57 AM6/4/18
to
On Mon, 04 Jun 2018 11:16:21 +0100, Barry Scott wrote:

[...]
> Turns out that this is a limitation on Windows as well. The \0 is not
> allowed for Windows, macOS and Posix.

We -- all of us, including myself -- have been terribly careless all
through this discussion. The fact is, this should not be an OS limitation
at all. It is a *file system* limitation.

If I can mount a HFS or HFS-Plus disk on Linux, it can include file names
with embedded NULs or slashes. (Only the : character is illegal in HFS
file names.) It shouldn't matter what the OS is, if I have drivers for
HFS and can mount a HFS disk, I ought to be able to sensibly ask for file
names including NUL.

Marko Rauhamaa

unread,
Jun 4, 2018, 8:26:24 AM6/4/18
to
Barry Scott <ba...@barrys-emacs.org>:
> os.path.exists() is not special and I don't think should be be changed.

You are right that os.path.exists() might be logically tied to other
os.* facilities. The question is, should the application be cognizant of
the seam between the standard library and the operating system kernel?

When a Linux system call contains an illegal value, it responds with
errno=EINVAL. In Python, that's represented by the OSError exception
with e.errno=EINVAL. However, when Python encounters an illegal value
itself, it usually raises a ValueError. Is it useful for the application
to have to be prepared for OSError/EINVAL and ValueError separately? Or
should the difference be paved over by Python?

As it stands, os.path.exists() really means: the operating system
doesn't have a reason to fail os.stat() on the pathname. Python
intercedes with an exception if it can't even ask the operating system
for its opinion. That dichotomy is not suggested by the os.path.exists()
documentation. In fact, the whole point of os.path.* is to provide for
an abstraction to isolate the application from the intricacies of the
operating system specifics.

BTW, I challenge you to find a test case that tests the proper behavior
of an application if it encounters a pathname with a NUL in it. Or code
that gracefully catches a ValueError from os.path.exists().


Marko

Paul Moore

unread,
Jun 4, 2018, 8:33:45 AM6/4/18
to
On 4 June 2018 at 13:01, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:

>> Turns out that this is a limitation on Windows as well. The \0 is not
>> allowed for Windows, macOS and Posix.
>
> We -- all of us, including myself -- have been terribly careless all
> through this discussion. The fact is, this should not be an OS limitation
> at all. It is a *file system* limitation.
>
> If I can mount a HFS or HFS-Plus disk on Linux, it can include file names
> with embedded NULs or slashes. (Only the : character is illegal in HFS
> file names.) It shouldn't matter what the OS is, if I have drivers for
> HFS and can mount a HFS disk, I ought to be able to sensibly ask for file
> names including NUL.

Agreed, being completely precise in this situation is both pretty
complicated, and essential.

The question of what are legal characters in a filename is, as you
say, a filesystem related issue. People traditionally forget this
point, but in these days of cross-platform filesystem mounting,
networked filesystems[1], etc, it's more and more relevant, and
thankfully people are getting more aware of the point.

But there's also the question of what capability the kernel API has to
express the queries. The fact that the Unix API (and the Windows one,
in most cases - although as Eryk Sun pointed out there are exceptions
in the Windows kernel API) uses NUL-terminated strings means that
querying the filesystem about filenames with embedded \0 characters
isn't possible *at the OS level*. (As another example, the fact that
the Unix kernel treats filenames as byte strings means that there are
translation issues querying an NTFS filesystem that uses Unicode
(UTF-16) natively - and vice versa when Windows queries a Unix-native
filesystem).

So "it's complicated" is about the best we can say :-)

Paul

[1] And of course if you mount (say) an NTFS filesystem over NFS, you
have *two* filesystems involved, each adding its own layer of
restrictions and capabilities.

Steven D'Aprano

unread,
Jun 4, 2018, 9:26:35 AM6/4/18
to
On Mon, 04 Jun 2018 13:33:28 +0100, Paul Moore wrote:

> But there's also the question of what capability the kernel API has to
> express the queries. The fact that the Unix API (and the Windows one, in
> most cases - although as Eryk Sun pointed out there are exceptions in
> the Windows kernel API) uses NUL-terminated strings means that querying
> the filesystem about filenames with embedded \0 characters isn't
> possible *at the OS level*.

I don't know whether or not the Linux OS is capable of accessing files
with embedded NULs in the file name. But Mac OS is capable of doing so,
so it should be possible. Wikipedia says:

"HFS Plus mandates support for an escape sequence to allow arbitrary
Unicode. Users of older software might see the escape sequences instead
of the desired characters."

Apple File System is an even more modern FS (it replaced HFS Plus in 2017
as Apple's preferred OS) which supports all Unicode code points,

Grant Edwards

unread,
Jun 4, 2018, 10:13:52 AM6/4/18
to
The normal Win32 API that all Windows apps use is not Posix compliant.

However, there is an API layer Microsoft provides (or provided) that
is/was Posix compliant. At one point, I think it was an add-on that
had to be purchased seperately. I've never heard of anybody actually
_using_ it, but it allowed some US government purchasing droid to
check the "Posix Compliant" box on an acquisition checklist back in
the 90's.

--
Grant Edwards grant.b.edwards Yow! But they went to MARS
at around 1953!!
gmail.com

Peter J. Holzer

unread,
Jun 4, 2018, 4:14:13 PM6/4/18
to
On 2018-06-04 13:23:59 +0000, Steven D'Aprano wrote:
> On Mon, 04 Jun 2018 13:33:28 +0100, Paul Moore wrote:
> > But there's also the question of what capability the kernel API has to
> > express the queries. The fact that the Unix API (and the Windows one, in
> > most cases - although as Eryk Sun pointed out there are exceptions in
> > the Windows kernel API) uses NUL-terminated strings means that querying
> > the filesystem about filenames with embedded \0 characters isn't
> > possible *at the OS level*.
>
> I don't know whether or not the Linux OS is capable of accessing files
> with embedded NULs in the file name. But Mac OS is capable of doing so,
> so it should be possible. Wikipedia says:
>
> "HFS Plus mandates support for an escape sequence to allow arbitrary
> Unicode. Users of older software might see the escape sequences instead
> of the desired characters."

I don't know about MacOS. In Linux there is no way to pass a filename
with an embedded '\0' (or a '/' which is not path separator) between the
kernel and user space. So if a filesystem contained such a filename, the
kernel would have to map it (via an escape sequence or some other
mechanism) to a different file name. Which of course means that - from
the perspective of any user space process - the filename doesn't contain
a '\0' or '/'.

Theoretically that mapping could be reversed in the standard library of
a language which allows '\0' in strings (like Python), but since that
would mean that programs written in that language see different
filenames than programs written in other languages (especially C, which
covers the majority of the GNU command line tools), this would be a very
bad idea. Much better to have strange but consistent filenames if you
mount a "foreign" file system. (This is btw also what Samba does,
although it does a spectacularly bad job).
signature.asc

eryk sun

unread,
Jun 4, 2018, 7:27:19 PM6/4/18
to
On Sat, Jun 2, 2018 at 11:28 AM, Chris Angelico <ros...@gmail.com> wrote:
>
> I also can't find anything about path names there. What does POSIX say
> about the concept of relative paths? Does Windows comply with that?

Certainly Windows file-system paths are not POSIX compatible. Seven
path types are supported:

* Extended Local Device (\\?\)
* Local Device (\\.\)
* UNC
* Drive Absolute
* Drive Relative
* Rooted
* Relative

Extended local-device paths only allow backslash as a path separator.
The others allow either backslash or slash. I doubt POSIX would allow
magically reserved DOS device names in every directory or stripping of
trailing dots and spaces from filenames.

But this isn't relevant to NT's POSIX compatibility. A POSIX process
links with psxdll.dll, which connects to the POSIX environment
subsystem (psxss.exe). It gets run from Windows via posix.exe
(console) or psxrun.exe. In the 00s, Microsoft acquired Interix, which
extended the original POSIX subsystem, and integrated it as the
Subsystem for UNIX Applications (SUA). Notably SUA adds a kernel
driver, psxdrv.sys, which facilitates implementing system calls and
signals. There used to be a community website with overviews [1], a
FAQ [2], a forum [3], tool downloads [4], and various documentation
[5]. However, NT's environment subsystems never really had mass
appeal, probably because existing programs had to be ported and
recompiled. SUA is no longer supported as of Windows 8.1 and Server
2012 R2. The community website was closed, and the domain is now held
by a squatter.

Regarding file-system paths, SUA has a single root directory and uses
"/dev/fs/C" for drive "C:" and "/net/server/share" for
"\\server\share".

[1]: http://www.suacommunity.com/SUA_Tools_Env_Start.htm
https://archive.li/45JG
[2]: http://www.suacommunity.com/FAQs.htm
https://archive.li/5LFw
[3]: http://www.suacommunity.com/forum2
https://archive.li/LzZxS
[4]: http://www.suacommunity.com/tool_warehouse.aspx
https://archive.li/0luI9
[5]: http://www.suacommunity.com/dictionary/fork-entry.php
https://archive.li/5k8vW

Windows 10 has a Linux subsystem (WSL), but this is not an NT
environment subsystem. WSL processes do not load ntdll.dll. They're
lightweight pico processes with an associated pico provider in the
kernel (lxss.sys, lxcore.sys), and they directly execute native Linux
binaries (no porting and recompiling from source). WSL only supports
the console, but at least the console was upgraded to support
virtual-terminal mode.

> We know that Ctrl-C maps to the internal Windows interrupt
> handler, and "kill process" maps to the internal Windows "terminate",
> but can you send a different process all the different signals and
> handle them differently?

IIRC, the original POSIX subsystem supported only single-threaded
processes, and SIGKILL called NtTerminateThread. Of course the
subsystem has its own client bookkeeping to handle here as well. (For
the Windows subsystem, csrss.exe also maintains shadow process and
thread structures for clients. This is how an environment subsystem
supplements base NT behavior.)

Regarding Ctrl+C, a console session is started by posix.exe, which is
a Windows console application. It translates console control events to
signals, e.g. CTRL_C_EVENT to SIGINT, CTRL_BREAK_EVENT to SIGQUIT, and
otherwise SIGKILL (e.g. closing the console, logoff, shutdown). It
sends the signal number and session ID to the subsystem, which signals
the processes in the given session.

One way for the subsystem to implement signal delivery is via NT's
runtime library function RtlRemoteCall (i.e. suspend the target
thread, get its CPU context and copy it to the stack, modify the
context and stack, and resume). Make a remote call to a known client
function (i.e. in psxdll.dll), which delivers the signal and then
continues the thread's original context via NtContinue. This approach
isn't really efficient, but it's basically how the original POSIX
subsystem worked. SUA probably uses NT asynchronous procedure calls
(APCs).

---

Appendix: NT APCs

NT doesn't have anything exactly like POSIX signals. It has
exceptions, which are handled using either Vectored Exception Handling
or Structured Exception Handling (i.e. MSVC __try, __except,
__finally), and asynchronous procedure calls (APCs). Some POSIX
signals correspond to NT exceptions (e.g. SIGSEGV corresponds to a
STATUS_ACCESS_VIOLATION). But APCs are what a POSIX subsystem would
likely use to implement signals.

There are two types of APC: kernel and user. A thread has an APC queue
for each type. User APCs can be queued from user mode via
NtQueueApcThread, or via WinAPI QueueUserAPC. Some APIs such as
ReadFileEx take an optional completion or notification APC routine,
for which a kernel component queues the user APC.

All APCs have a "kernel routine" and most also have a "normal"
routine. The kernel routine is called first, with the CPU in kernel
mode and its interrupt request level (IRQL) at APC_LEVEL (1). The
kernel routine is passed a pointer to the normal routine, which allows
it to set a different function or none at all (i.e. a NULL pointer).
If it's not NULL, the normal routine is called with the CPU IRQL at
PASSIVE_LEVEL (0), either in kernel mode or user mode, depending on
the APC type.

Kernel APCs are "special" if they're inserted in the queue without a
normal routine. Special APCs get placed ahead of normal APCs in the
queue and can preempt the execution of normal APCs. They're used for
high priority operations. For example, completion of an I/O request
queues a special kernel APC to the thread that originated the request.

Queueing a kernel APC either raises an APC interrupt if the thread is
currently running or awakens the thread if it's currently waiting and
has APC delivery enabled. Queueing a user APC does not raise an
interrupt but may awaken the thread if it's currently in a user-mode
wait and either the wait is alertable or the user APC pending flag is
set.

Kernel APC delivery is triggered either by the APC interrupt handler
or immediately after a context switch to a thread (e.g. when
awakened). User APCs are delivered when returning to user mode, but
only if the thread's user APC pending flag is set. Normally this flag
gets set when the thread does an alertable user-mode delay/wait and
its APC queue isn't empty. It's also set by the NtTestAlert system
call, and also specially set for the user APC that initiates
cross-thread termination.

A thread automatically resumes a delay or wait if it's awakened to
deliver a kernel APC. On the other hand, if a thread is awakened by
queueing a user APC, the wait returns with the status code
STATUS_USER_APC. This is the normal way user APCs are delivered, i.e.
upon returning from an alertable delay or wait system call (e.g.
NtDelayExecution, NtWaitForSingleObject). As mentioned above,
NtTestAlert can also be used to pump the user APC queue.

The APC delivery function first drains the kernel APC queue
completely. If delivering user APCs is enabled (i.e. the user APC
pending flag is set and the previous CPU mode is user mode), it also
delivers the first user APC from the head of the queue. Only one user
APC is delivered because it requires switching to the user-mode APC
dispatcher in ntdll.dll. This can't process the whole queue because,
as discussed above, all APCs have a kernel-mode routine that gets
called first. Thus, with user APCs, the pattern is to call the kernel
routine from the APC delivery function; transition to user-mode to
call the normal routine if the kernel routine didn't set it to NULL;
and then return back to kernel mode via the NtContinue system call.
The latter sets the user APC pending flag to deliver the next user
APC, if any. This cycle continues until the user APC queue is empty.

wxjm...@gmail.com

unread,
Jun 5, 2018, 2:43:23 AM6/5/18
to
Le jeudi 31 mai 2018 14:03:23 UTC+2, Marko Rauhamaa a écrit :
> This surprising exception can even be a security issue:
>
> >>> os.path.exists("\0")
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/usr/lib64/python3.6/genericpath.py", line 19, in exists
> os.stat(path)
> ValueError: embedded null byte
>
> Most other analogous reasons *don't* generate an exception, nor is that
> possibility mentioned in the specification:
>
> https://docs.python.org/3/library/os.path.html?#os.path.exists
>
> Is the behavior a bug? Shouldn't it be:
>
> >>> os.path.exists("\0")
> False
>
>
> Marko

Do not worry too much.

Even if a path/filename exists, this language is such
a buggy mess on the side of the coding of the characters,
you may not even be able to open file or use a path.

I know by experience, it's not worth to show an example.

Exercise: Show that Powershell works, where Python fails.

In fact, practically the totality of "Python" software
is not working.

Steven D'Aprano

unread,
Jun 5, 2018, 3:40:07 AM6/5/18
to
On Mon, 04 Jun 2018 22:13:47 +0200, Peter J. Holzer wrote:

> On 2018-06-04 13:23:59 +0000, Steven D'Aprano wrote:
[...]

>> I don't know whether or not the Linux OS is capable of accessing files
>> with embedded NULs in the file name. But Mac OS is capable of doing so,
>> so it should be possible. Wikipedia says:
>>
>> "HFS Plus mandates support for an escape sequence to allow arbitrary
>> Unicode. Users of older software might see the escape sequences instead
>> of the desired characters."
>
> I don't know about MacOS. In Linux there is no way to pass a filename
> with an embedded '\0' (or a '/' which is not path separator) between the
> kernel and user space. So if a filesystem contained such a filename, the
> kernel would have to map it (via an escape sequence or some other
> mechanism) to a different file name. Which of course means that - from
> the perspective of any user space process - the filename doesn't contain
> a '\0' or '/'.

That's an invalid analogy. According to that analogy, Python strings
don't contain ASCII NULs, because you have to use an escape mechanism to
insert them:

string = "Is this \0 not a NULL?"


But we know that Python strings are not NUL-terminated and can contain
NUL. It's just another character.

Chris Angelico

unread,
Jun 5, 2018, 6:15:26 AM6/5/18
to
No; by that analogy, a Python string cannot contain a non-Unicode
character. Here's a challenge: create a Python string that contains a
character that isn't part of the Universal Character Set.

ChrisA

Steven D'Aprano

unread,
Jun 5, 2018, 9:14:23 AM6/5/18
to
Huh? In what way is that the analogy being made? Your challenge is
impossible from pure Python, equivalent to "create a Python bytes object
that contains a byte greater than 255". The challenge is rigged to be
doomed to fail.

That's not the case when it comes to \0 in file names: we know that Mac
OS can do it, we know HFS and Apple FS support NUL in file names. We have
an existence proof that it is possible.

(Although in your case, it is conceivable that using C you might be able
to solve the challenge: create a string using the UCS-4 implementation
(32-bit code units), then modify some code unit to be a value outside of
the 21-bit range supported by Unicode. But that would require low-level
hacking, it isn't supported by the language or the interpreter except
maybe via ctypes.)

Apple FS, HFS and HFS Plus support \0 as a valid Unicode character. The
Mac OS kernel has an escape mechanism to allow user code to include \0
characters in pathnames, just as Python has an escape mechanism to allow
user code to include \0 in strings.

There's no such escape mechanism for characters outside of Unicode.

Chris Angelico

unread,
Jun 5, 2018, 9:27:48 AM6/5/18
to
On Tue, Jun 5, 2018 at 11:11 PM, Steven D'Aprano
And an ASCIIZ string cannot contain a byte value of zero. The parallel is exact.

ChrisA

wxjm...@gmail.com

unread,
Jun 5, 2018, 10:06:17 AM6/5/18
to
Python is the single language, which presents
a buggy and contraproductive (memory and performance)
Unicode implementation.

Amen.

Peter J. Holzer

unread,
Jun 5, 2018, 11:28:43 AM6/5/18
to
On 2018-06-05 07:37:34 +0000, Steven D'Aprano wrote:
> On Mon, 04 Jun 2018 22:13:47 +0200, Peter J. Holzer wrote:
> > On 2018-06-04 13:23:59 +0000, Steven D'Aprano wrote:
> >> I don't know whether or not the Linux OS is capable of accessing
> >> files with embedded NULs in the file name. But Mac OS is capable of
> >> doing so, so it should be possible. Wikipedia says:
> >>
> >> "HFS Plus mandates support for an escape sequence to allow
> >> arbitrary Unicode. Users of older software might see the escape
> >> sequences instead of the desired characters."
> >
> > I don't know about MacOS. In Linux there is no way to pass a
> > filename with an embedded '\0' (or a '/' which is not path
> > separator) between the kernel and user space. So if a filesystem
> > contained such a filename, the kernel would have to map it (via an
> > escape sequence or some other mechanism) to a different file name.
> > Which of course means that - from the perspective of any user space
> > process - the filename doesn't contain a '\0' or '/'.
>
> That's an invalid analogy. According to that analogy, Python strings
> don't contain ASCII NULs, because you have to use an escape mechanism
> to insert them:
>
> string = "Is this \0 not a NULL?"
>
>
> But we know that Python strings are not NUL-terminated and can contain
> NUL. It's just another character.

I think that's a bad analogy.

The escape mechanism for string literals is mostly for convenience of
the programmer. It's there to make the program's source code more
readable (and yes, also easier to write). But at run time the \0
character is just that: A character with the value 0.

If a disk with a file system which allows embedded NUL characters is
mounted on Linux (let's for the sake of the argument assume it is HFS+,
although I have to admit that I don't know anything about the internals
of that filesystem), then the low level filesystem code has to map that
character to something else. Even the generic filesystem code of the
kernel will never see that NUL character, let alone the user space. As
far as the OS is concerned, that file doesn't contain a NUL character.
The whole system (except for some low-level FS-dependent code) will
always only see the mapped name.

If some application (which might be an interpreter, or it might be a
graphics program, for example) decides that it knows better what the
"real" filename is and reverses that mapping, it can do so - but it
would be very confusing because it would use a different file name than
the rest of the system. The user would see one file name with ls, but
would have to use a different filename in the application. The
application would show one filename in its "save" dialog, but the OS's
file manager would show another. Not a good idea, especially as the
benefits of such a scheme would be extremely narrow (you could share an
HFS+ formatted USB disk between MacOS and Linux with filenames with
embedded NULs and that application would let you use the same filenames
as you would use on MacOS).

Now, if MacOS uses something like that, this is a different matter.
Presumably (since HFS+ is a native file system) the kernel deals with
NUL characters in a straightforward manner. It might even have a
(non-POSIX) API to expose such filenames. Even if it hasn't, presumably
the mapping back and forth is done in a very low level library used by
all (or most) of the applications, so that they all show consistently
the same filename.

But Linux isn't MacOS. On Linux there are no filenames with embedded
NULs, even if you mount an HFS+ disk and even if some application
decides to internally remap filenames in a way that they can contain NUL
characters.
signature.asc

eryk sun

unread,
Jun 5, 2018, 1:35:06 PM6/5/18
to
On Tue, Jun 5, 2018 at 3:28 PM, Peter J. Holzer <hjp-p...@hjp.at> wrote:
>
> Now, if MacOS uses something like that, this is a different matter.
> Presumably (since HFS+ is a native file system) the kernel deals with
> NUL characters in a straightforward manner. It might even have a
> (non-POSIX) API to expose such filenames. Even if it hasn't, presumably
> the mapping back and forth is done in a very low level library used by
> all (or most) of the applications, so that they all show consistently
> the same filename.

The Linux subsystem in Windows 10 has to use character escaping. The
root file system is stored in the NTFS directory
"%LocalAppData%\Packages\<distro package name>\LocalState\rootfs". It
escapes invalid NTFS characters (as implemented by the ntfs.sys
driver) using the hex code prefixed by "#". Thus "#" itself has to be
escaped as "#0023". For example:

$ touch '\*?<>|#'
$ ls '\*?<>|#'
\*?<>|#

With CMD in the above directory, we can see the real filename:

> dir /b #*
#005C#002A#003F#003C#003E#007C#0023

Steven D'Aprano

unread,
Jun 6, 2018, 3:20:34 AM6/6/18
to
On Tue, 05 Jun 2018 17:28:24 +0200, Peter J. Holzer wrote:
[...]
> If a disk with a file system which allows embedded NUL characters is
> mounted on Linux (let's for the sake of the argument assume it is HFS+,
> although I have to admit that I don't know anything about the internals
> of that filesystem), then the low level filesystem code has to map that
> character to something else. Even the generic filesystem code of the
> kernel will never see that NUL character,

Even if this were true, why is it even the tiniest bit relevant to what
os.path.exists() does when given a path containing a NUL byte?


> let alone the user space. As
> far as the OS is concerned, that file doesn't contain a NUL character.

I don't care about "as far as the OS". I care about users, people like
me. If I say "Here's a file called "sp\0am" then I don't care what the OS
does, or the FS driver, or the disk hardware. I couldn't care less what
the actual byte pattern on the disk is.

If you told me that the pattern of bytes representing that filename was
0x0102030405 then I'd be momentarily impressed by the curious pattern and
then do my best to immediately forget all about it.

As a Python programmer, *why do you care* about NULs? How does this
special treatment make your life as a Python programmer better?


> The whole system (except for some low-level FS-dependent code) will
> always only see the mapped name.

Yes. So what? That's *already the case*. Even Python string you pass to
os.path.exists is already mapped, and errors from the kernel are mapped
to False. Why should NUL be treated differently?

Typical Linux file systems (ext3, ext4, btrfs, ReiserFS etc) don't
support Unicode, only bytes 0...255, but we can query "invalid" file
names containing characters like δ ж or ∆, without any problem. We don't
get ValueError just because of some irrelevant technical detail that the
file system doesn't support characters outside of the range of bytes
1...255 (excluding 47). We can do this because Python seamlessly maps
Unicode to bytes and back again.

You may have heard of a little-known operating system called "Windows",
which defaults to NTFS as its file system. I'm told that there are a few
people who use this file system. Even under Linux, you might have
(knowingly or unknowingly) used a network file system or storage device
that used NTFS under the hood.

If so, then every time you query a filename, even an ordinary looking one
like "foo", you could be dealing with multiple NUL bytes, as the NTFS
file system (even under Linux!) uses Unicode file names encoded with
UTF-16. There's a good chance that EVERY filename you've used on a NAS
device or network drive has included embedded NUL bytes.

You've painted a pretty picture of the supposed confusion and difficulty
such NUL bytes would cause, but its all nonsense. We already can
seamlessly and transparently interact with file systems where file names
include NUL bytes under Linux.

BUT even if what you said was true, that Linux cannot deal with NUL bytes
in file names even with driver support, even if passing a NUL byte to the
Linux kernel would cause the fall of human civilization, that STILL
wouldn't require us to raise ValueError from os.path.exists!

Steven D'Aprano

unread,
Jun 6, 2018, 11:58:33 PM6/6/18
to
On Tue, 05 Jun 2018 23:27:16 +1000, Chris Angelico wrote:

> And an ASCIIZ string cannot contain a byte value of zero. The parallel
> is exact.

Why should we, as Python programmers, care one whit about ASCIIZ strings?
They're not relevant. You might as well say that file names cannot
contain the character "π" because ASCIIZ strings don't support it.

No they don't, and yet nevertheless file names can and do contain
characters outside of the ASCIIZ range.

Python strings are rich objects which support the Unicode code point \0
in them. The limitation of the Linux kernel that it relies on NULL-
terminated byte strings is irrelevant to the question of what
os.path.exists ought to do when given a path containing NUL. Other
invalid path names return False.

As a Python programmer, how does treating NUL specially make our life
better?

I don't know what the implementation of os.path.exists is precisely, but
in pseudocode I expect it is something like this:


if "\0" in pathname:
panic("OH NOES A NUL WHATEVER SHALL WE DO?!?!?!")
else:
ask the OS to do a stat on pathname
if an error occurs:
return False
else:
return True


Why not just return False instead of panicking?

Chris Angelico

unread,
Jun 7, 2018, 3:45:23 AM6/7/18
to
On Thu, Jun 7, 2018 at 1:55 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> On Tue, 05 Jun 2018 23:27:16 +1000, Chris Angelico wrote:
>
>> And an ASCIIZ string cannot contain a byte value of zero. The parallel
>> is exact.
>
> Why should we, as Python programmers, care one whit about ASCIIZ strings?
> They're not relevant. You might as well say that file names cannot
> contain the character "π" because ASCIIZ strings don't support it.
>
> No they don't, and yet nevertheless file names can and do contain
> characters outside of the ASCIIZ range.

Under Linux, a file name contains bytes, most commonly representing
UTF-8 sequences. So... an ASCIIZ string *can* contain that character,
or at least a representation of it. Yet it cannot contain "\0".

ChrisA

Antoon Pardon

unread,
Jun 7, 2018, 4:06:15 AM6/7/18
to
On 07-06-18 05:55, Steven D'Aprano wrote:
> Python strings are rich objects which support the Unicode code point \0
> in them. The limitation of the Linux kernel that it relies on NULL-
> terminated byte strings is irrelevant to the question of what
> os.path.exists ought to do when given a path containing NUL. Other
> invalid path names return False.

It is not irrelevant. It makes the disctinction clear between possible
values and impossible values. Now you personnaly may find that distinction
of minor importance but it is a relevant distinction in discussing how
to treat it.

> As a Python programmer, how does treating NUL specially make our life
> better?

By treating possible path values differently from impossible path values.

--
Antoon.

Marko Rauhamaa

unread,
Jun 7, 2018, 5:29:40 AM6/7/18
to
Antoon Pardon <antoon...@vub.be>:
There are all kinds of impossibility. The os.stat() reports those
impossibilities via an OSError exception. It's just that
os.path.exists() converts the OSError exception into a False return
value. A ValueError is raised by the Python os.stat() wrapper to
indicate that it can't even deliver the request to the kernel.

The application programmer doesn't give an iota who determined the
impossibility of a pathname. Unfortunately, os.path.exists() forces the
distinction on the application. If I have to be prepared to catch a
ValueError from os.path.exists(), what added value does os.path.exists()
give on top of os.stat()? The whole point of os.path.exists() is

1. To provide an operating-system-independent abstraction.

2. To provide a boolean interface instead of an exception interface.



This is a security risk. Here is a brief demonstration. Copy the example
HTTP server from:

<URL: https://docs.python.org/3/library/http.server.html?highlight=h
ttp#http.server.SimpleHTTPRequestHandler>

Run the server. Try these URLs in your browser:

1. http://localhost:8000/

=> The directory listing is provided

2. http://localhost:8000/test.html

=> A file is served or an HTTP error response (404) is generated

3. http://localhost:8000/te%00st.html

=> The server crashes with a ValueError and the TCP connection is
reset


Marko

Marko Rauhamaa

unread,
Jun 7, 2018, 5:40:53 AM6/7/18
to
Marko Rauhamaa <ma...@pacujo.net>:

> This is a security risk. Here is a brief demonstration. Copy the example
> HTTP server from:
>
> <URL: https://docs.python.org/3/library/http.server.html?highlight=h
> ttp#http.server.SimpleHTTPRequestHandler>
>
> [...]
>
> 3. http://localhost:8000/te%00st.html
>
> => The server crashes with a ValueError and the TCP connection is
> reset

An exercise for the reader: provide a fix for the example server so the
request returns a 404 response just like any other nonexistent resource.


Marko

Chris Angelico

unread,
Jun 7, 2018, 5:47:23 AM6/7/18
to
On Thu, Jun 7, 2018 at 7:29 PM, Marko Rauhamaa <ma...@pacujo.net> wrote:
> This is a security risk. Here is a brief demonstration. Copy the example
> HTTP server from:
>
> <URL: https://docs.python.org/3/library/http.server.html?highlight=h
> ttp#http.server.SimpleHTTPRequestHandler>
>
> Run the server. Try these URLs in your browser:
>
> 1. http://localhost:8000/
>
> => The directory listing is provided
>
> 2. http://localhost:8000/test.html
>
> => A file is served or an HTTP error response (404) is generated
>
> 3. http://localhost:8000/te%00st.html
>
> => The server crashes with a ValueError and the TCP connection is
> reset
>

Actually, I couldn't even get Chrome to make that request, so it
obviously was considered by the browser to be invalid. Doing the
request with curl produced a traceback on the server and an empty
response in the client. (And then the server returns to handling
requests normally.) How is this a security risk, exactly? To be fair,
it's somewhat unideal behaviour - I would prefer to see an HTTP 500
come back if the server crashes - but I can't see that that's a
security problem. Just a QOS issue, wherein you might get a 500 rather
than a 404 for certain requests.

ChrisA

Antoon Pardon

unread,
Jun 7, 2018, 6:21:17 AM6/7/18
to
On 07-06-18 11:29, Marko Rauhamaa wrote:
> Antoon Pardon <antoon...@vub.be>:
>
>> On 07-06-18 05:55, Steven D'Aprano wrote:
>>> As a Python programmer, how does treating NUL specially make our life
>>> better?
>> By treating possible path values differently from impossible path
>> values.
> There are all kinds of impossibility. The os.stat() reports those
> impossibilities via an OSError exception. It's just that
> os.path.exists() converts the OSError exception into a False return
> value. A ValueError is raised by the Python os.stat() wrapper to
> indicate that it can't even deliver the request to the kernel.
>
> The application programmer doesn't give an iota who determined the
> impossibility of a pathname.

So? The fact that the application programmer doesn't give an iota who
determined the impossibility of a pathname, doesn't imply he is
equally unconcerned about the specific impossibility he ran into.

> Unfortunately, os.path.exists() forces the
> distinction on the application.

No it doesn't. It forces the distinction between two different kinds
of impossibilities, but you don't have to care where they originate
from.

> If I have to be prepared to catch a
> ValueError from os.path.exists(), what added value does os.path.exists()
> give on top of os.stat()? The whole point of os.path.exists() is
>
> 1. To provide an operating-system-independent abstraction.
>
> 2. To provide a boolean interface instead of an exception interface.

Mayby trying to provide such an interface is inherently flawed. Answering
me a path doesn't exist because of a permission problem is IMO not a good
idea.

--
Antoon.


Marko Rauhamaa

unread,
Jun 7, 2018, 6:47:20 AM6/7/18
to
Chris Angelico <ros...@gmail.com>:

> On Thu, Jun 7, 2018 at 7:29 PM, Marko Rauhamaa <ma...@pacujo.net> wrote:
>> 3. http://localhost:8000/te%00st.html
>>
>> => The server crashes with a ValueError and the TCP connection is
>> reset
>>
>
> Actually, I couldn't even get Chrome to make that request, so it
> obviously was considered by the browser to be invalid.

Wow! Why on earth?

> it's somewhat unideal behaviour - I would prefer to see an HTTP 500
> come back if the server crashes - but I can't see that that's a
> security problem. Just a QOS issue, wherein you might get a 500 rather
> than a 404 for certain requests.

It's a demonstration of how this innocent-looking problem can lead to
surprising and even serious consequences.

The given URI is well-formed and should not give any particular trouble
to any HTTP server.


Marko

Chris Angelico

unread,
Jun 7, 2018, 8:15:44 AM6/7/18
to
On Thu, Jun 7, 2018 at 8:47 PM, Marko Rauhamaa <ma...@pacujo.net> wrote:
> Chris Angelico <ros...@gmail.com>:
>
>> On Thu, Jun 7, 2018 at 7:29 PM, Marko Rauhamaa <ma...@pacujo.net> wrote:
>>> 3. http://localhost:8000/te%00st.html
>>>
>>> => The server crashes with a ValueError and the TCP connection is
>>> reset
>>>
>> it's somewhat unideal behaviour - I would prefer to see an HTTP 500
>> come back if the server crashes - but I can't see that that's a
>> security problem. Just a QOS issue, wherein you might get a 500 rather
>> than a 404 for certain requests.
>
> It's a demonstration of how this innocent-looking problem can lead to
> surprising and even serious consequences.
>
> The given URI is well-formed and should not give any particular trouble
> to any HTTP server.

You haven't demonstrated a security problem. Don't claim security
risks unless you can show there's at least a possibility of that;
otherwise, it's just FUD.

ChrisA

Steven D'Aprano

unread,
Jun 7, 2018, 8:15:58 AM6/7/18
to
On Thu, 07 Jun 2018 19:47:03 +1000, Chris Angelico wrote:

> To be fair, it's somewhat unideal behaviour - I would prefer to see an
> HTTP 500 come back if the server crashes - but I can't see that that's a
> security problem.

You think that being able to remotely crash a webserver isn't a security
issue?


If Denial Of Service isn't a security issue in your eyes, what would it
take? "Armed men burst into your house and shoot you"?

*only half a wink*

Steven D'Aprano

unread,
Jun 7, 2018, 8:21:28 AM6/7/18
to
On Thu, 07 Jun 2018 13:47:07 +0300, Marko Rauhamaa wrote:

> Chris Angelico <ros...@gmail.com>:
>
>> On Thu, Jun 7, 2018 at 7:29 PM, Marko Rauhamaa <ma...@pacujo.net>
>> wrote:
>>> 3. http://localhost:8000/te%00st.html
>>>
>>> => The server crashes with a ValueError and the TCP connection is
>>> reset
>>>
>>>
>> Actually, I couldn't even get Chrome to make that request, so it
>> obviously was considered by the browser to be invalid.
>
> Wow! Why on earth?

It works in Firefox, but Apache truncates the URL:


Not Found
The requested URL /te was not found on this server.


instead of te%00st.html

I wonder how many publicly facing web servers can be induced to either
crash, or serve the wrong content, this way?

Chris Angelico

unread,
Jun 7, 2018, 8:46:26 AM6/7/18
to
On Thu, Jun 7, 2018 at 10:18 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> On Thu, 07 Jun 2018 13:47:07 +0300, Marko Rauhamaa wrote:
>
>> Chris Angelico <ros...@gmail.com>:
>>
>>> On Thu, Jun 7, 2018 at 7:29 PM, Marko Rauhamaa <ma...@pacujo.net>
>>> wrote:
>>>> 3. http://localhost:8000/te%00st.html
>>>>
>>>> => The server crashes with a ValueError and the TCP connection is
>>>> reset
>>>>
>>>>
>>> Actually, I couldn't even get Chrome to make that request, so it
>>> obviously was considered by the browser to be invalid.
>>
>> Wow! Why on earth?
>
> It works in Firefox, but Apache truncates the URL:
>
>
> Not Found
> The requested URL /te was not found on this server.
>
>
> instead of te%00st.html
>
> I wonder how many publicly facing web servers can be induced to either
> crash, or serve the wrong content, this way?
>

Define "serve the wrong content". You could get the exact same content
by asking for "te" instead of "te%00st.html"; what you've done is not
significantly different from this:

http://localhost:8000/te?st.html

Is that a security problem too?

ChrisA

Steven D'Aprano

unread,
Jun 7, 2018, 8:49:53 AM6/7/18
to
On Thu, 07 Jun 2018 10:04:53 +0200, Antoon Pardon wrote:

> On 07-06-18 05:55, Steven D'Aprano wrote:
>> Python strings are rich objects which support the Unicode code point \0
>> in them. The limitation of the Linux kernel that it relies on NULL-
>> terminated byte strings is irrelevant to the question of what
>> os.path.exists ought to do when given a path containing NUL. Other
>> invalid path names return False.
>
> It is not irrelevant. It makes the disctinction clear between possible
> values and impossible values.

That is simply wrong. It is wrong in principle, and it is wrong in
practice, for reasons already covered to death in this thread.

It is *wrong in practice* because other impossible values don't raise
ValueError, they simply return False:

- illegal pathnames under Windows, those containing special
characters like ? > < * etc, simply return False;

- even on Linux, illegal pathnames like "" (the empty string)
return False;

- invalid pathnames with too many path components, or too many
characters in a single component, simply return False;

- the os.path.exists() function is not documented as making
a three-way split between "exists, doesn't exist and invalid";

- and it isn't even true to say that NULL is illegal in pathnames:
there are at least five file systems that allow either NUL bytes:
FAT-8, MFS, HFS, or Unicode \0 code points: HFS Plus and Apple
File System.

And it is *wrong in principle* because in the most general case, there is
no way to tell which pathnames are valid or invalid without querying an
actual file system. In the case of Linux, any directory could be used as
a mount point.

Is "/mnt/some?file" valid or invalid? If an NTFS file system is mounted
on /mnt, it is invalid; if an ext4 file system is mounted there, it is
valid; if there's nothing mounted there, the question is impossible to
answer.


>> As a Python programmer, how does treating NUL specially make our life
>> better?
>
> By treating possible path values differently from impossible path
> values.

But it doesn't do that. "Pathnames cannot contain NUL" is a falsehood
that programmers wrongly believe about paths. HFS Plus and Apple File
System support NULs in paths.

So what it does is wrongly single out one *POSSIBLE* path value to raise
an exception, while other so-called "impossible" path values simply
return False.

But in the spirit of compromise, okay, let's ignore the existence of file
systems like HFS which allow NUL. Apart from Mac users, who uses them
anyway? Let's pretend that every file system in existence, now and into
the future, will prohibit NULs in paths.

Have you ever actually used this feature? When was the last time you
wrote code like this?

try:
flag = os.path.exists(pathname)
except ValueError:
handle_null_in_path()
else:
if flag:
handle_file()
else:
handle_invalid_path_or_no_such_file()

I want to see actual, real code used in production, not made up code
snippets, that demonstrate that this is a useful distinction to make.

Until such time that somebody shows me an actual real-world use-case for
wanting to make this distinction for NULs and NULs alone, I call bullshit.

Chris Angelico

unread,
Jun 7, 2018, 8:51:48 AM6/7/18
to
On Thu, Jun 7, 2018 at 10:13 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> On Thu, 07 Jun 2018 19:47:03 +1000, Chris Angelico wrote:
>
>> To be fair, it's somewhat unideal behaviour - I would prefer to see an
>> HTTP 500 come back if the server crashes - but I can't see that that's a
>> security problem.
>
> You think that being able to remotely crash a webserver isn't a security
> issue?
>
>
> If Denial Of Service isn't a security issue in your eyes, what would it
> take? "Armed men burst into your house and shoot you"?
>
> *only half a wink*
>

By "crash" I mean that the request handler popped out an exception.
The correct behaviour is to send back a 500 and go back to handling
requests; with the extremely simple server given in that example, it
fails to send back the 500, but it DOES go back to handling requests.
So it's not a DOS. In any real server environment, this wouldn't have
any significant impact; even in this trivially simple server, the only
way you could hurt the server is by spamming enough of these that it
runs out of file handles for sockets or something.

ChrisA

Steven D'Aprano

unread,
Jun 7, 2018, 9:12:18 AM6/7/18
to
On Thu, 07 Jun 2018 22:46:09 +1000, Chris Angelico wrote:

>> I wonder how many publicly facing web servers can be induced to either
>> crash, or serve the wrong content, this way?
>>
>>
> Define "serve the wrong content". You could get the exact same content
> by asking for "te" instead of "te%00st.html";

Perhaps so, but maybe you can bypass access controls to te and get access
to it even though it is supposed to be private.

This is a real vulnerability, called null-byte injection.

One component of the system sees a piece of input, truncates it at the
NULL, and validates the truncated input; then another component acts on
the untruncated (and unvalidated) input.

https://resources.infosecinstitute.com/null-byte-injection-php/

https://capec.mitre.org/data/definitions/52.html

Null-byte injection attacks have lead to remote attackers executing
arbitrary code. That's unlikely in this scenario, but given that most web
servers are written in C, not Python, it is conceivable that they could
do anything under a null-byte injection attack.

Does the Python web server suffer from that vulnerability? I would be
surprised if it were. But it can be induced to crash (an exception, not a
seg fault) which is certainly a vulnerability.

Since people are unlikely to use this web server to serve mission
critical public services over the internet, the severity is likely low.
Nevertheless, it is still a real vulnerability.

Chris Angelico

unread,
Jun 7, 2018, 9:33:03 AM6/7/18
to
On Thu, Jun 7, 2018 at 11:09 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> On Thu, 07 Jun 2018 22:46:09 +1000, Chris Angelico wrote:
>
>>> I wonder how many publicly facing web servers can be induced to either
>>> crash, or serve the wrong content, this way?
>>>
>>>
>> Define "serve the wrong content". You could get the exact same content
>> by asking for "te" instead of "te%00st.html";
>
> Perhaps so, but maybe you can bypass access controls to te and get access
> to it even though it is supposed to be private.
>
> This is a real vulnerability, called null-byte injection.
>
> One component of the system sees a piece of input, truncates it at the
> NULL, and validates the truncated input; then another component acts on
> the untruncated (and unvalidated) input.
>
> https://resources.infosecinstitute.com/null-byte-injection-php/
>
> https://capec.mitre.org/data/definitions/52.html
>
> Null-byte injection attacks have lead to remote attackers executing
> arbitrary code. That's unlikely in this scenario, but given that most web
> servers are written in C, not Python, it is conceivable that they could
> do anything under a null-byte injection attack.

Fair point. So you should just truncate early and have done with it. Easy.

> Does the Python web server suffer from that vulnerability? I would be
> surprised if it were. But it can be induced to crash (an exception, not a
> seg fault) which is certainly a vulnerability.

"Certainly"? I'm dubious on that. This isn't C, where a segfault
usually comes after executing duff memory, and therefore it's
plausible to transform a segfault into a remote code execution
exploit. This is Python, where we have EXCEPTION handling. Tell me, is
this a vulnerability?

@app.route("/foo")
def foo():
return "Kaboom", 500

What about this?

@app.route("/bar")
def bar():
1/0
return "Won't get here"

Put those into a Flask app and see what they do. One of them will
explicitly return a 500. The other will crash... and will return a
500. Is either of those a security problem? Now let's suppose a more
realistic version of the latter:

@app.route("/paginate/<int:size>"):
def paginate(size):
total_pages = total_data/size
...

Yes, it's a bug. If someone tries a page size of zero, it'll divide by
zero and bomb. Great. But how is it a vulnerability? It is a
properly-handled exception.

It's slightly different with SimpleHTTPServer, as it fails to properly
send back the 500. That would be a bug IMO. Even then, though, all you
can do is clog the server with unfinished requests - and you can do
that much more easily by just connecting and being really slow to send
data. (And I doubt that people are using SimpleHTTPServer in
security-sensitive contexts anyway.)

ChrisA

Antoon Pardon

unread,
Jun 7, 2018, 9:43:50 AM6/7/18
to
On 07-06-18 14:47, Steven D'Aprano wrote:
> On Thu, 07 Jun 2018 10:04:53 +0200, Antoon Pardon wrote:
>
>> On 07-06-18 05:55, Steven D'Aprano wrote:
>>> Python strings are rich objects which support the Unicode code point \0
>>> in them. The limitation of the Linux kernel that it relies on NULL-
>>> terminated byte strings is irrelevant to the question of what
>>> os.path.exists ought to do when given a path containing NUL. Other
>>> invalid path names return False.
>> It is not irrelevant. It makes the disctinction clear between possible
>> values and impossible values.
> That is simply wrong. It is wrong in principle, and it is wrong in
> practice, for reasons already covered to death in this thread.
>
> It is *wrong in practice* because other impossible values don't raise
> ValueError, they simply return False:
>
> - illegal pathnames under Windows, those containing special
> characters like ? > < * etc, simply return False;
>
> - even on Linux, illegal pathnames like "" (the empty string)
> return False;
>
> - invalid pathnames with too many path components, or too many
> characters in a single component, simply return False;
>
> - the os.path.exists() function is not documented as making
> a three-way split between "exists, doesn't exist and invalid";

So? Maybe we should reconsider the above behaviour?

>
> - and it isn't even true to say that NULL is illegal in pathnames:
> there are at least five file systems that allow either NUL bytes:
> FAT-8, MFS, HFS, or Unicode \0 code points: HFS Plus and Apple
> File System.

That doesn't matter much. sqrt(-1) gives a ValueError, while there
are numberdomains for which it has a value.


> And it is *wrong in principle* because in the most general case, there is
> no way to tell which pathnames are valid or invalid without querying an
> actual file system. In the case of Linux, any directory could be used as
> a mount point.

I don't see how your first statement follows from that explanation. I don't
have a problem with needing to query the actual file system in order to find
out which pathnames are valid or invalid.

> Have you ever actually used this feature? When was the last time you?

This is irrelevant. You are now trying to argue the uselesness. The fact that
after consideration something turns out not very useful, is not a reason
to conclude that the factors that were taken into consideration were irrelevant.

Personaly I don't use os.path.exists because it tries to shoe horn too many
possibilities into a boolean result. Do you think os.stat("\0") should
raise FileNotFoundError?

--
Antoon.


Tim Chase

unread,
Jun 7, 2018, 9:57:56 AM6/7/18
to
On 2018-06-07 22:46, Chris Angelico wrote:
> On Thu, Jun 7, 2018 at 10:18 PM, Steven D'Aprano
> >>>> 3. http://localhost:8000/te%00st.html
> >>> Actually, I couldn't even get Chrome to make that request, so it
> >>> obviously was considered by the browser to be invalid.

It doesn't matter whether Chrome or Firefox can make the request if
it can be made by opening the socket yourself with something as
simple as

$ telnet example.com 80
GET /te%00st.html HTTP/1.1
Host: example.com

If that crashes the server, it's a problem, even if browsers try to
prevent it from happening by accident.

>> It works in Firefox, but Apache truncates the URL:
>>
>> Not Found
>> The requested URL /te was not found on this server.
>>
>> instead of te%00st.html

This is a sensible result, left up to each server to decide what to
do.

>> I wonder how many publicly facing web servers can be induced to
>> either crash, or serve the wrong content, this way?

I'm sure there are plenty. I mean, I discovered this a while back

https://mail.python.org/pipermail/python-list/2016-August/713373.html

and that's Microsoft running their own stack. They seem to have
fixed that issue at that particular set of URLs, but a little probing
has turned it up elsewhere at microsoft.com since (for the record,
the first set of non-existent URLs return 404-not-found errors while
the second set of reserved filename URLs return
500-Server-Internal-Error pages). Filename processing is full of
sharp edge-cases.

> Define "serve the wrong content". You could get the exact same
> content by asking for "te" instead of "te%00st.html"; what you've
> done is not significantly different from this:
>
> http://localhost:8000/te?st.html
>
> Is that a security problem too?

Depending on the server, it might allow injection for something like

http://example.com/page%00cat+/etc/passwd

Or it might allow the request to be processed in an attack, but leave
the log files without the details:

GET /innocent%00malicious_payload
(where only the "/innocent" gets logged)

Or false data could get injected in log files

http://example.com/innocent%00%0a23.200.89.180+-+-+%5b07/Jun/2018%3a13%3a55%3a36+-0700%5d+%22GET+/nasty_porn.mov+HTTP/1.0%22+200+2326

(`host whitehouse.gov` = 23.200.89.180)

It all depends on the server and how the request is handled.

-tkc




MRAB

unread,
Jun 7, 2018, 1:11:12 PM6/7/18
to
On 2018-06-07 08:45, Chris Angelico wrote:
> On Thu, Jun 7, 2018 at 1:55 PM, Steven D'Aprano
> <steve+comp....@pearwood.info> wrote:
>> On Tue, 05 Jun 2018 23:27:16 +1000, Chris Angelico wrote:
>>
>>> And an ASCIIZ string cannot contain a byte value of zero. The parallel
>>> is exact.
>>
>> Why should we, as Python programmers, care one whit about ASCIIZ strings?
>> They're not relevant. You might as well say that file names cannot
>> contain the character "π" because ASCIIZ strings don't support it.
>>
>> No they don't, and yet nevertheless file names can and do contain
>> characters outside of the ASCIIZ range.
>
> Under Linux, a file name contains bytes, most commonly representing
> UTF-8 sequences. So... an ASCIIZ string *can* contain that character,
> or at least a representation of it. Yet it cannot contain "\0".
>
I've seen a variation of UTF-8 that encodes U+0000 as 2 bytes so that a
zero byte can be used as a terminator.

It's therefore not impossible to have a version of Linux that allowed a
(Unicode) "\0" in a filename.

Chris Angelico

unread,
Jun 7, 2018, 1:53:07 PM6/7/18
to
Considering that Linux treats filenames as raw bytes, that's not
surprising. The mangled encoding you refer to is a horrendous cheat,
though, and violates several of the design principles of UTF-8, so I
do not recommend it EVER. The correct way for Python to handle and
represent such a file name would be to use the U+DCxx range to carry
the bytes through unchanged - not using "\0".

ChrisA

Steven D'Aprano

unread,
Jun 7, 2018, 9:20:38 PM6/7/18
to
On Thu, 07 Jun 2018 15:38:39 -0400, Dennis Lee Bieber wrote:

> On Fri, 1 Jun 2018 23:16:32 +0000 (UTC), Steven D'Aprano
> <steve+comp....@pearwood.info> declaimed the following:
>
>>It should either return False, or raise TypeError. Of the two, since
>>3.14159 cannot represent a file on any known OS, TypeError would be more
>>appropriate.
>>
> I wouldn't be so sure of that...

I would.

There is no existing file system which uses floats instead of byte- or
character-strings for file names. If you believe different, please name
the file


> Xerox CP/V allowed for embedding
> non-printable characters into file names

Just like most modern file systems.

Even FAT-16 supports a range of non-ASCII bytes with the high-bit set
(although not the control codes with the high-bit cleared). Unix file
systems typically support any byte except \0 and /. Most modern file
systems outside of Unix support any Unicode character (or almost any)
including ASCII control characters.

https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits



[...]
> With some work, one could probably generate a file name
containing the
> bytes used for storing a floating point value.

Any collection of bytes can be interpreted as any thing we like.
(Possibly requiring padding or splitting to fit fixed-width data
structures.) Sounds. Bitmaps. Coordinates in three dimension space.
Floating point numbers is no challenge. A Python float is represented by
an eight-byte C double. Provided we agree on a convention for splitting
byte strings into eight-byte chunks, adding padding, and agree on big- or
little-endianness, it is trivial to convert file names to one or more
floats:

/etc is equivalent to 2.2617901550715974e-80

(big endian, padding added to the right)

But just because I can do that conversion, doesn't mean that the file
system uses floats for file names.

Steven D'Aprano

unread,
Jun 7, 2018, 10:22:29 PM6/7/18
to
On Thu, 07 Jun 2018 23:25:54 +1000, Chris Angelico wrote:

[...]
>> Does the Python web server suffer from that vulnerability? I would be
>> surprised if it were. But it can be induced to crash (an exception, not
>> a seg fault) which is certainly a vulnerability.
>
> "Certainly"? I'm dubious on that. This isn't C, where a segfault usually
> comes after executing duff memory, and therefore it's plausible to
> transform a segfault into a remote code execution exploit.

I just said that I would be surprised if you could get remote code
execution from the Python web server, for exactly the reason you state:
its an exception, not a segfault.

Stop agreeing with me when we're trying to have an argument! *wink*


[...]
> Yes, it's a bug. If someone tries a page size of zero, it'll divide by
> zero and bomb. Great. But how is it a vulnerability? It is a
> properly-handled exception.

Causing a denial of service is a vulnerability.

Security vulnerabilities are not just about remote code execution. Can
remote attackers bring your service down? If so, you are vulnerable to
having remote attackers bring your service down.

Can remote attackers overwhelm your server with so many errors that they
fill your disks with error logs and either stop logging, or crash? Then
you are vulnerable to having remote attackers crash your server, or hide
their tracks by preventing logging.

Can remote attackers induce your server to serve files it shouldn't? Then
you are vulnerable to attacks that leak sensitive or private information.

There's far more to security vulnerabilities than just "oh well, they
can't get a shell or execute code on my server, so it's all cool" *wink*


In this specific case:

> It's slightly different with SimpleHTTPServer, as it fails to properly
> send back the 500. That would be a bug IMO.

There seems to be some weird interaction occurring on my system between
the SimpleHTTPServer, Firefox, and my web proxy, so I may have
misinterpreted the precise nature of the crash. What I initially saw was
that allow the SimpleHTTPServer remained running, it stopped responding
to requests and Firefox would repeatedly respond:

Firefox can't find the server at www.localhost.com

even though the process was still running. But when I tried with a
different browser (links), I don't get that same behaviour. links is
using the web proxy, Firefox isn't, but I'm not quite sure why that makes
a difference.

> Even then, though, all you
> can do is clog the server with unfinished requests - and you can do that
> much more easily by just connecting and being really slow to send data.
> (And I doubt that people are using SimpleHTTPServer in
> security-sensitive contexts anyway.)

Again, you're just repeating what I said in different words. I already
said that *this specific* issue is probably low severity, because people
are unlikely to use SimpleHTTPServer for mission critical services
exposed to the internet.

Steven D'Aprano

unread,
Jun 7, 2018, 10:27:32 PM6/7/18
to
On Thu, 07 Jun 2018 17:45:06 +1000, Chris Angelico wrote:

> On Thu, Jun 7, 2018 at 1:55 PM, Steven D'Aprano
> <steve+comp....@pearwood.info> wrote:
>> On Tue, 05 Jun 2018 23:27:16 +1000, Chris Angelico wrote:
>>
>>> And an ASCIIZ string cannot contain a byte value of zero. The parallel
>>> is exact.
>>
>> Why should we, as Python programmers, care one whit about ASCIIZ
>> strings? They're not relevant. You might as well say that file names
>> cannot contain the character "π" because ASCIIZ strings don't support
>> it.
>>
>> No they don't, and yet nevertheless file names can and do contain
>> characters outside of the ASCIIZ range.
>
> Under Linux, a file name contains bytes, most commonly representing
> UTF-8 sequences.

The fact that user-space applications like the shell and GUI file
managers sometimes treat file names at UTF-8 Unicode is not really
relevant to what the file system allows. The most common Linux file
systems are fundamentally bytes, not Unicode characters, and while I'm
willing to agree to call the byte 0x41 "A", there simply is no such byte
that means "π" or U+10902 PHOENICIAN LETTER GAML.

File names under typical Linux file systems are not necessarily valid
UTF-8 Unicode. That's why Python still provides a bytes-interface as well
as a text interface.


> So... an ASCIIZ string *can* contain that character, or
> at least a representation of it. Yet it cannot contain "\0".

You keep saying that as if it made one whit of difference to what
os.path.exists should do. I completely agree that ASCIIZ strings cannot
contain NUL bytes. What does that have to do with os.path.exists()?

NTFS file systems use UTF-16 encoded strings. For typical mostly-ASCII
pathnames, the bytes on disk are *full* of NUL bytes. If the
implementation detail that ASCIIZ strings cannot contain NUL is important
to you, it should be equally important that UTF-16 strings typically have
many NULs.

They're actually both equally implementation details and utterly
irrelevant to the behaviour of os.path.exists.

Chris Angelico

unread,
Jun 7, 2018, 10:42:28 PM6/7/18
to
On Fri, Jun 8, 2018 at 12:16 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> On Thu, 07 Jun 2018 23:25:54 +1000, Chris Angelico wrote:
>> Yes, it's a bug. If someone tries a page size of zero, it'll divide by
>> zero and bomb. Great. But how is it a vulnerability? It is a
>> properly-handled exception.
>
> Causing a denial of service is a vulnerability.

Yes, but remember, anyone can build a botnet and send large numbers of
entirely legitimate requests to your server. Since no server has
infinite capacity, a DOS is inherently unavoidable. So to call
something a "DOS vulnerability", you have to show that it makes you
*more vulnerable* than simply getting overloaded with requests. For
example:

1) If the kernel allocates resources for half-open socket connections,
a malicious client can SYN-flood the server, causing massive resource
usage from relatively few packets.

2) If the language can be induced to build a hashtable using values
that all have the same hash, the CPU load required for the O(n²)
operations can easily exceed the cost of making the requests.

3) If the app inefficiently performs many database transactions for a
simple request, a plausible number of such requests could slow the
database to a crawl.

4) If a small request results in an inordinately large response, the
server's outgoing bandwidth can be saturated by a small number of
requests.

Where in this is a simple HTTP 500 from the os.stat() call worse than
a legitimate request for an actual page?

The response is small (far smaller than many legit files - consider a
web app with a large JavaScript bundle, easily multiple megabytes). It
required zero disk operations, so it's as fast as returning a file
from cache. The only way it's more expensive is the actual exception
handling code itself, and if you reckon someone can DOS a server via
the cost of throwing and catching exceptions, I'm going to have to ask
for some serious measurements.

Apart from the one odd bug with SimpleHTTPServer not properly sending
back 500s, I very much doubt that the original concern - namely that
os.path.exists() and os.stat() raise ValueError if therels a %00 in
the URL - can be abused effectively.

ChrisA

Richard Damon

unread,
Jun 7, 2018, 10:57:09 PM6/7/18
to
On 6/7/18 9:17 PM, Steven D'Aprano wrote:
> On Thu, 07 Jun 2018 15:38:39 -0400, Dennis Lee Bieber wrote:
>
>> On Fri, 1 Jun 2018 23:16:32 +0000 (UTC), Steven D'Aprano
>> <steve+comp....@pearwood.info> declaimed the following:
>>
>>> It should either return False, or raise TypeError. Of the two, since
>>> 3.14159 cannot represent a file on any known OS, TypeError would be more
>>> appropriate.
>>>
>> I wouldn't be so sure of that...
> I would.
>
> There is no existing file system which uses floats instead of byte- or
> character-strings for file names. If you believe different, please name
> the file
>
>
>> Xerox CP/V allowed for embedding
>> non-printable characters into file names
> Just like most modern file systems.
>
> Even FAT-16 supports a range of non-ASCII bytes with the high-bit set
> (although not the control codes with the high-bit cleared). Unix file
> systems typically support any byte except \0 and /. Most modern file
> systems outside of Unix support any Unicode character (or almost any)
> including ASCII control characters.
>
> https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits
>
>
>
This does bring up an interesting point. Since the Unix file system
really has file names that are collection of bytes instead of really
being strings, and the Python API to it want to treat them as strings,
then we have an issue that we are going to be stuck with problems with
filenames. If we assume they are utf-8 encoded, then there exist
filenames that will trap with invalid encodings  (if for example the
name were generated on a system that was using Latin-1 as an 8 bit
character set for file names). On the other hand, if we treat the file
names as 8 bit characters by themselves, if the system was using utf-8
then we are mangling any characters outside the basic ASCII set.
Basically we hit to old problem of confusing bytes and strings.
Ultimately we have a fundamental limitation with trying to abstract out
the format of filenames in the API, and we need a back door to allow us
to define what encoding to use for filenames (and be able to detect that
it doesn't work for a given file, and change it on the fly to try
again), or we need an alternate API that lets us pass raw bytes as file
names and the program needs to know how to handle the raw filename for
that particular file system.

--
Richard Damon

It is loading more messages.
0 new messages