Path views, which are views of file system paths

194 views
Skip to first unread message

Niall Douglas

unread,
Apr 19, 2018, 6:05:19 PM4/19/18
to ISO C++ Standard - Future Proposals
Beginning to get towards the end of the papers I intend to propose at Rapperswil, thank god.

This proposes a std::filesystem::path_view. It has an orthogonal design and intent to std::filesystem::path, and thus will be controversial.

Also I may have gone into too much detail about my best understanding of the history of how std::filesystem::path came to have its present design. That recounted history may also be inaccurate, or it may offend some people.

Feedback is welcome.

Niall

DDDDDR0 std filesystem path_view draft 1.pdf

Bengt Gustafsson

unread,
Apr 19, 2018, 7:09:07 PM4/19/18
to ISO C++ Standard - Future Proposals
It is unclear to me what happens if a path_view is constructed from a wstring or wchar_t* in Windows unicode build. I can see two possibilirtes: One is that the incoming string is converted to UTF-8 and copied to an internal buffer and the other is that the path_view has a flag which indicates whether it actually points at narrow or wide characters, and then the c_str struct "does the right thing" for the platform. Your text seems to stress that it is _always_ UTF-8 which seems to indicate the first possibility... but I hope you are intending that this should be implemented with a flag as otherwise ALL usage in Windows unicode builds would do conversions and if you use a L"filename" literal two conversions, forth and back to utf-8.

I have two specific concerns about the owned UTF-8 buffer alternative: 1) It is hard to know how many bytes to allocate for a UTF-16 source string. 2) it seems that if the source owns its buffer any slices will have to make deep copies of the buffer as it is unrealistic to think that programmers would understand that the first path_view has to outlive the lifetime of the last remaining slice made from it, not only the original std::wstring or std::path you  have stored.


Den fredag 20 april 2018 kl. 00:05:19 UTC+2 skrev Niall Douglas:
Beginning to get towards the end of the papers I intend to propose at Rapperswil, thank god.

This proposes a std::filesystem::path_view. It has an orthogonal design and intent to std::filesystem::path, and thus will be controversial.
To me it seems to play quite nicely with path! and having both an object and a view to it is getting quite prevalent now.

Nicol Bolas

unread,
Apr 19, 2018, 8:37:55 PM4/19/18
to ISO C++ Standard - Future Proposals
First, I would say that the proposed interface is not well explained by the proposal. The key aspect of this design is that is that `path_view` pretends that it stores a UTF-8 encoded string (its iterators return `char`), but it actually stores a pointer+size of the type you specifically provide.

That's never stated directly; it's something you kind of have to infer from things you talk about. But it is the foundation of your design, so it needs to be something you come out and say.

----

Enter thus the proposed std::filesystem::path_view, which is to std::filesystem::path as
std::string_view is to std::string. It provides most of the same member functions as
std::filesystem::path, operating by constant reference upon some character source which is in
the format of the local platform’s file system path.

This statement is of... dubious accuracy, relative to the actual design. `basic_string` and `basic_string_view` are very tightly related, where functions and typedefs that exist in one do identical things in the other.

By contrast, there are numerous points of divergence between `path` and `path_view`. Indeed, not 3 paragraphs after the above statement, we get our first substantial point of divergence: "One thing which is perhaps surprising is that the value type is always a char, not a std::filesystem::path::value_type."

Even if you agree with the design rationale for this choice, that doesn't change the fact that, by making this choice, you have something with is no longer like "std::string_view is to std::string". It is a distinct type that, while having similarities with `path` ultimately has its own distinct behavior.

And therefore, it should not have a name that suggests that it is merely a "view" into a `path` when it is rather more complex than that.

At the very least, there's no point in providing `value_type` and its ilk. It is always the same type on every implementation (`path::value_type` exists because its implementation defined), so there's no much point in it. And the iterators don't return `value_type`s; they're return `path_view`s. So there's not much point in providing it. After all, all of the `path` interfaces that interact with `value_type` don't exist in `path_view`.

----

4.2 Why interpret chars as UTF-8 when std::filesystem::path interprets chars as ‘the native narrow encoding’ ?

P0482, which I believe was forwarded to CWG at the previous meeting, allows us to distinguish between narrow characters and genuine UTF-8. So there's no reason to assume narrow strings are UTF-8. Indeed, putting such assumptions into the standard library encourages laziness by programmers. It encourages them to skip the `u8` prefix, thinking it optional when it is actually really important.

At the very least, its `value_type` (if you keep that around) should be `char8_t`, not `char`.

Niall Douglas

unread,
Apr 20, 2018, 4:19:29 AM4/20/18
to ISO C++ Standard - Future Proposals
On Friday, April 20, 2018 at 12:09:07 AM UTC+1, Bengt Gustafsson wrote:
It is unclear to me what happens if a path_view is constructed from a wstring or wchar_t* in Windows unicode build. I can see two possibilirtes: One is that the incoming string is converted to UTF-8 and copied to an internal buffer and the other is that the path_view has a flag which indicates whether it actually points at narrow or wide characters, and then the c_str struct "does the right thing" for the platform. Your text seems to stress that it is _always_ UTF-8 which seems to indicate the first possibility... but I hope you are intending that this should be implemented with a flag as otherwise ALL usage in Windows unicode builds would do conversions and if you use a L"filename" literal two conversions, forth and back to utf-8.

Oh ok. I am surprised that you read it this way. I thought it very clear that we always pass through unchanged inputs of the native encoding, so on Windows, wchar_t input never causes reencoding.

Thanks for this feedback though. I'll need to do a round of removing ambiguity.
 

I have two specific concerns about the owned UTF-8 buffer alternative: 1) It is hard to know how many bytes to allocate for a UTF-16 source string.

As the paper explains, we throw 64Kb at the problem, but only on Windows. 64Kb is the maximum a path can ever be on Windows.
 
2) it seems that if the source owns its buffer any slices will have to make deep copies of the buffer as it is unrealistic to think that programmers would understand that the first path_view has to outlive the lifetime of the last remaining slice made from it, not only the original std::wstring or std::path you  have stored.

Path views are no different to string views. They don't own their storage, so you need to keep that around.

Niall

Niall Douglas

unread,
Apr 20, 2018, 4:42:03 AM4/20/18
to ISO C++ Standard - Future Proposals
On Friday, April 20, 2018 at 1:37:55 AM UTC+1, Nicol Bolas wrote:
First, I would say that the proposed interface is not well explained by the proposal. The key aspect of this design is that is that `path_view` pretends that it stores a UTF-8 encoded string (its iterators return `char`), but it actually stores a pointer+size of the type you specifically provide.

That's never stated directly; it's something you kind of have to infer from things you talk about. But it is the foundation of your design, so it needs to be something you come out and say.

Oh ok. Thanks.
 

Enter thus the proposed std::filesystem::path_view, which is to std::filesystem::path as
std::string_view is to std::string. It provides most of the same member functions as
std::filesystem::path, operating by constant reference upon some character source which is in
the format of the local platform’s file system path.

This statement is of... dubious accuracy, relative to the actual design. `basic_string` and `basic_string_view` are very tightly related, where functions and typedefs that exist in one do identical things in the other.

By contrast, there are numerous points of divergence between `path` and `path_view`. Indeed, not 3 paragraphs after the above statement, we get our first substantial point of divergence: "One thing which is perhaps surprising is that the value type is always a char, not a std::filesystem::path::value_type."

Even if you agree with the design rationale for this choice, that doesn't change the fact that, by making this choice, you have something with is no longer like "std::string_view is to std::string". It is a distinct type that, while having similarities with `path` ultimately has its own distinct behavior.

Ok.
 

And therefore, it should not have a name that suggests that it is merely a "view" into a `path` when it is rather more complex than that.

It would depend on how tightly one understands "view". I'd call it any const reference to contiguous data not owned by the object. And that it is. It is the c_str child class and the comparison functions which do any just-in-time reencoding.
 

At the very least, there's no point in providing `value_type` and its ilk. It is always the same type on every implementation (`path::value_type` exists because its implementation defined), so there's no much point in it. And the iterators don't return `value_type`s; they're return `path_view`s. So there's not much point in providing it. After all, all of the `path` interfaces that interact with `value_type` don't exist in `path_view`.

Very good point. I hadn't thought of that.
 

4.2 Why interpret chars as UTF-8 when std::filesystem::path interprets chars as ‘the native narrow encoding’ ?

P0482, which I believe was forwarded to CWG at the previous meeting, allows us to distinguish between narrow characters and genuine UTF-8. So there's no reason to assume narrow strings are UTF-8. Indeed, putting such assumptions into the standard library encourages laziness by programmers. It encourages them to skip the `u8` prefix, thinking it optional when it is actually really important.

At the very least, its `value_type` (if you keep that around) should be `char8_t`, not `char`.

We must be very careful here.

Filesystem paths, despite what a lot of people think, are treated by almost all filesystems as a bunch of bits. As in, comparisons are done via memcmp(), and you can send the almost unfiltered binary output from a random number generator as a filename and it'll work perfectly. Layers above the filesystem might do "case insensitive" fallback comparisons where "case insensitive" may be not, partially, or wholly Unicode aware, and they usually do these if and only if memcmp() failed to find anything. It is also the case that apart from MacOS, every major platform lets you say "only ever use memcmp()" which is an obvious big performance gain. The low level file i/o library I am proposing lets you use memcmp() compared paths on Windows using a "\\!\" path prefix for example, and it's a big gain (about 40% faster file opens!).

So paths, therefore, are simultaneously a bunch of bytes, but also may get some ICU applied to them depending on system configuration, maybe.  Hence the u8 prefix is inappropriate, but only sometimes, for filesystem paths. I agree all of this is unfortunate, but that's the current standard practice right now, and as I mentioned on all but MacOS we have memcmp() path comparisons at the kernel level. 

Me personally I think it best balanced to use char here, as it's a pure sitting-on-the-fence declaration. Filesystem paths are not std::byte, and not char8_t. They are something in between, and probably best pushed onto the programmer to decide. Which is kinda what char* means in C++.

Niall

Niall Douglas

unread,
Apr 20, 2018, 4:45:23 AM4/20/18
to ISO C++ Standard - Future Proposals
Me personally I think it best balanced to use char here, as it's a pure sitting-on-the-fence declaration. Filesystem paths are not std::byte, and not char8_t. They are something in between, and probably best pushed onto the programmer to decide. Which is kinda what char* means in C++.

And just to be clear here, there is no way of telling the kernel that "this path is in UTF-8 encoding" any more than  "this path is in UTF-16 encoding".

Filesystem paths are bunches of bytes with system configuration determined meaning if and only if memcmp() fails.

Niall

Nicol Bolas

unread,
Apr 20, 2018, 12:26:40 PM4/20/18
to ISO C++ Standard - Future Proposals

That's precisely why I don't think it's a good solution. Because you're not treating `char` as "a pure sitting-on-the-fence declaration". You're treating `char` as UTF-8. You're explicitly doing this, because on Win32, you will generate a native-encoded path by converting from UTF-8 to UTF-16. There is no fence-sitting happening; you've pick a side.

So if you are going to pick a side, best to be explicit about it.

Bengt Gustafsson

unread,
Apr 20, 2018, 4:42:37 PM4/20/18
to ISO C++ Standard - Future Proposals

Oh ok. I am surprised that you read it this way. I thought it very clear that we always pass through unchanged inputs of the native encoding, so on Windows, wchar_t input never causes reencoding.

Thanks for this feedback though. I'll need to do a round of removing ambiguity.

So you are using a flag or something to remember what type the pointer actually points to for this path_view instance (or a type erased helper object or something like that). I understand the part about the c_str buffer being used in case conversion is required. I am not convinced that allocating 64k on the stack will be appreciated by the committee but that could be seen as a QoI issue. I assume that the c_str type also has some magic to make sure it does not do any copying if the referred original data's type matches, allocating the buffer with alloca() if not, rather than using a member array which always uses up the maximum stack space. QoI again...

Is there a point in making the c_str buffer a template so that it can be used for other character sets than the "native" one? There are cases where you for instance send filenames over network links or store them as bytes in a file being written. But maybe that's too narrow usage to motivate the complication. There is also the issue of character set in addition to just character type in this case, which should be handled by a future codecvt replacement. An intermediate solution could be to provide utf8_str and utf16_str in addition to c_str which do the same thing but convert the characters if needed compared to what the view actually points at. One of these would be the same as c_str depending on platform, but this would provide an easier portable way to get at the individual characters in a random accessible way.

What don't understand really is how you can iterate over UTF-8 characters if you have UTF-16 in the underlying implementation but if the iterator has some internal state I guess you can, as long as you don't provide random access. I think the fact that the iterator must keep a few bytes worth of preprocessed utf-8 data worth mentioning.

This being a read only view it can't handle some common path manipulation scenarios in a portable way, such as combining parts of paths to form a full path, or replacing the extension of a file name. An operator/() which takes paths and path_views as lhs and rhs and returns a path seems like a logical addition to the current operator/(path, path). Not providing this would make some patterns cumbersome to code as you would first have to call the path() method doing one copy and then operator/ doing another. Investigating if path should have a constructor from path_view and similar is also of interest, and here the semantics should be the same as between string/string_view if possible.

Thinking about how to get the utf8 string out of there and into storage that is not local on the stack it seems natural to do:

    path_view pv;
    char* buffer = new char[???];
    *std::copy(pv.begin(), pv.end(), buffer) = 0;

So maybe a method to get the max length for this particular view would be of interest?

Also, it seems neat to have wbegin() and wend() which provide iterators over UTF-16 strings converting from utf8 if this is what was referred, improving symmetry.

At this point the similarity with an iterator based codecvt becomes a little too obvious... is there a concrete proposal for what is to replace the deprecated std::codecvt?

And then the question about the #ifdef _WIN32. What is the point of not providing this functionality on linux? Maybe it would not be widely used, but it doesn't seem to hurt keeping it in and it improves portability for software initially written using wstring on Windows and then ported to Linux. I suspect this has to do with the fact that wchar_t is 32 bits on Linux so the wide char encoding would be different than on Windows anyway. Is there a discussion on this topic?

Niall Douglas

unread,
Apr 20, 2018, 5:53:04 PM4/20/18
to ISO C++ Standard - Future Proposals
That's precisely why I don't think it's a good solution. Because you're not treating `char` as "a pure sitting-on-the-fence declaration". You're treating `char` as UTF-8.

If and only if the native encoding is not char.
 
You're explicitly doing this, because on Win32, you will generate a native-encoded path by converting from UTF-8 to UTF-16. There is no fence-sitting happening; you've pick a side.

I suppose one could match std::filesystem::path, and treat char input as ANSI?

Would you prefer that instead?
 

So if you are going to pick a side, best to be explicit about it.

Sure, it looks like I'll need to simplify the paper's argument a bit. Thanks for the feedback.

Niall

Niall Douglas

unread,
Apr 20, 2018, 6:19:51 PM4/20/18
to ISO C++ Standard - Future Proposals
On Friday, April 20, 2018 at 9:42:37 PM UTC+1, Bengt Gustafsson wrote:

Oh ok. I am surprised that you read it this way. I thought it very clear that we always pass through unchanged inputs of the native encoding, so on Windows, wchar_t input never causes reencoding.

Thanks for this feedback though. I'll need to do a round of removing ambiguity.

So you are using a flag or something to remember what type the pointer actually points to for this path_view instance (or a type erased helper object or something like that).

The current implementation is lazy, and stores both a string_view and a wstring_view. Whichever is not null is the original source. You can see it at https://github.com/ned14/afio/blob/master/include/afio/v2.0/path_view.hpp#L149.
 
I understand the part about the c_str buffer being used in case conversion is required. I am not convinced that allocating 64k on the stack will be appreciated by the committee but that could be seen as a QoI issue.

I say meh on Windows. And it's only so large on Windows.
 
I assume that the c_str type also has some magic to make sure it does not do any copying if the referred original data's type matches, allocating the buffer with alloca() if not, rather than using a member array which always uses up the maximum stack space. QoI again...

We cannot use alloca here. We cannot predict the size of the output you see. So we allocate on the stack the maximum path size possible. As I mention in the paper, if the compiler deduces it will never be used, the stack allocation is completely removed.
 

Is there a point in making the c_str buffer a template so that it can be used for other character sets than the "native" one?

No. Path views as proposed are unsuitable for anything other than calling a syscall.
 
There is also the issue of character set in addition to just character type in this case, which should be handled by a future codecvt replacement.

Unnecessary for path views.
 
An intermediate solution could be to provide utf8_str and utf16_str in addition to c_str which do the same thing but convert the characters if needed compared to what the view actually points at. One of these would be the same as c_str depending on platform, but this would provide an easier portable way to get at the individual characters in a random accessible way.

Ah, but we don't want people to get at the individual characters via a path view. Path components, yes. Characters, no.

This probably sounds excessively limiting, but as the paper points out, there is no need to permute whole paths anymore in the low level file i/o library. Just leafnames. And those are likely a char buffer on the stack which you sprintf() into or something.
 

What don't understand really is how you can iterate over UTF-8 characters if you have UTF-16 in the underlying implementation but if the iterator has some internal state I guess you can, as long as you don't provide random access. I think the fact that the iterator must keep a few bytes worth of preprocessed utf-8 data worth mentioning.

The iterators don't need to understand UTF because all they search for is the path system separator character as defined by std::filesystem.

They do not care what is in between the separators. They blindly accept whatever bytes there is.
 

This being a read only view it can't handle some common path manipulation scenarios in a portable way, such as combining parts of paths to form a full path, or replacing the extension of a file name.

If you want to modify a path, convert the view to a path beforehand using .path() so you can manipulate its underlying string.
 
An operator/() which takes paths and path_views as lhs and rhs and returns a path seems like a logical addition to the current operator/(path, path). Not providing this would make some patterns cumbersome to code as you would first have to call the path() method doing one copy and then operator/ doing another.

I would imagine that, if approved, std::filesystem::path would gain understanding of path views.
 
Investigating if path should have a constructor from path_view and similar is also of interest, and here the semantics should be the same as between string/string_view if possible.

I think this very likely.
 

Thinking about how to get the utf8 string out of there and into storage that is not local on the stack it seems natural to do:

    path_view pv;
    char* buffer = new char[???];
    *std::copy(pv.begin(), pv.end(), buffer) = 0;

Path view iterators, same as path iterators, iterate path components. Not characters. So iterators return path views which are slices of the original, same as path. See http://en.cppreference.com/w/cpp/filesystem/path/begin.

Thus std::copy would need to copy into an array of path_views, not char[]. Same as for path.
 

So maybe a method to get the max length for this particular view would be of interest?

As with path, the size of a view is the number of path components, not the size of the underlying string.

I do have an extension, native_size(), which returns whatever the underlying string view returns for its size.
 

Also, it seems neat to have wbegin() and wend() which provide iterators over UTF-16 strings converting from utf8 if this is what was referred, improving symmetry.

As mentioned before, path view iterators return new path views.
 

And then the question about the #ifdef _WIN32. What is the point of not providing this functionality on linux?

A far more efficient implementation primarily as we can hard assume characters will be the sole backing storage. Eliminates the runtime dispatch.
 
Maybe it would not be widely used, but it doesn't seem to hurt keeping it in and it improves portability for software initially written using wstring on Windows and then ported to Linux. I suspect this has to do with the fact that wchar_t is 32 bits on Linux so the wide char encoding would be different than on Windows anyway. Is there a discussion on this topic?


I don't personally see any gain. Less is more. We want to encourage people to not cause unnecessary string reencodings with path views. They're the same as copying memory. And if they really want to, they can use a std::filesystem::path, and construct a path view from the reencoded path.

Thanks for your feedback!

Niall

Bengt Gustafsson

unread,
Apr 20, 2018, 6:50:30 PM4/20/18
to ISO C++ Standard - Future Proposals
Ok, I'm not too familiar with path iterators, sorry for the confusion. Anyhow, does this mean that the only way to get the actual characters out is using either path() and from the returned path object get a string (two deep copies later) or using c_str buffer, always getting a "platform" friendly string, i.e. non-portable code? While it is of course important to be able to efficiently get the "platform" friendly string I think it is also important to be able to get the character sequence out the other end, i.e. to be able to work with the characters of the view you received in a portable way.

Thiago Macieira

unread,
Apr 21, 2018, 12:42:10 PM4/21/18
to std-pr...@isocpp.org
On Friday, 20 April 2018 13:42:37 PDT Bengt Gustafsson wrote:
> And then the question about the #ifdef _WIN32. What is the point of not
> providing this functionality on linux? Maybe it would not be widely used,
> but it doesn't seem to hurt keeping it in and it improves portability for
> software initially written using wstring on Windows and then ported to
> Linux. I suspect this has to do with the fact that wchar_t is 32 bits on
> Linux so the wide char encoding would be different than on Windows anyway.
> Is there a discussion on this topic?

It would still be Unicode, though UTF-32.

The problem is that the conversion is not guaranteed to be lossless. File
names on Unix systems can be arbitrary encoding, practically binary so long as
neither slashes and nulls are used. So certain file names cannot be converted
to UTF-32 or UTF-16.

This is common when you decompress a very old .zip file.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center



Reply all
Reply to author
Forward
0 new messages