How to use the wfopen wchar_t API from Cython on Windows?

80 views
Skip to first unread message

Chris Barker

unread,
Mar 6, 2024, 7:56:40 PMMar 6
to cython-users
I'm wrapping an old-style C lib, that needs a plain old FILE pointer. 

So I'm using wfopen, so I can use unicode filenames.

What used to work:

in the pxd:

from libc.stdio cimport FILE

IF UNAME_SYSNAME == "Windows":
    cdef extern from "<windows.h>":
        ctypedef Py_UNICODE wchar_t
        FILE *_wfopen(const wchar_t *filename, const wchar_t *mode)

and in the pyx:

cdef FILE* open_file(file_path) except *:
    """
    opens a file
    :param path: python str or PathLike
    :returns: File Pointer
    Note: On Windows, it uses a wchar, UTC-16 encoded
          On other platforms (Mac and Linux), it assumes utf-8
    """
    cdef FILE* fp
    file_path = os.fspath(file_path)
    fp = NULL
    IF UNAME_SYSNAME == 'Windows':
        fp = _wfopen(file_path, "wb")
    ELSE:
        fp = fopen(file_path.encode('utf-8'), 'wb')
    if fp is NULL:
        raise OSError('could not open the file: {}'.format(file_path))
    return fp

This has worked OK for years -- but now, on Python 3.12, I get:

unresolved external symbol PyUnicode_AsUnicode

I imagine it's the typdef:

ctypedef Py_UNICODE wchar_t

So how do I pass a Python str as a wchar_t?

-CHB

NOTE: ion an issue on gitHub, TeamSpen210 suggested:

There's other APIs that allocate and give you a copy of the string in a wchar_t buffer. You'll need to ensure you deallocate the buffer once it's no longer being used though, unlike PyUnicode_AsUnicode().

So I'll look into that -- but sure sounds harder than it should be -- could Cytohn provide a utility like the old PyUnicode_AsUnicode(). ?

Thanks,
-CHB


--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris....@noaa.gov

Chris Barker

unread,
Mar 7, 2024, 1:22:40 PMMar 7
to cython-users
another note / question:

IIUC, Windows uses UTF-16 for filenames. So should I be able to encode a string to UTF-16 bytestring, and then cast that to a wchar_t ?

-CHB


da-woods

unread,
Mar 8, 2024, 3:09:31 AMMar 8
to cython...@googlegroups.com
I think in the past we've recommended:

wchar_str = PyUnicode_AsWideCharString(unicodeobj, NULL)
# use wchar_str ...
PyMem_Free(wchar_str)

Python seems to use that quite a bit internally for handling paths (cpython/modules/getpath.c, cpython/Python/fileutils.c) and Windows registry keys (cpython/PC/winreg.c). However, I'm not absolutely sure what it does encoding-wise. I imagine your UTF-16 bytes scheme would also be fine.
--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cython-users/CALGmxELPQjMPWA8-AvrHugEFkGPj7ti0DKa%2BcZ1Vkcf_kDNj1g%40mail.gmail.com.


Chris Barker

unread,
Mar 13, 2024, 2:48:38 PMMar 13
to cython...@googlegroups.com
On Fri, Mar 8, 2024 at 12:09 AM da-woods <dw-...@d-woods.co.uk> wrote:
I think in the past we've recommended:

wchar_str = PyUnicode_AsWideCharString(unicodeobj, NULL)
# use wchar_str ...
PyMem_Free(wchar_str)

Thanks -- that seems to be working :-)

We need to update the docs -- they are pretty old, still reference Python2, and also recommend the now deprecated API :-( 

I wonder if it's worth putting something directly into Cython for this so we don't all have to figure out on our own.
 
Python seems to use that quite a bit internally for handling paths (cpython/modules/getpath.c, cpython/Python/fileutils.c) and Windows registry keys (cpython/PC/winreg.c). However, I'm not absolutely sure what it does encoding-wise.

well, while wchar_t is technically just a type, I *think* , at least with the Windows APIs, that there's an assumption of the UTF-16 encoding -- so I think that cPython is doing it for us. 
 
I imagine your UTF-16 bytes scheme would also be fine.

which I wanted to do because it seemed I shouldn't NEED to allocate a new C wchar_t array. In theory, if I have a python bytes object with UTF-16 encoding bytes, I should be able to grab that pointer and cast it to a wchar_t.

I suppose that's still allocating the bytes object, but then Python memory management can take care of it.

But I never quite got it to work. This is what I tried:

      IF UNAME_SYSNAME == 'Windows':
          cdef bytes bytes_flag = "wb".encode('utf-16')
          cdef bytes bytes_filepath = file_path.encode('utf-16')
          fp = _wfopen(<wchar_t*> bytes_filepath, <wchar_t*> bytes_flag)
                                                  ^
  ------------------------------------------------------------
 
  py_gd\py_gd.pyx:82:48: Python objects cannot be cast to pointers of primitive types
 
and that's the Cython error -- so how do I get the pointer to the underlying array? memoryview? buffer? 

(note that Cython seems happy casting a bytes object to a char*, why not a wchar_t ?)

No, I haven't looked at the generated C code -- but I expect, particularly given the history of bytes and py2 strings, that bytes is assumed to be a char* -- maybe even a null-terminated one.

-CHB
 



Peter Schay

unread,
Mar 13, 2024, 3:19:43 PMMar 13
to cython...@googlegroups.com

But I never quite got it to work. This is what I tried:

      IF UNAME_SYSNAME == 'Windows':
          cdef bytes bytes_flag = "wb".encode('utf-16')
          cdef bytes bytes_filepath = file_path.encode('utf-16')
          fp = _wfopen(<wchar_t*> bytes_filepath, <wchar_t*> bytes_flag)
                                                  ^
  ------------------------------------------------------------
 
  py_gd\py_gd.pyx:82:48: Python objects cannot be cast to pointers of primitive types
 
and that's the Cython error -- so how do I get the pointer to the underlying array? memoryview? buffer? 


In my project I have 1-dimensional contiguous arrays all over the place, and when there is a system call to make with a specific pointer type, I use a memoryview to access whatever python buffer type I have.

Here is an example with strlen; hopefully something similar will work for your _wfopen call:

pete@flow$ cat t.pyx
from libc.string cimport strlen

def foo():
    cdef bytes b = b"hello"
    cdef const char [::1] name
    name = b
    print("len is {}".format(strlen(&name[0])))

foo()

pete@flow$ cythonize --3str --build t.pyx && python -c 'import t'
Compiling /home/pete/src/xpack/t.pyx because it changed.
[1/1] Cythonizing /home/pete/src/xpack/t.pyx
len is 5
You can always cast &name[0] as needed; does <wchar_t *>(&name[0]) work for you? 
I am curious to know if that works for you on Windows, since I am about to take the mad plunge and port my project to Windows soon :-)

Regards,
Pete

da-woods

unread,
Mar 13, 2024, 4:25:56 PMMar 13
to cython...@googlegroups.com
On 13/03/2024 18:47, 'Chris Barker' via cython-users wrote:

We need to update the docs -- they are pretty old, still reference Python2, and also recommend the now deprecated API :-(

An update to the docs would definitely be helpful.


I wonder if it's worth putting something directly into Cython for this so we don't all have to figure out on our own.

The main reason that's hard to do automatically in Cython is because the lifetimes are no longer tied to a Python object so Cython would have to work out when to release the memory. With `bytes` -> `char*` and `unicode` -> `Py_UNICODE*` the storage is owned internally by the Python object.

Obviously a lot of the time it just needs to live for the duration of the statement it's in, but some of the time users will stash the data away for later.

 
But I never quite got it to work. This is what I tried:

      IF UNAME_SYSNAME == 'Windows':
          cdef bytes bytes_flag = "wb".encode('utf-16')
          cdef bytes bytes_filepath = file_path.encode('utf-16')
          fp = _wfopen(<wchar_t*> bytes_filepath, <wchar_t*> bytes_flag)
                                                  ^
  ------------------------------------------------------------
 
  py_gd\py_gd.pyx:82:48: Python objects cannot be cast to pointers of primitive types
 

I'd cast to `char*` (to get the underlying data using Cython's predefined conversion) then to `wchar*`:

          fp = _wfopen(<wchar_t*><char*>bytes_filepath, <wchar_t*><char*>bytes_flag)


(note that Cython seems happy casting a bytes object to a char*, why not a wchar_t ?)

Because `char*` is always right - every `bytes` object holds a `char*` array (because underneath it's always just an array of C chars), but reinterpreting it as `wchar_t*` may or may not make sense depending on the encoding. If you encoded it as ascii then `wchar_t*` would be wrong.


Chris Barker

unread,
Mar 13, 2024, 4:46:06 PMMar 13
to cython...@googlegroups.com
On Wed, Mar 13, 2024 at 1:26 PM da-woods <dw-...@d-woods.co.uk> wrote:
I wonder if it's worth putting something directly into Cython for this so we don't all have to figure out on our own.

The main reason that's hard to do automatically in Cython is because the lifetimes are no longer tied to a Python object so Cython would have to work out when to release the memory. With `bytes` -> `char*` and `unicode` -> `Py_UNICODE*` the storage is owned internally by the Python object.

hmm -- I guess that would require a python object wrapper around the w_char array -- much like a bytes type -- and I guess it's not up to Cython to make that.

then maybe an fopen utility -- though maybe I'm the rare case of using them old fashioned FILE pointers :-)
But I never quite got it to work. This is what I tried:

      IF UNAME_SYSNAME == 'Windows':
          cdef bytes bytes_flag = "wb".encode('utf-16')
          cdef bytes bytes_filepath = file_path.encode('utf-16')
          fp = _wfopen(<wchar_t*> bytes_filepath, <wchar_t*> bytes_flag)
                                                  ^
  ------------------------------------------------------------
 
  py_gd\py_gd.pyx:82:48: Python objects cannot be cast to pointers of primitive types
 

I'd cast to `char*` (to get the underlying data using Cython's predefined conversion) then to `wchar*`:

          fp = _wfopen(<wchar_t*><char*>bytes_filepath, <wchar_t*><char*>bytes_flag)


(note that Cython seems happy casting a bytes object to a char*, why not a wchar_t ?)

Because `char*` is always right - every `bytes` object holds a `char*` array (because underneath it's always just an array of C chars), but reinterpreting it as `wchar_t*` may or may not make sense depending on the encoding. If you encoded it as ascii then `wchar_t*` would be wrong.

sure -- but consenting adults and all that -- i think of a bytes object as, well, what it's called -- a wrapper around a bunch of bytes -- the content of those bytes is irrelevant. And a char* is the same, thanks to legacy, there's no distinction between a "charactor" and "unsigned single byte integer" or "a byte", so char* is used for arbitrary data buffers.  

Though now that I think about it - casting a char* to a wchar_t* would only work if there were an even number of bytes allocated. so it could fail for reasons other than the encoding.  Though I thought casting was inherently unsafe anyway :-) -- but I guess not if you are casting a python type to a C pointer -- that should be a bit safer.

I tried casting to a char* first, and it failed. It did compile, but it didn't work. I suspected it might have to do with null termination -- lots of nulls in UTF-16. But now that I think about it, that doesn't make sense if only the pointer is passed through. So maybe I made another mistake I need to figure out.

And if that doesn't work, I'll try the memoryview trick.

Thanks,
-CHB



Chris Barker

unread,
Mar 13, 2024, 6:48:13 PMMar 13
to cython...@googlegroups.com
Final post for the archives: 

On Wed, Mar 13, 2024 at 1:45 PM Chris Barker <chris....@noaa.gov> wrote:

I'd cast to `char*` (to get the underlying data using Cython's predefined conversion) then to `wchar*`:

          fp = _wfopen(<wchar_t*><char*>bytes_filepath, <wchar_t*><char*>bytes_flag)

So this compiled and ran without error, but the file path somehow wasn't correct, and the file couldn't be opened. So apparently PyUnicode_AsWideCharString does something other than just encode to utf-16. I could try to debug more to see what's different, but why? that's what PyUnicode_AsWideCharString is for -- might as well use it.

And it's working in our code.

So thanks all!

Stefan Behnel

unread,
Mar 14, 2024, 5:37:31 AMMar 14
to cython...@googlegroups.com
da-woods schrieb am 13.03.24 um 21:25:
> On 13/03/2024 18:47, 'Chris Barker' via cython-users wrote:
>> We need to update the docs -- they are pretty old, still
>> reference Python2, and also recommend the now deprecated API :-(
>
> An update to the docs would definitely be helpful.

He probably didn't mean the Cython docs, but I agree that opening a file
from a Python path string in C shouldn't be as hard as it is. There should
be at least a FAQ entry for this.

I could imagine having a

cython.fopen(path: str, mode: str) -> FILE*

that does the right thing on different platforms, at C compile time. It
would use the wchar APIs for the file path on Windows (and ASCII encoding
for the 'mode') and encode the file path to the local file system encoding
(which usually is but may not always be UTF-8) on *nix systems. Sounds
doable. Someone out there probably already has the code for this and could
contribute it.

I also wonder if this shouldn't be in CPython's C-API (as well?). Seems
worth filing a ticket on their side.

BTW, while looking up the details, I noticed that fopen() also supports
UTF-8 encoded file paths on Windows, see the section on Unicode support here:

https://learn.microsoft.com/de-de/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170

That might actually be the easiest way to handle this, just append
", ccs=UTF-8"
to the mode if you're on Windows and the encode the file path to UTF-8
normally.

Anyone up for writing a FAQ entry on this?

https://github.com/cython/cython/blob/master/docs/src/userguide/faq.rst

Stefan

Chris Barker

unread,
Mar 14, 2024, 12:33:14 PMMar 14
to cython...@googlegroups.com
Thanks Stefan,

I actually did mean the Cython docs. The discussion of handling Unicode and C/C++ is recommending methods that won't work in Python3.12. 

It also references Py2 a fair bit -- is there a policy about that? Should the latest docs be Py3 only?

Anyway, I'll try to start a docs PR, but I'm not sure I'll have much time for it to work on it :-(

Which brings up another question -- should Cython generate code using deprecated APIs that are now removed ?

Probably yes (or a lot of code would break that can still work on not-the-latest Python) -- but perhaps Cython should raise a warning that a Cython API is being used that is deprecated.

 I agree that opening a file
from a Python path string in C shouldn't be as hard as it is. There should
be at least a FAQ entry for this.

I could imagine having a

     cython.fopen(path: str, mode: str) -> FILE*

that does the right thing on different platforms, at C compile time. It
would use the wchar APIs for the file path on Windows (and ASCII encoding
for the 'mode') and encode the file path to the local file system encoding
(which usually is but may not always be UTF-8) on *nix systems. Sounds
doable. Someone out there probably already has the code for this and could
contribute it.

I have a start on it -- which is how this question started :-) -- I'll clean it up and post it for review.

(which usually is but may not always be UTF-8) on *nix systems

What the heck to do about this??? -- I recall from the contentious Py2-3 transition that *nix systems simply used a char* with no special requirements for the contents other than null and the ascii slash being special. I never understood how folks thought that a system with non known (and maybe mxed) encodings for filenames wa not broken, but apparetnly that's the reality -- so how to deal with that?

Options: 
1) let the user specify the encoding for the filename (utf-8 by default)
2) let the user pass in a bytes object -- it which case it would be used directly
3) support utf-8 only, and so be it if it doesn't work. (that's what my code does, and it hasn't been a problem yet -- maybe non utf-8 filesystems are pretty rare these days ...)


Anyway, I'll work on a prototype, once I figure out how to eliminate IF

I also wonder if this shouldn't be in CPython's C-API (as well?). Seems
worth filing a ticket on their side.

You'd think -- worth a try -- it would be nice to have it there. the code must be there internally somewhere.
 
BTW, while looking up the details, I noticed that fopen() also supports
UTF-8 encoded file paths on Windows, see the section on Unicode support here:

https://learn.microsoft.com/de-de/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170

I"ll need to find the English version of that, but good find! this could be very helpful for some other code of ours :-) and maybe the way to do this problem a lot more easily.
 
That might actually be the easiest way to handle this, just append
", ccs=UTF-8"
to the mode if you're on Windows and the encode the file path to UTF-8
normally.

Anyone up for writing a FAQ entry on this?

https://github.com/cython/cython/blob/master/docs/src/userguide/faq.rst

I'll get some code working and reviewed first ...

Thanks,
-CHB

 



Stefan


--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.

da-woods

unread,
Mar 14, 2024, 1:48:27 PMMar 14
to cython...@googlegroups.com

A slight tangent to the main point of discussion but the other thing I believe you can do is:

1. open the file in Python
2. use PyObject_AsFileDescriptor to get a file descriptor integer from the file
3. use the POSIX
fdopen to get a FILE* from the file descriptor integer.

That means you can leave the opening of the file and handling the encoding in Python, but still get the C file pointer.

I believe scipy do it:

https://github.com/scipy/scipy/blob/1ff9e4801b9e903a1a31564154cad0f4d0d1a966/scipy/_lib/messagestream.pyx#L38

Chris Barker

unread,
Mar 14, 2024, 3:44:38 PMMar 14
to cython...@googlegroups.com
On Thu, Mar 14, 2024 at 10:48 AM da-woods <dw-...@d-woods.co.uk> wrote:

A slight tangent to the main point of discussion but the other thing I believe you can do is:

1. open the file in Python
2. use PyObject_AsFileDescriptor to get a file descriptor integer from the file
3. use the POSIX
fdopen to get a FILE* from the file descriptor integer.

That means you can leave the opening of the file and handling the encoding in Python, but still get the C file pointer.

Ahh -- that would be good -- and in other cases, I've needed to get a FILE* from an already open Python file object -- which used to be easy in py2, but I never figured out how with Py3.

but: 
3. use the POSIX fdopen to get a FILE* from the file descriptor integer.

Hmm - there does seem to be something similar on Windows:


Is there no cPython API function for  "get a FILE* from an already open file"?

-CHB

--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.

Chris Barker

unread,
Mar 14, 2024, 4:06:51 PMMar 14
to cython...@googlegroups.com
On Thu, Mar 14, 2024 at 2:37 AM Stefan Behnel <stef...@behnel.de> wrote:
BTW, while looking up the details, I noticed that fopen() also supports
UTF-8 encoded file paths on Windows, see the section on Unicode support here:

https://learn.microsoft.com/de-de/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170

That might actually be the easiest way to handle this, just append
", ccs=UTF-8"

Darn -- tried this, and no go -- I think that doesn't change how the path is interpreter, on how the contents are processed:

fopen supports Unicode file streams. To open a Unicode file, pass a ccs=encoding flag that specifies the desired encoding to fopen, as follows.

FILE *fp = fopen("newfile.txt", "rt+, ccs=UTF-8");

Allowed values for ccs encoding are UNICODE, UTF-8, and UTF-16LE.

When a file is opened in Unicode mode, input functions translate the data that's read from the file into UTF-16 data stored as type wchar_t. 

Oh well,

-CHB 

to the mode if you're on Windows and the encode the file path to UTF-8
normally.

Anyone up for writing a FAQ entry on this?

https://github.com/cython/cython/blob/master/docs/src/userguide/faq.rst

Stefan

--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.

Chris Barker

unread,
Mar 14, 2024, 7:18:14 PMMar 14
to cython...@googlegroups.com
Hopefully, I'll find some time to add to the FAQ, or even a PR for a new function in Cython. But in the meantime, here's what I've got to work for me:

cdef FILE* open_file(file_path, str mode) except *:
    """
    opens a file for writing


    :param path: python str or PathLike

    :param mode: python str with mode to open the file with:
                 e.g. "wb"

    :returns: FILE* File Pointer

    Note: On Windows, it uses a wchar, UTF-16 encoded.
          On other platforms (Mac and Linux), it assumes utf-8.

          If the file system is not utf-8 encoded, this will only
          work for ascii file paths.
    """
    cdef FILE* fp = NULL

    file_path = os.fspath(file_path)

    IF UNAME_SYSNAME == 'Windows':
        cdef Py_ssize_t length
        cdef wchar_t *wchar_flag = PyUnicode_AsWideCharString(mode, &length)
        cdef wchar_t *wchar_filepath = PyUnicode_AsWideCharString(file_path, &length)

        fp = _wfopen(wchar_filepath, wchar_flag)

        PyMem_Free(<void *>wchar_filepath)
        PyMem_Free(<void *>wchar_flag)
    ELSE:
        fp = fopen(file_path.encode('utf-8'), mode.encode('ascii'))


    if fp is NULL:
        raise OSError('could not open the file: {}'.format(file_path))

    return fp

I suppose it would be good to add a parameter for the encoding of the file name for non-Windows systems, though does anyone ever know if it's not utf-8 or ascii ?

I'd also like to remove the IF, but not sure when I'll get a chance to figure that out.

-CHB

Stefan Behnel

unread,
Mar 15, 2024, 3:00:55 AMMar 15
to cython...@googlegroups.com
'Chris Barker' via cython-users schrieb am 14.03.24 um 17:32:
>> (which usually is but may not always be UTF-8) on *nix systems
>
> What the heck to do about this???

https://docs.python.org/3/c-api/unicode.html#file-system-encoding

There's also PEP 529 for Windows, but that only deals with the Python side
of the encoding:

https://peps.python.org/pep-0529/

Stefan

Stefan Behnel

unread,
Mar 15, 2024, 3:06:12 AMMar 15
to cython...@googlegroups.com
'Chris Barker' via cython-users schrieb am 14.03.24 um 20:43:
> Is there no cPython API function for "get a FILE* from an already open
> file"?

I think the main issue here is that, in many cases, files opened by Python
are not plain file objects but some kind of stream these days. Anything
that involves text reading adds at least one wrapper level. Not sure about
features like buffering etc.

For the simple (bytes) cases, PyObject_AsFileDescriptor seems as good as it
gets.

https://docs.python.org/3/c-api/file.html#c.PyObject_AsFileDescriptor

Stefan

Chris Barker

unread,
Mar 15, 2024, 6:58:12 PMMar 15
to cython...@googlegroups.com
On Fri, Mar 15, 2024 at 12:06 AM Stefan Behnel <stef...@behnel.de> wrote:
'Chris Barker' via cython-users schrieb am 14.03.24 um 20:43:
> Is there no cPython API function for  "get a FILE* from an already open
> file"?
 
I think the main issue here is that, in many cases, files opened by Python
are not plain file objects but some kind of stream these days.

Yeah, that's why this is a lot more complicated than it was in with Py2.
 
For the simple (bytes) cases, PyObject_AsFileDescriptor seems as good as it
gets.

Hmm -- and a File Descriptor may or may not point to a "regular file" that you can get an FILE* to -- but it seems it's just punting the problem to the next step. Oh well.

I see this in the docs:

"""
 third-party code is advised to access the io APIs instead.
"""
Which I suppose is why it doesn't provide the old FILE* pointers.

However, in my use case, and probably not rare, I" not trying to read/write files in C -- I"m trying to wrap an existing C library that wants a FILE*.

Perhaps a utility function in Cython for that?

(or FAQ, but the function could be helpful as it's slightly platfrom dependent, as far as I can see)

-Chris
Reply all
Reply to author
Forward
0 new messages