How to check that Python object provides buffer interface and find out dims / type?

576 views
Skip to first unread message

Yury V. Zaytsev

unread,
Jun 3, 2013, 7:51:32 AM6/3/13
to cython...@googlegroups.com
Hi,

I'm working on Python bindings for a C++ application and one part of it
is data conversion routines between Python <-> C++ application.

For Python -> C++ conversion, I basically have something of this sort:

if isinstance(obj, types.BooleanType):
ret = <Datum*> new BoolDatum(obj)
elif isinstance(obj, (types.IntType, types.LongType)):
ret = <Datum*> new IntegerDatum(obj)
...

Now, I'd like to be able to convert 1D arrays (NumPy or Python, and,
potentially, 2D arrays) to C++ vectors, but I would like to avoid having
a compile time dependence on NumPy.

Would Cython memory views help me to solve this problem? If yes, how it
is possible to check whether the passed Python object provides a buffer
interface, and what is the dimensionality and type?

I think that in the worst case I could do isinstance(obj,
numpy.ndarray), extract shape & dtype and then call a function that
takes a memory view of this kind and processes it, but this will come
with compile-time dependency on NumPy, right?

Thanks!

--
Sincerely yours,
Yury V. Zaytsev


Stefan Behnel

unread,
Jun 3, 2013, 12:33:30 PM6/3/13
to cython...@googlegroups.com
Yury V. Zaytsev, 03.06.2013 13:51:
> I'm working on Python bindings for a C++ application and one part of it
> is data conversion routines between Python <-> C++ application.
>
> For Python -> C++ conversion, I basically have something of this sort:
>
> if isinstance(obj, types.BooleanType):
> ret = <Datum*> new BoolDatum(obj)
> elif isinstance(obj, (types.IntType, types.LongType)):
> ret = <Datum*> new IntegerDatum(obj)
> ...

Better use the native types here, i.e.

if isinstance(obj, (int, long))

This is substantially faster and also more portable in Cython ("long"
doesn't exist in Py3).

For bool and None, you could even use pointer tests:

if obj is None:
...
elif obj is True or obj is False:
...

BTW, and just for fun, I just noticed that this is valid Python syntax:

if obj == True/False:
...

It doesn't quite do what one would want, though...


> Now, I'd like to be able to convert 1D arrays (NumPy or Python, and,
> potentially, 2D arrays) to C++ vectors, but I would like to avoid having
> a compile time dependence on NumPy.
>
> Would Cython memory views help me to solve this problem? If yes, how it
> is possible to check whether the passed Python object provides a buffer
> interface, and what is the dimensionality and type?
>
> I think that in the worst case I could do isinstance(obj,
> numpy.ndarray), extract shape & dtype and then call a function that
> takes a memory view of this kind and processes it, but this will come
> with compile-time dependency on NumPy, right?

The question is: what good is support for the interface if you know nothing
about the layout of the data? If you want a completely generic mapping that
handles anything from 1-N dimensional arrays of arbitrary size, you will
have to implement that yourself, using the normal buffer interface C-API
functions. It's somewhat complicated if you want to support it completely,
though, because it supports several different memory layout schemes.

http://docs.python.org/3.4/c-api/buffer.html

The native support in Cython is meant for cases where you know at least the
dimensionality, so that Cython can generate fast access code. If you know
that statically, things will become much easier with Cython's memory views,
without creating an external dependency (at least in Py2.6+).

Stefan

Yury V. Zaytsev

unread,
Jun 3, 2013, 1:01:54 PM6/3/13
to cython...@googlegroups.com
Hi Stefan,

Your message was extremely enlightening, thank you very much!

On Mon, 2013-06-03 at 18:33 +0200, Stefan Behnel wrote:

> Better use the native types here, i.e.

Oh, I did the opposite, because I thought that this was more portable
and using built-in types in the global scope as opposed to the types
defined in 'types' is bad style! I certainly very much need
compatibility with Python 3, this is one of the main goals of the
re-write in Cython.

Does this now make more sense to you?

if isinstance(obj, bool):
...
elif isinstance(obj, (int, long)):
...
elif isinstance(obj, float):
...
elif isinstance(obj, str):
...
elif isinstance(obj, (tuple, list, types.XRangeType)):
...
elif isinstance(obj, dict):
...
else:
raise NESTError("unknown Python type: {0}".format(type(obj)))

I'm worried about the XRangeType. I put it there in the first place,
because I don't want to convert *anything* iterable to a generic list,
but I'd rather like to convert things that expose buffer interface to a
more specialized container (vector of ints or doubles).

> The question is: what good is support for the interface if you know nothing
> about the layout of the data? If you want a completely generic mapping that
> handles anything from 1-N dimensional arrays of arbitrary size, you will
> have to implement that yourself, using the normal buffer interface C-API
> functions. It's somewhat complicated if you want to support it completely,
> though, because it supports several different memory layout schemes.
>
> http://docs.python.org/3.4/c-api/buffer.html
>
> The native support in Cython is meant for cases where you know at least the
> dimensionality, so that Cython can generate fast access code. If you know
> that statically, things will become much easier with Cython's memory views,
> without creating an external dependency (at least in Py2.6+).

I can totally restrict myself to 1-D vectors of longs and doubles, sorry
if this was not clear from my original post; I really don't need
anything more complicated than that.

I would assume users will create them with NumPy, and in my own code
internally, I can stick to array.arrays only.

However, in this conversion cascade that I've just posted, I need
somehow to figure out if I was given something that exposes an 1-D
buffer of longs or doubles, and what is the length of this buffer, so
that I can create an std::vector<long> or std::vector<double> out of it.

Stefan Behnel

unread,
Jun 3, 2013, 1:57:56 PM6/3/13
to cython...@googlegroups.com
Yury V. Zaytsev, 03.06.2013 19:01:
> Your message was extremely enlightening, thank you very much!
>
> On Mon, 2013-06-03 at 18:33 +0200, Stefan Behnel wrote:
>> Better use the native types here, i.e.
>
> Oh, I did the opposite, because I thought that this was more portable
> and using built-in types in the global scope as opposed to the types
> defined in 'types' is bad style! I certainly very much need
> compatibility with Python 3, this is one of the main goals of the
> re-write in Cython.
>
> Does this now make more sense to you?
>
> if isinstance(obj, bool):
> ...
> elif isinstance(obj, (int, long)):
> ...
> elif isinstance(obj, float):
> ...
> elif isinstance(obj, str):

Note that "str" is "bytes" in Py2 and "unicode" in Py3. May or may not be
what you actually want. More likely, you'd want to handle both separately
and explicitly.

> elif isinstance(obj, (tuple, list, types.XRangeType)):
> ...
> elif isinstance(obj, dict):
> ...
> else:
> raise NESTError("unknown Python type: {0}".format(type(obj)))
>
> I'm worried about the XRangeType

And rightly so, as it doesn't exist in Py3. You can use "xrange" instead,
though, that should work in both Py2 and Py3 when used in Cython compiled code.


> I put it there in the first place,
> because I don't want to convert *anything* iterable to a generic list,
> but I'd rather like to convert things that expose buffer interface to a
> more specialized container (vector of ints or doubles).

Sounds reasonable, although there are countless iterable types in Python.
Which ones of them are worth special casing depends entirely on your code
and the use cases you anticipate. Maybe you should just test for the buffer
interface earlier and make the iteration case a generic fallback.

What you do here is actually a pretty common thing for "generic" wrapper
code. Here's an example I've written:

https://github.com/scoder/lupa/blob/c7b505369463d989e6865601f602805d46a3578b/lupa/_lupa.pyx#L808


>> The question is: what good is support for the interface if you know nothing
>> about the layout of the data? If you want a completely generic mapping that
>> handles anything from 1-N dimensional arrays of arbitrary size, you will
>> have to implement that yourself, using the normal buffer interface C-API
>> functions. It's somewhat complicated if you want to support it completely,
>> though, because it supports several different memory layout schemes.
>>
>> http://docs.python.org/3.4/c-api/buffer.html
>>
>> The native support in Cython is meant for cases where you know at least the
>> dimensionality, so that Cython can generate fast access code. If you know
>> that statically, things will become much easier with Cython's memory views,
>> without creating an external dependency (at least in Py2.6+).
>
> I can totally restrict myself to 1-D vectors of longs and doubles, sorry
> if this was not clear from my original post; I really don't need
> anything more complicated than that.

In that case, memory views should work just fine for you.

http://docs.cython.org/src/userguide/memoryviews.html


> I would assume users will create them with NumPy, and in my own code
> internally, I can stick to array.arrays only.
>
> However, in this conversion cascade that I've just posted, I need
> somehow to figure out if I was given something that exposes an 1-D
> buffer of longs or doubles, and what is the length of this buffer, so
> that I can create an std::vector<long> or std::vector<double> out of it.

This might help, but again, it depends on your actual code:

http://docs.cython.org/src/userguide/fusedtypes.html

Stefan

Yury V. Zaytsev

unread,
Jun 4, 2013, 10:38:31 AM6/4/13
to cython...@googlegroups.com
On Mon, 2013-06-03 at 19:57 +0200, Stefan Behnel wrote:
> What you do here is actually a pretty common thing for "generic"
> wrapper code. Here's an example I've written:
>
> https://github.com/scoder/lupa/blob/c7b505369463d989e6865601f602805d46a3578b/lupa/_lupa.pyx#L808

Hi Stefan,

Thanks again for your reply, I'm now studying the items that you have
pointed me to, but I have a short question with respect to this
particular routine:

elif type(o) is float:
lua.lua_pushnumber(L, <lua.lua_Number>cpython.float.PyFloat_AS_DOUBLE(o))
pushed_values_count = 1
...
elif isinstance(o, float):
lua.lua_pushnumber(L, <lua.lua_Number><double>o)
pushed_values_count = 1

Is this just an oversight, or there is some deep meaning to it that eludes me?

Stefan Behnel

unread,
Jun 4, 2013, 10:55:26 AM6/4/13
to cython...@googlegroups.com
Yury V. Zaytsev, 04.06.2013 16:38:
> On Mon, 2013-06-03 at 19:57 +0200, Stefan Behnel wrote:
>> What you do here is actually a pretty common thing for "generic"
>> wrapper code. Here's an example I've written:
>>
>> https://github.com/scoder/lupa/blob/c7b505369463d989e6865601f602805d46a3578b/lupa/_lupa.pyx#L808
>
> Thanks again for your reply, I'm now studying the items that you have
> pointed me to, but I have a short question with respect to this
> particular routine:
>
> elif type(o) is float:
> lua.lua_pushnumber(L, <lua.lua_Number>cpython.float.PyFloat_AS_DOUBLE(o))
> pushed_values_count = 1
> ...
> elif isinstance(o, float):
> lua.lua_pushnumber(L, <lua.lua_Number><double>o)
> pushed_values_count = 1
>
> Is this just an oversight, or there is some deep meaning to it that eludes me?

The builtin float type can be subtyped. However, it's very unlikely that
users do that. In order to handle the 99.9% case of exactly a float as fast
as possible, it's special cased above. The unlikely case of finding a
subtype is then handled later for correctness, after testing for further
more likely input types.

Stefan

Yury V. Zaytsev

unread,
Jun 4, 2013, 11:43:38 AM6/4/13
to cython...@googlegroups.com
Hi Stefan,

On Mon, 2013-06-03 at 19:57 +0200, Stefan Behnel wrote:

> What you do here is actually a pretty common thing for "generic"
> wrapper code. Here's an example I've written:
>
> https://github.com/scoder/lupa/blob/c7b505369463d989e6865601f602805d46a3578b/lupa/_lupa.pyx#L808

Brilliant! Too bad that I've already re-invented much of the same wheel,
but I will use your code for inspiration from now on.

> > elif isinstance(obj, str):
>
> Note that "str" is "bytes" in Py2 and "unicode" in Py3. May or may not be
> what you actually want. More likely, you'd want to handle both separately
> and explicitly.

That's a really good catch! Actually, I'm not sure what's the best
practice here :-( So, sorry, more questions to follow:

The C++ application that I'm wrapping (and I believe this is also valid
for the Lua runtime) uses C++ strings as bytes internally and in theory
it is not supposed to do anything 'smart' about them (as in depending on
the locale/encoding), so all I want is for the users to be able to push
strings back and forth with minimum hassle...

I guess this can be expressed in form of the following rules:

- For Py2, I'd like them to be able to push both str() and unicode()
- For Py3, I guess I can also accept both bytes and unicode strings
- Probably in both cases, it's fine to always return unicode back

Are these the same semantics as what you have implemented for Lua or
not? Your code is as follows:

elif isinstance(o, bytes): # Python -> Lua
lua.lua_pushlstring(L, <char*>(<bytes>o), len(<bytes>o))
pushed_values_count = 1
elif isinstance(o, unicode) and runtime._encoding is not None:
pushed_values_count = push_encoded_unicode_string(runtime, L, <unicode>o)

elif lua_type == lua.LUA_TSTRING: # Lua -> Python
s = lua.lua_tolstring(L, n, &size)
if runtime._encoding is not None:
return s[:size].decode(runtime._encoding)
else:
return s[:size]

And how do I handle bytes in my code? Would the following make sense? My
C++ constructor takes std::string.

cdef object ret
cdef string obj_str

elif isinstance(obj, bytes): # Python -> NEST
obj_str = obj
ret = <Datum*> new StringDatum(obj_str)
elif isinstance(obj, str):
obj_str = obj.encode()
ret = <Datum*> new StringDatum(obj_str)

elif datum_type.compare("stringtype") == 0: # NEST -> Python
ret = (<string> deref_str(<StringDatum*> dat)).decode()

> Sounds reasonable, although there are countless iterable types in Python.
> Which ones of them are worth special casing depends entirely on your code
> and the use cases you anticipate. Maybe you should just test for the buffer
> interface earlier and make the iteration case a generic fallback.
>
> > I can totally restrict myself to 1-D vectors of longs and doubles, sorry
> > if this was not clear from my original post; I really don't need
> > anything more complicated than that.
>
> In that case, memory views should work just fine for you.
>
> http://docs.cython.org/src/userguide/memoryviews.html

Right, I have read this page many times, and every time it's becoming a
bit more clear, but still I don't understand, how do I check whether the
supplied Python object exposes a buffer interface, and, if yes, then
what is the type and the dimensions of this buffer.

Do I understand correctly that you are implying that there is no nice
way to do this (like isinstance) and I should create functions like

cdef Datum* long_vector_to_datum(long [:] obj):
cdef Datum* double_vector_to_datum(double [:] obj):

and then

if isinstance(obj, bool):
...
else:
try:
long_vector_to_datum(obj)
except:
# doesn't provide this interface
try:
double_vector_to_datum(obj)
except:
# doesn't provide this interface

etc. ?

> This might help, but again, it depends on your actual code:
>
> http://docs.cython.org/src/userguide/fusedtypes.html

I've seen it, but I'll get back to it again, as soon as the previous
question is cleared up.

Many thanks!

Yury V. Zaytsev

unread,
Jun 4, 2013, 11:55:27 AM6/4/13
to cython...@googlegroups.com
Hi Stefan,

On Tue, 2013-06-04 at 16:55 +0200, Stefan Behnel wrote:
>
> The builtin float type can be subtyped. However, it's very unlikely
> that users do that. In order to handle the 99.9% case of exactly a
> float as fast as possible, it's special cased above. The unlikely case
> of finding a subtype is then handled later for correctness, after
> testing for further more likely input types.

Thank you very much, I suspected something of this sort; when I was
googling for the best way to find out the type, it was always suggested
on StackOverflow to use isinstance() to take care of subclasses.

Just (hopefully) a final question: do you have numbers for type(o) is
foo vs. isinstance(o, foo) ? I have tried it, and I really can't see the
difference:

In []: x = 1.2

In []: %timeit "if type(x) is float: pass"
10000000 loops, best of 3: 20.8 ns per loop

In []: class MyFloat(float): pass

In []: y = MyFloat(1.2)

In []: %timeit "if isinstance(y, float): pass"
10000000 loops, best of 3: 20.9 ns per loop

However, you were probably basing your approach on some solid
benchmarks, so I must be doing it wrong. I'm just wondering what orders
of magnitude or ~ percents are we talking about...

Chris Barker - NOAA Federal

unread,
Jun 4, 2013, 12:06:52 PM6/4/13
to cython...@googlegroups.com
On Tue, Jun 4, 2013 at 8:43 AM, Yury V. Zaytsev <yu...@shurup.com> wrote:

> - For Py2, I'd like them to be able to push both str() and unicode()
> - For Py3, I guess I can also accept both bytes and unicode strings

This is more for my clarification than anything else, but:

I _think_ that you can use "bytes" and "unicode" as types in Cython
code. In py2, an str is a bytes object, and in py3, a str is a unicode
object. But if you avoid using "str" for a type in Cython code, you
will avoid the confusion between the two python versions.

> And how do I handle bytes in my code? Would the following make sense? My
> C++ constructor takes std::string.
>
> cdef object ret
> cdef string obj_str
>
> elif isinstance(obj, bytes): # Python -> NEST
> obj_str = obj
> ret = <Datum*> new StringDatum(obj_str)
> elif isinstance(obj, str):
> obj_str = obj.encode()
> ret = <Datum*> new StringDatum(obj_str)

I think this is only going to work in py3, you can't encode a str in
py2. I think you can do:

elif isinstance(obj, unicode):
obj_str = obj.encode()
ret = <Datum*> new StringDatum(obj_str)

and you're all set.

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris....@noaa.gov

Robert Bradshaw

unread,
Jun 4, 2013, 2:14:46 PM6/4/13
to cython...@googlegroups.com
In both of your timings above, the runtime is dominated by a
dictionary lookup and function call. Cython recognizes both of these
patterns and will handle them much quicker. However, isinstance is
unlikely to be significantly slower as the first thing it checks
(inline) is whether the type matches exactly and branch prediction
will just to do the right thing.

- Robert

Stefan Behnel

unread,
Jun 4, 2013, 3:16:46 PM6/4/13
to cython...@googlegroups.com
Robert Bradshaw, 04.06.2013 20:14:
That's in Py2. In Py3, the PyXyz_Check() functions use a bit test on the
type flags, which directly includes the subtype check. So they are plenty
fast even for subtypes. Still, "type(x) is T" is a pointer comparison in
Cython for known types, which is fast in all Python versions. And I do two
of those in a row, so, in many cases, it's just comparing one address to
two different values and branching on the result.

The main reason why I special case the type test, however, is because
subtypes may need a more involved conversion than the known exact builtin
type. For the latter, a simple pointer deref is enough to read the double
value, whereas a subtype may involve more than one C function call.

Whether all of this hassle is worth it for someone else's use case is
certainly up to profiling.

Stefan

Stefan Behnel

unread,
Jun 4, 2013, 3:36:20 PM6/4/13
to cython...@googlegroups.com
Chris Barker - NOAA Federal, 04.06.2013 18:06:
> On Tue, Jun 4, 2013 at 8:43 AM, Yury V. Zaytsev wrote:
>> - For Py2, I'd like them to be able to push both str() and unicode()
>> - For Py3, I guess I can also accept both bytes and unicode strings
>
> This is more for my clarification than anything else, but:
>
> I _think_ that you can use "bytes" and "unicode" as types in Cython
> code. In py2, an str is a bytes object, and in py3, a str is a unicode
> object. But if you avoid using "str" for a type in Cython code, you
> will avoid the confusion between the two python versions.

Correct.


>> And how do I handle bytes in my code? Would the following make sense? My
>> C++ constructor takes std::string.
>>
>> cdef object ret
>> cdef string obj_str
>>
>> elif isinstance(obj, bytes): # Python -> NEST
>> obj_str = obj
>> ret = <Datum*> new StringDatum(obj_str)
>> elif isinstance(obj, str):
>> obj_str = obj.encode()
>> ret = <Datum*> new StringDatum(obj_str)
>
> I think this is only going to work in py3, you can't encode a str in
> py2.

Well, the code will also run in Py2, but it won't do what you want. First
of all, bytes *is* str in Py2, so the second branch is dead code. Even if
it was taken, the next line (obj.encode()) is useless. It will first decode
the (byte) string with the default encoding (which is platform specific and
may fail), and then encode the string with the encoding you specify (which,
in the case above, is also the default encoding). So you'd get either an
exception (a UnicodeDecodeError for calling encode()!) or your original
byte string back.

Welcome to the wonderful world of Python 2.


> I think you can do:
>
> elif isinstance(obj, unicode):
> obj_str = obj.encode()

Passing an appropriate encoding name, of course, most likely "utf8" IIUC.


> ret = <Datum*> new StringDatum(obj_str)
>
> and you're all set.

Stefan

Alex Leach

unread,
Jun 4, 2013, 3:48:26 PM6/4/13
to cython...@googlegroups.com
I was attracted to this discussion due to the title more than the
content.. I spent some time trying to expose the C++ std::ios library to
Python, and have been wondering ever since what Cython might do with
iostreams, instead.

I had another look at Cython docs the other day, and it seems that C++
support is much closer to completion now. So I'm curious, does Cython do
anything special with classes derived from std::ios_base, or classes that
provide either the >> or << operators?

Having spent some time looking at the Python buffer interface, it seems at
first glance incompatible with C++ iostreams. The reason being, is that
Py_buffer's require a void* buffer, but in C++, that datatype is either
private or protected. So it seemed at the time that the only option was to
create an extra, temporary buffer and copy its available content, when I
wish I wouldn't need to make that extra copy.

Have any of you given this thought? I would love to read some more
discussion on the topic..

Cheers,
Alex

Nikita Nemkin

unread,
Jun 5, 2013, 3:59:22 AM6/5/13
to cython...@googlegroups.com
On Wed, 05 Jun 2013 01:48:26 +0600, Alex Leach <beame...@gmail.com>
wrote:
Why would you want to expose iostreams to Python?
They are relatively slow, don't offer any unique functionality,
and aren't particularly popular.

Python has it's own io library and conventions that Python users know
and expect to be followed.

It really sounds like problem better solved by other means.

Best regards,
Nikita Nemkin

Alex Leach

unread,
Jun 5, 2013, 6:19:42 AM6/5/13
to cython...@googlegroups.com, Nikita Nemkin
Hi,

Thanks for the response.

On Wed, 05 Jun 2013 08:59:22 +0100, Nikita Nemkin <nik...@nemkin.ru> wrote:

> Why would you want to expose iostreams to Python?

It's the serialisation features I'm most interested in. The implementation
is application-specific, but the ability to serialise a type into a
specific format, with format flag operators, is a standards-defined,
extensible feature of C++. In the C++ application I'm using, there are
format flags for serialising objects into XML, JSON, plain text or binary
forms, allowing, for example:

FooSerialObj m_obj = FooSerialObj("bar");
std::cout << eFormat_Xml << m_obj;

I was hoping to expose this same functionality to Python, in as efficient
a manner as possible. The Python equivalent of an IOstream is obviously a
file-like object, so I was hoping I could expose to Python whatever
format-flags are available, and use them when reading from or writing to
file-like objects. e.g.

>>> from my_ext import FooSerialObj
>>> m_obj = FooSerialObj("bar")
>>> print( m_obj.format('xml') )
...

> They are relatively slow, don't offer any unique functionality,
> and aren't particularly popular.
>
> Python has it's own io library and conventions that Python users know
> and expect to be followed.
>
> It really sounds like problem better solved by other means.

Thanks for the honesty. I agree with you on this, actually, but as the
serialisation code has already been implemented in C++, I'd like to
minimise the need for Python code that replicates the same functionality.

Perhaps I've been thinking about this the wrong way around, though -
forgive me, I'm pretty new to the C++ STL - it's FooSerialObj that needs
to be serialised, not the iostream.

But the implementation is provided by overloading std:: stream operators,
so I thought that an iostream's internal 'streambuf' would be the best
place to get serialised object data, for Python. I hoped that I could map
std::streambuf to a Py_buffer object, thereby exposing std::streambuf's
data to Python, without needing an intermediary char buffer.

But as I mentioned before, the Py_buffer interface requires `void *`
pointers, but std::streambuf keeps them either private or protected.

Forgetting for now the sheer amount of code that would be needed - believe
me, I know it's a lot! - is there are way you can think of to provide
Python with buffered access to a formatted C++ stream? Copying the entire
stream into a 'void*' buffer in memory, is out of the question, due to the
potential for it to max out system memory. But it looked to me that that's
what the Py_buffer API would require.

In terms of Python's own code conventions, I've already done the work to
add Python's Sequence and Mapping Object Protocols to iostreams[1], but
adding the buffer protocol did not go according to plan.. The final
features I would have wanted to add would have been (obviously 'format',
as well as) '__enter__', '__exit__', 'read(into)' and 'write(into)'
methods, but a buffered interface is basically what I started it for...

Any suggestions or ideas on how to fulfill the requirements of the
Py_buffer API in this regard, would be greatly appreciated!

Kind regards,
Alex


[1]: https://github.com/alexleach/bp_helpers/blob/master/IOSTREAM.md


ps. Please don't hate me for using Boost Python instead of Cython...

Nikita Nemkin

unread,
Jun 5, 2013, 10:17:25 AM6/5/13
to cython...@googlegroups.com, Alex Leach
On Wed, 05 Jun 2013 16:19:42 +0600, Alex Leach <beame...@gmail.com>
wrote:
You should create an ostream that wraps Python file object.
The way to do it is to implement a custom streambuf:

namespace py = using boost::python;

// minimal streambuf implementation that uses python bytes object
// for its buffer and flushes to the provided python file object.
struct py_streambuf : std::streambuf
{
py::object write_func_;
py::str buffer_;

py_file_sink(py::object py_file, size_t bufsize=512)
: write_func_(py_file.attr("write")), buffer_(py::str(NULL,
bufsize))
{
char* buffer = static_cast<char*>(extract<const
char*>(buffer_));
setp(buffer, buffer + bufsize);
}

int_type overflow(int_type ch)
{
sync(); // flush
if (ch != traits_type::eof()) {
*pptr() = ch;
pbump(1);
return ch;
}
return 0; // generic success
}

int sync() {
py::str buffer;
std::ptrdiff_t n = pptr() - pbase();
if (pptr() == epptr()) {
buffer = buffer_;
} else {
// partial buffer flush
buffer = buffer_.slice(py::_, n);
}
py::long_ written = write_func_(buffer);
if (!written.is_null()) {
// Py3 streams return the number of bytes written.
// For raw streams, this number may be less than requested.
n = py::extract<size_t>(written);
if (n < pptr() - pbase())
memmove(pbase(), pbase() + n, pptr() - pbase() - n);
}
pbump(-n); // rewind
return 0; // generic success
}
};

// ostream that wraps python file object (anything with a write method)
struct py_ostream : std::ostream {
py_streambuf streambuf_;

py_ostream(object f) : std::ostream(&streambuf_), streambuf_(f) {}
};

// Implementation of FooSerialObj.tofile(f, format)
void FooSerialObj_tofile(FooSerialObj* this, object f, str format) {
py_ostream stream(f);

if (format == "xml")
stream << eFormat_Xml;
else if (format == "text")
stream << eFormat_Text;
else ...

stream << *this; // voila
}

(Warning: this code is very rough, just to give you an idea. I don't
normally use boost.python and this is my first time writing a custom
streambuf too.)

So, whenever your C++ object wants an ostream, you give it py_ostream
(which is just as good as any other C++ ostream).
Python side only deals with its native file objects and never sees any
C++ specifics.

Partial stream flushes still require one copy, but they should be rare.
Common case does not copy anything.
(Except, standard file objects are buffered, so there would be double
buffering in any case.)
Py2 could be further optimized to use direct FILE* access when available.
Py3 provides no such thing.

Similar code could be written in Cython, probably with some hacks around
limited C++ suport.


Best regards,
Nikita Nemkin

Yury V. Zaytsev

unread,
Jun 5, 2013, 11:10:21 AM6/5/13
to cython...@googlegroups.com
Hi Stefan and Chris,

Thank you very much once again for your clarifications regarding string
conversion and type checks! I think I get it now.

Still, I have the original question open re. how to check in this
cascade whether the Python object actually exposes a buffer interface
and what kind of buffer it is.

Could you please comment on the text below, whenever you get a chance?

On Tue, 2013-06-04 at 17:43 +0200, Yury V. Zaytsev wrote:

> Right, I have read this page many times, and every time it's becoming
> a bit more clear, but still I don't understand, how do I check whether
> the supplied Python object exposes a buffer interface, and, if yes,
> then what is the type and the dimensions of this buffer.
>
> Do I understand correctly that you are implying that there is no nice
> way to do this (like isinstance) and I should create functions like
>
> cdef Datum* long_vector_to_datum(long [:] obj):
> cdef Datum* double_vector_to_datum(double [:] obj):
>
> and then
>
> if isinstance(obj, bool):
> ...
> else:
> try:
> long_vector_to_datum(obj)
> except:
> # doesn't provide this interface
> try:
> double_vector_to_datum(obj)
> except:
> # doesn't provide this interface
>
> etc. ?

Alex Leach

unread,
Jun 6, 2013, 10:18:31 AM6/6/13
to cython...@googlegroups.com, Nikita Nemkin
Nikita,

Before actually getting around to using your code and ideas, I wanted to
thank you for the time and effort you put into the last email. Really
appreciated!

On Wed, 05 Jun 2013 15:17:25 +0100, Nikita Nemkin <nik...@nemkin.ru> wrote:
>
> (Warning: this code is very rough, just to give you an idea. I don't
> normally use boost.python and this is my first time writing a custom
> streambuf too.)
>

In such a case, that was a damn impressive effort, especially as in such a
short space of time!


> Partial stream flushes still require one copy, but they should be rare.
> Common case does not copy anything.

Excellent. That's exactly what I was hoping for!

> (Except, standard file objects are buffered, so there would be double
> buffering in any case.)

In such a Python-C++ extension, would calling
`::sync_with_stdio(false)`[1] be appropriate in any / all ios_base
instances? Looks to me like I would want to call it each time such an
instance is exposed to Python..

[1]: http://www.cplusplus.com/reference/ios/ios_base/sync_with_stdio/

> Py2 could be further optimized to use direct FILE* access when available.
> Py3 provides no such thing.

I don't remember seeing FILE* objects available anywhere in the Python
buffer API, nor C++ iostream interfaces, so hadn't really considered using
them. I was trying more to create a class template that could expose any
class derived from ios_base, which includes all the std:: fstream classes.
(Unfortunately, I hit a major stumbling block with C++ class constructors;
attempts to replicate bp's init<> functionality on class_ instances were
unsuccessful, in the time I had available.)

>
> Similar code could be written in Cython, probably with some hacks around
> limited C++ suport.
>

No doubt! I bet it could be done better, too. I started to use the Python
C-API about as much as the bp API, found it a lot simpler to use,
understand and a lot faster to compile. Also, much smaller binaries are
produced, especially when not stripped! I imagine all Cython-generated C
code could be considered an improvement, in some of those respects.

Thanks again for your help. I'm in the middle of something else atm, but
will hopefully have time to fix up those parts of the code in coming weeks
/ months.

Kind regards,
Alex



Reply all
Reply to author
Forward
0 new messages