Could we generate .pxd files from .pyx files?

Matthew Honnibal

unread,

Jul 17, 2015, 9:25:31 PM7/17/15

to cython...@googlegroups.com

Hi,

Currently my main code-base contains 4,820 lines in .pyx files, and 1,270 lines in .pxd files. Most of the information in the .pxd files is redundant, and maintaining them is frequently painful.

It's very common for me to change a small type declaration, e.g. to add or remove an except value or a nogil, and have to make the change in two places. Adding or removing an argument from a function signature is also very common. If I move a helper class to its own file, suddenly I have to create a new file, and delete the attribute type declarations from the original. And inlining things is really bad — then the implementation lives in the .pxd files, which will surely surprise me later. I really dislike using inline for this reason. Maybe after trying to inline the function, I'll see it made no difference to performance, so I'll remove it. Then I have to move the whole function back to the .pyx file --- except the declaration in the .pxd file.

Is there any reason we couldn't be generating the .pxd files, given some annotations in the .pyx declarations? Even if it meant always exporting everything, I would much prefer that to the current system. If a class is in the .pxd at all, every C attribute must be exported anyway, so I don't get much value out of choosing what's in the .pxd at the moment. Probably a better solution, though, would be to export names that don't start with an underscore. This is what Python does. Decorators could also be used, e.g. to specify public/readonly.

I've been tempted at various points to hack something together along these lines, since I'm pretty sure something like this could at least meet my own needs. But I thought I'd check whether there was a dimension to the problem that I was missing; some reason why this is harder than it seems.

Thanks,
Matthew Honnibal

http://spacy.io

Stefan Behnel

unread,

Jul 18, 2015, 2:15:04 AM7/18/15

to cython...@googlegroups.com

Matthew Honnibal schrieb am 18.07.2015 um 03:25:
> Currently my main code-base contains 4,820 lines in .pyx files, and 1,270
> lines in .pxd files. Most of the information in the .pxd files is
> redundant, and maintaining them is frequently painful.
>
> It's very common for me to change a small type declaration, e.g. to add or
> remove an except value or a nogil, and have to make the change in two
> places.

At least for some changes, making them in the .pxd file should be enough,
as long as both files don't contain *contradicting* declarations.

> Adding or removing an argument from a function signature is also
> very common. If I move a helper class to its own file, suddenly I have to
> create a new file, and delete the attribute type declarations from the
> original. And inlining things is really bad — then the implementation lives
> in the .pxd files, which will surely surprise me later. I really dislike
> using inline for this reason. Maybe after trying to inline the function,
> I'll see it made no difference to performance, so I'll remove it. Then I
> have to move the whole function back to the .pyx file --- except the
> declaration in the .pxd file.

For trial and error, you can rename the function, e.g. prefix it with an
underscore. Saves typing and copying. And going back to a previous version
should just mean backing out a single change from VCS or so, rather than
manually undoing changes.

> Is there any reason we couldn't be generating the .pxd files, given some
> annotations in the .pyx declarations?

IIRC, there is support for cimporting also directly from .pyx files. See
the "cimport_from_pyx.srctree" test. Don't know if there are limitations to
what it can do. There is certainly the limitation that you cannot limit the
public parts of an implementation, so you'd have to take care of not
writing spaghetti code yourself. And .pyx files also tend to be
substantially larger than .pxd files, so compilation would take longer.
(Not that it matters, given that there's still a slowish C compiler coming
after us anyway...)

> Even if it meant always exporting
> everything, I would much prefer that to the current system. If a class is
> in the .pxd at all, every C attribute must be exported anyway

Yes, because Cython has to know about the object struct layout in order to
handle it correctly. Especially subclassing would otherwise be impossible.

>, so I don't
> get much value out of choosing what's in the .pxd at the moment. Probably a
> better solution, though, would be to export names that don't start with an
> underscore. This is what Python does. Decorators could also be used, e.g.
> to specify public/readonly.

Sounds like a reasonable restriction.

> I've been tempted at various points to hack something together along these
> lines, since I'm pretty sure something like this could at least meet my own
> needs. But I thought I'd check whether there was a dimension to the problem
> that I was missing; some reason why this is harder than it seems.

Shouldn't be. I'm not a big fan of this myself, but given that there's
support for this already, I wouldn't object to extending it a bit more if
there is a need for improvement.

Stefan

Robert Bradshaw

unread,

Jul 18, 2015, 3:10:04 AM7/18/15

to cython...@googlegroups.com

On Fri, Jul 17, 2015 at 11:15 PM, Stefan Behnel <stef...@behnel.de> wrote:
> Matthew Honnibal schrieb am 18.07.2015 um 03:25:
>> Currently my main code-base contains 4,820 lines in .pyx files, and 1,270
>> lines in .pxd files. Most of the information in the .pxd files is
>> redundant, and maintaining them is frequently painful.
>>
>> It's very common for me to change a small type declaration, e.g. to add or
>> remove an except value or a nogil, and have to make the change in two
>> places.
>
> At least for some changes, making them in the .pxd file should be enough,
> as long as both files don't contain *contradicting* declarations.
>
>> Adding or removing an argument from a function signature is also
>> very common. If I move a helper class to its own file, suddenly I have to
>> create a new file, and delete the attribute type declarations from the
>> original. And inlining things is really bad — then the implementation lives
>> in the .pxd files, which will surely surprise me later. I really dislike
>> using inline for this reason. Maybe after trying to inline the function,
>> I'll see it made no difference to performance, so I'll remove it. Then I
>> have to move the whole function back to the .pyx file --- except the
>> declaration in the .pxd file.
>
> For trial and error, you can rename the function, e.g. prefix it with an
> underscore. Saves typing and copying. And going back to a previous version
> should just mean backing out a single change from VCS or so, rather than
> manually undoing changes.

Note also that inlining inherently must be in the pxd file, as (other
than the case below) inly the pxd files are inspected during
compilation.

>> Is there any reason we couldn't be generating the .pxd files, given some
>> annotations in the .pyx declarations?
>
> IIRC, there is support for cimporting also directly from .pyx files. See
> the "cimport_from_pyx.srctree" test. Don't know if there are limitations to
> what it can do.

This redundancy of pyx and pxd files, and having to have two files
around, is exactly why cimport-from-pyx was added. The only limitation
is not distinguishing between implementation and public interface
(though one may argue that such distinctions aren't very Pythonic
anyways... using conventions like underscore names is more standard).
This is essentially the same restriction auto-generated pxd files
would have as well.

> There is certainly the limitation that you cannot limit the
> public parts of an implementation, so you'd have to take care of not
> writing spaghetti code yourself. And .pyx files also tend to be
> substantially larger than .pxd files, so compilation would take longer.
> (Not that it matters, given that there's still a slowish C compiler coming
> after us anyway...)

Even for pxd files, parsing is expensive, so I've toyed with the idea
of caching a parsed version of the source anyways...

>> Even if it meant always exporting
>> everything, I would much prefer that to the current system. If a class is
>> in the .pxd at all, every C attribute must be exported anyway
>
> Yes, because Cython has to know about the object struct layout in order to
> handle it correctly. Especially subclassing would otherwise be impossible.
>
>
>>, so I don't
>> get much value out of choosing what's in the .pxd at the moment. Probably a
>> better solution, though, would be to export names that don't start with an
>> underscore. This is what Python does. Decorators could also be used, e.g.
>> to specify public/readonly.
>
> Sounds like a reasonable restriction.
>
>
>> I've been tempted at various points to hack something together along these
>> lines, since I'm pretty sure something like this could at least meet my own
>> needs. But I thought I'd check whether there was a dimension to the problem
>> that I was missing; some reason why this is harder than it seems.
>
> Shouldn't be. I'm not a big fan of this myself, but given that there's
> support for this already, I wouldn't object to extending it a bit more if
> there is a need for improvement.
>
> Stefan
>

> --
>
> ---
> You received this message because you are subscribed to the Google Groups "cython-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Robert Bradshaw

unread,

Jul 18, 2015, 3:45:13 AM7/18/15

to cython...@googlegroups.com

Also, a .pxd file without a corresponding .pyx file does not generate
an actual module (useful for C library declarations, or decorating
existing modules). But having to have a .pyx and .pxd file is often
cumbersome and confusing.

Matthew Honnibal

unread,

Jul 18, 2015, 4:31:27 AM7/18/15

to cython...@googlegroups.com, stef...@behnel.de

At least for some changes, making them in the .pxd file should be enough,
as long as both files don't contain *contradicting* declarations.

<snip>

For trial and error, you can rename the function, e.g. prefix it with an
underscore. Saves typing and copying. And going back to a previous version
should just mean backing out a single change from VCS or so, rather than
manually undoing changes.

Thanks for the suggestions. I have to say that temporarily renaming the function isn't very appealing --- and if I'm going to have the code in two places, I really do want them to match exactly. I agree that there are various practical things that can reduce the pain. There always are. Do you agree the current situation isn't ideal? For me it's very high on the list of language warts / pain points.

> Is there any reason we couldn't be generating the .pxd files, given some
> annotations in the .pyx declarations?

IIRC, there is support for cimporting also directly from .pyx files. See
the "cimport_from_pyx.srctree" test.

That would work too!

Don't know if there are limitations to
what it can do. There is certainly the limitation that you cannot limit the
public parts of an implementation, so you'd have to take care of not
writing spaghetti code yourself. And .pyx files also tend to be
substantially larger than .pxd files, so compilation would take longer.
(Not that it matters, given that there's still a slowish C compiler coming
after us anyway...)

Well, I can see a few reasons why it might be hard. But at this point I'm agnostic about which solution to pursue. I just want to maintain 20% less code. I'll investigate the cimport_from_pyx.

Shouldn't be. I'm not a big fan of this myself, but given that there's
support for this already, I wouldn't object to extending it a bit more if
there is a need for improvement.

Hmm. Do your modules mostly not have C-level APIs? I'm curious as to why the current set up feels quite bad to me, but you'd rather nothing changed.

Stefan Behnel

unread,

Jul 18, 2015, 4:48:35 AM7/18/15

to cython...@googlegroups.com

Matthew Honnibal schrieb am 18.07.2015 um 10:31:
>> I'm not a big fan of this myself, but given that there's
>> support for this already, I wouldn't object to extending it a bit more if
>> there is a need for improvement.
>
> Hmm. Do your modules mostly not have C-level APIs? I'm curious as to why
> the current set up feels quite bad to me, but you'd rather nothing changed.

The modules tend to be large and their C-API tends to be fairly small. In
fact, most of the implementation is entirely internal, including many of
the cdef classes, with public spots only here and there. That's why it's
important for me to be explicit about what's really meant to be public, so
that users don't start to depend on implementation details. Exposing those
is not great in Python either, but it can really harm future development
when people start depending on them at the C/Cython level.

Stefan

Matthew Honnibal

unread,

Jul 18, 2015, 5:33:01 AM7/18/15

to cython...@googlegroups.com, stef...@behnel.de

>
> Hmm. Do your modules mostly not have C-level APIs? I'm curious as to why
> the current set up feels quite bad to me, but you'd rather nothing changed.

The modules tend to be large and their C-API tends to be fairly small. In
fact, most of the implementation is entirely internal, including many of
the cdef classes, with public spots only here and there. That's why it's
important for me to be explicit about what's really meant to be public, so
that users don't start to depend on implementation details. Exposing those
is not great in Python either, but it can really harm future development
when people start depending on them at the C/Cython level.

Looking through the lxml codebase, I think it seems like we prefer different styles. The lxml.etree.pyx module is ~3500 lines. I note that the Cython parts of pandas seem to work the same way.

I can see the technical advantage to that, since there's much greater clarity about what's really external and what's really internal. But for my own code I definitely prefer to have much smaller modules. I don't go as far as to insist that each class have its own file, but my files tend to be about 100-500 lines long. I find it easier to navigate the code and name things if I have smaller files. Often I want to export an interface I'll reuse heavily within a package, even if I don't document it for users. That part of the code will have its own private implementation details. I use the file structure to help me keep these distinctions straight.

If I really have a minority use-case, and most people don't like to do it this way, then maybe I really should just hack up something to satisfy myself (if for some reason I don't like the existing cimport_from_pyx thing). But, I think the way I'm structuring my code is fairly popular across a number of languages, so my hunch is that I'm not alone.

Stefan Behnel

unread,

Jul 18, 2015, 7:06:10 AM7/18/15

to cython...@googlegroups.com

Matthew Honnibal schrieb am 18.07.2015 um 11:33:
>>> Hmm. Do your modules mostly not have C-level APIs? I'm curious as to why
>>> the current set up feels quite bad to me, but you'd rather nothing
>>> changed.
>>
>> The modules tend to be large and their C-API tends to be fairly small. In
>> fact, most of the implementation is entirely internal, including many of
>> the cdef classes, with public spots only here and there. That's why it's
>> important for me to be explicit about what's really meant to be public, so
>> that users don't start to depend on implementation details. Exposing those
>> is not great in Python either, but it can really harm future development
>> when people start depending on them at the C/Cython level.
>
> Looking through the lxml codebase, I think it seems like we prefer
> different styles. The lxml.etree.pyx module is ~3500 lines. I note that the
> Cython parts of pandas seem to work the same way.

It's more like 18K lines. There are literal includes for .pxi files that
contribute to the module as well.

That wouldn't be strictly necessary anymore, it's more grown that way (lxml
actually predates Cython). If I had to redo the source layout, I'd also
split some minor parts of the functionality into separate modules that
contribute to a façade package or Python module via selective imports. And
then there would be a utility helper .pxd file with inline functions that
the others could use.

But that shouldn't change the size of the overall public API. Most of the
code would still live in the main module and it would be the only one that
exports its little public C-API. All the other modules would just provide
essentially one extension type and not expose it at the C level at all.

The main drawback of multiple modules is that it complicates static linking
against the external C libraries, if users want or need to do that.

> I can see the technical advantage to that, since there's much greater
> clarity about what's really external and what's really internal. But for my
> own code I definitely prefer to have much smaller modules. I don't go as
> far as to insist that each class have its own file, but my files tend to be
> about 100-500 lines long. I find it easier to navigate the code and name
> things if I have smaller files. Often I want to export an interface I'll
> reuse heavily within a package, even if I don't document it for users. That
> part of the code will have its own private implementation details. I use
> the file structure to help me keep these distinctions straight.

My guess is that there's also a given domain model involved on your side,
which would then suggest a certain code split.

> If I really have a minority use-case, and most people don't like to do it
> this way, then maybe I really should just hack up something to satisfy
> myself (if for some reason I don't like the existing cimport_from_pyx
> thing). But, I think the way I'm structuring my code is fairly popular
> across a number of languages, so my hunch is that I'm not alone.

Just try the direct from-pyx cimport and report back. It's been there for a
while and certainly not been heavily used, so I'd guess that there are
minor issues with it due to some interaction with more recent feature
additions etc. If so, finding and eventually fixing them would be appreciated.

Stefan

Matthew Honnibal

unread,

Jan 24, 2016, 12:03:19 PM1/24/16

to cython-users, stef...@behnel.de

Just try the direct from-pyx cimport and report back. It's been there for a
while and certainly not been heavily used, so I'd guess that there are
minor issues with it due to some interaction with more recent feature
additions etc. If so, finding and eventually fixing them would be appreciated.

Finally got around to trying this.

Overall it works very well. A small problem is that I couldn't figure out how to get the command-line cython tool to use it. This was a bit of a pain for my regular workflow, since I like using the -a mode a lot.

I'd be reluctant to ship libraries that rely on this, though. It requires users to do extra configuration before using the Cython API of the library, and it seems like another thing that could go wrong. I'd also worry that it's an undocumented feature that not many people are using.

Reply all

Reply to author

Forward