Guidelines for reducing size of Cython-generated C++ code?

Antoine Pitrou

unread,

Feb 20, 2024, 1:43:51 PMFeb 20

to cython...@googlegroups.com

Hello,

We use Cython for generating the PyArrow bindings to Arrow C++ (see
https://arrow.apache.org/docs/python/).

While reading through the logs of one our CI builds (on Windows, but I
don't think that's factor), I noticed the following messages:

```
lib.cpp
C:\Python312\Lib\site-packages\numpy\_core\include\numpy\ndarraytypes.h(1250,38): warning C4200: nonstandard extension used: zero-sized array in struct/union [C:\arrow\python\build\temp.win-amd64-cpython-312\lib.vcxproj]
C:\Python312\Lib\site-packages\numpy\_core\include\numpy\ndarraytypes.h(1250,38): message : This member will be ignored by a defaulted constructor or copy/move assignment operator [C:\arrow\python\build\temp.win-amd64-cpython-312\lib.vcxproj]
C:\arrow\python\build\temp.win-amd64-cpython-312\lib.cpp(335145,17): warning C4244: '=': conversion from 'Py_ssize_t' to 'long', possible loss of data [C:\arrow\python\build\temp.win-amd64-cpython-312\lib.vcxproj]
C:\arrow\python\build\temp.win-amd64-cpython-312\lib.cpp(335645,17): warning C4244: '=': conversion from 'Py_ssize_t' to 'long', possible loss of data [C:\arrow\python\build\temp.win-amd64-cpython-312\lib.vcxproj]
C:\arrow\python\build\temp.win-amd64-cpython-312\lib.cpp(335855,17): warning C4244: '=': conversion from 'Py_ssize_t' to 'long', possible loss of data [C:\arrow\python\build\temp.win-amd64-cpython-312\lib.vcxproj]
```

`lib.cpp` is our workhorse extension module hosting much of PyArrow's
core functionality, though we do have a dozen of extension modules as
well. What stands out in the messages I pasted above is the line
numbers: they imply that `lib.cpp`, which is generated by Cython from
PyArrow's Cython code, is *at least* 350_000 lines long!

This can obviously lead to very long compile times (on the Windows CI
builds above, compiling this file seems to take minutes, perhaps
because there's not enough RAM to keep all of the C++ compiler's
working set in memory).

We should certainly think about splitting that module into several
independent parts (*), but regardless, are there any guidelines to
make the length of Cython-generated C/C++ source code shorter?

Regards

Antoine.

(*) issue created at https://github.com/apache/arrow/issues/40166

da-woods

unread,

Feb 20, 2024, 5:38:06 PMFeb 20

to cython...@googlegroups.com

I don't know of a hugely good way of doing it.

My initial suggestions would be:
* turn off the binding directive (at the cost of reduced introspectability)
* turn off the auto-pickle directive (at the cost of cdef classes not
being pickleable by default)

I had a bit of a look into it a few years ago to try to reduce the
compile-time of some of the bigger files in Cython itself. See
https://github.com/cython/cython/issues/4425. It's a slightly distant
memory now but I think the conclusion was:

* the problem is a few big functions, not a big file,
* the two main culprits are the module init function, and the function
that generates all the module global constants,
* I think I got some benefit to splitting the module init function
line-by-line into a separate function per-line. This increased the size
of the C code, but decreased the compilation time a bit. It was just a
hacky proof of concept though, so I never seriously tried to merge it.
* I think we probably cache too many module-level constants (things like
strings, tuples, ints). For a good chunk of them it's possible to show
they're only used once and we could probably shrink the code-size by
skipping these. It isn't an easy change though.
* This was also mainly looking at GCC on Linux; MSVC on Windows might be
different

None of that is much use to you right now of course.

David

da-woods

unread,

Feb 20, 2024, 5:40:49 PMFeb 20

to cython...@googlegroups.com

On 20/02/2024 22:37, da-woods wrote:
> * I think I got some benefit to splitting the module init function
> line-by-line into a separate function per-line. This increased the
> size of the C code, but decreased the compilation time a bit. It was
> just a hacky proof of concept though, so I never seriously tried to
> merge it.

The hacky patch still looks to merge cleanly though
(https://github.com/cython/cython/pull/4386/files). It might be worth a
quick test to see what it does to your code. I don't know what we'd do
with the information but maybe something...

Antoine Pitrou

unread,

Feb 21, 2024, 8:29:17 AMFeb 21

to cython...@googlegroups.com

I'm looking at the culprit file (lib.cpp) and a bunch of things stand
out:

1. the module state is huge, it has thousands of individual PyObject*
variables (why not a single tuple or array of PyObject* instead, that
you would index using #define'd constants? that would make tp_clear and
tp_traverse much shorter as well).

2. tons of hand-written optimizations are inlined into the generated
code, for example different code paths for calling a function. For
example this trivial method generates not one but two C++ functions,
each ~80 lines of C/C++ code. That's 160 lines of generated code for a
trivial method that's not even performance critical!

```
/* "pyarrow/error.pxi":45
*
* class ArrowKeyError(KeyError, ArrowException):
* def __str__(self): # <<<<<<<<<<<<<<
* # Override KeyError.__str__, as it uses the repr() of the key
* return ArrowException.__str__(self)
*/
```

3. many functions, generated or not, marked CYTHON_INLINE for no
obvious reason. Of course, this is just a hint that the compiler is
free to ignore, but it might still have an undesirable effect (some of
those CYTHON_INLINE functions are really long and would probably not
benefit from inlining).

4. heroic micro-optimizations accessing PyLong internals in "inline"
functions such as __Pyx_PyInt_As_long.

5. worse, these heroic micro-optimizations are *replicated* (not even
delegated to a helper routine) in generated code for C/C++ enums. For
example, I see a generated function:

```
static CYTHON_INLINE enum arrow::TimeUnit::type
__Pyx_PyInt_As_enum____arrow_3a__3a_TimeUnit_3a__3a_type(PyObject *x)
```

... that is 170 lines long for a trivial functionality (unpack a
Python int into a C++ arrow::TimeUnit::type enum) that's not even
performance-critical to us.

6. It's a bit weird and makes the generated code less readable that
Cython uses C/C++-level #define'd constants for Cython-related
compile-time options, instead of switching at code generation time.

7. Indeed, the module initialization function is huge, but it doesn't
need to be, as much of it is mechanical. You don't have to generate 123
times the same global function creation code. You could instead of
generate a constant array of 123 structures with the necessary
information (function name, etc.), that you loop on to achieve to same
effect.

8. More generally, generated code can be structured and you can let the
compiler decide whether to inline for performance or not, which is much
better than *forcing* the compiler to always compile the same snippets.

Regards

Antoine.

da-woods

unread,

Feb 21, 2024, 2:00:20 PMFeb 21

to cython...@googlegroups.com

Just replying to a few points below

On 21/02/2024 13:09, Antoine Pitrou wrote:

1. the module state is huge, it has thousands of individual PyObject*
variables (why not a single tuple or array of PyObject* instead, that
you would index using #define'd constants? that would make tp_clear and
tp_traverse much shorter as well).

tp_clear and tp_traverse are probably ifdefed out so unlikely to be your problem. Although the overall idea may be a good one.

6. It's a bit weird and makes the generated code less readable that
Cython uses C/C++-level #define'd constants for Cython-related
compile-time options, instead of switching at code generation time.

The idea is that you should be able to run Cython once on a single platform and get code that you can then recompile on any platform that Cython supports. Obviously there's pros and cons to that but it'd be a fairly big change to change that now. It should have a fairly minimal effect of compile time though because it's preprocessed out fairly early.

7. Indeed, the module initialization function is huge, but it doesn't
need to be, as much of it is mechanical. You don't have to generate 123
times the same global function creation code. You could instead of
generate a constant array of 123 structures with the necessary
information (function name, etc.), that you loop on to achieve to same
effect.

It's not quite that simple in that they're def statements so the order matters so you couldn't just run through a big table, because there might be other work to be done between rows. That doesn't mean that you can't factor much of it out.

8. More generally, generated code can be structured and you can let the
compiler decide whether to inline for performance or not, which is much
better than *forcing* the compiler to always compile the same snippets.

It might be worth seeing what impact it has if you redefine CYTHON_INLINE to nothing (just as a compiler flag). While we might not want to do that generally, you could definitely apply it to your project.

Stefan Behnel

unread,

Feb 23, 2024, 6:28:35 AMFeb 23

to cython...@googlegroups.com

Salut Antoine,

Antoine Pitrou schrieb am 21.02.24 um 14:09:

> I'm looking at the culprit file (lib.cpp) and a bunch of things stand
> out:
>
> 1. the module state is huge, it has thousands of individual PyObject*
> variables (why not a single tuple or array of PyObject* instead, that
> you would index using #define'd constants? that would make tp_clear and
> tp_traverse much shorter as well).

See

https://github.com/cython/cython/issues/3689

https://github.com/cython/cython/issues/4926

> 2. tons of hand-written optimizations are inlined into the generated
> code

Many of those optimisations depend on the preprocessor, but yes, they
usually add to the C compile time when compiling in CPython. We use helper
functions most of the time when possible, but often, the generated code is
not entirely the same on each occurrence, which leads to more code being
generated locally. Cython usually has a mix of known and unknown types
available, for example, which often leads to slghtly different code being
generated.

Some of these adaptations could possibly be handled with similar effect in
inline functions, but you don't seem to be entirely happy with those either.

>, for example different code paths for calling a function. For
> example this trivial method generates not one but two C++ functions,
> each ~80 lines of C/C++ code. That's 160 lines of generated code for a
> trivial method that's not even performance critical!
>
> ```
> /* "pyarrow/error.pxi":45
> *
> * class ArrowKeyError(KeyError, ArrowException):
> * def __str__(self): # <<<<<<<<<<<<<<
> * # Override KeyError.__str__, as it uses the repr() of the key
> * return ArrowException.__str__(self)
> */
> ```

Not sure, but maybe you're looking at the Python wrapper of the function
(i.e. its signature implementation) and its user code implementation?

I'm only aware of one case where we're really duplicating user code, and
that's the exit part of a "try-finally" statement. Functions aren't duplicated.

> 3. many functions, generated or not, marked CYTHON_INLINE for no
> obvious reason. Of course, this is just a hint that the compiler is
> free to ignore, but it might still have an undesirable effect (some of
> those CYTHON_INLINE functions are really long and would probably not
> benefit from inlining).

A few helper functions were historically marked with "CYTHON_INLINE" to
avoid "unused" warnings when it's difficult to inject them conditionally.
In many cases, however, they are inlined because we expect the C compiler
to cut them down considerably based on the calling code.

As always, some of those modifiers are there for historical reasons. It's
easier to keep code than to revisit and change code. You've probably
noticed. :)

> 4. heroic micro-optimizations accessing PyLong internals in "inline"
> functions such as __Pyx_PyInt_As_long.

Yep – I'm actually quite proud of that. :) It's easy to disable with a C
macro guard as well, if you find it excessive.

> 5. worse, these heroic micro-optimizations are *replicated* (not even
> delegated to a helper routine) in generated code for C/C++ enums. For
> example, I see a generated function:
>
> ```
> static CYTHON_INLINE enum arrow::TimeUnit::type
> __Pyx_PyInt_As_enum____arrow_3a__3a_TimeUnit_3a__3a_type(PyObject *x)
> ```
>
> ... that is 170 lines long for a trivial functionality (unpack a
> Python int into a C++ arrow::TimeUnit::type enum) that's not even
> performance-critical to us.

We could possibly cut down on those a bit, but then, it's only one such
function per user defined type. You probably have tons of usages for each
type that you define in your code, which should outweigh the conversion
functions by far.

> 6. It's a bit weird and makes the generated code less readable that
> Cython uses C/C++-level #define'd constants for Cython-related
> compile-time options, instead of switching at code generation time.

That's intentional and IMHO a really great feature. It makes experimenting
with the code configuration very easy, by enabling or disabling different C
macros. It also makes it quite easy for us to adapt the C code to different
Python implementations and runtimes, by switching the fine-granular C
compile time macros on and off, depending on the available C-API features.

> 7. Indeed, the module initialization function is huge, but it doesn't
> need to be, as much of it is mechanical. You don't have to generate 123
> times the same global function creation code. You could instead of
> generate a constant array of 123 structures with the necessary
> information (function name, etc.), that you loop on to achieve to same
> effect.

David already mentioned that it's not quite that simple due to ordered
interdependencies and intermediate code sections.

Thanks for starting the discussion, though. It's always good to revisit the
status quo from time to time, even if it's not always easy to change it
drastically. Not everything that emerged over the years was intentional.

Stefan

Reply all

Reply to author

Forward