Hi all,
For the past few weeks, the developers for the yt project have
been struggling with an issue in our docs build.
We recently switched to the latest development version of sphinx
to take advantage of the new parallel reading capabilities. I'm
currently using the tip of the default branch (Sphinx 1.3b3).
I'm using the sphinx build for the yt project:
I've modified our Sphinx configuration to do a parallel build on
8 cores and to not build some custom directives we use to
evaluate code in the docs build:
The memory errors still happen with our custom sphinx extensions
turned on I've just disabled them here to isolate the issue to the
autosummary extension.
The reading phase proceeds without issues, but during the write
phase after 10% of the documents have been written, I see (via top)
memory spikes on all the worker processes - each worker consumes
10s of gigabytes of ram. Shortly after this, the whole build
crashes. I don't quite understand the traceback, but I think
it's happening when Sphinx tries to serialize and transmit a huge
amount of data between workers or the master process. I've
pasted the output from the build here, along with the traceback sphinx
reports when the build crashes:
One extra hint is that if I turn off the autosummary extension,
the build proceeds fine in parallel. My guess is that we're
exposing a scaling issue in sphinx or docutils that only becomes
important for a build that contains a really large number of
output pages.
We've been using the autosummary extension for years with no
trouble, and have been using parallel writing since it became
available in Sphinx 1.2. I guess this is due to a change in the
way parallel writing works in Sphinx 1.3? Another hint is that we've
been able to reproduce this on three different build machines now.
Does anyone have any hints on debugging memory leaks like this?
I'd like to be able to get a traceback on one of the worker
processes once the memory usage spike has started.
Thanks for your help!
-Nathan