I now have the addmul_multi code running multi-threaded, and have introduced an extra cancellation step that significantly reduces the complexity of the problem. All of the test cases pass, and they have pretty good code coverage, and valgrind runs clean.
Last night, a 12-core machine multiplied 440 polynomials spread over 40 terms and generated a 6870314768 term polynomial in 6.5 hours. The working set set was around 50 GB, but most of that was used for buffering. The problem might require as little as 8 GB of RAM, but that would probably slow things down because having lots of buffer space is crucial to running the processors at 100% utilization. The basic algorithm is to let each processor multiple the factors in a single term, writing into a buffer, then add up the terms by merging the buffers together.
I say "generated", because the machine didn't save the output at all. It just checked to make sure the exponents were in descending order, collected some statistics, and then discarded the results. The result needs to be processed further and saved to disk, but I haven't written that code yet.
To facilitate this, I've added an "output_function" field to the fmpz_mpoly structure. If this function pointer is NULL, the subroutine proceeds normally, writing the output into memory and resizing the arrays as needed, just like always. If "output_function" is not NULL, the routine calls it for each output polynomial term, and does nothing else.
I've currently got a single output_function that just checks exponent ordering and collects some statistics. I'm planning to write several more: one that writes out to disk, one that does a variable substitution, one that evaluates the polynomial at a given coordinate vector.
An "input_function" also seems to be needed, to deal with polynomials that were previously written to disk.
I'm not sure what the API should be, but this seems to be moving in the direction of having input and output functions for polynomials that allow a chain of calculations to be configured and then run as a block, stalling as needed to wait for buffers to fill rather than expecting everything to be in memory at once.
Obviously this only works for algorithms that can read or write polynomials sequentially, though this can be fudged a little bit. For example, I'm planning to write an output_function that does variable substitution by writing a series of intermediate files to disk, each one a sorted polynomial produced from a block of the input polynomial by a variable substitution, then adding them up at the end using a merge tree to produce the final result. The idea here is that I can write a portion of the input polynomial to a memory buffer (as it's being produced by addmul_multi), run the variable substitution when either that buffer fills up or when the substituted variables change (put them at the beginning of a lex ordering so they don't change too often), and dump the results straight to disk so no additional memory is required.
I've got the fmpz_mpoly_addmul_multi subroutine, that's probably close to being suitable for a PR, but I'm adding this other feature (output_function) that's going to be a whole new issue, and the code is starting to get a little mixed together, but it's not real bad (yet) because I can just cut out the blocks that start with "if (output_function)".
agape
brent