Alex,
I spent a bit of time testing this, and was surprised by the results. With large mechanisms (I ran the LLNL gasoline surrogate with 1389 species), where the computation time is completely dominated by the LU factorization of the Jacobian, the computation time does indeed scale badly for independent processes, depending on the BLAS/LAPACK implementation used. Using the implementation of the LU factor/solve included with Sundials (which is what you will be using if you use the Windows binaries for Cantera), I get the following computation times on a quad-core machine:
1 process: 12:22
2 processes: 23:45
3 processes: 34:25
My suspicion is that the straightforward implementation of the LU factorization ends up being constrained by memory bandwidth rather than actual processing speed. What's interesting is that for optimized BLAS/LAPACK implementations which take into account the sizes of the various levels of processor caches available, the performance scales much better for multiple processes. Using ATLAS:
1 process: 4:06
3 processes: 5:50
Using MKL:
1 process: 2:10
3 processes: 2:41
I think the takeaway here is that if you want good performance, you really need to use an optimized BLAS/LAPACK library, both for the improvement in single-core performance, as well as for better scaling to multiple processors.
Regards,
Ray