One thing to be careful of is the multiple levels of parallelism available.
Generally in the python interface you have two different parallel processes:
1. BLAS/LAPACK parallelism -- i.e., how many threads are used during matrix LU factorization & other linear algebra processes. This is typically controlled by the OMP_NUM_THREADS environment variable. Generally for small models (e.g., say 50-1000 species depending on your machine), you'll likely see the best performance using a single thread here, however it seems like you're using a very large model (since your ignition delay calculated take so long!), so you may want to play with the number of threads you allocate for BLAS/LAPACK to see the performance increase. Note: this assumes you're using MKL or another parallel library as the BLAS/LAPACK implementation, I believe this is the only option for the conda install currently, but I thought I'd mention it.
2. Python level parallelism, e.g., via the multiprocessing library. In order to see what's going on here, we'll need the info that Bryan requested. However, one thing I'll note is that the product of OMP_NUM_THREADS and the number of python processes you spawn shouldn't exceed the total physical core count of your machine for best performance. Note: Hyperthreading is evil for chemical kinetics, be really reaaaaally sure you have the actual physical core count, e.g. via the
psutils' cpu_count (with logical=False) or you'll take a performance hit.
Best,
Nick