The parallelism of parts of the implementation is still limited by:
(1) the Python "global interpreter lock" (GIL)
(2) a single corpus-iteration thread, which batches text examples to the worker threads
You've already got a pretty simple corpus-iterator, so that may not be a bottleneck. (Still, if data is coming from a spinning disk, using a faster volume, or bringing it all into RAM, might help a bit.)
The contention from single-threading is likely the larger issue, and often means peak throughput is achieved with 3-8 worker threads, even if more physical cores are available.
The exact optimal number of threads can vary based on things like the system/libraries, meta-parameters and vocabulary-size. For example, increasing the `window`, `negative` count, or vector `size` each cause the highly-optimized multi-threadable code blocks to take longer – thus *lessening* contention and potentially getting more work done 'for free' (from cores that would otherwise be idle). Of course, increasing these values may not in fact obtain optimal results on other project goals.
A few other tangential comments on your set-up:
* using Intel's MKL (as for example installed by the `conda` environment-manager) may outperform OpenBLAS
* it's rare to use PV-DM with summing rather than averaging (your `dm=1, dm_mean=0`)
* keeping words that only appear once (`min_count=1`) or even a few times usually serves more to interfere-with, or dilute, other vectors than to help.
Hope this helps,
- Gordon