The creation of your document vectors, as a simple average of the word-vectors, is a very simple calculation. Note that with a pool-of-processes, there can be notable overhead moving the data to each process, then collecting the results. (Typically: a Python pickle-unpickle operation each way.) So, even if your memory issue is overcome, that new per-doc overhead might swamp any benefits of parallelism.
But also: that points to what's likely the biggest issue with your current setup: the giant `chunksize`. This value indicates how many items should be sent as one task to each process – in one big lump. By picking a chunksize exactly equal to 1/3 of the full dataset, for your 3 processes, there'll be 3 giant sends, then 3 giant results. Imagine your data is 6GB. 2GB will be assigned to one chunk, pickled into more objects that are at least 2GB (because generic serialization usually means some expansion) on the sending side process, relayed to the child process where it becomes another 2GB+ pickled-data, which is then unpickled into 2GB of objects at the receiving-side process. That's perhaps a 3x-or-more RAM usage expansion - before any processing is done or results returned.
Try leaving the default chunksize of 1 - just to see if things then work. (Perhaps use a subset of your data, to get a sense of the speedup/slowdown compared to a non-pooled approach.) Then try larger chunksizes – in the hundreds to thousands, but probably not millions – to see what sort of speedup over the `chunksize=1` cases is possible. (The benefit of a little-larger batches may be noticeably large, but then become negligible with larger and larger chunksizes.)
You still may not match the throughput of a single-process solution, due to the overhead mentioned above, but you should prevent memory from being a bottleneck.
Further optimizations you could then consider:
* if all the data is in memory before the `Pool` is created, you can count on each subprocess already having its own addressable copy (ideally through shared memory if your OS hasn't messed things up). So, rather than (redundantly) pushing the data to the subprocesses, you could just pass int indexes for them to use to look-up locally - much less serialization/RAM overhead
* ensure that any time these doc-vecs calculated, they're written in a canonical place/format alongside the source data – and then re-using that cache whenever available. (Only if your data or FT model changes, would that be discarded.) Then the overhead of one lengthy single-process calculation may not be as much of a concern, in whatever analysis cycle you're repeating that can reuse the data.
* avoid loading the full data into a dataframe – if you only ever need the `Text` column, and its native format is amenable to being read in an incremental streaming fashion, this could save a lot of memory. (You could conceivably even just tell the subprocesses ranges of the source data to read, and they'd do their own IO.)
- Gordon