Accessing FastText model in multiprocessing.Pool()

bok...@gmail.com

unread,

Jan 11, 2019, 5:15:04 AM1/11/19

to Gensim

Hello everyone!

I have used gensim for creating word embeddings for quite some time now.

Starting by using Doc2Vec to create document vectors, I have ended up using FastText to built the word embeddings.

With FastText I am building the document vectors manually, by calculating the mean of the word vectors in a document.

Since I have a lot of documents to process, the task to built the document vectors takes some time. Therefore I wanted

to use the multiprocessing package of Python to split the work to all cores available on my computer. While I have found

a way to split the work and reduce the processing time, it is limited by how much RAM is available.

There were a few problems I had to solve before getting to this stage. Since I am using the Jupyter Notebook to execute

the code I had to place some methods in a seperate Python script to be able to use mulitprocessing in a Jupyter Notebook.

This is the code that's available in the Jupyter Notebook. It imports the module "m_helpers" which has the methods to

create the document vectors:

import multiprocessing
import m_helpers

# Define number of workers.
num_processes = 3

if __name__ == "__main__":
    # This pool spawns several processes to built the
    # document vectors with the FastText model
    with multiprocessing.Pool(processes = num_processes, initializer = m_helpers.init_vars, initargs = (fasttext_model, vars_df)) as pool:
        results = pool.map(m_helpers.create_docvecs, data_df.itertuples(name = False), chunksize = int(len(data_df) / num_processes))
        output = [x for x in results]
    
    # Print length of output to see whether everything was processed
    print("Length of output (document vectors): {0}".format(len(output)))

The m_helpers module has the init_vars() method to set the same FastText model for each spawning process and an

other dataframe. Both objects are used to built the document vectors later on:

fasttext_model = None
vars_df = None

def init_vars(model, df):
    global fasttext_model
    fasttext_model = model
    global vars_df
    vars_df = df

def create_docvecs(data):
    word_vectors = [fasttext_model.wv[word] for word in data["Text"].str.split()]
    document_vector = sum(word_vectors) / len(word_vectors)
    return document_vector

Using this method I can't use more than three cores, otherwise it results in a BrokenPipeError which is most likely the

result of using to much memory. I have got 32GB of RAM. When I am loading all necessary data (FastText model and
data frames) 12 - 14 GB of RAM is in use already. Since I only need read those objects to built the document vectors

I thought Windows would use those objects in a Shared Memory and process the document vectors. This doesn't seem

to be the case. I have also tried to use

FastText.load(directory, mmap = "r")

with no success.

Since the multiprocessing is new to me I would like to ask how I can solve this problem or whether that's possible at all?

Maybe there are different and more efficient ways to use gensim's FastText to create those document vectors.

Thank you very much in advance.

Gordon Mohr

unread,

Jan 11, 2019, 2:52:38 PM1/11/19

to Gensim

The creation of your document vectors, as a simple average of the word-vectors, is a very simple calculation. Note that with a pool-of-processes, there can be notable overhead moving the data to each process, then collecting the results. (Typically: a Python pickle-unpickle operation each way.) So, even if your memory issue is overcome, that new per-doc overhead might swamp any benefits of parallelism.

But also: that points to what's likely the biggest issue with your current setup: the giant `chunksize`. This value indicates how many items should be sent as one task to each process – in one big lump. By picking a chunksize exactly equal to 1/3 of the full dataset, for your 3 processes, there'll be 3 giant sends, then 3 giant results. Imagine your data is 6GB. 2GB will be assigned to one chunk, pickled into more objects that are at least 2GB (because generic serialization usually means some expansion) on the sending side process, relayed to the child process where it becomes another 2GB+ pickled-data, which is then unpickled into 2GB of objects at the receiving-side process. That's perhaps a 3x-or-more RAM usage expansion - before any processing is done or results returned.

Try leaving the default chunksize of 1 - just to see if things then work. (Perhaps use a subset of your data, to get a sense of the speedup/slowdown compared to a non-pooled approach.) Then try larger chunksizes – in the hundreds to thousands, but probably not millions – to see what sort of speedup over the `chunksize=1` cases is possible. (The benefit of a little-larger batches may be noticeably large, but then become negligible with larger and larger chunksizes.)

You still may not match the throughput of a single-process solution, due to the overhead mentioned above, but you should prevent memory from being a bottleneck.

Further optimizations you could then consider:

* if all the data is in memory before the `Pool` is created, you can count on each subprocess already having its own addressable copy (ideally through shared memory if your OS hasn't messed things up). So, rather than (redundantly) pushing the data to the subprocesses, you could just pass int indexes for them to use to look-up locally - much less serialization/RAM overhead

* ensure that any time these doc-vecs calculated, they're written in a canonical place/format alongside the source data – and then re-using that cache whenever available. (Only if your data or FT model changes, would that be discarded.) Then the overhead of one lengthy single-process calculation may not be as much of a concern, in whatever analysis cycle you're repeating that can reuse the data.

* avoid loading the full data into a dataframe – if you only ever need the `Text` column, and its native format is amenable to being read in an incremental streaming fashion, this could save a lot of memory. (You could conceivably even just tell the subprocesses ranges of the source data to read, and they'd do their own IO.)

- Gordon

Message has been deleted

bok...@gmail.com

unread,

Jan 14, 2019, 2:48:09 AM1/14/19

to Gensim

Thank you very much for your input. I haven't tried setting the chunksize to it's default value. However I was able to make it work with a chunksize of 512 and three processes.

The RAM usage goes up to nearly 100% for three times for a few seconds. I guess those are the three processes doing their job. However I can't increase the number of processes any further because than the BrokenPipeError occurs.

The creation of the document vectors is actually just one part of the job. Sorry, I forgot to mention that. I also append some other values from the feature data frame to the document vector, that's why I am also initializing the data frame in each process. When running this part of the script it takes 30 minutes. Running in parallel with 3 processes shortens the time to about 10 minutes. Since my computer has 6 cores / 12 threads I thought I would be able to shorten the time to about 3 minutes.

Loreto Parisi

unread,

Jan 23, 2019, 12:46:24 PM1/23/19

to Gensim

Why are you using fastText word2vec averaged to create the PV instead of using gensim PV-DBOW?

Reply all

Reply to author

Forward