Memory limitation

ziv rader

unread,

Oct 10, 2016, 7:51:05 AM10/10/16

to bob-devel

Hi,
We are running bob.spear to train a UBM on our own collected database, where we have over 5M frames (each of the ~4000 utterance/speech sample has between 3000 and 15000 frames).

The experiment (running ./bin/verify.py -vv -d 'bigdata' -p 'energy-2gauss' -e 'mfcc-60' -a 'ivec-cosine-bigdata' -s 'ivec-cosine-bigdata' -v -parallel 8) fails on a memory error in the file ivector.py, function train_projector on the line

#train UBM

data = numpy.vstack(train_features_flatten)

We are running on a 8-core 32GB RAM Ubuntu system, and the total size of our extractor HD5 files is over 32GB (~34GB to be more exact). The data is successfully passed to the function, but the numpy array created using numpy.vstack seems to create a copy of the data, and overloads the memory.

We are wondering if there is a way to perhaps load the data in chunks or maybe another solution to help with this problem.

Many thanks in advance,
Ziv

Manuel Günther

unread,

Oct 10, 2016, 11:51:42 AM10/10/16

to bob-devel

One way around this limitation is to split off the data into more than 8 jobs, but run only 8 jobs in parallel. To do that, you can specify a "--grid" command line option, such as: ``... --grid local-p16`` (or you can create a configuration file with more, if needed). This will write a ``GridTK`` database (see http://pythonhosted.org/gridtk/manual.html#the-job-manager), which you can run, e.g., with the following command:

./bin/jman --local -vv run-scheduler --parallel 8 --die-when-finished

You can also reduce the number of parallel jobs, when the memory issue still occurs.

I hope that helped. I know that the GMM training is very memory hungry, and that it indeed makes a copy of the data. Unfortunately, we haven't yet come up with a solution to avoid that. If you have something in mind, let us know.

Cheers

Manuel

ziv rader

unread,

Oct 25, 2016, 11:02:58 AM10/25/16

to bob-devel

Manuel Hi,

It seems that the K-Means Initialization is non-parallel and as a result I cant avoid the memory issue by splitting into multiple jobs.

I have also noticed that the M-Step is not Parallel but I haven't got there yet.

am I missing something from your answer and the parallelism process?

Thanks,

Ziv

Manuel Günther

unread,

Oct 25, 2016, 12:05:18 PM10/25/16

to bob-devel

Indeed, the KMeans initialization is not parallelizable. However, you don't really need all of the data to initialize KMeans. As far as I remember, in `bin/verify_gmm.py` there is a command line option to limit the number of data items used for KMeans and GMM initialization. Try to run with the `--limit-training-data` (https://gitlab.idiap.ch/bob/bob.bio.gmm/blob/master/bob/bio/gmm/tools/command_line.py#L14 , or see `bin/verify_gmm.py --help`) option.

The M-Step is also not parallel, but I think it does not need soo much memory, as it does not load the training samples, but only accumulates the statistics of the E-Step and computes new means.