distributed swivel

Chris Waterson

unread,

May 5, 2017, 5:32:55 PM5/5/17

to Swivel Embeddings

Hello Swivelers! (There are dozens of us!)

I merged some changes to master this morning to make it straight-forward to use Swivel with Distributed Tensorflow. This allows you to run Swivel workers on multiple machines to parallelize training, which was sort of the whole point of Swivel in the first place. :D

It's also the way that I'd recommend using Swivel on a multi-GPU machine, as well. I've had a few reports of poor training results when trying to use multiple GPUs within a single multi-threaded process.

Please take a look at this Bash script for the basics of a single-machine, multi-process setup.

I've also switched the default optimizer to RMSProp (instead of AdaGrad), and eliminated (well, turned off) some of the strange hyper-parameters from the paper. I've routinely achieved results that are as good as -- but usually better -- using RMSProp with a simple weighting of the L2 loss by the square root of the count. I suspect that some of the bizarre parameters reported in the paper had to do with over-tuning for AdaGrad and the specific corpus that we'd been using. (AdaGrad has a "clamping" behavior that can cause it to stop learning too quickly if it receives large gradient updates.)

But if your goal is to reproduce the results from the paper, then it's possible to re-enable AdaGrad and set the hyper-parameter values to what we used there.

Please let me know if you have any questions!

chris

Phuong Nguyen

unread,

May 22, 2017, 12:20:16 PM5/22/17

to Swivel Embeddings

Hi Chris and All:

I got couples of difficulties when I am running swivel model on larger corpus or more GPUs. I would appreciate your help and comments on these.

1. I tried to run distributed swivel using 4 GPUs with 1PS and 8 workers on text8 dataset vocabulary size ~71,290 unique frequent words.

it run much slower than using 2 GPUs 2 times slower.

2. When I tried to run swivel on datasets which has ~2.7 million unique frequent tokens of 3 billion tokens.

The prep.py stop with error too many file opens. I increased the shard size then prep.py produced list of tmp files e.g shard-001-015.tmp ~240MB and it terminated without error messages and it did not produced all file needed for swivel.py

I also tried to run glove_to_shards.py to create tf.Record format from GloVe co-occurrence matrix which is produced from my ~2.7 million vocabulary

glove_to_shards.py also terminated after producing temporary files. It did not produce error message why it stopped.

Also, in swivel.py code does not have option to read shards.recs

Chris, Do you have version which read shards.recs to train the swivel instead of input files produced from prep.py?

Thanks for your time.

Best Regards,

Phuong

Chris Waterson

unread,

May 22, 2017, 2:48:02 PM5/22/17

to Phuong Nguyen, Swivel Embeddings

On Mon, May 22, 2017 at 9:20 AM, Phuong Nguyen <thuphu...@gmail.com> wrote:

Hi Chris and All:

I got couples of difficulties when I am running swivel model on larger corpus or more GPUs. I would appreciate your help and comments on these.

1. I tried to run distributed swivel using 4 GPUs with 1PS and 8 workers on text8 dataset vocabulary size ~71,290 unique frequent words.
it run much slower than using 2 GPUs 2 times slower.

This is a difficult case to debug, and I'm sorry I'm not really sure how we'd begin investigating that. :(

2. When I tried to run swivel on datasets which has ~2.7 million unique frequent tokens of 3 billion tokens.
The prep.py stop with error too many file opens. I increased the shard size then prep.py produced list of tmp files e.g shard-001-015.tmp ~240MB and it terminated without error messages and it did not produced all file needed for swivel.py

So, to be honest, if you're going to do a corpus of 3 billion tokens, you may want to use prep.py as a guide for writing a map-reduce job that you could run on a cluster of machines.

You could also give fastprep.cc a try: there is also an outstanding pull request that further improves its performance, but hasn't been well tested yet: https://github.com/tensorflow/models/pull/1108

I also tried to run glove_to_shards.py to create tf.Record format from GloVe co-occurrence matrix which is produced from my ~2.7 million vocabulary
glove_to_shards.py also terminated after producing temporary files. It did not produce error message why it stopped.

Also, in swivel.py code does not have option to read shards.recs

Chris, Do you have version which read shards.recs to train the swivel instead of input files produced from prep.py?

So, I think what you're asking here is: can you make glove_to_shards.py output something that can be read by swivel.py directly?

If that's so, then I see that there was some filename drift between the two implementations. Specifically, I think we'd want to fix glove_to_shards.py so that it outputs a separate shard.pb-????? record for each tf.Example rather than a file of tf.Records.

But the fact that that job isn't completing either is a bit disheartening.

Phuong Nguyen

unread,

May 23, 2017, 2:17:47 PM5/23/17

to Swivel Embeddings

Chris, Thank you for your time looking in to these. I am going to try fastprep.cc first.

Best Regards,

Phuong

al...@sourced.tech

unread,

Oct 24, 2017, 8:01:25 AM10/24/17

to Swivel Embeddings

Hi,

So, to be honest, if you're going to do a corpus of 3 billion tokens, you may want to use prep.py as a guide for writing a map-reduce job that you could run on a cluster of machines.

Just FIY, we have implemented this distributed preprocessing step using Apache Spark at https://github.com/src-d/swivel-spark-prep