.bin and .vec file sizes

657 views
Skip to first unread message

Henry Thornton

unread,
Oct 2, 2016, 3:57:56 PM10/2/16
to fastText library
Noticed during testing with small datasets (~8,000 documents), that fasttext generates a very large .bin file of >3Gb. The .vec text file is ~70Mb. Is this normal?

isneh...@gmail.com

unread,
Oct 17, 2016, 3:05:23 PM10/17/16
to fastText library
Yes, since bin file contains more additional information than .vec file. It has larger size

Edouard G.

unread,
Oct 17, 2016, 11:51:10 PM10/17/16
to fastText library
Hi,

You can reduce the size of the output model by reducing the number of buckets used for word / character ngram features, with the -bucket option. By default, fastText uses 2M buckets, but on small datasets, you can probably reduces this number to 200k, or even lower.

You can also try to reduce the dimension of the word vectors, with the -dim option (default is 100). Finally, you can try to reduce the size of the vocabulary, by increasing the -minCount value (default is 1).

Best,
Edouard.

Henry Thornton

unread,
Oct 18, 2016, 3:14:38 AM10/18/16
to fastText library
I'm using a small 8K dataset to trial fasttext. Production datasets are tens of millions of items. So, I guess the .bin file will be even bigger?

Edouard G.

unread,
Dec 22, 2016, 6:42:47 AM12/22/16
to fastText library
Hi Henry,

On larger datasets, the size occupied by the buckets for ngram features will not change. Thus the .bin file will not necessarily be much bigger.

To reduce the model file size for very large training datasets, you can reduce the size of the vocabulary by increasing the -minCount parameter (e.g. from 1 to 5).

Best,
Edouard.
Reply all
Reply to author
Forward
0 new messages