Can I use different corpus for fasttext build_vocab than train?

114 views

Skip to first unread message

Ghawadi Bassam

unread,

Feb 23, 2022, 10:14:49 PM2/23/22

to Gensim

Hello,

I am curious to know the implication of using a different source while calling the build_vocab and train. My intention for doing this is that there are a specific set of words I am interested to get the vector representation for and when calling `model.wv.most_similar` I only want words defined in this vocab list to get returned rather than all possible words in the training corpus. I would use the result of this to decide if I want to group those words to be relevant to each other based on similarity threshold.

Following is the code snippet that I am using, appreciate your thoughts if there are any concerns with this approach.

- vocab.txt contains a list of unique words that are of interest

- corpus.txt contains full conversation text (i.e. chat messages) where each line represents a chunk of sentences per chat

```

%%time
from gensim.models.fasttext import FastText

model = FastText(min_count=1, vector_size=300,)
print(model, model.corpus_count, model.epochs,)

corpus_path = f'data/{client_id}-corpus.txt'
vocab_path = f'data/{client_id}-vocab.txt'
corpus_count = get_lines_count(corpus_path)

# build the vocabulary
model.build_vocab(corpus_file=vocab_path)
print(model, f'corpus_count:{corpus_count}', f'model.corpus_count:{model.corpus_count}')

# train the model
model.train(corpus_file=corpus.corpus_path, epochs=100,
total_examples=corpus_count, total_words=model.corpus_total_words,
)

print(model, f'corpus_count:{corpus_count}', f'model.corpus_count:{model.corpus_count}')

```

Ghawady

Gordon Mohr

unread,

Feb 24, 2022, 1:58:11 AM2/24/22

to Gensim

You can try it, but I wouldn't expect it to work well for most purposes.

The `build_vocab()` call establishes the known vocabulary of the model, & caches some stats about the corpus.

If you then supply another corpus – & especially one with *more* words – then:

* You'll want your `train()` parameters to reflect the actual size of your training corpus. With regard to your example code – you'll *not* want to provide a true `total_examples` count from the training-corpus, but a words-count from the vocabulary corpus (as with `model.corpus_total_words`) - but an examples & words count that are accurate for the training-corpus.

* Every word in the training corpus that's not in the know vocabulary is ignored completely, as if it wasn't even there. So you might as well filter your corpus down to just the words-of-interest first, then use that same filtered corpus for both steps. Will the example texts still make sense? Will that be enough data to train meaningful, generalizable word-vectors for just the words-of-interest, alongside other words-of-interest, without the full texts? (You could look at your pref-filtered corpus to get a sense of that.) I'm not sure - it could depend on how severely trimming to just the words-of-interest changed the corpus. In particular, to train high-dimensional dense vectors – as with `vector_size=300` – you need a lot of varied data. Such pre-trimming might thin the corpus so much as to make the word-vectors for the words-of-interest far less useful.

You could certainly try it both ways – pre-filtered to just your words-of-interest, or with the full original corpus – and see which works better on downstream evaluations.

More generally, if the concern is training time with the full corpus, there are likely other ways to get an adequate model in an acceptable amount of time.

If using `corpus_file` mode, you can increase `workers` to equal the local CPU core count for a nearly-linear speedup from number of cores. (In traditional `corpus_iterable` mode, max throughput is usually somewhere in the 6-12 `workers` threads, as long as you ahve that many cores.)

`min_count=1` is usually a bad idea for these algorithms: they tend to train faster, in less memory, leaving better vectors for the remaining words when you discard the lowest-frequency words, as the default `min_count=5` does. (It's possible `FastText` can eke a little bit of benefit out of lower-frequency words via their contribution to character-n-gram-training, but I'd only ever lower the default `min_count` if I could confirm it was actually improving relevant results.

If your corpus is so large that training time is a concern, often a more-aggressive (smaller) `sample` parameter value not only speeds training (by dropping many redundant high-frequency words), but ofthen improves final word-vector quality for downstream purposes as well (by letting the rarer words have relatively more influence on the model in the absense of the downsampled words).

And again if the corpus is so large that training time is a concern, than `epochs=100` is likely overkill. I believe the `GoogleNews` vectors were trained using only 3 passes – over a gigantic corpus. A sufficiently large & varied corpus, with plenty of examples of all words all throughout, could potentially train in 1 pass – because each word-vector can then get more total training-updates than many epochs with a small corpus. (In general larger `epochs` values are more often used when the corpus is thin, to eke out something – not on a corpus so large you're considering non-standard shortcuts to speed the steps.)