select Doc2Vec parameters

413 views
Skip to first unread message

Seda Aydin

unread,
Jan 7, 2022, 4:08:05 AM1/7/22
to Gensim
Hi, 

How can I select optional Doc2Vec parameters? The subject is the similarity between the texts.
I've dataset with 12420 texts.
Here are the parameters of my doc2vec model. I tried many model parameters but without success.
How can i fix this calculation ?

# python code
docLabels = []
for req in data_tokenize["REQ_NO"][:]:
    l_req = ''+str(req)+''
    docLabels.append(l_req)

corpus_list = []
for col in data_tokenize['TOKENIZED_SENTENCE']:
    l_word = col.split(" ")
    corpus_list.append(l_word)
   
len(corpus_list) #14240
   
class DocIterator(object):
    def __init__(self, doc_list, labels_list):
            self.labels_list = labels_list
            self.doc_list = doc_list
    def __iter__(self):
        for idx, doc in enumerate(self.doc_list):
            yield TaggedDocument(words=doc, tags=[self.labels_list[idx]])

it = DocIterator(corpus_list, docLabels) #(data, docLabels)

workers= multiprocessing.cpu_count()-1
corpus_count = len(docLabels)  # 14240
token_count = sum([len(sentence) for sentence in corpus_list]) # 656644
model_name = 'd2v_30'

model = gensim.models.Doc2Vec(vector_size = 100
                            , min_count = 5
                            , alpha = 0.025
                            , min_alpha = 0.0001
                            , dm = 0
                            , sample = 10e-3 # dbow =10e-3 - dm = 10e-6
                            , workers = workers
                            )
model.build_vocab(corpus_iterable = it)
for epoch in range(100):
        model.train(it, total_examples = corpus_count, total_words = token_count, epochs = 10 )
        model.alpha -= 0.002
        model.min_alpha = model.alpha
   
model.save('C:/Users/doc2vec/'+model_name+'.model')

Regards,
Seda 

Gordon Mohr

unread,
Jan 7, 2022, 9:01:26 PM1/7/22
to Gensim
14k docs (of only ~47 words each on average, and less after rare words discarded) is on the small side for training a Doc2Vec model, but I'd still expect you to see some useful modeling from training. (Increasing `epochs` & decreasing `vector_size` can help a bit with thinner corpora, but getting more compatible data would be best.)

Your main error is trying to manage the training iterations and `alpha` learning rate with multiple calls to `train()` inside your own loop – and further severely mismanaging `alpha` such that it's going to a nonsensical negative value long before training ends. 

Doing this yourself is almost always a mistake – unnecessary and error-prone. You should have even seen a `WARNING`-level message in your logs about the atypical situation: "Effective 'alpha' higher than previous training cycles"

See this SO answer for more details: https://stackoverflow.com/q/62801052/130288

If you can let whatever tutorial/example/etc site you copied this practice from know that they're steering users astray, please do! (And if by chance they offered exactly these wrong-values for `alpha`, the looping range, and the `alpha`-decrement, note that they didn't know what they were doing or check their results effectively before offering this example, so take any other their other guidance with a big grain of salt in the future.)

Separate comments:

* enabling logging to the INFO level will give you far more insight into the process, & progress of the model

* when you do call `.train()` with the same corpus as was just passed to `build_vocab()`, it's sufficient to use the `.corpus_count` that's cached inside the model (as per the code in the SO answer) – there's no need for the extra complexity & error-risk of supplying your own calculation. 

* using a non-default `sample` parameter is typically most beneficial with much-larger corpora, where you might want to make it more aggressive (even smaller than default `10e-3`) - but starting out, with a tiny corpus, I'd not specify anything at all. (And, contrary to the comment in your code, I've never noticed a good reason to make it different between PV-DM and PV-DBOW modes.)

* setting `workers` to a number equal, or just under, the count of CPU cores is only a good policy up to about 8-core processors; if by chance you're on a machine with even more cores, a workers values somewhere in the 6-12 ranges is likely to achieve highest training throughput (though with a tiny corpus, no choice here will make that big of a running-time difference). 

- Gordon

Seda Aydin

unread,
Jan 11, 2022, 1:10:47 PM1/11/22
to Gensim
Thanks for your help.

I have a new question.
I revised my model. These are my parameters for which I get the best results after clustering, but when I check for cross-text similarity, it shows irrelevant texts as similar. 
What could be the reason for this? Is 5 vector size too little for 14240 text? I've tried more sizes of vectors, but 5 vector sizes work best.

This is :
model_name = 'd2v_36'      
max_epochs = 30
model = gensim.models.Doc2Vec(vector_size=5,dm=0,dbow_words=2)  
model.build_vocab(it)
model.train(it, total_examples=model.corpus_count, epochs=max_epochs)

Regards,
Seda.
8 Ocak 2022 Cumartesi tarihinde saat 05:01:26 UTC+3 itibarıyla Gordon Mohr şunları yazdı:

Gordon Mohr

unread,
Jan 12, 2022, 2:53:45 PM1/12/22
to Gensim
If clustering is the real goal, then optimize for good results in that process. 

If similar-doc-retrieval is the ultimate goal, then optimize for better results in that task.

These algorithms are typically used on training sets of (at least) many tens-of-thousands of documents, and tens-of-millions of training words, to train "dense embeddings", for words or docs, of 100 dimensions or more. 

I've only used smaller dataset, and smaller dimensions, on tiny instructive examples to demonstrate steps moreso than quality results. And, 5 dimensions is so tiny that I wouldn't expect most of people's usual motivations for pursuing dense-embeddings to still apply. So whether these techniques can be scaled down to your specific data, & goals, to still give useful results is something only your own experimental results can tell you. So sure, experiment with larger vectors, but if at all possible, get far more training data. 

Separately: `dbow_words=2` is a nonsensical (but harmless) setting. `dbow_words` is interpreted as a boolean toggle, enabling or disabling simultaneous skip-gram training, so `dbow_words=2` has same effect as `dbow_words=1` or `dbow_words=True`. (If you want to affect the number of skip-gram neighbors to consider, the `window` parameter still controls that.)

- Gordon

Reply all
Reply to author
Forward
0 new messages