Doc2Vec - Finding good hyperparameters and preprocessment combinations

2,574 views
Skip to first unread message

Denis Candido

unread,
Sep 27, 2017, 1:54:13 PM9/27/17
to gensim
Hello,

There's some weeks that I'm making some researchs using word2vec/doc2vec to optimized the search mechanism of several specif documents (that contain logs, fault codes, descriptions, etc). Currently the search engine used is the Elastic Search.
At first the objective was to overcome Elastic Search on all cases, but after some tests I realized that doc2vec is not performing as good as Elastic Search in most of the cases. Although this is happening, there's specif cases that doc2vec overcome Elastic Search, for example, when there's a context match objective.

Then I took as objetive to use both of mechanisms in order to improve the system. Later I will decide how they will be merge... The main problem here is the training and preprocessing to achieve a great performance.

The dataset is consisted with ~360k documents.

There's several preprocessing combinations that I'm testing, like stemization, filtering stopwords and filtering special characters.
About the hyperparameters, I'm using the random search technique to search for a good combination. 

These are the combinations that I'm using:

    hyperparams  = {
       
'size': [100, 200],
       
'min_count': [1, 2, 3, 4, 5],
       
'iter': [50, 100, 150],
       
'window': [4, 5, 6, 7, 8],
       
'alpha': [0.025, 0.01, 0.05],
       
'min_alpha': [0.025, 1e-4],
   
}

O all possible combinations generated on these parameters, I test 50 of them, unique and randomly selected.

The train function:

def start_training(hyperparams, train_corpus):
    model
= gensim.models.doc2vec.Doc2Vec(size=hyperparams['size'], min_count=hyperparams['min_count'], iter=hyperparams['iter'], workers=4, window=hyperparams['window'], alpha=hyperparams['alpha'], min_alpha=hyperparams['min_alpha'])
   
print("Building vocabulary")
    model
.random.seed(0)
    model
.build_vocab(train_corpus)
   
print("Training the model")
   
print(model)
    model
.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)


The evaluation method consists on searching specific documents using a text that I'm sure that they are linked with the document.
The better it rank the text with the target document, the best.

For an accuracy rate I made an weighted average based on the rank. Unfortunately there's few evaluation files (about 15 files).

The evaluation function:

def eval_model(model, eval_dir, hyperparams):
    ranked_eval
= {}
    correct
= 0


    eval_files_list
= os.listdir(eval_dir)
   
for file in eval_files_list:
        eval_file
= eval_dir + file
        words_vec
= get_word_vec(eval_file)
        model
.random.seed(0)
        steps
= (hyperparams['iter']) + 50
        inferred_vector
= model.infer_vector(words_vec, alpha=hyperparams['alpha'], min_alpha=hyperparams['min_alpha'], steps=steps)
        similars
= model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
        target
= eval_file[-18:]


       
for i in range(len(similars)):
            sim
= similars[i]
           
if sim[0] == target_ER:
               
print(file, "found in position", i)
                ranked_eval
[file] = i
               
if i == 0 :
                    correct
+= 1
               
elif i >= 1 and i < 5:
                    correct
+= 0.9
               
elif i >= 5 and i < 10:
                    correct
+= 0.7
               
elif i >= 10 and i < 20:
                    correct
+= 0.4
               
elif i >= 20 and i < 50:
                    correct
+= 0.2
               
break


        accuracy_rate
= (correct / len(eval_files_list)) * 100


   
return accuracy_rate, ranked_eval


Important to inform that I preprocess the "input text" of the eval files the same as the preprocess used on the training set.

I strongly believe that the use of Doc2Vec can achieve good improvements on the currently most used search engines, but it's still not getting affordable results.
Remembering that the objective is not to beat the Elastic Search indexing algorithm, but to complement it. So the ideia is not to achieve great results on every case, but at least try to find good documents that Elastic Search can't find.

For example, on the best combination I found until now:


{
 
'size': 100,
 
'min_count': 4,
 
'iter': 100,
 
'window': 4,
 
'alpha': 0.025,
 
'min_alpha': 0.025,
 
'accuracy_rate': 34.285714285714285,
 
'model_file': './trained_models/test09_360k/test09_360krandom_21.model'
}
{ #These are the classified position of the document. The less the better.
 
'doc1.txt': 1567,
 
'doc2.txt': 396,
 
'doc3.txt': 10929,
 
'doc4.txt': 3,
 
'doc5.txt': 3,
 
'doc6.txt': 0,
 
'doc7.txt': 70868,
 
'doc8.txt': 2334,
 
'doc9.txt': 486,
 
'doc10.txt': 0,
 
'doc11.txt': 30569,
 
'doc12.txt': 1571,
 
'doc13.txt': 2088,
 
'doc14.txt': 0
}


Do you have any opinion about it to improve? Maybe searching on other hyperparameters (e.g. the negative sampling values), different preprocessments, or other ranges of hyperparameters.
Am I doing something wrong? Any opinion will help and motivates a lot...

Thanks,
Denis.

Denis Candido

unread,
Sep 27, 2017, 1:59:00 PM9/27/17
to gensim
This is the preprocessment used on the 'best combination found' mentioned above:

soup = BeautifulSoup(out, 'html.parser')
out = soup.get_text()

out = out.lower()  # Set all characters to lower case
out = re.sub('^.*\wmissing.*?$', ' ', out, flags=re.M)  # remove lines with productmissing due to HTML
out = re.sub('\s+', ' ', out)  # replace sequence of spaces by one space

out = re.sub("\_", '', out)  # merge words with _
out = re.sub("\=", '', out)  # merge words with =
out = re.sub("\-", '', out)  # merge words with -
out = re.sub("\/", '', out)  # merge words with /
out = re.sub("\'", '', out)  # merge words with '

out = re.sub('[^a-zA-Z0-9 ]', ' ', out)  # remove all special characters
Message has been deleted

Gordon Mohr

unread,
Sep 27, 2017, 3:04:38 PM9/27/17
to gensim
I'm not sure what you mean by "affordable results". It's hard to evaluate your scoring method - it seems a bit narrow (just 14 querydoc-to-desired-result probes?) and complicated (ad hoc scoring). But if you think it accurately models your desired results, it's better than just 'eyeballing' in that its quantitative & repeatable. 

Your corpus size in documents is reasonable, especially if the documents themselves are not tiny. You should try to make sure they're not arranged in some way that all similar documents (in size, topic, etc) are clumped together. (For example, if there's a risk of that, one initial shuffle should help.)

Regarding your preprocessing, does examining the tokenized end-result seem to retain tokens you'd expect to be significant for your purposes? If so, it's a fine base, but other experimentation should be driven by your domain-familiarity. Do make sure your preprocessing on the querydocs is the same as that performed on the bulk-training, so it's an apples-to-apples calculation. 

Regarding your meta-parameter search space:

* you should probably never tinker with `min_alpha` – the algorithm is based on the learning-rate value decaying to something tiny. With your inclusion of a end-value as high as the typical default starting-value (0.025), I'd fear a lot of 'noise' in the model's final configuration (and thus ranked evals), and some of your meta-parameter combinations may even include an alpha learning-rate that *increases* over the course of training (eg, `alpha=0.01, min_alpha=0.025`). 

* with only 14 evaluation datapoints, but tuning 6 metaparameters over a total of (2*5*3*5*3*2=) 900 permutations, and some of those metaparameters in tight ranges (esp `min_count` and `window`), and the `alpha` issue mentioned above, slight differences in your ranked results may be more a function of jitter/noise/overfitting than any tangible difference in the meta-parameter appropriateness. More eval data, and more contrast in the meta-parameters, may help deliver more reliable contrast in your evaluation scores. 

* In particular, a `min_count` of 1 often adds a lot of noise to doc-vector training (lowering quality), and eliminating more words helps more than you'd think. So trying a `min_count` range of [2, 5, 10, 20] may reveal more than just 1...5

* Similarly, tiny `window` values can be surprisingly good with large word2vec training sets, but also large `window` values tend to emphasize topical-domain-similarity more – so searching a larger range here may also help, eg [2, 5, 10, 20]

* 10 or 20 iterations are mentioned in the original Paragraph Vector (Doc2Vec) papers - so only if you're really confident in your evaluation method, and improving value of so many more iterations, would I be testing 100+ iterations

* But generally, whenever meta-parameter searching with robust scoring method, if one of the most 'extreme' offered values performs best, it can make sense to add another more-extreme value in that same direction

* PV-DBOW mode (`dm=0`) is worth trying, especially on short documents. (Adding the extra option `dbow_words=1` to PV-DBOW trains word-vectors interleaved simultaneously, at an extra time cost proportional to `window`, which sometimes also helps.)

Finally, if I understand "classified position of the document" correctly, there's *wild* variation in the position of your desired results. (Five docs appear in the top-4 positions, but then 3 other docs are lower-than-10,000.) This makes me wonder about the quality of the test probes, and think that you should look individually at each, to see what's contributing to the specific results – are some of the probes/targets really small, really generic, poorly-handled-by-tokenization, etc. It's odd enough to be worthy-of-investigation  if something you've hand-picked to be "a desired near-top result" is actually behind 70K other documents. (But maybe, if the docs are very small and very similar, and thus the corpus isn't as "big" as it looks, it's not odd. Hard to tell.)

- Gordon

Denis Candido

unread,
Sep 28, 2017, 9:00:54 AM9/28/17
to gensim
Hello Gordon,

Very grateful for your answer.

I think that 14 files is a too low number for a test set too. Unfortunately it takes a large amount of time to collect just one these eval files and I have to ask another person to do it because I don't know the system. I think I will have to learn to use so I can do it myself and accelerate this process.
Knowing that there's around 360k documents, how much evaluation files do you think it's acceptable?

I think that this evaluation method has some gaps, but I don't know a better way to improve it.
Taking into account that the objective is to beat Elastic Search on cases that it performs poorly (cases that find in position > 30 approximately), what about I use only the eval files that happens it?

See this table to see the comparation of classifications:


About the documents, the 'training set' are documents that has a problem title and description followed by a solution (called 'experience record') for the specified problem (the solution text is not used on training, just the cause description and title).
The eval files are the problem (called user report) title and description reported by an user, written by people but a lot of times with log files. Unfortunately a lot of these reports are not well written and has some syntax fault (which is not doc2vec/word2vec can deal with...).

On the system there's some problems that has an link with a solution, informing that X report was solved by Y 'experience record'. They are different systems and I don't have 'fully access' to it, so the process to find eval files is complicated.

In order to don't make tendencious tests, I just copy&paste the X report's title/description and find which position the Y 'experience record' is classified. As I said, unfortunately some of them are now well written, but the ones that has 'context description' about the fault, doc2vec performs good.


I don't know if you consider it a large size but every doc has an average of 1000 characters, after preprocessing.

Do you have any idea of evaluation method or a general opinion of core preprocessings steps?

Thanks,
Denis.

Gordon Mohr

unread,
Sep 28, 2017, 4:16:54 PM9/28/17
to gensim
There's no magic threshold for ranking-evaluation, but more examples of desired results are better, and examples that accurately reflect the needed final-system behavior. 

Are people really querying based on the full "title and description"? 

It's not clear from your description why the 14 evaluation file target-results are considered "good results" for the associated queries. To be a `most_similar()` result at all, the document had to have been in the training set. 

Is it even possible for domain-level experts to tell, from just the sometimes poorly-written "title and description", that one of the training-set documents is well-related? Can *you* tell, especially looking at the evaluation data for expected-ranked docs like `doc3.txt` or `doc7.txt`, that the training-set doc is meaningfully-related to the related to the query-doc? (Those docs, especially, make me wonder about the evaluation dataset construction because neither of your retrieval methods seem to come anywhere close to ranking them highly.)

Doc2Vec/Word2Vec can be tolerant of sloppy writing, including misspellings and formatting glitches, if there are enough examples to learn common/repeated glitches. (A unique typo renders a token meaningless... but if the same error repeats greater than `min_count` times, the token will have influence, and with enough examples it should acquire an influence similar to its properly-typed synonym.)

The Paragraph Vectors followup paper, "Document Embedding With Paragraph Vectors" <https://arxiv.org/abs/1507.07998> is worth checking out for its evaluation method for tuning meta-parameters – using extra categorical fields in the corpuses to identify pairs of documents that "should" have closer doc-vectors than other, randomly chosen 3rd documents that don't share the same category. (Each set of meta-parameters are scored by the % of doc-triplets, where only 2 have the same category, that match this goal.) If the "experience record" fields are more-carefully-written, or there are other controlled-vocabulary fields that are tended more carefully in your source data, and that are shared by documents that "should" be close, a similar method might help for your dataset and goals. But if the true final goal is to have certain docs rank highly for certain queries, nothing will be better than having plenty of examples that effectively describe the desired end-behavior. 

- Gordon

Rajat Mehta

unread,
Oct 30, 2018, 10:30:42 AM10/30/18
to Gensim
Hi Denis,

I have been working on something similar, training a doc2vec model on my train set and I am stuck somewhere. So may be you can help me a little bit in this, I am not able to figure out how can I tune my doc 2vec model. I just have a set of documents and my model is trained to create the embeddings of those documents. Now in order to implement GridSearchCV, I need labels to evaluate my models but I  don't have any labels. I looked at many blogs and could not figure out a solution. 
So, Could you please help me in this and give me some idea on how can I tune the hyperparametrs of my model and also how did you implement GridSearchcV for your model? I would be really thankful to you for your help.

Regards,
Rajat
Reply all
Reply to author
Forward
0 new messages