Hello,
There's some weeks that I'm making some researchs using word2vec/doc2vec to optimized the search mechanism of several specif documents (that contain logs, fault codes, descriptions, etc). Currently the search engine used is the Elastic Search.
At first the objective was to overcome Elastic Search on all cases, but after some tests I realized that doc2vec is not performing as good as Elastic Search in most of the cases. Although this is happening, there's specif cases that doc2vec overcome Elastic Search, for example, when there's a context match objective.
Then I took as objetive to use both of mechanisms in order to improve the system. Later I will decide how they will be merge... The main problem here is the training and preprocessing to achieve a great performance.
The dataset is consisted with ~360k documents.
There's several preprocessing combinations that I'm testing, like stemization, filtering stopwords and filtering special characters.
About the hyperparameters, I'm using the random search technique to search for a good combination.
These are the combinations that I'm using:
hyperparams = {
'size': [100, 200],
'min_count': [1, 2, 3, 4, 5],
'iter': [50, 100, 150],
'window': [4, 5, 6, 7, 8],
'alpha': [0.025, 0.01, 0.05],
'min_alpha': [0.025, 1e-4],
}
O all possible combinations generated on these parameters, I test 50 of them, unique and randomly selected.
The train function:
def start_training(hyperparams, train_corpus):
model = gensim.models.doc2vec.Doc2Vec(size=hyperparams['size'], min_count=hyperparams['min_count'], iter=hyperparams['iter'], workers=4, window=hyperparams['window'], alpha=hyperparams['alpha'], min_alpha=hyperparams['min_alpha'])
print("Building vocabulary")
model.random.seed(0)
model.build_vocab(train_corpus)
print("Training the model")
print(model)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
The evaluation method consists on searching specific documents using a text that I'm sure that they are linked with the document.
The better it rank the text with the target document, the best.
For an accuracy rate I made an weighted average based on the rank. Unfortunately there's few evaluation files (about 15 files).
The evaluation function:
def eval_model(model, eval_dir, hyperparams):
ranked_eval = {}
correct = 0
eval_files_list = os.listdir(eval_dir)
for file in eval_files_list:
eval_file = eval_dir + file
words_vec = get_word_vec(eval_file)
model.random.seed(0)
steps = (hyperparams['iter']) + 50
inferred_vector = model.infer_vector(words_vec, alpha=hyperparams['alpha'], min_alpha=hyperparams['min_alpha'], steps=steps)
similars = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
target = eval_file[-18:]
for i in range(len(similars)):
sim = similars[i]
if sim[0] == target_ER:
print(file, "found in position", i)
ranked_eval[file] = i
if i == 0 :
correct += 1
elif i >= 1 and i < 5:
correct += 0.9
elif i >= 5 and i < 10:
correct += 0.7
elif i >= 10 and i < 20:
correct += 0.4
elif i >= 20 and i < 50:
correct += 0.2
break
accuracy_rate = (correct / len(eval_files_list)) * 100
return accuracy_rate, ranked_eval
Important to inform that I preprocess the "input text" of the eval files the same as the preprocess used on the training set.
I strongly believe that the use of Doc2Vec can achieve good improvements on the currently most used search engines, but it's still not getting affordable results.
Remembering that the objective is not to beat the Elastic Search indexing algorithm, but to complement it. So the ideia is not to achieve great results on every case, but at least try to find good documents that Elastic Search can't find.
For example, on the best combination I found until now:
{
'size': 100,
'min_count': 4,
'iter': 100,
'window': 4,
'alpha': 0.025,
'min_alpha': 0.025,
'accuracy_rate': 34.285714285714285,
'model_file': './trained_models/test09_360k/test09_360krandom_21.model'
}
{ #These are the classified position of the document. The less the better.
'doc1.txt': 1567,
'doc2.txt': 396,
'doc3.txt': 10929,
'doc4.txt': 3,
'doc5.txt': 3,
'doc6.txt': 0,
'doc7.txt': 70868,
'doc8.txt': 2334,
'doc9.txt': 486,
'doc10.txt': 0,
'doc11.txt': 30569,
'doc12.txt': 1571,
'doc13.txt': 2088,
'doc14.txt': 0
}
Do you have any opinion about it to improve? Maybe searching on other hyperparameters (e.g. the negative sampling values), different preprocessments, or other ranges of hyperparameters.
Am I doing something wrong? Any opinion will help and motivates a lot...
Thanks,
Denis.