Hi,
I am currently encountering some difficulties with a doc2vec model, particularly in the results given by the model.
The context of my study:
I am working on a corpus of document representing telephone contact qualifications on a call center for an insurance client.
The words of these documents are concatenation:
1/ words contained in the drop-down menus that agents select to describe the call in a coarse manner
2/ words contained in the commentary they leave to qualify the call more finely.
My corpus is composed of about 1 million relatively short documents (between 3 and 100 words for each one) and my goal is to be able to cluster these qualifications of contacts in small families based on their similarities, especially with regard to what is in the comment field.
What motivates the doc2vec + cluster approach is :
1/ the drop-down menus are poorly used by agents. Some main catch-all categories are often used to describe very different situations
2/ agents write mainly in abbreviations in the comment field because they are limited in the number of characters. And each agent has its own abbreviations...
The results obtained:
With word2vec:
I’ve trained word2vec on the corpus, especially to see if the model can catch the meaning behind abbreviations and I’ve got real great results.
For example, it finds almost every possible abbreviation (or typing-error) for the french word “reglement”.
modelw2v.wv.most_similar('reglement', topn=20)
[('rglmt', 0.6037212610244751), ('reglt', 0.5418980717658997),('rglement', 0.5389199256896973), ('regl', 0.51054847240448), ('rglmnt', 0.49834704399108887), ('regul', 0.49558568000793457), ('regelement', 0.48182356357574463), ('rglt', 0.4810987114906311), ('eglement', 0.4797542095184326),('rgt', 0.475763738155365), ('paiement', 0.47565507888793945), ('encaissement', 0.46454089879989624), ('reglemnt', 0.46448570489883423), ('reversement', 0.4572943449020386), ('rlgt', 0.4441632330417633), ('cheq', 0.4148927330970764), ('rgltm', 0.4078122675418854), ('rlg', 0.40633389353752136), ('regt', 0.4033368229866028), ('cheque', 0.40114736557006836)]
With doc2vec:
Now, I want to use this ability to adapt to abbreviations to find similar documents by training a doc2vec template. However, and now that I am on the document level, I get very bad results.
Example:
numtest = 860
print(ctct_corpus[numtest])
inferred_vector = modeld2vec.infer_vector(ctct_corpus[numtest])
print("++++++++++++++++++++++++++++++++++++++++++++++++")
for el in modeld2vec.docvecs.most_similar(numtest, topn=2):
print(el, ctct_corpus[el[0]])
print()
Output:
vie du contrat informations generales garanties garanties etablissmnt hospi ai info limitation de 30 jrs pr sejour en etablissmnt reducation etablissmnt hospi ai info limitation de 30 jrs pr sejour en etablissmnt reducation
++++++++++++++++++++++++++++++++++++++++++++++++
(228450, 0.19079256057739258) vie du contrat informations generales contrat contrat co lui ai envoye par mail let XXXX pour les niveau X et X avec pack con frt niveau privilege et le detail des gties de ce ct co lui ai envoye par mail let 1022 pour les niveau 4 et 4 avec pack con frt niveau privilege et le detail des gties de ce ct
(589278, 0.19007737934589386) prestations indemnisations devis dentaire dentaire mr ai info sur conditions de rb prothese dentaire XXX y compris ss cient mecontent souhaite resilier mr ai info sur conditions de rb prothese dentaire XXX y compris ss cient mecontent souhaite resilier
here, the document tested (number 860) refers to a 30-day limit on the time spent in rehabilitation institutions.
The 2 most similar documents found by the model:
1/ document speaking about sending to the broker the details of guarantees of a customer's contract
2/ a document speaking about client's dissatisfaction with dental reimbursement levels, and threat of termination
??? What’s wrong? Where is my error?
Here are my parameters:
For the word2vec model:
gensim.models.word2vec.Word2Vec(size=150, window=5, min_count=5, workers=7, iter=15)
For the doc2vec model:
gensim.models.doc2vec.Doc2Vec(dm=1, size=150, window=5, min_count=5, workers=7, iter=20)
Any idea? Recommendations?
Thanks in advance for your help,
Mathieu
def getsentenceCTCT(row, weights):
[wa, wb, wc, wd, we] = weights
cmtr = row['Cmtr']
cmtr = cmtr.lower()
cmtr = re.sub(r'^(ct)', "", cmtr)
# Une cassure de mot (avec un espace) a été repéré au 75e caractère
if len(cmtr) > 75:
cmtr = cmtr[:75] + cmtr[76:]
# Concaténation des différents champs avec pondération
s = (row['Mét'] + " ") * wa + (row['Obj'] + " ") * wb + (row['Mot'] + " ") * wc + (row['Ss mot'] + " ") * wd + (cmtr + " ") * we
s = s.lower()
# Phase permettant de ne pas générer de mot 'fr' lorsque la ponctuation va disparaitre
# pour éviter l'amalgame avec l'abréviation 'fr' qui peut vouloir dire 'frais' ou 'faire'
# remplace les adresses mails par un mot générique
s = re.sub(r'\b[\w.-]+?@\w+?\.\w+?\b', 'adressemail', s)
# remplace le site du client par un mot générique
s = re.sub(r'client.fr', 'siteclientfr', s)
# remplace les iban par un mot générique
s = re.sub(r'^(.*fr76)[0-9]+', 'numiban', s)
# Suppression de la ponctuation
s = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', s)
# Normalisation de la phrase (suppression d'accents, suppression d'espace trop grand)
s = unicodedata.normalize('NFD', s).encode('ascii', 'ignore').decode()
while s.find(" ") > 0:
s = s.replace(" ", " ")
s = s.strip()
return s
return s
dfCTCTSAN['motif_CTCT'] = dfCTCTSAN.apply(lambda row: getsentenceCTCT(row, [1, 1, 1, 2, 2]), axis=1)
CTCT_corpus = dfCTCTSAN['motif_CTCT']
CTCT_corpus = CTCT_corpus.reset_index()
print("Stats for length of Documents")
print(CTCT_corpus['motif_CTCT'].map(lambda s: len(s.split())).describe())Stats for length of Documents
count 945946.00000
mean 25.76403
std 19.99681
min 2.00000
25% 7.00000
50% 21.00000
75% 40.00000
max 98.00000
def read_corpus(corpus):
for ix, row in corpus.iterrows():
yield gensim.models.doc2vec.TaggedDocument(row['motif_CTCT'].split(), [row['index']])
train_corpus = list(read_corpus(CTCT_corpus))
modeld2vec = gensim.models.doc2vec.Doc2Vec(dm=1, size=150, window=5, min_count=5, workers=7, iter=20)
print(modeld2vec)
Doc2Vec(dm/m,d150,n5,w5,mc5,s0.001,t7)
modeld2vec.build_vocab(train_corpus)
print("Count of Documents: {}".format(modeld2vec.corpus_count))
print("Count of unique different words in these documents: {}".format(len(modeld2vec.wv.vocab)))Count of Documents: 945946Count of unique different words in these documents: 24593
%time modeld2vec.train(train_corpus, total_examples=modeld2vec.corpus_count, epochs=modeld2vec.iter)Wall time: 29min 10s
319575569
Here are the results:
# Test sur quelques contacts
for numtest in [6591,78986, 564131]:
print("=============================================================")
print("Initial document (as string):")
print(CTCT_corpus[CTCT_corpus['index'] == numtest].values)
print()
print("Document in doc2vec corpus (as tokens):")
print(train_corpus[numtest])
print()
print("-------------------------------------------")
inferred_vector = modeld2vec.infer_vector(train_corpus[numtest][0], alpha=0.01, steps=20)
# print(inferred_vector)
for el in modeld2vec.docvecs.most_similar([inferred_vector], topn=3):
print("Best match index & similarity score: ", el)
print(CTCT_corpus[CTCT_corpus['index'] == el[0]].values)
print()
=============================================================Initial document (as string):[[6591 'prestation individuel demande rembourst bipage par prestations rb fj au tiers cxXXXXX info fille de l adh rb fj au tiers cxXXXXX info fille de l adh']] Document in doc2vec corpus (as tokens):TaggedDocument(['prestation', 'individuel', 'demande', 'rembourst', 'bipage', 'par', 'prestations', 'rb', 'fj', 'au', 'tiers', 'cxXXXXX', 'info', 'fille', 'de', 'l', 'adh', 'rb', 'fj', 'au', 'tiers', 'cxXXXXX', 'info', 'fille', 'de', 'l', 'adh'], [6591]) -------------------------------------------Best match index & similarity score: (6591, 0.6939924955368042)[[6591 'prestation individuel demande rembourst bipage par prestations rb fj au tiers cxXXXXX info fille de l adh rb fj au tiers cxXXXXX info fille de l adh']] Best match index & similarity score: (249594, 0.5970174074172974)[[249594 'prestation individuel demande rembourst bipage par prestations rb transport rb transport']] Best match index & similarity score: (845638, 0.5846525430679321)[[845638 'gestion relance instance relance tel 1 da par adh et manda par tiers da par adh et manda par tiers']]# Here, while the second is pretty similar, the last one has really Nothing to do with the tested sentence.
=============================================================Initial document (as string):[[78986 'telephonie prestations indemnisations devis autres autres praticien ai info garantie appareil auditif 600 renouv ts les 3 ans rmbs ro inclus praticien ai info garantie appareil auditif 600 renouv ts les 3 ans rmbs ro inclus']] Document in doc2vec corpus (as tokens):TaggedDocument(['telephonie', 'prestations', 'indemnisations', 'devis', 'autres', 'autres', 'praticien', 'ai', 'info', 'garantie', 'appareil', 'auditif', '600', 'renouv', 'ts', 'les', '3', 'ans', 'rmbs', 'ro', 'inclus', 'praticien', 'ai', 'info', 'garantie', 'appareil', 'auditif', '600', 'renouv', 'ts', 'les', '3', 'ans', 'rmbs', 'ro', 'inclus'], [78986]) -------------------------------------------Best match index & similarity score: (55324, 0.8848056197166443)[[55324 'prestation individuel demande rembourst bipage par mdf devis chir devis chir']] Best match index & similarity score: (711873, 0.8842121958732605)[[711873 'gestion cotisations paiement demande de remboursement demande de remboursement demande de remboursement demande de remboursement ']] Best match index & similarity score: (505816, 0.8803945779800415)[[505816 'gestion resiliation suspension annula resiliation acs acs']] # Here it don't find itself in the top 3 similar sentences. The last one is really business different even if the similarity score is high
=============================================================Initial document (as string):[[564131 'telephonie suivi d action appel suite reclamation faire appel telephonique d info cli pas joignable faire appel telephonique d info cli pas joignable']] Document in doc2vec corpus (as tokens):TaggedDocument(['telephonie', 'suivi', 'd', 'action', 'appel', 'suite', 'reclamation', 'faire', 'appel', 'telephonique', 'd', 'info', 'cli', 'pas', 'joignable', 'faire', 'appel', 'telephonique', 'd', 'info', 'cli', 'pas', 'joignable'], [564131]) -------------------------------------------Best match index & similarity score: (891814, 0.9361838102340698)[[891814 'gestion ajout adhesion intrclient']] Best match index & similarity score: (681130, 0.9340693950653076)[[681130 'gestion credit impots valide injection cs septembre 2015 injection cs septembre 2015']] Best match index & similarity score: (902394, 0.9340336322784424)[[902394
'gestion modification affaire modification date d effet modification date d effet']]| dm | 0 |
| size | 200 |
| window | 9 |
| min_count | 20 |
| iterate | 20 |
| dbow_words | 1 |