poor results with doc2vec while very good with word2vec

MG MG

unread,

Sep 19, 2017, 1:49:26 PM9/19/17

to gensim

Hi,

I am currently encountering some difficulties with a doc2vec model, particularly in the results given by the model.

The context of my study:

I am working on a corpus of document representing telephone contact qualifications on a call center for an insurance client.

The words of these documents are concatenation:

1/ words contained in the drop-down menus that agents select to describe the call in a coarse manner

2/ words contained in the commentary they leave to qualify the call more finely.

My corpus is composed of about 1 million relatively short documents (between 3 and 100 words for each one) and my goal is to be able to cluster these qualifications of contacts in small families based on their similarities, especially with regard to what is in the comment field.

What motivates the doc2vec + cluster approach is :

1/ the drop-down menus are poorly used by agents. Some main catch-all categories are often used to describe very different situations

2/ agents write mainly in abbreviations in the comment field because they are limited in the number of characters. And each agent has its own abbreviations...

The results obtained:

With word2vec:

I’ve trained word2vec on the corpus, especially to see if the model can catch the meaning behind abbreviations and I’ve got real great results.

For example, it finds almost every possible abbreviation (or typing-error) for the french word “reglement”.

modelw2v.wv.most_similar('reglement', topn=20)

[('rglmt', 0.6037212610244751), ('reglt', 0.5418980717658997),('rglement', 0.5389199256896973), ('regl', 0.51054847240448), ('rglmnt', 0.49834704399108887), ('regul', 0.49558568000793457), ('regelement', 0.48182356357574463), ('rglt', 0.4810987114906311), ('eglement', 0.4797542095184326),('rgt', 0.475763738155365), ('paiement', 0.47565507888793945), ('encaissement', 0.46454089879989624), ('reglemnt', 0.46448570489883423), ('reversement', 0.4572943449020386), ('rlgt', 0.4441632330417633), ('cheq', 0.4148927330970764), ('rgltm', 0.4078122675418854), ('rlg', 0.40633389353752136), ('regt', 0.4033368229866028), ('cheque', 0.40114736557006836)]

With doc2vec:

Now, I want to use this ability to adapt to abbreviations to find similar documents by training a doc2vec template. However, and now that I am on the document level, I get very bad results.

Example:

numtest = 860

print(ctct_corpus[numtest])

inferred_vector = modeld2vec.infer_vector(ctct_corpus[numtest])

print("++++++++++++++++++++++++++++++++++++++++++++++++")

for el in modeld2vec.docvecs.most_similar(numtest, topn=2):

print(el, ctct_corpus[el[0]])

print()

Output:

vie du contrat informations generales garanties garanties etablissmnt hospi ai info limitation de 30 jrs pr sejour en etablissmnt reducation etablissmnt hospi ai info limitation de 30 jrs pr sejour en etablissmnt reducation

++++++++++++++++++++++++++++++++++++++++++++++++

(228450, 0.19079256057739258) vie du contrat informations generales contrat contrat co lui ai envoye par mail let XXXX pour les niveau X et X avec pack con frt niveau privilege et le detail des gties de ce ct co lui ai envoye par mail let 1022 pour les niveau 4 et 4 avec pack con frt niveau privilege et le detail des gties de ce ct

(589278, 0.19007737934589386) prestations indemnisations devis dentaire dentaire mr ai info sur conditions de rb prothese dentaire XXX y compris ss cient mecontent souhaite resilier mr ai info sur conditions de rb prothese dentaire XXX y compris ss cient mecontent souhaite resilier

here, the document tested (number 860) refers to a 30-day limit on the time spent in rehabilitation institutions.

The 2 most similar documents found by the model:

1/ document speaking about sending to the broker the details of guarantees of a customer's contract

2/ a document speaking about client's dissatisfaction with dental reimbursement levels, and threat of termination

??? What’s wrong? Where is my error?

Here are my parameters:

For the word2vec model:

gensim.models.word2vec.Word2Vec(size=150, window=5, min_count=5, workers=7, iter=15)

For the doc2vec model:

gensim.models.doc2vec.Doc2Vec(dm=1, size=150, window=5, min_count=5, workers=7, iter=20)

Any idea? Recommendations?

Thanks in advance for your help,

Mathieu

Gordon Mohr

unread,

Sep 20, 2017, 12:55:51 PM9/20/17

to gensim

Your data – a million documents – seems sufficient, though the quality of vectors on very-short documents (under a dozen or two words) may be weak.

Your example code, reproduced here, may not be doing what you expect:

numtest = 860

print(ctct_corpus[numtest])

inferred_vector = modeld2vec.infer_vector(ctct_corpus[numtest])

print("++++++++++++++++++++++++++++++++++++++++++++++++")

for el in modeld2vec.docvecs.most_similar(numtest, topn=2):

print(el, ctct_corpus[el[0]])

print()

You didn't show how document tags are specified during training. Only if the document at index 860 was passed to training with the (plain integer) tag 860 will a `most_similar(860)` use that same doc's bulk-trained vector as the origin-vector for similarity-finding. Because your results show int tags, the right thing is *probably* occurring, but sometimes unexpected results are because of an off-by-one or similar mismatch in what doc-vector is being used, versus intended.

Separately, while you're inferring a vector, that vector is *not* being used for the following `most_similar()`. It's being calculated then ignored.

If you were using the inferred-vector as the origin, and not getting good results, some usual things to try would be: (1) ensure you're supplying a list-of-tokens, tokenized the same as the training data, as the `infer_vector()` argument – it doesn't take strings; (2) try non-default values for `infer_vector()` optional parameters, especially many more `steps` or a smaller `alpha` more like the training starting value.

Given that the matches each have rather low similarity-scores (0.19...), I also wonder if perhaps your training is occurring on the right units (lists-of-words) – as opposed to raw strings (lists-of-characters). If `len(modeld2vec.wv.vocab)` is very small, and `modeld2vec.wv.index2word` is all single-characters, rather than full words, this is the problem. Make sure the items passed in as the training corpus – shaped like the `TaggedDocument` example class – have a list-of-tokens as their `words` property, rather than a simple string.

- Gordon

MG MG

unread,

Sep 21, 2017, 10:04:54 AM9/21/17

to gensim

Hi Gordon,

Thanks for your answer.

After a serie of improvement and sanity check (especially the one with the tokens 'words" vs "charcter" which was effectively an error, thanks for that),

I did the code below and trained it.

The results are different, better similarity-scores but still no strong business similarities between qualification of contacts...

Also, I've played with the parameters step & alpha of the inferred_vector() with no real success...

Sometimes it find good similarities but that looks mostly "luck" to me than anything else.

I said that because the inferred vector for the same tagged sentence change significantly each time I produce it.

It seems to me that the model I've trained is really "unstable" and don't produce "reproducible" results that could guess similarities with a certain certitude.

How could I make it "stronger"?

Are the parameters dm, dm_mean, dm_concat, dbow_words could help me in this task?

Should I increase/decrease the size? the iterations?

Are the small length of my documents an obstacle that even doc2vec can't beat?

Is this useful to double the commentary field to give it more weight in the preprocessing step?

Thanks for your help,

Mathieu

def getsentenceCTCT(row, weights):
     [wa, wb, wc, wd, we] = weights
 
 
     cmtr = row['Cmtr']
     cmtr = cmtr.lower() 
     cmtr = re.sub(r'^(ct)', "", cmtr) 
     
     # Une cassure de mot (avec un espace) a été repéré au 75e caractère
     if len(cmtr) > 75:
         cmtr = cmtr[:75] + cmtr[76:]    
 
     # Concaténation des différents champs avec pondération 
     s = (row['Mét'] + " ") * wa + (row['Obj'] + " ") * wb + (row['Mot'] + " ") * wc + (row['Ss mot'] + " ") * wd + (cmtr + " ") * we
     s = s.lower()   
 
 
     # Phase permettant de ne pas générer de mot 'fr' lorsque la ponctuation va disparaitre
     # pour éviter l'amalgame avec l'abréviation 'fr' qui peut vouloir dire 'frais' ou 'faire'
     # remplace les adresses mails par un mot générique
     s = re.sub(r'\b[\w.-]+?@\w+?\.\w+?\b', 'adressemail', s)
     # remplace le site du client par un mot générique
     s = re.sub(r'client.fr', 'siteclientfr', s)
     # remplace les iban par un mot générique
     s = re.sub(r'^(.*fr76)[0-9]+', 'numiban', s) 
     
     # Suppression de la ponctuation
     s = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', s)  
 
     # Normalisation de la phrase (suppression d'accents, suppression d'espace trop grand)    
     s = unicodedata.normalize('NFD', s).encode('ascii', 'ignore').decode()
     while s.find("  ") > 0:
         s = s.replace("  ", " ")
     s = s.strip()
     return s
     
     return s

 
 dfCTCTSAN['motif_CTCT'] = dfCTCTSAN.apply(lambda row: getsentenceCTCT(row, [1, 1, 1, 2, 2]), axis=1)
 CTCT_corpus = dfCTCTSAN['motif_CTCT'] 
 CTCT_corpus = CTCT_corpus.reset_index() 
 print("Stats for length of Documents") 
 print(CTCT_corpus['motif_CTCT'].map(lambda s: len(s.split())).describe())

Stats for length of Documents

count 945946.00000

mean 25.76403

std 19.99681

min 2.00000

25% 7.00000

50% 21.00000

75% 40.00000

max 98.00000

def read_corpus(corpus):
    for ix, row in corpus.iterrows():
        yield gensim.models.doc2vec.TaggedDocument(row['motif_CTCT'].split(), [row['index']])

train_corpus = list(read_corpus(CTCT_corpus))
modeld2vec = gensim.models.doc2vec.Doc2Vec(dm=1, size=150, window=5, min_count=5, workers=7, iter=20)
print(modeld2vec)

Doc2Vec(dm/m,d150,n5,w5,mc5,s0.001,t7)

modeld2vec.build_vocab(train_corpus)
print("Count of Documents: {}".format(modeld2vec.corpus_count))
print("Count of unique different words in these documents: {}".format(len(modeld2vec.wv.vocab)))

Count of Documents: 945946

Count of unique different words in these documents: 24593

%time modeld2vec.train(train_corpus, total_examples=modeld2vec.corpus_count, epochs=modeld2vec.iter)

Wall time: 29min 10s

319575569

Here are the results:

# Test sur quelques contacts
for numtest in [6591,78986, 564131]:
    print("=============================================================")
    print("Initial document (as string):")
    print(CTCT_corpus[CTCT_corpus['index'] == numtest].values)
    print()
    print("Document in doc2vec corpus (as tokens):")
    print(train_corpus[numtest])
    print()
    print("-------------------------------------------")
    inferred_vector = modeld2vec.infer_vector(train_corpus[numtest][0], alpha=0.01, steps=20)
    # print(inferred_vector)
    for el in modeld2vec.docvecs.most_similar([inferred_vector], topn=3):
        print("Best match index & similarity score: ", el)
        print(CTCT_corpus[CTCT_corpus['index'] == el[0]].values)
        print()

=============================================================

Initial document (as string):

[[6591

  'prestation individuel demande rembourst bipage par prestations rb fj au tiers cxXXXXX info fille de l adh rb fj au tiers cxXXXXX info fille de l adh']]

Document in doc2vec corpus (as tokens):

TaggedDocument(['prestation', 'individuel', 'demande', 'rembourst', 'bipage', 'par', 'prestations', 'rb', 'fj', 'au', 'tiers', 'cxXXXXX', 'info', 'fille', 'de', 'l', 'adh', 'rb', 'fj', 'au', 'tiers', 'cxXXXXX', 'info', 'fille', 'de', 'l', 'adh'], [6591])

-------------------------------------------

Best match index & similarity score:  (6591, 0.6939924955368042)

[[6591

  'prestation individuel demande rembourst bipage par prestations rb fj au tiers cxXXXXX info fille de l adh rb fj au tiers cxXXXXX info fille de l adh']]

Best match index & similarity score:  (249594, 0.5970174074172974)

[[249594

  'prestation individuel demande rembourst bipage par prestations rb transport rb transport']]

Best match index & similarity score:  (845638, 0.5846525430679321)

[[845638

  'gestion relance instance relance tel 1 da par adh et manda par tiers da par adh et manda par tiers']]

# Here, while the second is pretty similar, the last one has really Nothing to do with the tested sentence.

=============================================================

Initial document (as string):

[[78986

  'telephonie prestations indemnisations devis autres autres praticien ai info garantie appareil auditif 600 renouv ts les 3 ans rmbs ro inclus praticien ai info garantie appareil auditif 600 renouv ts les 3 ans rmbs ro inclus']]

Document in doc2vec corpus (as tokens):

TaggedDocument(['telephonie', 'prestations', 'indemnisations', 'devis', 'autres', 'autres', 'praticien', 'ai', 'info', 'garantie', 'appareil', 'auditif', '600', 'renouv', 'ts', 'les', '3', 'ans', 'rmbs', 'ro', 'inclus', 'praticien', 'ai', 'info', 'garantie', 'appareil', 'auditif', '600', 'renouv', 'ts', 'les', '3', 'ans', 'rmbs', 'ro', 'inclus'], [78986])

-------------------------------------------

Best match index & similarity score:  (55324, 0.8848056197166443)

[[55324

  'prestation individuel demande rembourst bipage par mdf devis chir devis chir']]

Best match index & similarity score:  (711873, 0.8842121958732605)

[[711873

  'gestion cotisations paiement demande de remboursement demande de remboursement demande de remboursement demande de remboursement ']]

Best match index & similarity score:  (505816, 0.8803945779800415)

[[505816 'gestion resiliation suspension annula resiliation acs acs']]

# Here it don't find itself in the top 3 similar sentences. The last one is really business different even if the similarity score is high

=============================================================

Initial document (as string):

[[564131

  'telephonie suivi d action appel suite reclamation faire appel telephonique d info cli pas joignable faire appel telephonique d info cli pas joignable']]

Document in doc2vec corpus (as tokens):

TaggedDocument(['telephonie', 'suivi', 'd', 'action', 'appel', 'suite', 'reclamation', 'faire', 'appel', 'telephonique', 'd', 'info', 'cli', 'pas', 'joignable', 'faire', 'appel', 'telephonique', 'd', 'info', 'cli', 'pas', 'joignable'], [564131])

-------------------------------------------

Best match index & similarity score:  (891814, 0.9361838102340698)

[[891814 'gestion ajout adhesion intrclient']]

Best match index & similarity score:  (681130, 0.9340693950653076)

[[681130

  'gestion credit impots valide injection cs septembre 2015 injection cs septembre 2015']]

Best match index & similarity score:  (902394, 0.9340336322784424)

[[902394

  'gestion modification affaire modification date d effet modification date d effet']]

# here all sentence has nothing to do with themselves. for all of them.

Gordon Mohr

unread,

Sep 21, 2017, 2:36:16 PM9/21/17

to gensim

It may be useful to compare the `most_similar()` results using the re-inferred vector (the output you've shown) against the `most_similar()` results using the bulk-trained vector already in the model.

If the `most_similar()` results using just bulk-trained vectors is still unhelpful, then more adjustment of initial training may help, and should be the first priority. (There's no reason to be trying inference on a model that doesn't already do well in reporting similarities among its native training docs.)

You may wish to try PV-DBOW mode – `dm=0` – as it often does well with shorter documents. In PV-DBOW mode, you could also try with `dbow_words=1`, especially if you separately need word-vectors from the same training. (Toggling `dm_mean` is unlikely to help, and `dm_concat` mode is slow and generally unproven.)

You could try more iterations – but 10 or 20 is already typical in published work. A lower or higher `min_count` might help. Sometimes eliminating more low-frequency words makes the remaining vectors stronger, but perhaps your 4, 3, 2 occurrence words can also help the doc-vectors. (Single-occurrence words are usually no better than noise.) In `dm=1` mode, a smaller or larger window may help – usually larger windows emphasize domain-topicality better. (In pure PV-DBOW, `window` has no effect, but will still affect word-vector quality & training-time if `dbow_words=1`.)

It's not surprising to see a very-short (4-token) document intrude in results unexpectedly, or get poor most-similar results. Doc-vectors for short documents are getting the least training during bulk-training, and (for a constant choice of `steps`) get less inference-adjustment during inference. It *might* help to expand very short documents by concatenating them with themselves, until they reach some minimum token-count. (5? 10? 15? – would need to be experimentally tested.)

It would also make sense to enable INFO logging and watch the output for hints something may not be working as intended.

Once the similarities using just the bulk-trained vectors seem meaningful, then you can focus on whether inference is working. In general we'd hope that re-inferring a vector for the same tokens as were used in training should result in a vector close to that bulk-trained vector – that the same document ID appears in the top or top-few `most_similar()` results for an inferred vector. Until inference is giving that, more experimentation with its parameters may be warranted. In particular, `steps` of even 100 or more, and `alpha` of 0.05 or 0.025. (In the code you've shared, any benefit from upping the `steps` by 4x from its default to 20 may be overwhelmed by having reduced starting `alpha` by 10x from its default to 0.01.)

If none of these help, it's possible the corpus and goals don't work well with the algorithm. A million documents with ~25 million words seems in line with other published work, but maybe there's not the same variety of word-contexts that help power the algorithm in other projects.

- Gordon

MG MG

unread,

Sep 22, 2017, 6:17:49 AM9/22/17

to gensim

Hi Gordon,

A big thank you for all the improvement areas you just gave me in your answer. you rock!

I will give a try to all of them and come back to you ASAP.

Thanks

Mathieu

MG MG

unread,

Sep 27, 2017, 5:19:38 AM9/27/17

to gensim

Hi Gordon,

As the training time was 25 min (+/- 15 min), i've decided to generate automatically a bunch of models (approx 30) with different combinations of parameters to evaluate the best one.

The best set of parameters I found, "businessly speaking" with my eyes on some manual tests, was effectively:

dm	0
size	200
window	9
min_count	20
iterate	20
dbow_words	1

Now, the results are good enough to perform clustering on the data, thanks a lot for your help.

Just one last question concerning this next step:

I see many threads talking about using cosine distance or "equivalently" normalized euclidean distance to perform clustering after a LSI for example.

The main argument is that we are focused on the meaning similarity (angle between vectors) whatever if their magnitude (length of vectors) are very different.

With vectors of doc2vec, do you think I have to apply the same logic for my task of grouping similar contact qualifications?

Actually, I question myself of what really represent a feature vector with doc2vec ? Do I have to consider that it is the same thing than for LSI, it's a kind of concept, right?