Doc2Vec, find most similar documents in training set from infered vector

Michael Davidson

unread,

Feb 27, 2017, 4:39:12 PM2/27/17

to gensim

I have fit a doc2vec model and wish to find which documents used to train that model are the most similar to an inferred vector. Can I do this without iterating through all of the documents in the training set as if they are unseen? E.g. using the docvecs.similarity_unseen_docs() function.

Gordon Mohr

unread,

Feb 27, 2017, 7:31:56 PM2/27/17

to gensim

You can supply an inferred vector to `most_similar()`, as a single `positive` example:

similars = model.docvecs.most_similar(positive=[model.infer_vector(doc_words)])

This still involves a comparison with all model vectors to find the top-n neighbors, but the library operations it uses will be faster than your own loop in Python.

(There's also the option of building an ANNOY index for faster most-similar operations, but with approximate results – possibly missing some of the true n-nearest points. See an example for Word2Vec at <https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/annoytutorial.ipynb>.)

- Gordon

Michael Davidson

unread,

Feb 28, 2017, 1:37:05 PM2/28/17

to gen...@googlegroups.com

Thank you, Gordon. Much appreciated. I got hung up on feeding the inferred vector to the most_similar() function as a list. Works like a charm now.

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/IH_u8HYVbpg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stefan Geißler

unread,

Sep 21, 2017, 4:40:31 AM9/21/17

to gensim

Hi there,

I am playing with the same setup and have gotten to a point where the setup works in principle: a corpus of ~100000 texts (press articles on various topics) used to train a doc2vecmodel. I then infer a vector for query (a new sentence) and ask the model for the n most similar documents. That works BUT the results currently don't seem to make sense: I can see no intersting semantic relation between my query sentence and the documents that come back as most similar. I was hoping that entering a sentence like "the candidates in this election campaign have very similar programs and therefore many voters are still undecided" would get me some documents about elections etc, but instead I get practically arbitrary sports etc etc documents.

I am wondering whether there is any meaningful way to monitor training success: I let the training run for 20 epochs on the complete corpus but I am wondering whether there is a metric that tells me if with 20 it alrady starts to reach a plateau (quality doesn't grow anymore) or whether I should better let it run for 100 epochs ...

Thoughts?

Stefan

Gordon Mohr

unread,

Sep 21, 2017, 1:38:57 PM9/21/17

to gensim

Things to check:

Do `most_similar()` operations on documents that were inside the original bulk-training set give reasonable-looking responses? (If not, there may be more general problems in your choice of parameters or training, unrelated to inference.)

Are you supplying a list-of-tokens, tokenized identically to the training data, to `infer_vector()`? (If you supply a string, it's seen as a list-of-single-character-tokens, most of which won't be known terms, and which won't be a realistic text in any case – and so will yield random-looking most-similar matches.)

Have you tried non-default parameters to `infer_vector()`, especially many more `steps` (20-200) or a different starting `alpha` (such as the bulk-training default of 0.025)?

There's no explicit support for early-stopping, but if you split your training over multiple calls to `train()`, you could run an interim evaluation, based on your own project-specific idea of 'quality', between each call. However, how iterations & learning-rate should be managed across such multiple calls involves murky tradeoffs and many who try to manage their own iterations wind up doing counterproductive things, so it's only advisable after you deeply understand the process.

- Gordon

Gopi G

unread,

Nov 25, 2017, 12:52:52 PM11/25/17

to gensim

Hi Stefan and Gordon,
I am trying similar program using gensim and found totally random results. I was feeding 20K questions downloaded from stack overflow. What further improvements did you do? And if you don't mind can you share your code?
One particular thing I couldn't find is: After running a piece of code like below:
sims = model.docvecs.most_similar(positive=[model.infer_vector(new_doc.split() )], topn=3)
How to get the actual similar document? model.docvecs has no method to get it? I have to load the ORIGINAL document and find the use the index from 'sims' to fetch it. But this forces me to keep the whole documents in memory which is very costly.
Thanks in advance,
Gopi

On Thursday, September 21, 2017 at 1:40:31 AM UTC-7, Stefan Geißler wrote:

Gordon Mohr

unread,

Nov 28, 2017, 3:54:17 AM11/28/17

to gensim

If you're receiving totally random results, there may be something wrong with your training process. Try to build up from a working example elsewhere, such as the Doc2Vec demo notebooks that are included with gensim, and also viewable online at:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

Enable and watch logging for anything suspicious, and if you're still facing problems, share your code in a question post so we can see what might be going wrong.

Doc2Vec, like other gensim models, uses the text documents you provide to train their internal state/weights, but doesn't retain the original documents in a form for retrieval. (It's not a database.) So it's up to your code to provide efficient retrieval-by-keys - perhaps by using the simple (but memory-inefficient) method of keeping all documents in an in-memory dictionary, or some other persistent storage.

- Gordon

Gopi G

unread,

Nov 28, 2017, 11:05:52 PM11/28/17

to gensim

Hi Gordon,
Thanks for your reply. Greatly appreciate it.
I thought I improved my program but still don't know if its working. Here is my code:
I am using a bunch of questions I downloaded from stack overflow and using them for training.
One thing I still don't find an easy way is: how to give a random question and find a closest/similar/relevant question? and once we found how to print it? to print we must again go back to original documents? or is there a way to print from model ?

thanks
gopi

******************************
from os import listdir
from os.path import isfile, exists, join
from os import walk
import gensim
from gensim.models.doc2vec import LabeledSentence
from gensim.models.doc2vec import TaggedLineDocument
from pprint import pprint # pretty-printer
import nltk
from nltk.tokenize import word_tokenize
import re
import random
from random import shuffle

############ GLOBAL VARIABLES ############

docpath="./asciimath/models/camera/docs/"
filename = "questions.txt"
modeldocpath="./asciimath/models/camera/"
modelfilename= modeldocpath + "doc2vec.model"

labledDocs = []
stoplist = nltk.corpus.stopwords.words('english')

############ START OF FUNCTIONS ############

def getAllFilesAsList(dirpath):
    files = []
    for (dirpath, dirnames, filenames) in walk(docpath):
        for fn in filenames:
            if not fn.startswith('.'):
                files.append(fn)
    return files

## line numbers start from 0
def getLineFromFile(srcfilename,linenum):
    linenum = int(linenum)
    with open(srcfilename, "r") as ins:
        index = 0
        for line in ins:
            if index == linenum:
                return line
            else:
                index += 1

def generate_new_text(text):
        no_punctuation = re.sub("[^A-Za-z0-9]", " ", text.lower())
        no_punctuation = re.sub("\s+", " ", no_punctuation)
        return no_punctuation

def getLabledDocs(f):
    documents = []
    for index, text in enumerate(open(docpath+f)):
        text = generate_new_text(text)
        words1 = [w for w in word_tokenize(text) if w not in stoplist]
        documents.append( LabeledSentence(words=words1, tags=['%s_%s' %(f,index)]) )
    return documents

def getLineFromLabledDocs(tag):
    tempfilename, index = tag.split("_")
    #print ("filename: " + filename + ", index: " + index )
    return getLineFromFile(docpath + tempfilename, index)

def debugprint(docs):
    index = 0
    print ">> total no of labled docs: ", len(labledDocs)
    for ld in labledDocs:
        index = index + 1
        if index > 30:
            break
        print ld[1], ' - ', ld[0]

def printSimOutput(sims):
    indx = len(sims)
    print
    print 'Total num of results : ' , indx
    cntr = 1
    for index, label in sims:
        print cntr,') ', getLineFromLabledDocs(index)
        cntr = cntr + 1
        print ""

def findSimilarByTakingInputline():
    new_doc=''
    while True:    # infinite loop
        new_doc = raw_input("\nEnter the question(q to quit): ")
        if new_doc == "q":
            break
        print " "
        print 'Finding similar documents to: ' , new_doc
        print
        #sims = model.docvecs.most_similar(positive=[model.infer_vector(new_doc.split())], topn=3)
        ## below is giving better results
        sims = model.docvecs.most_similar(positive=[model.infer_vector(new_doc)], topn=3)

        printSimOutput(sims)

def findSimilarByRandomNumber():
    new_doc=''
    modcount = model.docvecs.count
    rand = 0
    while True:
        toquit = raw_input("\nEnter q to quit: ")
        if toquit == "q":
            break
        rand = random.randint(0,modcount)
        text = getLineFromFile(docpath+ filename, rand)
        print ("finding similar doc for: " + text )
        sims = model.docvecs.most_similar([rand], topn=3)
        printSimOutput(sims)

def shudffledocs():
    shuffle(labledDocs)
    return labledDocs

############ END OF FUNCTIONS ############

print("")
print('started......')
print

model = None
if isfile(modelfilename):
    print "model file exists."
    model = gensim.models.Doc2Vec.load(modelfilename)

skip = False
if model:
    print "model is loaded."
    skip = True

if skip:
    print ("skipping the building model")
else:

    files = getAllFilesAsList(docpath)
    print("Reading following files:")
    print(files)

    for f in files:
        labledDocs.extend(getLabledDocs(f))

    #debugprint(labledDocs)

    model = gensim.models.Doc2Vec(size=50, window=10, min_count=2, workers=8,alpha=0.025, min_alpha=0.025)
    model.build_vocab(labledDocs)

    for epoch in range(20):
        shudffledocs()
        model.train(labledDocs,total_examples=model.corpus_count, epochs = model.iter)
        model.alpha -= 0.002 # decrease the learning rate
        model.min_alpha = model.alpha # fix the learning rate, no deca
        model.train(labledDocs,total_examples=model.corpus_count, epochs = model.iter)

    model.save(modelfilename)

modcount = model.docvecs.count
print('no of model docs: ', modcount)
modcount = modcount - 1

findSimilarByTakingInputline()

#findSimilarByRandomNumber()

print("")
print("Done!")

******************************

Gordon Mohr

unread,

Nov 30, 2017, 10:46:20 AM11/30/17

to gensim

Without having reviewed all your code, you're making a bunch of common but nonsensical errors when calling `train()` multiple times in your own loop. Here's the relevant section of your code:

for epoch in range(20):
shudffledocs()
        model.train(labledDocs,total_examples=model.corpus_count, epochs = model.iter)
        model.alpha -= 0.002 # decrease the learning rate
        model.min_alpha = model.alpha # fix the learning rate, no deca
        model.train(labledDocs,total_examples=model.corpus_count, epochs = model.iter)

You're calling `train()` twice per loop - so 40 times total. And then you're supplying an `epochs` value, so each call is actually going over the training data 5 times (the default `model.iter` value). Did you really want to iterate over the data 200 times? If not, why is the code this way? (What bad online example did you copy or adapt it from?)

Second, you're decrementing `alpha` 20 times, by 0.002 each time, from a starting value of 0.025. Well, 0.025-(20*0.002) equals -0.015 – a negative alpha value for later iterations. A negative alpha value means the algorithm will be trying to worsen its results with each new example – so even if training was going well for a while, this alpha-mismanagement is then undoing any progress.

(Some of these mistakes will generate logged warnings in recent versions of gensim. Did you enable and watch logging for anything suspicious, as suggested previously?)

Call `train()` just once with your desired starting `alpha`, ending `min_alpha`, and number of training-passes over the data - the right number of passes and alpha-management will happen. Only tinker with your own loop, re-shuffling, manual alpha-management, and multiple `train()` calls if you're sure you know what you're doing and already have something more simple working as a baseline.

As mentioned previously, Doc2Vec does not store the original documents inside its model. That means to print the original texts you'll need to retrieve the original texts from your own chosen lookup facility in your own code.

- Gordon

Simon D

unread,

May 13, 2018, 7:48:13 AM5/13/18

to gensim

Hello,
I'm having issues getting good results from most_similar() using an inferred vector. Using most_similiar() on documents that were used for training give excelent results. I should add that I was able to get good results from inferred vectors before, since then I switched from python 2.7 to python 3.5.2, re-trained and added some documents.

Currently, most_similar() results from an inferred vector seem completetly random to me.

# Most similar from an inferred vector

model = gensim.models.doc2vec.Doc2Vec.load(modelfilename)

words = ['long','list','of','words'] #280 words provided

inferred_vector = model.infer_vector(doc_words=words, alpha=0.025, min_alpha=0.0001, steps=150)

similar_doc = model.docvecs.most_similar(positive=[inferred_vector], topn=10)

The model was trained with the same python version. The training process should not be at fault, since most_similar() works very well on documents that were present during training.
Code for training the model:

# Training the model

cores = multiprocessing.cpu_count()

dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, '../server/data/articles.txt')

def read_corpus(fname, tokens_only=False):
  with smart_open.smart_open(fname, encoding="utf-8") as f:
    for i, line in enumerate(f):
      if tokens_only:
        yield gensim.utils.simple_preprocess(line)
      else:
        # For training data, add tags
        lineArray = line.split(', ')
        articleId = int(lineArray[0])
        text = lineArray[1]
        yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(text), [articleId])


train_corpus = list(read_corpus(filename))

model = gensim.models.doc2vec.Doc2Vec(vector_size=200, min_count=2, epochs=55, workers=cores)

model.build_vocab(train_corpus)

model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

model.save('/path/to/model/file.bin')

Any ideas, why most_similar() won't work on inferred vectors? Could the python version be at fault? (Python 2.7 and Python 3.5 different results)

Gordon Mohr

unread,

May 13, 2018, 2:20:00 PM5/13/18

to gensim

I don't see anything too wrong with your training code. (The tiny `min_count` is probably making the final doc-vectors somewhat worse – low-frequency words essentially introduce noise to training.)

Can you be more specific about what you mean when you say it "won't work"? What kind of evaluation is telling you the results aren't good anymore? What are examples of operations that both "work" and "seem completely random" on the exact same trained model?

I doubt the Python 2.7 to 3.5 change is implicated. Is there a chance that your lookup of docs-from-IDs has changed independently of the model training, so that (for example) the IDs it's giving are suitable results, but being interpreted as different documents than they were at training-time?

A sensible rough 'sanity check' in such cases is to re-infer a vector for a doc that was in the training set, and then use `most_similar()` both with the training-time tag, and then with the re-inferred vector. With a suitably-trained model & enough inference, the results should be similar (not identical), and the results for the inferred-vector should include the tag that same text was trained with, in one of the top positions.

- Gordon

Simon D

unread,

May 15, 2018, 11:05:14 AM5/15/18

to gensim

Hello Gordon,

thank you for your helpful reply. Your suggestion to re-infer a vector from a document within the pre-trained set guided me to the solution.

I ran into several issues. Turns out the code I was using was different to the one I posted.

I changed

# before
inferred_vector = model.infer_vector(wordTokens, alpha=0.025, steps=150)

to this

# after
inferred_vector = model.infer_vector(doc_words=wordTokens, alpha=0.025, min_alpha=0.001, steps=55)

And changed this

# before
similar_doc = model.docvecs.most_similar([inferred_vector])

to this

# after
similar_doc = model.docvecs.most_similar(positive=[inferred_vector])

Additionally, while communicating between python and another language some document ids might have been corrupted like you suggested.

Getting most_similar() documents from an inferred vector provide excellent results again.

Thank you again for your help, it is much appreciated.

Simon

Donatas Stonys

unread,

Jan 2, 2019, 11:15:41 AM1/2/19

to Gensim

So from my understanding, the following two lines should provide same/similar results?

print(d2v_model.docvecs.most_similar(572))
print(d2v_model.docvecs.most_similar(positive=[d2v_model.infer_vector(doc_words=all_content[572].words)]))

However, the first line is giving 98% accuracy, which is obvious visually too when looking at the data sources provided, whereas the second line is only reaching about 25% accuracy and even that seem to be on completely unrelated document.

Gordon Mohr

unread,

Jan 2, 2019, 2:54:05 PM1/2/19

to Gensim

IF `all_content[572]` was passed during training as a document with an int `572` tag,

...AND `all_content[572[.words` is a list-of-word-tokens (NOT a string) prepared the same as was passed during training,

...AND the training was generally well-tuned for the purpose (avoiding things like an overlarge model overfit on a tiny dataset or manual alpha-mismanagement)...

THEN yes, the results of `infer_vector()` should be roughly similar to the direct-lookup.

And thus also, the list of most-similar docs should be similar, and performance of a downstream classifier on evaluations like accuracy should be roughly similar.

So if that's not what you're seeing, check that the three conditions above are satisfied.

- Gordon

Donatas Stonys

unread,

Jan 3, 2019, 5:20:34 AM1/3/19

to Gensim

Thank you Gordon.

I think it's time for examples as I believe those 3 criteria you've mention are being met. So...

I'm using Legal Case Reports Data Set. Those two lines are being ran right after model has been trained using following parameters:

print("Creating model... (number of cores", cores, ")")
d2v_model = Doc2Vec(all_content, 
                    vector_size = 300, 
                    window = 5, # how many words from right and from left to take
                    min_count = 1, # the minimum count of words to consider when training the model
                    workers = cores, # use this many worker threads to train the model 
                    dm = 1, # 1 - ‘distributed memory’ (PV-DM) is used' 0 - Bag of Words algorithm used
                    alpha = 0.025, 
                    min_alpha = 0.001,
                    epochs = 100)
print("Training model...")
d2v_model.train(all_content, 
                total_examples = d2v_model.corpus_count, 
                epochs = 100, 
                start_alpha = 0.002, 
                end_alpha = -0.016)

Document 572 is made up of these stemmed words:

all_content[572].words # ['question', 'principl', 'migrat', 'law']

First line produces following:

## LINE 1: direct-lookup
# Both tags are valid 
# Not using optional parameters as those should be inherited from the model
print(d2v_model.docvecs.most_similar(572))
print(d2v_model.docvecs.most_similar('572: 06_246.xml | Silia v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 246 (1 March 2006).'))

1. ('630', 0.9794655442237854) # ['question', 'principl', 'migrat', 'law']
2. ('650', 0.9794110059738159) # ['question', 'principl', 'migrat', 'law']

...
7. ('717', 0.8861348628997803) # ['point', 'principl', 'migrat']

So results above are lloking good even when testing on tiny corpus.

Now, if we will look at the results of the second line that uses inferred vector:

## LINE 2: inferred vector


print(d2v_model.docvecs.most_similar(positive=[d2v_model.infer_vector(doc_words=all_content[572].words)]))

1. ('1893', 0.21301265060901642) # ['refuge', 'appeal', 'applic', 'review', 'refus', 'grant', 'protect', 'visa', 'appel', 'sought', 'submit', 'materi', 'refuge', 'review', ...]

Interestingly enough, sometimes results with much higher percentages start coming up and that’s without changing any parameters:

1. ('409', 0.7360442876815796), # ['acn', 'fca']
2. ('1262', 0.6570378541946411) # ['subpoena', 'duce', 'tecum', 'practic', 'procedur']

However, as you can see, words in those results has nothing in common with the original document tagged 572. So the second line using inferred vector is completely unreliable, whilst the first one using direct-lookup is very stable.

Regardless of how well is model trained, I would at least expect to see equally good or equally bad results from both lines, right?

I'm using Python 3.6 if that helps.

Darren Govoni

unread,

Jan 3, 2019, 9:27:14 AM1/3/19

to gen...@googlegroups.com

i'm curious about this as well since I'm seeing similar behavior from inferred vector lookup using default (or suggested) training parameters (big corpus too).

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.

Gordon Mohr

unread,

Jan 3, 2019, 2:11:21 PM1/3/19

to Gensim

The alpha values your code is passing to `train()` are insane.

A negative ending alpha means the model will literally be trying to make itself worse with each training example by the end of training. Documents with very-similar words may still wind up near each other, having gone through the same wild ride-into-opposite-land, but overall model utility will likely be weak, and docs that go through a later inference process won't have landed anywhere similar. (Inference will, by default, be using the sane default alpha values you specified at initialization... but that won't be much use on a model that's been anti-trained so much.)

But also: in your code, you don't even need to call `train()`. By supplying the corpus on the line that creates the model, the initialization will do all necessary training using the supplied corpus. (You only need to call `build_vocab()` and then `train()` later if you *didn't* supply a corpus at model initialization.) So the bulk-trained vectors in your model went through one training with sensible alpha values, then an extra training with nonsense values. On the other hand, the inferred vectors are going through one training with sensible alpha values – but on a model whose internal weights were last trained in the nonsense mode and then frozen. It's quite understandable the vectors wouldn't be comparable.

I highly suggest alway enabling logging at the INFO level. That would likely have made it clear that training was happening twice.

You may get better results simply by not calling `train()` at all, since you've already done training in the instance initiatlization.

Separately:

* 4000 docs, especially if many are just a 3-4 words, is a very very small corpus for `Doc2Vec`. Published work uses tens-of-thousands to millions of docs, each of dozens to hundreds or thousands of words. It might never work well in such a case, but also: it may work better with a smaller model (fewer dimensions), to avoid overfitting, or with a simpler mode like `dm=0` (PV-DBOW).

* it's unclear what `tags` were actually attached to each document in your setup. You're accessing the doc by both raw int 572 , and by a long string like ''572: 06_246.xml | Silia v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 246 (1 March 2006).' – but then the results indicate the primary tags may be strings-of-integers, like "572". For clarity, you should pick one canonical ID for each document and stick with it.

* it looks like there may be significant duplication in the dataset: documents 572, 630, and 650, at least, are the same 4 words. That's usually not good: it makes the effective size of the corpus – it's essential variety for distinguishing documents – even smaller than the overall document count. (It could make sense to train the model on deduplicated data, but then externally assign all documents with the same words the same vector.)

* it looks like the corpus may be sorted to group similar-topics together. Training works better if there's not such clumping, so at least one initial shuffle could help. (And, after a shuffle, being consistent in how documents are tagged becomes extra-important, or you might be mixing pre-shuffle and post-shuffle document positions if using plain int indexes.)

* You shouldn't really think of the similarity values as 'percentages'. They range from -1.0 to 1.0, but also their effective ranges within a model can be fairly influenced by other model parameters, and are most meaningful only when compared to values from the same mode – not any absolute idea of "how much overlap" two docs exhibit. So if, for example, you were testing different dimensionality `vector_size` values, one model might tell you 2 quite-similar documents have a similarity of X, and for another model for those 2 docs have similarity Y, and X and Y are very different. But if in both models, they're still each others' nearest neighbor, and the relative rankings compared to other docs are sensible, there's no real meaning to the difference between X and Y.

- Gordon

Donatas Stonys

unread,

Jan 3, 2019, 3:08:53 PM1/3/19

to Gensim

Thank you for your valuable input, Gordon. I really appreciate it. You have cleared a few things up for me and I'll update my training parameters accordingly.

Just to clarify, each documents (XML file) consists of catchphrases (think of keywords in HTML) and actual text (sentences), which is huge and takes forever to pre-process and train.

So I've decided to work with catchphrases first in order to make sure that I'm getting meaningful and consistent results, and then switch to sentences. I know it makes a tiny corpus working that way, but using direct-lookup provided logically meaningful results and only deferred vectors were giving troubles. Which now will be likely fixed.

I shall post an updated code here and also will do more research on 'negative ending alpha' as I see it being used in quite a few examples on Internet. It may also be a relict from my previous attempts to optimize model.

Donatas

Gordon Mohr

unread,

Jan 3, 2019, 3:25:40 PM1/3/19

to Gensim

If you can point me at any such examples of negative ending alpha, and there's a way to comment there or contact the authors, that'd be great.

Most often, it will happen inadvertently by someone calling `train()` multiple times manually, and changing the alpha vars in a loop. (Generally people shouldn't do this: a single call to the constructor with a corpus, or a single call to train with your desired values, is enough.) When using the multiple-calls-in-a-loop, they may increase the number of iterations, but not make a matching change in the delta-per-iteration, thus causing alpha to reach negative values at some point in their training, rather than the tiny final value (something like 0.0001) that's proper.

Yours is the first code I've seen that explicitly specifies a negative value, rather than lands there inadvertently – hence my strong 'insane' description. If there's a live example/tutorial/example online that recommends that, it'd be good to squash that misguided recommendation ASAP.

- Gordon

Donatas Stonys

unread,

Jan 3, 2019, 4:12:31 PM1/3/19

to Gensim

I think this is were I got this bit of code from originally - Document Clustering using Doc2Vec/word2vec. And because this dataset is one of the most popular in Kaggle, the excerpts from it may have ended in sites like StackOverflow, where I believe I've seen it too.

Over time my understanding of parameters being used increased and I may have adjusted initial code accordingly, but I didn't really got to that alpha bit properly. Until now. Live and learn.

Donatas Stonys

unread,

Jan 4, 2019, 4:01:35 AM1/4/19

to Gensim

Just wanted to confirm that removing 'nonsensical' alpha values and removing unnecessary .train() call altogether made both direct-lookup and inferred vector method to return very similar results even on tiny corpus.

print("Creating and training model... (number of cores {})".format(cores))


d2v_model = Doc2Vec(all_content, 
                    vector_size = 300, 
                    window = 5, # how many words from right and from left to take
                    min_count = 1, # the minimum count of words to consider when training the model
                    workers = cores, # use this many worker threads to train the model 
                    dm = 1, # 1 - ‘distributed memory’ (PV-DM) is used' 0 - Bag of Words algorithm used
                    alpha = 0.025, 
                    min_alpha = 0.001,
                    epochs = 100)

#Print the cases similar to case with tagged id as '572: 06_246.xml | Silia v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 246 (1 March 2006).'
print(d2v_model.docvecs.most_similar(572))

('650: 06_377.xml | SZFQV v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 377 (15 March 2006).', 0.975567638874054),

('632: 06_350.xml | SZDCR v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 350 (1 March 2006).', 0.9672735929489136),

('651: 06_378.xml | SZGGS v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 378 (15 March 2006).', 0.960903525352478),

('586: 06_269.xml | S231/2003 v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 269 (1 March 2006).', 0.9572501182556152),

('649: 06_376.xml | SZEXT v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 376 (15 March 2006).', 0.9561994075775146),

('630: 06_349.xml | Le v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 349 (1 March 2006).', 0.9454089403152466),

('1024: 06_912.xml | SFE Corporation Limited, in the matter of SFE Corporation Limited [2006] FCA 912 (7 July 2006).', 0.8524386882781982),

('721: 06_46.xml | SZDZQ v Minister for Immigration and Multicultural Affairs and Anor [2006] FCA 46 (2 February 2006).', 0.847876250743866),

('875: 06_703.xml | SZHIK v Minister for Immigration and Multicultural and Indigenous Affairs [2006] FCA 703 (6 June 2006).', 0.845397412776947),

('733: 06_475.xml | SZFFL v Minister for Immigration and Multicultural Affairs [2006] FCA 475 (2 May 2006).', 0.83931565284729)

print(d2v_model.docvecs.most_similar(positive=[d2v_model.infer_vector(doc_words=all_content[572].words)]))

('632: 06_350.xml | SZDCR v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 350 (1 March 2006).', 0.9284729361534119),

('586: 06_269.xml | S231/2003 v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 269 (1 March 2006).', 0.9254553914070129),

('651: 06_378.xml | SZGGS v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 378 (15 March 2006).', 0.9254460334777832),

('650: 06_377.xml | SZFQV v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 377 (15 March 2006).', 0.9217897653579712),

('630: 06_349.xml | Le v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 349 (1 March 2006).', 0.9108121395111084),

('649: 06_376.xml | SZEXT v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 376 (15 March 2006).', 0.90613853931427),

('572: 06_246.xml | Silia v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 246 (1 March 2006).', 0.9045872688293457),

('875: 06_703.xml | SZHIK v Minister for Immigration and Multicultural and Indigenous Affairs [2006] FCA 703 (6 June 2006).', 0.8175121545791626),

('381: 06_1575.xml | SZIRJ v Minister for Immigration & Multicultural Affairs and Refugee Review Tribunal [2006] FCA 1575 (14 November 2006).', 0.8162938356399536),

('985: 06_86.xml | CZAH v Minister for Immigration and Multicultural Affairs [2006] FCA 86 (13 February 2006).', 0.8141422867774963)]

Thank you for your help, Gordon!

Darren Govoni

unread,

Jan 29, 2019, 10:24:54 AM1/29/19

to gen...@googlegroups.com

How did you construct " all_content"? 100 epochs seems like a lot, how did you arrive at that number?

--

Reply all

Reply to author

Forward