Issues with using Doc2Vec to get vector representation of documents

411 views
Skip to first unread message

Jyothish Vidyadharan

unread,
Dec 29, 2016, 12:11:57 PM12/29/16
to gensim
Hello,
 

    I have a set of 1000 documents. I am trying to use Doc2Vec to convert the documents to vectors. I am a complete newbie to Word2Vec and Doc2Vec. I have written a program for this. The program is based on this tutorial : https://medium.com/@klintcho/doc2vec-tutorial-using-gensim-ab3ac03d3a1#.th1f2mxns. My code is given below:

import json

def TicketIterator(fileName="Data/
text.json"):
    with open(fileName) as f:
        data = json.load(f)

    tickets = data['tickets']

    ticket_data = []

    for ticket in tickets:
        desc = ticket['description']
        ##print desc
        ticket_data.append(desc)

    return ticket_data

import json

def labels(fileName="Data/text.json"):
    with open(fileName) as f:
        data = json.load(f)

    labels = ["ticket%04d"%(i+1) for i in xrange(data['count'])]

    return labels

from gensim.models.doc2vec import LabeledSentence

class DocIterator(object):
    def __init__(self, doc_iter, labels_list):
        print len(doc_iter)
        self.doc_iter = doc_iter
        self.labels_list = labels_list

    def __iter__(self):
        for (idx, doc) in enumerate(self.doc_iter):
            yield LabeledSentence(words=doc.split(),\
                tags=self.labels_list[idx])

import gensim

from TicketIterator import TicketIterator

from labels import labels

from DocIterator import DocIterator

def similarity():
    docs = TicketIterator()
    docLabels = labels()

    ##print docLabels

    corpora = DocIterator(docs, docLabels)

    model = gensim.models.Doc2Vec(size=300, window=10,
     min_count=5, workers=2, alpha=0.025, min_alpha=0.025)

    model.build_vocab(corpora)

    for epoch in xrange(10):
        model.train(corpora)
        model.alpha -= 0.002
        model.min_alpha = model.alpha
        model.train(corpora)

    model.save("DocVectors.model")

    print model['ticket0000']


if __name__=="__main__":
    similarity()

Now, the issue I am having is that I was expecting that I would get 1000 vectors, each representing a single document. I was also expecting that I could access each of those vectors using the command
print model.docvecs["labelNameHere"].

Now, the issue is that the model seems to have just 15 vectors as demonstrated by the following command and its output :

>>> len(model.docvecs)
15

Also, I am not able to access these vectors using the label(tag) name. But, I can access the individual vectors using numeric indices.

Now, my question is, what are these 15 vectors. Are they vectors representing 15 sets of similar documents grouped together? How can I go about getting vector representations of the 1000 documents as I originally set about to do and how can I access them??

Gordon Mohr

unread,
Dec 29, 2016, 3:20:41 PM12/29/16
to gensim
You can see the tags the model discovered/trained in `model.docvecs.offset2doctag`. 

Because a text-example (sentence/paragraph/document) can have multiple tags, the `tags` part of a example should be a list-of-tags. If you supply a single simple string, it looks like a list-of-single-characters - and thus your first example, rather than having the string tag 'ticket0000', will have the character tags 't', 'i', 'c', 'k', 'e', 't', '0', '0', '0', '0'. (Your 15 unique 'tags' are thus likely: 't', 'i', 'c', 'k', 'e', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'.) Even if an example just has one tag, it should be in a 1-element list: ['ticket0000', ]

If you're using the latest version of gensim, `TaggedDocument` is the preferred class name (rather than `LabeledSentence`), and there should also be a logged warning when all your tags are single characters – because this is a common mistake. (Do you have logging enabled and see any log output?)

When your examples have the right kind of tags, you'll be able to access doctag vectors with `model.docvecs['ticket0000']` as you expected.

A few other notes about your setup: 

(1) You don't need to manage the epochs/alpha yourself - just supply an `iter=10` parameter to Doc2Vec and one call to `train()` will do 10 passes, with `alpha` linearly-falling from its initial value to whatever your desired `min_alpha`.   

(2) Because the default `iter` is 5, each call to train does 5 passes, and your loop of 10 calls thus results in 50 passes total. You'd really only want to call `train()` multiple times yourself if you were (a) doing something more fancy with `alpha`; or (b) wanted to do something, like print extra progress/midway-evaluation-info, after each pass.

(3) Note that 1000 documents is a very small corpus, and Word2Vec/Doc2Vec generally needs many more examples to give sensible results. Using fewer dimensions or more iterations may help make small-corpus results a bit more stable/generalizable, but really you'll want a larger dataset if at all possible. 

- Gordon

Jyothish Vidyadharan

unread,
Jan 1, 2017, 12:17:09 PM1/1/17
to gensim
Hello,

   I have a few more questions. Does the resulting model have the documents grouped into groups of similar documents or will I have to do it manually? If the model already has vectors for groups of documents, how can I access them? If I have to do it manually, will the Euclidean distance between vector tips be a good criterion for deciding if two documents are closely related?

Gordon Mohr

unread,
Jan 1, 2017, 3:38:35 PM1/1/17
to gensim
There's no automatic grouping/clustering. 

Usually cosine-similarity between unit-normed vectors is used for comparisons, though once all vectors are unit-normed, the ranked lists of nearest-neighbors will be the same by either (lowest) euclidean distance or (highest) cosine-similarity. To what extent that value usefully indicates document-relatedness will depend on how well Doc2Vec works with your corpus & training-parameters.

- Gordon
Reply all
Reply to author
Forward
0 new messages