doc2vec vectors per doc

Peter Krejzl

unread,

Oct 7, 2015, 3:00:32 PM10/7/15

to gensim

hi, I like gensim a lot but currently struggling with one thing.

In my test example I have just 9 sentences (size of vocabulary is 42).

I'm basically trying to find a way how to generate one vector per sentence, so I can later use them in some classifier or so.

When I simple print this: print(model.syn0.shape) It returns (42,5) where 42 is vocabulary size and 5 is number of features.

When I try model.docvecs['label'].shape it returns (1,13,5). So, again it does not procedure one vector.

Maybe, I'm not fully understanding it but I thought I can have 9 different vectors (one per sentence). Is that true? Can somebody clarify it for me?

Thank you very much.

Peter

Gordon Mohr

unread,

Oct 7, 2015, 9:19:40 PM10/7/15

to gensim

It does not appear your model has any trained data for the tag 'label'.

For some reason I don't quite understand, index-accessing a numpy array with the index `None` gets back the whole array, inside an array of one larger dimension. When you ask for `model.docvecs['label']`, there's no integer index corresponding to the string tag 'label', so the internal array with the per-tag vectors (`model.docvecs.doctag_syn0`) is accessed with `None`, giving that large result.

If you request a tag that is actually in the trained corpus – which appears to have 13 unique document tags, rather than 9 – you should get a vector like you expect.

(Note that more typical vector dimensionality would be in the hundreds, rather than 5, and you might have problems achieving even toy/demo results with such little text.)

- Gordon

Peter Krejzl

unread,

Oct 10, 2015, 2:40:20 PM10/10/15

to gensim

Hi Gordon, thanks a lot for reply. Here's my code:

documents = ["Human machine interface for lab abc computer applications",

"A survey of user opinion of computer system response time",

"The EPS user interface management system",

"System and human system engineering testing of EPS",

"Relation of user perceived response time to error measurement",

"The generation of random binary unordered trees",

"The intersection graph of paths in trees",

"Graph minors IV Widths of trees and well quasi ordering",

"Graph minors A survey"]

i=0

tagged_docs = []

for line in documents:

tg = 'for_' + str(i)

i += 1

tagged_docs.append(TaggedDocument(words = line.lower().split(), tags = tg))

model = Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, workers=8, size = 5)

model.build_vocab(tagged_docs)

for epoch in range(10):

model.train(tagged_docs)

model.alpha -= 0.002

model.min_alpha = model.alpha

So, labels are for_0 ... for_8

I'm not too worried in this example about results. But I have 9 documents, so looking for a way how to "export" a 9 vectors with size 5.

print(model.syn0.shape) returns (42,5)

and print(model.docvecs['for_1'].shape) returns (1, 13, 5).

So, should I for each document call model.docvecs[] and returned array reshape from (13,5) to 13*5 to get one row per doc?

Thanks for help.

Peter

Gordon Mohr

unread,

Oct 10, 2015, 4:33:23 PM10/10/15

to gensim

The `tags` value for a TaggedDocument should be a *list* of tags. (It's typical for there to be only one, a unique document ID – but possible and sometimes useful for there to be more than one.) Since you're providing a single string as `tags`, and a string behaves like a list of characters, your first document is actually being given the tags...

['f', 'o', 'r', '_', '0']

...and overall you're actually training up a total of 13 one-character tags...

['f', 'o', 'r', '_', '0', '1', '2', '3', '4', '5', '6', '7', '8']

...instead of the 9 tags you intend.

(The value of `model.docvecs.index2doctag` will be roughly that list for your code.)

Replace your `tags = tg` with `tags = [tg]` and you should get the behavior you expect.

- Gordon

Peter Krejzl

unread,

Oct 15, 2015, 4:48:22 PM10/15/15

to gensim

You're right, Gordon. It works now. Thank you very much!

peter

Reply all

Reply to author

Forward