Bigrams and Trigrams along with Doc2Vec training

1,830 views
Skip to first unread message

Pankaj Anand

unread,
Jan 4, 2017, 2:34:44 PM1/4/17
to gensim
Hello,

I have written a recommendation system based on Doc2Vec. I have about 600K articles and I get good document similarity results with Doc2Vec implementation.
I train the model like shown below

# build the model
print('Now building doc2vec model with {} CPUs'.format(str(n_cpus)))
doc_iterator
= DocIterator()
model
= gensim.models.Doc2Vec(
    documents
=doc_iterator, workers=n_cpus, size=100, dbow_words=1)


model
.save(
    os
.path.join(general_settings["save_dir"],
                 general_settings
["doc2vec_model"]))
model
.save_word2vec_format(
    os
.path.join(general_settings["save_dir"],
                 general_settings
["word2vec_model"]))

The DocIterator is simply reading from a redis database where I have already stored the tagged document

class DocIterator(object):
   
def __iter__(self):
       
for docs in keystore.get_docs(size=100):
           
for doc in docs:
               
yield TaggedDocument(doc["words"], doc["tags"])


My token extraction code is quite simple


@staticmethod
def get_tagged_document(text, label):
    text = text.replace('\n+', ' ').strip()

    words = re.findall(r"[\w']+|[.,!?;]", text)
    # # lowercase. perhaps lemmatize too?
    words = [word.lower() for word in words]

    # remove stop words from tokens
    stop_words = TextProcessor._en_stop
    words = [i for i in words if (i not in stop_words) and (len(i) > 1)]

    # # remove numbers
    words = [re.sub(r'[\d]', ' ', i) for i in words]
    words = ' '.join(words).split()

    tags = [label]
    return {"words": words, "tags": tags}


I also get the word similarity because I am saving the word2vec model generated as a result of my Doc2Vec training.

My challenge at the moment is that I would like to get Bigrams and Trigrams (Phrases) from the word2vec model as well. Currently the tokens are all single word in the tagged document.

I think I have to use the Phrases class to generate the multi-word phrases. I am at a loss how to implement that with the Doc2Vec. Or I should train a word2vec model separately. The reason I want to do this is to facilitate searches and show related concepts to the user.

Another question is that how do I find out similarity between the phrases and the documents e.g. I want to get semantically related articles when a user searches for Artificial Intelligence and at the same time show related concepts like AI, Neural Networks etc.

Thanks a lot.

Lev Konstantinovskiy

unread,
Jan 5, 2017, 9:09:53 AM1/5/17
to gensim
Hi Pankaj,

1) How to use phraser. You can add  a line  words = trigram[bigram[words]] into get_tagged_document . . Prior to that you need to train Phrases on your training corpus. It will make tokens like 'new_york_times' appear as tokens in your Doc2vec model.

2) How to find related documents. The trigrams will be just like any other token in the model.
Transform the search query using Phrases and  then look for the closest tags to to its tokens.

search_query = trigram[bigram[search_query]];  
doc2vec_model.docvecs.most_similar(search_query)

Please correct me if I misunderstood your question,
Lev

Pankaj Anand

unread,
Jan 6, 2017, 3:16:42 AM1/6/17
to gensim
You understood the question right and thanks for your help.
I have a follow up question though. Can the training of the Phrases be done simultaneously as I am reading the sentences by using add_vocab or do i have to train by iterating on the entire dataset first and then re-iterate again?

Can you point me to any sample code that does that?

Lev Konstantinovskiy

unread,
Jan 6, 2017, 5:46:21 PM1/6/17
to gensim
Hi Pankaj,

The phraser needs to be trained first. It does add an extra corpus iteration to the process.

By the way, here is a related word2vec code that detects compond words(noun followed by a noun) and concatenates them. Just thought you might find it useful.

Regards
Lev

Pankaj Anand

unread,
Jan 6, 2017, 5:47:11 PM1/6/17
to gen...@googlegroups.com
Thank you so much.

--


You received this message because you are subscribed to a topic in the Google Groups "gensim" group.


To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/VmCai9ciUz0/unsubscribe.


To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.



LEGAL NOTICE: The information contained in this email may be confidential and/or legally privileged. It has been sent for the sole use of the intended recipient(s). If the reader of this message is not an intended recipient, you are hereby notified that any unauthorized review, use, disclosure, dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please reply to the sender and destroy all copies of the message. Thank you.

AVISO LEGAL: La información contenida en este e-mail puede ser confidencial y/o legalmente privilegiada. Fue enviada para el uso exclusivo del(los) destinatario(s) correspondiente(s). Si la persona que lea este mensaje no es el destinatario correspondiente, se le notifica que cualquier revisión, uso, divulgación, transmisión, distribución o reproducción de este comunicado, o cualquier parte de su contenido queda estrictamente prohibida. Si recibió este comunicado por error, agradeceremos notificarlo al remitente y destruir todas las copias de este mensaje. Gracias.

Pankaj Anand

unread,
Jan 6, 2017, 5:51:45 PM1/6/17
to gen...@googlegroups.com
Quick question.

How do I save the phrases model and reload it. I will have to save this model on the disk in the first iteration. I know export phrases can be used to generate a tsv like format.but how do I load it in the phraser object ?


On Fri, Jan 6, 2017 at 2:46 PM Lev Konstantinovskiy <lev....@gmail.com> wrote:
--


You received this message because you are subscribed to a topic in the Google Groups "gensim" group.


To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/VmCai9ciUz0/unsubscribe.


To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.


Pankaj Anand

unread,
Jan 12, 2017, 8:14:06 PM1/12/17
to gensim
Hi Lev, the building the phrases model is really really slow. I have about 600K documents and it takes almost no significant time running doc2vec as compared to building phrases from the sentences. I think it is because it is single threaded. Is there a faster implementation or multil-threaded implementation that I am missing. I saw a pull request from last year suggesting it will be done. Did that work ever finished or pulled in the master branch?

Pankaj

Lev Konstantinovskiy

unread,
Jan 12, 2017, 9:47:02 PM1/12/17
to gensim
Hi Pankaj,

You can try Phraser  - it is somewhat faster than Phrases though its main advantage is memory efficiency. 

In terms of peformance speed-up, are you reffering to this PR

Also one can save/load Phraser with .save and .load functions.

Regards
Lev

Pankaj Anand

unread,
Jan 12, 2017, 10:10:49 PM1/12/17
to gen...@googlegroups.com
I will give Phraser a try.
Yes I was referring to that PR.

Thanks for the save/load tip. I was using utils saveload class for that.

--


You received this message because you are subscribed to a topic in the Google Groups "gensim" group.


To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/VmCai9ciUz0/unsubscribe.


To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.


Reply all
Reply to author
Forward
0 new messages