Doc2Vec Sentence Clustering

Boy

unread,

Apr 18, 2017, 12:35:14 PM4/18/17

to gensim

Hi, I am fairly new to gensim, so hopefully one of you could help me solving this problem.

I have multiple documents that contain multiple sentences. I want to use doc2vec to cluster (e.g. k-means) the sentence vectors by using sklearn.

As such, the idea is that similar sentences are grouped together in several clusters. However, it is not clear to me if I have to train every single document separately and then use a clustering algorithm on the sentence vectors. Or, if I could infer a sentence vector from doc2vec without training every new sentence.

Right now this is a snippet of my code:

sentenceLabeled = []
for sentenceID, sentence in enumerate(example_sentences):
    sentenceL = TaggedDocument(words=sentence.split(), tags = ['SENT_%s' %sentenceID])
    sentenceLabeled.append(sentenceL)

model = Doc2Vec(size=300, window=10, min_count=0, workers=11, alpha=0.025, 
min_alpha=0.025)
model.build_vocab(sentenceLabeled)
for epoch in range(20):
    model.train(sentenceLabeled)
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay
textVect = model.docvecs.doctag_syn0

## K-means ##
num_clusters = 3
km = KMeans(n_clusters=num_clusters)
km.fit(textVect)
clusters = km.labels_.tolist()

## Print Sentence Clusters ##
cluster_info = {'sentence': example_sentences, 'cluster' : clusters}
sentenceDF = pd.DataFrame(cluster_info, index=[clusters], columns = ['sentence','cluster'])

for num in range(num_clusters):
     print()
     print("Sentence cluster %d: " %int(num+1), end='')
     print()
     for sentence in sentenceDF.ix[num]['sentence'].values.tolist():
        print(' %s ' %sentence, end='')
        print()
    print()

Basically, what I am doing right now is training on every labeled sentence in the document. However, I have the idea that this could be done in a simpler way.

Eventually, the sentences that contain similar words should be clustered together and be printed. At this point training every document separately, does not clearly reveal any logic within the clusters.

Normally, the different documents contain 20 to 30 sentences to be clustered.

Hopefully someone can steer me in the right direction. Thanks.

Lev Konstantinovskiy

unread,

Apr 18, 2017, 6:53:54 PM4/18/17

to gensim

Hi Boy,

Your code is correct in general for the purpose of clustering sentences. Though the doc2vec training part should be changed to use the latest version of gensim.

Is your goal to cluster documents or sentences?

Regards

Lev

Message has been deleted

Boy van Dijk

unread,

Apr 19, 2017, 1:39:10 AM4/19/17

to gensim

Hi Lev,

Thanks for your response.

My ultimate goal is to cluster sentences of various documents containing crime-related information.

For example:

Cluster 1: Sentences regarding the getaway vehicle.

Cluster 2: Sentences regarding the victim/perpetrator.

Etc.

At this point however, I am training on just one document (i.e. example_sentences). I think I could also use some pre-trained doc2vec model and use it to infer vectors for the sentences, to really understand their semantics, because right know the clustesr do not really make sense to me.

So hopefully, you have some idea of how to do that.

Best regards,

Boy van Dijk

unread,

Apr 19, 2017, 6:58:36 AM4/19/17

to gensim

Following on the latter, another thing I am considering is using LDA for topic modelling and 'cluster' the sentences based on the inferred topics. Would this be an approach pursuing as well?

Lev Konstantinovskiy

unread,

Apr 20, 2017, 2:58:26 AM4/20/17

to gensim

A sentence is very short so it makes the task harder.

Though if the words in the topics are very different then even simple TF-IDF might work. LDA, LSI and NMF are also worth exploring.

Also, consider assigning a word-average word2vec vector to a document like in this example in the ShortText package.

Marco Mastrogirolamo

unread,

Mar 15, 2018, 10:04:41 AM3/15/18

to gensim

Hi, why are you using K-Means with euclidian distance instead of coine distance?

Reply all

Reply to author

Forward