clustering gensim corpus with sklearn dbscan

21 views
Skip to first unread message

Stefano Zacchiroli

unread,
Oct 26, 2021, 4:03:17 PM10/26/21
to Gensim
Hello all,

I'm trying to cluster a gensim Corpus using DBSCAN (as implemented in sklearn), but I'm getting as output a number of labels that's equal to the amount of terms in the corpus, rather than the number of documents, which is odd.

Here's what I'm doing (on a test subsample of the entire corpus):

>>> corpus.num_docs
16563
>>> corpus.num_terms
18880
>>> corpus_csc = matutils.corpus2csc(corpus)
>>> corpus_csc
<18880x16563 sparse matrix of type '<class 'numpy.float64'>'
with 1922716 stored elements in Compressed Sparse Column format>
>>> clustering = DBSCAN(eps=3, min_samples=2).fit(corpus_csc)
>>> len(set(clustering.labels_))
206
>>> len(clustering.labels_)
18880

I've checked list archives and I believe I'm on the right track, but I'm not sure if I'm using corpus2csc right. Any suggestion would be appreciated!

Thanks in advance,
Cheers

Stefano Zacchiroli

unread,
Oct 26, 2021, 4:46:11 PM10/26/21
to Gensim
To be clear: I'm aware I can just transpose the sparse matrix produced by corpus2csc. What I'd like to understand is if what I'm doing above is the intended way to perform DBSCAN (or others, for whats is worth) clustering on gensim corpora.

Thanks

Reply all
Reply to author
Forward
0 new messages