Hello all,
I'm trying to cluster a gensim Corpus using DBSCAN (as implemented in sklearn), but I'm getting as output a number of labels that's equal to the amount of terms in the corpus, rather than the number of documents, which is odd.
Here's what I'm doing (on a test subsample of the entire corpus):
>>> corpus.num_docs
16563
>>> corpus.num_terms
18880
>>> corpus_csc = matutils.corpus2csc(corpus)
>>> corpus_csc
<18880x16563 sparse matrix of type '<class 'numpy.float64'>'
with 1922716 stored elements in Compressed Sparse Column format>
>>> clustering = DBSCAN(eps=3, min_samples=2).fit(corpus_csc)
>>> len(set(clustering.labels_))
206
>>> len(clustering.labels_)
18880
I've checked list archives and I believe I'm on the right track, but I'm not sure if I'm using corpus2csc right. Any suggestion would be appreciated!
Thanks in advance,
Cheers