Corpus from Apache Solr

109 views
Skip to first unread message

Pedro Vitor Quinta de Castro

unread,
Jul 21, 2017, 2:35:53 PM7/21/17
to gensim
Hi there!

I'm planning on creating a corpus from documents stored in an instance of an Apache Solr server. I'm going to read all documents stored in Solr and create models for LDA Topic Modeling and doc2vec similarity.

What's the best way to approach this, considering I can't load all retrieved documents in memory? I was thinking about using SolrClient to retrieve the documents, but can't load all of them at once due to memory issues.

Thanks!

Ivan Menshikh

unread,
Jul 25, 2017, 4:25:42 AM7/25/17
to gensim
Hi Pedro,

Most of gensim models can use "generator" (aka stream) as input. For this reason, you can to create "stream" from your Solr and use it for LDA for example. 

Pedro Vitor Quinta de Castro

unread,
Jul 25, 2017, 8:08:57 AM7/25/17
to gensim
Hi Ivan, thanks for the response!

Do you mean something like this?

class SolrCorpus(object):
   
def __iter__(self):
       
for doc in solr.query('default',query_def).docs:
           
yield doc

I was wondering if something like this would prevent from keeping the entire result set in memory, while performing the query only once, at the same time. 

Also, I noticed that for measuring coherence for LDA models I need to pass my texts as parameters for CoherenceModel. Wouldn't this mean that I'd need to keep all processed texts that were used to create my corpus and dictionary in memory, to pass to CoherenceModel?

Thanks!

Radim Řehůřek

unread,
Jul 26, 2017, 11:05:06 PM7/26/17
to gensim, Ivan Menshikh
That's a great question. I don't think CoherenceModel supports streaming, which should be made very explicit in its docs (as streaming is gensim's core value proposition, that's what people expect).

@Ivan, any way to make CoherenceModel streamed? Or does it inherently need the random access?

Cheers,
Radim

Ivan Menshikh

unread,
Jul 31, 2017, 10:08:17 AM7/31/17
to gensim, iv...@rare-technologies.com
No, as I know CoherenceModel doesn't support streaming.

Pedro, yes, you understand me correctly. For CoherenceModel you can use `u_mass` mode (texts isn't needed), or pass part of all texts ( the only part that you can fit to your memory).

Pedro Vitor Quinta de Castro

unread,
Jul 31, 2017, 10:12:09 AM7/31/17
to gensim, iv...@rare-technologies.com
OK, Thanks Ivan and Radim!
Reply all
Reply to author
Forward
0 new messages