Corpus from Apache Solr

Pedro Vitor Quinta de Castro

unread,

Jul 21, 2017, 2:35:53 PM7/21/17

to gensim

Hi there!

I'm planning on creating a corpus from documents stored in an instance of an Apache Solr server. I'm going to read all documents stored in Solr and create models for LDA Topic Modeling and doc2vec similarity.

What's the best way to approach this, considering I can't load all retrieved documents in memory? I was thinking about using SolrClient to retrieve the documents, but can't load all of them at once due to memory issues.

Thanks!

Ivan Menshikh

unread,

Jul 25, 2017, 4:25:42 AM7/25/17

to gensim

Hi Pedro,

Most of gensim models can use "generator" (aka stream) as input. For this reason, you can to create "stream" from your Solr and use it for LDA for example.

Pedro Vitor Quinta de Castro

unread,

Jul 25, 2017, 8:08:57 AM7/25/17

to gensim

Hi Ivan, thanks for the response!

Do you mean something like this?

class SolrCorpus(object):
    def __iter__(self):
        for doc in solr.query('default',query_def).docs:
            yield doc

I was wondering if something like this would prevent from keeping the entire result set in memory, while performing the query only once, at the same time.

Also, I noticed that for measuring coherence for LDA models I need to pass my texts as parameters for CoherenceModel. Wouldn't this mean that I'd need to keep all processed texts that were used to create my corpus and dictionary in memory, to pass to CoherenceModel?

Thanks!

Radim Řehůřek

unread,

Jul 26, 2017, 11:05:06 PM7/26/17

to gensim, Ivan Menshikh

That's a great question. I don't think CoherenceModel supports streaming, which should be made very explicit in its docs (as streaming is gensim's core value proposition, that's what people expect).

@Ivan, any way to make CoherenceModel streamed? Or does it inherently need the random access?

Cheers,

Radim

Ivan Menshikh

unread,

Jul 31, 2017, 10:08:17 AM7/31/17

to gensim, iv...@rare-technologies.com

No, as I know CoherenceModel doesn't support streaming.

Pedro, yes, you understand me correctly. For CoherenceModel you can use `u_mass` mode (texts isn't needed), or pass part of all texts ( the only part that you can fit to your memory).

Pedro Vitor Quinta de Castro

unread,

Jul 31, 2017, 10:12:09 AM7/31/17

to gensim, iv...@rare-technologies.com

OK, Thanks Ivan and Radim!

Reply all

Reply to author

Forward