Gensim Doc2Vec for large corpus(getting error: py4j connection refused error)

sandeep chitta

unread,

Mar 28, 2017, 11:53:38 AM3/28/17

to gensim

Hi,

I am using very large corpus for gensim Doc2Vec(~200K documents), when i run the .collect() command i am getting issues related to 'java server connection error'(attaching the error). In the official documentation, for 'documents' parameter they suggested to use a iterator that streams the data(to quote: "but for larger corpora, consider an iterable that streams the documents directly from disk/network."). Can you usher me to how to use a iterable to stream the data?

Any suggestions would also be appreciated!

Thanks,

Sandeep

java server error.PNG

Lev Konstantinovskiy

unread,

Mar 28, 2017, 1:11:05 PM3/28/17

to gensim

Hi,

Glad you are using doc2vec on Spark. Could you please post more code? How did you define documents1?

sandeep chitta

unread,

Mar 28, 2017, 3:07:21 PM3/28/17

to gen...@googlegroups.com

Hello,

Thanks for replying! I am very much eager to make genism doc2vec work for this dataset!

Below is the relevant code :

df_final = patentsDF.selectExpr("label as tags", "patentbody as words")

from gensim.models.doc2vec import TaggedDocument

documents1=df_final.rdd.map(lambda x:TaggedDocument(x[1].split(),[x[0].replace('"','')]))

datatype of 'documents1' and 'df_final'

df_final is a spark dataframe and 'documents1' is pyspark.rdd.PipelinedRDD

Please find the screenshots attached for the structure of variables 'df_final' and 'documents1'.

Looking forward to your response!

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/EkB-FuWyeVU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Sandeep Chitta

Masters in Business analytics

University of Minnesota

Minneapolis,MN-55414, USA

Phone: +1-612-458-4347

documents1.PNG

df_final.PNG

Gordon Mohr

unread,

Mar 28, 2017, 5:38:03 PM3/28/17

to gensim

As the error is being generated from the PySpark `collect()` method, the problem seems entirely independent of gensim code.

That said, a 'connection refused' error like this typically indicates a targeted server (localhost:43963) isn't actually listening/running. Are you sure the Spark cluster and local server has been started, and is still running? Do the other commands that did successfully run, like the `take(1)`, still work if tried again?

Some other observations:

* 200,000 isn't that large of a corpus; there may not be any benefit streaming it from Spark, unless it's simply the fact that the Spark cluster is the convenient or usual place where this data lives. Still, you may just want to stream the docs to a local flat-file for Doc2Vec training - after that finishes once, further local Doc2Vec experiments then won't be reliant on the cluster/servers remaining up.

* note that if supplied text examples have more than 10,000 words, gensim Doc2Vec only trains on the first 10,000. If you have larger docs and need to simulate their effect on the resulting doc-vectors, you'd want to create individual TaggedDocument instances with 10,000 or fewer words each, but then the same ID in `tags`.

* it looks like the body-text has already been preprocessed to no longer be (easily) readable natural-text. This might help, or hurt – just note it's not strictly necessary, as much published Word2Vec/Doc2Vec even leaves in stop words.

- Gordon

On Tuesday, March 28, 2017 at 12:07:21 PM UTC-7, sandeep chitta wrote:

Hello,

Thanks for replying! I am very much eager to make genism doc2vec work for this dataset!

Below is the relevant code :

df_final = patentsDF.selectExpr("label as tags", "patentbody as words")
from gensim.models.doc2vec import TaggedDocument
documents1=df_final.rdd.map(lambda x:TaggedDocument(x[1].split(),[x[0].replace('"','')]))

datatype of 'documents1' and 'df_final'
df_final is a spark dataframe and 'documents1' is pyspark.rdd.PipelinedRDD

Please find the screenshots attached for the structure of variables 'df_final' and 'documents1'.

Looking forward to your response!

On Tue, Mar 28, 2017 at 12:11 PM, Lev Konstantinovskiy <l...@rare-technologies.com> wrote:

Hi,

Glad you are using doc2vec on Spark. Could you please post more code? How did you define documents1?

On Tuesday, March 28, 2017 at 12:53:38 PM UTC-3, sandeep chitta wrote:
Hi,

I am using very large corpus for gensim Doc2Vec(~200K documents), when i run the .collect() command i am getting issues related to 'java server connection error'(attaching the error). In the official documentation, for 'documents' parameter they suggested to use a iterator that streams the data(to quote: "but for larger corpora, consider an iterable that streams the documents directly from disk/network."). Can you usher me to how to use a iterable to stream the data?

Any suggestions would also be appreciated!

Thanks,
Sandeep

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/EkB-FuWyeVU/unsubscribe.

To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Tummala Armitha

unread,

Apr 16, 2018, 4:45:28 PM4/16/18

to gensim

Hi,

I am using the deepdist and genism doc2vec and i have faced the same issue. can please tell me how did you solve it ? can your share your results of training doc2vec.

Gordon Mohr

unread,

Apr 16, 2018, 7:18:05 PM4/16/18

to gensim

If your error is actually "py4j connection refused error", then it's coming from the non-gensim software you're running, and you'd be most likely to get help with that specific error in forums dedicated to that other software.

- Gordon

Reply all

Reply to author

Forward