As the error is being generated from the PySpark `collect()` method, the problem seems entirely independent of gensim code.
That said, a 'connection refused' error like this typically indicates a targeted server (localhost:43963) isn't actually listening/running. Are you sure the Spark cluster and local server has been started, and is still running? Do the other commands that did successfully run, like the `take(1)`, still work if tried again?
Some other observations:
* 200,000 isn't that large of a corpus; there may not be any benefit streaming it from Spark, unless it's simply the fact that the Spark cluster is the convenient or usual place where this data lives. Still, you may just want to stream the docs to a local flat-file for Doc2Vec training - after that finishes once, further local Doc2Vec experiments then won't be reliant on the cluster/servers remaining up.
* note that if supplied text examples have more than 10,000 words, gensim Doc2Vec only trains on the first 10,000. If you have larger docs and need to simulate their effect on the resulting doc-vectors, you'd want to create individual TaggedDocument instances with 10,000 or fewer words each, but then the same ID in `tags`.
* it looks like the body-text has already been preprocessed to no longer be (easily) readable natural-text. This might help, or hurt – just note it's not strictly necessary, as much published Word2Vec/Doc2Vec even leaves in stop words.
- Gordon