MongoDB Pyspark connector gets stuck (long pause with huge data)

Akshesh Doshi

unread,

Sep 6, 2018, 3:23:58 AM9/6/18

to mongodb-user

Hi

I am using Pyspark to read data from MongoDB via dataframes (simply using spark.read.format('com.mongodb.spark.sql.DefaultSource').option('uri', uri).load()).

I am observing a behaviour which I am not able to understand:

When I am reading my data Spark first performs a treeAggregate at MongoInferSchema.scala:78 stage, after which it pauses for a random amount of time before it actually does the next step with the data. I've also observed that this pause time increases as the data in the source MongoDB database increases.

It might be worth mentioning that the InferSchema task is shown to be completed within around 30s.

Does anyone here have any idea what is going on during the interval of pause. I monitored my cluster manually for IO/CPU utilization/RAM utilization/inode utilization but nothing seems to vary differently for this period.

I would really appreciate if anyone could help me understand what is happening during this period and how I can avoid it.

Regards

Akshesh Doshi

unread,

Sep 6, 2018, 3:28:20 AM9/6/18

to mongodb-user

If it makes any difference I am using the following command to submit my job - /path/to/spark-submit --master spark://master:7077 --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.3,com.databricks:spark-avro_2.11:4.0.0 --jars ./jar_files/elasticsearch-hadoop-5.6.4.jar --driver-class-path ./jar_files/elasticsearch-hadoop-5.6.4.jar main_df.py. We are trying to read from MongoDB and index the data into Elasticsearch.

Also, when our data size in MongoDB collection was around 1TB, this pause period went beyond and hour (and we had to cancel our Spark job).

Ross Lawley

unread,

Sep 7, 2018, 11:47:18 AM9/7/18

to mongod...@googlegroups.com

Hi Akshesh,

Inferring the Schema can be an expensive as it is sampling the database. You can negate this cost by providing your own schema, is that viable in your scenario?

Ross

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/51e27f9c-feb9-4556-a88e-c79cc435904f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

{ name : "Ross Lawley",
title : "Senior Software Engineer",
location : "London, UK",
twitter : ["@RossC0", "@MongoDB"], facebook :"MongoDB"}

Akshesh Doshi

unread,

Sep 14, 2018, 3:54:04 AM9/14/18

to mongod...@googlegroups.com

Hi Ross

Thank you very much for taking the time to respond to my query.

What I understand is that the inferSchema stage gets completed in few seconds and then the process is spending time in doing something else (before starting the next stage which is writing to ES). I've attached screenshots regarding these two stages - please note that once first stage completes, the next stage doesn't start immediately rather it takes quite a long pause (which is my concern).

I can try giving the schema explicitly myself to solve the problem for now but that is not a scalable approach in my case.

Is there anything else I can do? For instance, reducing the sample size, etc.?

Thanks again for the reply.

You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/FBlzJ6H0yH8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.

To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/CAOjLY%2BE7RGCsLiEHYCmoxGp%3D7-FL50u455_egC%2BzpmTFKdCS5w%40mail.gmail.com.

Selection_069.png

Selection_070.png

Ross Lawley

unread,

Sep 19, 2018, 9:23:43 AM9/19/18

to mongod...@googlegroups.com

Hi,

Looks like the cost of inferring the schema is relatively small (~20-40 seconds) that cost should reduce in the next release of the Spark Connector because of work done in SPARK-210. Spark-210 will limit the number of documents to sample from and should vastly reduce the cost of inferring the schema in large collections.

From the image the cost is in EsSparkSQL line 97 - that is creating 59,214 tasks and its takes time to process through that many tasks, in order to complete that stage.

I hope that helps,

Ross

To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/CAPiAHQHVw13VgMWp9nXb-9HZAdMZUDOGmQJG8b%2B_TeVQHyu7SQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward