Proper way to update reasonably large data using mongo-spark connector via pyspark

723 views

Skip to first unread message

Xiao Cui

unread,

May 20, 2017, 12:30:32 AM5/20/17

to mongodb-user

Hello everyone,

I m using Spark as ETL tool for our data pipeline (mainly pyspark, hosts on EMR). In the very end of the step, using mongo spark connector to export the results to mongo database. We have seen that writing causing the mongo instance's whole performance down, i.e.

command: insert { insert: "xxx_intermediate_wip", ordered: true, documents: 309 } ninserted:309 keyUpdates:0 writeConflicts:0 numYields:0 reslen:80 locks:{ Global: { acquireCount: { r: 315, w: 315 } }, MMAPV1Journal: { acquireCount: { w: 322 }, acquireWaitCount: { w: 4 }, timeAcquiringMicros: { w: 7690 } }, Database: { acquireCount: { w: 315 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 13713400 } }, Collection: { acquireCount: { W: 6 }, acquireWaitCount: { W: 6 }, timeAcquiringMicros: { W: 86102310 } }, Metadata: { acquireCount: { w: 309 } }, oplog: { acquireCount: { W: 309 }, acquireWaitCount: { W: 1 } } } protocol:op_query 100165ms

The questions/issues I have:

1. Is there a way to ensureIndex on the collection through mongo-spark connector's python api? Will this help for speed up writing?.

2. Is there a way that we can override the default batch size here:

https://github.com/mongodb/mongo-spark/blob/master/src/main/scala/com/mongodb/spark/MongoSpark.scala

Since i noticed all insert is around 300 documents.

3. Are there any recommended "best practices" for these kind of scenarios?

(The dataframe is around 4M records, we are using Mongo 3.2 without wireTiger)

Thanks in advance!

Wan Bachtiar

unread,

Jun 28, 2017, 3:56:01 AM6/28/17

to mongodb-user

We have seen that writing causing the mongo instance’s whole performance down, i.e.

Hi Xiao,

It’s been a while since you posted this question, have you found a way to improve insert performance?

Before you’re going deeper into the Spark config, I would recommend to limiting the scope of the performance test. For example, by testing your MongoDB instance performance to handle 4M inserts. The goal is to find out the performance bottleneck by executing simple tests. For example, check your MongoDB memory/disk IO. See also MongoDB Capacity Planning

You may find the following performance related resources useful:

Is there a way to ensureIndex on the collection through mongo-spark connector’s python api? Will this help for speed up writing?.

Generally, adding an index wouldn’t improve your insert operations. It may improve update operations querying, but depends on the update operation itself.

we are using Mongo 3.2 without wiredTiger

If possible, I would recommend to consider/test using WiredTiger storage engine for your use case.