aggregation pipeline from the Spark connector use cases

rbru...@ippon.fr

unread,

Feb 3, 2017, 2:24:43 PM2/3/17

to mongodb-user

Hi,

I'm playing with MongoDB, Spark and the MongoDB connector for Spark.

Specifically, I'm interested in the feature to send an aggregation pipeline definition from Spark to MongoDB to do the processing on the MongoDB side instead of the Spark nodes.

The connector already does a pretty good job of pushing down the predicates (projection + filter) from Spark to MongoDB to minimize the amount of data loaded in Spark.

I can't think of a use case where there is the need to use the aggregation pipeline from Spark when Spark is already setup on top of MongoDB.

Especially because of the limitations of the pipeline seems more constrained than Spark.

Do you know use cases where it is better to use the aggregation pipeline from Spark than only Spark?

Thanks.

Wan Bachtiar

unread,

Feb 12, 2017, 6:39:10 AM2/12/17

to mongodb-user

Do you know use cases where it is better to use the aggregation pipeline from Spark than only Spark?

Hi,

As you have mentioned, using MongoDB aggregation pipeline could minimise the amount of data loaded into your Spark workers. This is especially useful if you have large documents but only need to process certain fields reducing the amount of data transfer over the network.

Apache Spark is a data processing framework, it is designed to process data. If you already have a Spark deployment, it would make sense to use it for data processing.

Kind regards,

Wan.

rbru...@ippon.fr

unread,

Feb 14, 2017, 4:34:38 PM2/14/17

to mongodb-user

Thanks for your answer.

This validates my interrogations.

When using Apache Spark on top of MongoDB, the aggregation pipeline is mostly useful to filter out the data before transferring it to the Spark nodes.

This is automatically done by the connector for the projections and where clauses.