Querying Data with MongoDB Connector for Apache Spark

179 views
Skip to first unread message

Koko

unread,
Nov 1, 2016, 10:05:25 AM11/1/16
to mongodb-user
Hi,

I'm currently evaluating the MongoDB Connector for Apache Spark.
I have some collections where I want to query data from and process that data with Spark afterwards.

Previously, we have been running tradtional find() queries, such as:
  • $geoWithin queries
  • time range queries on an indexed timestamp
  • "normal" queries such as find({$and: [ field_x: {$elemMatch: { field_y: "whatever" } }, ... ]) on indexed fields
  • combination of $geoWithin, and time range with $and
  • combination of time range and field values with $and

By using these queries, we broke down the complete dataset to only return a subset.

Ideally, I would like to reuse the existing queries with the mongo-spark-connector. 


So my questions are:

  1. How can I use find() queries and process the data with Spark afterwards? As far as I have seen, you can only use the aggregate pipeline.
  2. If there is no possibility to use find(), why so? What are the differences between find and aggregate in regards to the performance? I tried to look up the differences between find and aggregate and stumbled upon this: https://www.codeschool.com/discuss/t/differences-between-find-and-aggregate-query-and-projection-stages/21004/2 ... tl:dr "Something to keep in mind, aggregations are not a replacement for find(). Generally, you'll use an aggregate to calculate something specific. Whereas, you'll be using find() to retrieve documents". I basically want to pre-filter the data being passed from mongodb to spark and not add any more fields or do group by's or the like.


Thanks

Ross Lawley

unread,
Nov 1, 2016, 2:36:25 PM11/1/16
to mongod...@googlegroups.com
Hi Koko,

The MongoDB Spark Connector uses the aggregation framework for two main reasons:  

Firstly, the aggregation framework is designed for extendable pipelines. This is super useful and means that reduces risk of bugs when it comes to partitioning up a collections. This pattern also allows the push down of filters and projections that can come from DataFrames / Datasets in Spark, thus reducing the amount of data sent across the wire to Spark.

The second reason, is to allow the use of the aggregation to precompute before sending data to Spark.  For some users this can allow substantial speed improvements, as initial transformations of the data can be done inside MongoDB on the local data before it is sent across the wire.

For those reasons the use of find() rather than aggregate() was discounted.  There are numerous improvements that the pipeline optimiser can perform to improve efficiency.  It worth keeping in mind when loading data into Spark from MongoDB the cost of a find query versus an aggregation performing a match is minimal compared to the cost of transmitting the data across the wire. The overall cost to a Spark Job where data is then computed on further reduces the real cost of a find compared to an aggregate.

I hope that answers your questions!  So please use $match inside the aggregation pipeline so that only the active subset of data is passed to Spark and take advantage of passing less data across the wire. 

Ross

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+unsubscribe@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/e9b55c4d-ab00-4037-9b98-58144a1a6786%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--


{ name     : "Ross Lawley",
  title    : "Senior Software Engineer",
  location : "London, UK",
  twitter  : ["@RossC0", "@MongoDB"],
  facebook :"MongoDB"}

Reply all
Reply to author
Forward
0 new messages