Mongodb Spark Connector slow when inferring schema

Isart Montane

unread,

Oct 19, 2017, 4:33:15 AM10/19/17

to mongodb-user

Hello,

I'm trying to use the Mongodb Spark Connector to read data from Spark and I'm having some trouble when reading data from a big collection (+300G and +800M rows)

The connector gets stucked with the aggregate query that tries to infer the schema from the collection for more than 30 minutes.

The query looks something like that

```

"pipeline" : [

{

"$match" : {

"updateAt" : {

"$gt" : ISODate("2017-10-06T00:00:00Z")

}

},

{

"$sample" : {

"size" : 1000

}

],

"planSummary" : "IXSCAN { updateAt: 1 }",

```

Most of our documents look the same, so reading the first 1000 rows to infer the schema will be enough, is there a way to force that?

I've looked at the connector code and it looks that in this line the code decides to use a sample if `hasSampleAggregateOperator` returns True, and that returns True if we use a MongoDB version >3.2.

https://github.com/mongodb/mongo-spark/blob/82c33be6f9ef37a5b904a1a86d6aecc24435c899/src/main/scala/com/mongodb/spark/sql/MongoInferSchema.scala#L63

Is there any config option to force to use the first N rows instead of the sample?

I've seen that we can use "case classes" to set the schema instead of inferring it, but since we need to read from a few collections that might get a bit messy.

We've tried mongodb versions 3.2.9, 3.2.17 and 3.4.9 and all had the same behaviour.

Thanks,

Isart Montane Mogas

Wan Bachtiar

unread,

Nov 15, 2017, 2:15:30 AM11/15/17

to mongodb-user

I’ve seen that we can use “case classes” to set the schema instead of inferring it, but since we need to read from a few collections that might get a bit messy.

Hi Isart,

It’s been a while since you posted this question, have you found a solution yet?

Specifying schema explicitly would be preferable if you know the collections schema, particularly since you said that the documents look the same.
For example to specify schema explicitly:

  case class MyDocument(name: String, age: Int)

  MongoSpark.load[MyObject.MyDocument](sparkSession).printSchema()

Isart Montane

unread,

Nov 15, 2017, 5:46:48 AM11/15/17

to mongod...@googlegroups.com

Hi Wan,

I've managed to find a workaround reading 10k rows from the collection, and then infer the schema from that subset instead of infer the schema from the whole collection.

That seems to work fine for us.

Tuning the sampleSize didn't work because it reads random rows from a really big collection, and that was very slow on our case.

Thanks a lot for your answer anyway, really appreciated.

Isart

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/aeGGAO_JV_0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user+unsubscribe@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/4624033c-3436-477a-a6eb-2e97244a01c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

Message has been deleted