Spark connector: Mongo Null String is converted to "null"

448 views
Skip to first unread message

dan...@instaclustr.com

unread,
Nov 12, 2016, 10:10:16 PM11/12/16
to mongodb-user
Hi All,

I am using mongo-spark-connector-1.1.1. and my mongo is 3.2.0. In my mongo collection, there is a field called "middle_name" and value could be null or a string. I manually defined the collection schema like the following:

Enter code here...val event_coll_schema = new StructType()
  .add("first_name", DataTypes.StringType, nullable = true)
  .add("middle_name", DataTypes.StringType, nullable = true)
  .add("last_name", DataTypes.StringType, nullable = true)

I tried to apply the following filter. 

Enter code here...val df = sqlContext.read.mongo(event_coll_schema);
df.filter(col("middle_name").isNotNull)

The above filter cannot filter out documents which contains null middle_name.

Then I tries to apply the following filter

Enter code here...val df = sqlContext.read.mongo(event_coll_schema);
df.filter(col("middle_name")!=="null")

the above filter works!!!

So looks like the null string is being recognised as "null" string.

Does anyone have the same issue?

Wan Bachtiar

unread,
Nov 15, 2016, 1:29:10 AM11/15/16
to mongodb-user

I am using mongo-spark-connector-1.1.1

Hi danyang,

Do you mean mongo-spark v1.1.0 ? The next release up is version 2.0.0.

In my mongo collection, there is a field called “middle_name” and value could be null or a string. I manually defined the collection schema like the following:

I’ve ran a test under environment of mongo-spark v1.1.0, MongoDB v3.2.x, and Apache Spark 1.6.2, where there are three documents in collection names as below:

{"first_name": "Feisty", "middle_name": "String", "last_name": "Fawn"}
{"first_name": "Lucid", "middle_name": "String", "last_name": "Lynx"}
{"first_name": "Maverick", "middle_name": null, "last_name": "Meerkat"}

Using Scala code example below:

> val readConfigNames: ReadConfig = ReadConfig(Map("uri"-> "mongodb://host:27017/dbName.collName"))

> val schema = new StructType().add("first_name", DataTypes.StringType, nullable=true).add("middle_name", DataTypes.StringType, nullable=true).add("last_name", DataTypes.StringType, nullable=true)

> val names = sqlContext.read.mongo(schema, readConfigNames)

> names.filter(col("middle_name").isNotNull).foreach(print)
[Feisty,String,Fawn][Lucid,String,Lynx] // filtering out 'Maverick'

It's successfully able to query the value of middle_name that is not Null. You may also find some of the examples on mongo-spark: spark-sql useful.

Does anyone have the same issue?

If you have further question, could you provide the following information:

  • Specific version of mongo-spark
  • Specific version of Apache Spark
  • Example documents input

regards,

Wan.

Reply all
Reply to author
Forward
0 new messages