Mongodb-Spark Conflict datatype issue

Siva B

unread,

Jul 20, 2016, 7:37:51 PM7/20/16

to mongodb-user

Hi All,

I am using mongo-spark-connector_2.10:1.0.0 package for connecting MongoDB v3.2 from Spark v1.6

While mapping collection to dataframe, most of the fields will become conflict data type.

>>> df.printSchema()

root

|-- Feature1: conflict (nullable = true)

|-- Feature2: conflict (nullable = true)

|-- Feature3: conflict (nullable = true)

|-- Feature4: integer(nullable = true)

While retrieving these fields using sql select or dataframe selecct operation, I am getting errors like "ValueError: Could not parse datatype: conflict"

I am trying to cast these fields into String, that operation also failed.

Caused by: com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a IntegerType (value: BsonString{value='NA'})

How to handle this case ?

Thanks,

Siva

Thanks,

Siva

Ross Lawley

unread,

Jul 21, 2016, 6:16:09 AM7/21/16

to mongod...@googlegroups.com

Hi Siva,

The ConflictType indicates that the field was found to contain disparate data types that cannot be coerced into a unifying type. In simple terms it contains varying types of data for example numbers and strings or strings and documents.

What types do these Feature fields contain - you maybe able to manually specify the schema and allow spark to coerce the data.

Ross

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/d7127bbb-7f8f-4294-a596-c42a7121d39d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

{ name : "Ross Lawley",
title : "Senior Software Engineer",
location : "London, UK",
twitter : ["@RossC0", "@MongoDB"], facebook :"MongoDB"}

Damon Woo

unread,

Jul 26, 2016, 3:09:00 AM7/26/16

to mongodb-user

Hi, Ross

I had a similar issue.

It is ok when I load from Mongodb

val rdd = MongoSpark.load(sc)

and the printSchema() runs ok.

But when I execute

rdd.count()

I throws a exception like

com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast OBJECT_ID into a ConflictType (value: BsonObjectId{value=5570245373af6caaed4efe02}

How can I identify the wrong data?

Thanks,

Damon

Ross Lawley

unread,

Jul 26, 2016, 4:46:40 AM7/26/16

to mongodb-user

Hi Damon,

It looks like one of your fields contains mixed types, so it has been assigned a ConflictType. It should be visible when run printSchema() however, I've added https://jira.mongodb.org/browse/SPARK-70 to improve the error message regarding conflict types.

Ross

Damon Woo

unread,

Jul 26, 2016, 9:43:37 PM7/26/16

to mongodb-user

Hi, Ross

Thanks for your help, there is field contains mixed types.

Siva B

unread,

Jul 27, 2016, 12:19:09 PM7/27/16

to mongodb-user

But mixed data should be treated as "string" right. Stratio Spark connector mapping these columns as string only.

Any reason for handling this data as conflict data type ?

Siva B

unread,

Jul 28, 2016, 4:23:58 AM7/28/16

to mongodb-user

Is there any workaround for querying conflict datatype columns from dataframe ?

Ross Lawley

unread,

Aug 1, 2016, 9:07:36 AM8/1/16

to mongodb-user

Hi Siva,

Currently the workaround is to manually set the schema for the DataFrame to string - rather than using the MongoConnector to infer the schema which can result in ConflictTypes.

There are downsides to using String as the base type, for example filters that are converted into aggregation framework queries may not work as expected (for example when querying ObjectId's) In those instances caching the DataFrame and using Spark to apply the filter post conversion would return the correct results.

I'm currently investigating alternative work arounds, some of which may rely on changes forthcoming in Spark.

Ross

Jason Zhang

unread,

Oct 10, 2017, 9:52:29 PM10/10/17

to mongodb-user

Hi all,

I think we can leverage the lazy mechanism to handle the dynamic schema during reading phase. However, due to the limit support in Mongo-Spark-Connector, we can't write RDD which supports dynamic schema to Mongo.

How can I get the support from MongoDB-Spark?

*Dynamic Schema Challenge*

Spark has RDD and DataFrame, by design, RDD supports dynamic schema, DataFrame does not for better performance.
Mongo Spark Connector Scala API supports RDD read&write, but Python API does not. Python API only support DataFrame which will not support dynamic schema by design of Spark.

----Workaround for Read phase, completed
1. read Mongo documents to DF
2. dump data to Json String
3. transfer it to TD Spark application

----Blocking issue in Write phase, pending on Mongo Spark team
For write, we parse the string to dynamic schema dictionary into RDD, however we can't push it to connector without transfer to DataFrame.
I think we need to consulting with Mongo Spark Team, once Mongo Spark can support RDD writing, we can migrate all codes to Python.

Issue History:
1. RDD approach has been deprecated in mongo-hadoop project March 2016.
RDD saveAsNewAPIHadoopFile which used to write data into MongoDB has been deprecated.
rdd.saveAsNewAPIHadoopFile(
path='file:///this-is-unused',
outputFormatClass='com.mongodb.hadoop.MongoOutputFormat',
keyClass='org.apache.hadoop.io.Text',
valueClass='org.apache.hadoop.io.MapWritable',
conf=

{ 'mongo.output.uri': 'mongodb://t2cUserQA:G05h...@qa-t2c-node1.paradata.io:27017/t2c.JasonpartFromSpark2' }

)
Announced @: https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage

2. objectID issue: resolved.... when converting to DataFrame found: "TypeError: not supported type: <class 'bson.objectid.ObjectId'>"
tracking by: https://jira.mongodb.org/browse/HADOOP-277

Schema related issues:
3. StructType issue: "com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast ARRAY into a StructType"
tracking by: https://groups.google.com/forum/#!topic/mongodb-user/lQjppYa21mQ

4. repartition issue:
"Cannot cast ARRAY into a StructType(StructField(0,StringType,true), StructField(1,StringType,true), StructField(2,StringType,true), StructField(3,StringType,true), StructField(4,StringType,true)) (value: BsonArray{values=[BsonString

{value='Logic'}

, BsonString

{value='Logic ICs'}

]})"
tracking by: https://groups.google.com/forum/#!topic/mongodb-user/lQjppYa21mQ

Murtaza Chiba

unread,

Oct 17, 2017, 7:41:15 PM10/17/17

to mongodb-user

Hello Ross,

I am running the example MongoDB Spark example for Movielens and getting the same error, what is the workaround or hopefully, by now, a solution to the issue?

// Calculating the best model

val bestModel = trainedAndValidatedModel.fit(movieRatings)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 1 times, most recent failure: Lost task 0.0 in stage 20.0 (TID 20, localhost, executor driver): com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a IntegerType (value: BsonString{value='userId'})

Wan Bachtiar

unread,

Oct 18, 2017, 3:04:22 AM10/18/17

to mongodb-user

Hi Jason,

I believe this question has also been posted and responded on SPARK-146.
Please keep the discussion there for continuity.

Regards,
Wan.

Wan Bachtiar

unread,

Oct 18, 2017, 3:19:35 AM10/18/17

to mongodb-user

com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a IntegerType (value: BsonString{value=’userId’})

Hi Murtaza,

This is likely that the field was found to contain different data types that cannot be coerced into a unifying type. In other words, the field userId contains varying types of data. e.g. integers and strings.

Note that in MongoDB Connector For Spark v2 the base type for conflicting types would be in strings.

If you have further questions, please open a new discussion thread with your relevant environments below:

MongoDB Spark Connector version
Apache Spark version
Example document(s)
More information about the snippet code example

Regards,
Wan.

Murtaza Chiba

unread,

Oct 18, 2017, 11:35:34 AM10/18/17

to mongodb-user

Hello Wan,

Thanks, based on your input I figured out that when I loaded the original CSV file it had the header row that got added as a document. Things worked after dropping the header document from the collection.