Updating Existing Document of MongoDB from Spark Using mongo-spark connector

5,611 views
Skip to first unread message

luqman ulkhair

unread,
Nov 12, 2016, 10:10:04 PM11/12/16
to mongodb-user
Hi,

I want to update some fields of a collection using sparkSQL DataFrame. I made some changes to a field of a document and then write the DataFrame back to MongoDB using APPEND_MODE. The fields are updated successfully.However when I try to update some fields then after writing the DataFrame using save method the remaining fields of document disappears.

Wan Bachtiar

unread,
Nov 14, 2016, 9:44:48 PM11/14/16
to mongodb-user

I made some changes to a field of a document and then write the DataFrame back to MongoDB using APPEND_MODE. The fields are updated successfully.However when I try to update some fields then after writing the DataFrame using save method the remaining fields of document disappears.

Hi luqman,

To clarify, you are able to update one field of a document successfully however when updating multiple fields the non-updated fields disappears ?

I’ve just tested this behaviour with environment: mongo-spark v1.1.0, Apache Spark v1.6.2, and MongoDB v3.2.x using ‘append’ mode and successfully updated multiple fields value of a dataframe:

MongoSpark.save(df.write.option("collection", "collNameToUpdate").mode("append"))

If you have further questions, could you provide the relevant information:

  • Your mongo-spark version

  • Your Apache-Spark version

  • Your MongoDB version

  • An example snippet code of the failing update

  • An example input/output documents showing the failed update operation

  • If any, error log or output.

Best regards,

Wan

luqman ulkhair

unread,
Nov 15, 2016, 12:59:51 PM11/15/16
to mongodb-user
Dear Wan,

Actual problem is not with either updating single or multiple fields. The real problem is that when your dataFrame hold subset of fields of actual collection and then you try to update a column then the document is replaced by the fields in the dataFrame. What I want is the fields in the document which are not present in the DataFrame remain unchanged and the fields in DataFrame are changed/added in the Document.


A scenario could be load the data to dataframe......filterout some fields except _id .......and then save it using append........the dataframe is replaced. I don't want the other fields to disappear.

Code:

DataFrame.write.format("com.mongodb.spark.sql").options.mode("append").save();;


using 

mongo-spark-connector 2.0
mongo-java-driver 3.2
apachespark sql cor 2.0.1



Regards,
Luqman Ul Khair

Wan Bachtiar

unread,
Nov 23, 2016, 2:43:02 AM11/23/16
to mongodb-user

A scenario could be load the data to dataframe……filterout some fields except _id …….and then save it using append……..the dataframe is replaced. I don’t want the other fields to disappear.

Hi luqman,

To clarify your case with an example, you have a document with fields fieldA and fieldB , with _id field loaded into a dataframe via the connector. After filtering out fieldA, update the value of fieldB, and saving with append, the document in source is now only containing fieldB and _id.

If the above case description is correct, then there is currently an open tracking ticket SPARK-100 for this improvement. Please feel free to watch/upvote the ticket for updates.

However, if the above description is not what you’re referring to, could you please clarify further with some code steps and examples. For example, the original document, DataFrame, document after saving, etc.

Regards,

Wan.

Vijay Das

unread,
Jul 8, 2017, 8:25:24 PM7/8/17
to mongodb-user
Hi wan,

I'm having exact requirement.

I'm currently updating the document using below api. It overwrites the existing document. The dataframe has less fields than the existing document. My requirement is to update only those fields that are available in dataframe. The rest fields in the existing document should be intact.

MongoSpark.save(df.write.option("collection", "xxxx").mode("append")).

Do you have the code snippet of how can we achieve this?  

Ross Lawley

unread,
Jul 13, 2017, 5:35:37 AM7/13/17
to mongod...@googlegroups.com
Hi Vijay,

With the Spark Mongo Connector 2.1 you can do:

MongoSpark.save(df.write.option("collection", "xxxx").option("replaceDocument", "false").mode("append")) 

As long as the DataFrame has a _id it will update the existing document and set the fields as per the DataFrame and not replace it.

Ross

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+unsubscribe@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/afee82e7-356f-459a-9127-22c9404367b8%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--


{ name     : "Ross Lawley",
  title    : "Senior Software Engineer",
  location : "London, UK",
  twitter  : ["@RossC0", "@MongoDB"],
  facebook :"MongoDB"}

Thang

unread,
May 4, 2018, 9:36:11 AM5/4/18
to mongodb-user
Hi guys,

I have tried your recommendation but it doesn't work as expected, or maybe I'm understanding it wrong? Please take a look.

Spark version: 2.2.0
Mongo-Spark-Connector version: 2.2.2

Original table:
{ "_id" : ObjectId("5aec54da78062173ca30f1bd"), "id" : 1, "value" : 1 }
{ "_id" : ObjectId("5aec54da78062173ca30f1be"), "id" : 2, "value" : 1 }
{ "_id" : ObjectId("5aec54da78062173ca30f1bf"), "id" : 3, "value" : 1 }


Dataframe:

+--------------------------+---+

|_id                       |id |

+--------------------------+---+

|[5aec54da78062173ca30f1be]|4  |

+--------------------------+---+



Update command:
val writeConfig = WriteConfig(Map("uri" -> s"mongodb://$host/$database.$table"))
MongoSpark.save(df1.write.option("replaceDocument", "false").mode("append"), writeConfig)


Expected:
{ "_id" : ObjectId("5aec54da78062173ca30f1bd"), "id" : 1, "value" : 1 }
{ "_id" : ObjectId("5aec54da78062173ca30f1be"), "id" : 4, "value" : 1 }
{ "_id" : ObjectId("5aec54da78062173ca30f1bf"), "id" : 3, "value" : 1 }


Actual result, the value field was removed:
{ "_id" : ObjectId("5aec54da78062173ca30f1bd"), "id" : 1, "value" : 1 }
{ "_id" : ObjectId("5aec54da78062173ca30f1be"), "id" : 4 }
{ "_id" : ObjectId("5aec54da78062173ca30f1bf"), "id" : 3, "value" : 1 }

Regards,
Thang
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.

To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.

Ross Lawley

unread,
May 8, 2018, 11:49:02 AM5/8/18
to mongodb-user
Hi Thang,

Can you set the replaceDocument value on the writeConfig and that should correct the issue.

Ross.

Thang

unread,
May 8, 2018, 1:18:26 PM5/8/18
to mongodb-user
Hi Ross,

I actually just found out the problem. 

Even if I put replaceDocument to writeConfig:

val writeConfig = WriteConfig(Map("uri" -> s"mongodb://$host/$database.$table", "replaceDocument" -> "false"))

If I use:

MongoSpark.save(data.write.mode("append"), writeConfig)

It will not work, because this function call the dataFrameWriter from Spark and doesn't take account of the mongodb writeConfig options. Instead I should use:

MongoSpark.save(data, writeConfig)

This will call the correct MongoSpark writer function.

I think this point should be noted in the document because it is indeed confusing.

Regards,
Thang

Ross Lawley

unread,
May 9, 2018, 4:45:09 AM5/9/18
to mongodb-user
Hi Thang,

Glad you got that sorted - I think that is a bug and can be fixed. I've added SPARK-180 to track.

Ross

Yayati Sule

unread,
Feb 14, 2019, 7:18:42 AM2/14/19
to mongodb-user
Hi Ross,
I know this thread may be long dead, but I had query on the reply you posted.
If I want to insert the dataframe which does not contain the "_id" field in a collection and I want the insert to be an update
is it not possible to do so then?
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.

To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.

Anish Gupta

unread,
May 1, 2019, 12:13:52 PM5/1/19
to mongodb-user
Hi Thang,
I have a similar requirement where my source collection is same as the target and if a document is present, then I want to update it otherwise I would like to insert a new document with a fresh ObjectID. I tried all approaches but inplace update doesn't work for me. I was wondering on how did you create the "data" used in the above comment. 

Thanks & Regards,
Anish

Thang

unread,
May 7, 2019, 4:49:17 AM5/7/19
to mongodb-user
Hi Anish,

I'm not sure which "data" you're referring to. But as I understand, the MongoSpark writes based on the "_id" field only. In my case, I read the data from the target (which included "_id" already), do some analysis then write back, hence it always update on existing "_id".

Regards,
Thang
Reply all
Reply to author
Forward
0 new messages