Hi Bernie,
thanks for the fast replay. I looked through the code (mine and pymongo) and realized its harder than a I thought.
I identified these cases
1) Assuming a unordered bulk insert like this
* bulk.insert {_id: 1, ....}
* bulk.insert {_id: 2, ....}
.......
* bulk.insert {_id: 20000, ....}
* bulk.execute()
Looking inside pymongo/message.py I could see that, if the entire bulk is to big, the driver splits up the bulk into smaller batches. Lets assume my 20k inserts are divided into two batches.
The first batch successfully ran on the primary, but during preparation of the second batch this node became secondary.
In that case I got and AutoReconnect exception, but having no idea that 10k of the documents are already inserted.
I could retry or do a find to identify if the document is already present. However then I'm unable to detect if another part/thread/process of my application already inserted this document, making it impossible to notify the caller about that.
2) Using $inc as update modifier
* bulk.find({id: 1}).update_one({$inc: {x: 1}, ....})
* bulk.find({id: 2}).update_one({$inc: {x: 1}, ....})
.......
* bulk.find({id: 20000}).update_one({$inc: {x: 1}, ....})
Lets assume again after the first 10k the server became secondary. There is no easy way to know which documents are already incremented or not.
I was thinking to avoid this 'split of the entire bulk into batches' on my application. For example just send 1000 operations per bulk, so it will never reach the max size per bulk. However I'm not sure if it will solve the problem entirely. It could still happen that during the execution of this small bulk, the mongodb-server can become secondary, right? So I will also get an AutoReconnect where a few operations already happened.
Anyway, do you know existing application code (can be any project Java/c++) which deals with bulks and AutoReconnect ?
A link to github is enough. I just wanted to see how other people are tackling this problem.
Regards,
Stephan