Duplicate documents in sharded environment

Showing 1-13 of 13 messages
Duplicate documents in sharded environment Patrick Scott 9/26/12 5:24 AM
I have a 2 shard setup and I recently discovered duplicate documents between shards. I have turned off the balancer so it is not an issue with an in-progress balancer operation. Is there a tool that I can use to clean up those duplicates? If not, is there a command that will determine which shard is the owner of the document?

Thanks,
Patrick
Re: Duplicate documents in sharded environment Gianfranco 9/26/12 9:01 AM
Hi,

I'm assuming that you have an index unique:true and the duplicates exist because of a migration failed from one shard to another.
This resulted in 2 shards having the same data and the configs didn't get updated.

There isn't a single command which will fix this problem unfortunately.

If this is the case you'll need a script which finds and removes orphaned documents.
Re: [mongodb-user] Re: Duplicate documents in sharded environment Patrick Scott 9/26/12 10:32 AM
So how can I found out which shard "owns" the document?

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

Re: [mongodb-user] Re: Duplicate documents in sharded environment Gianfranco 10/2/12 2:12 AM
Hi Patrick,

Sorry for the delay.

Could you run this script with the path to the filename of orphanage.js?

Note: The script must be run from a 2.x shell.
         And you must connect to primary

If it is in the current working directory, where you started mongo shell, it will be:
1
load("orphanage.js")

After, you'll see a series of options you can now run:

Balancer.stop() -- Do this first, if it's not stopped already
Orphans.find('db.collection') – Find orphans in a given namespace
Orphans.findAll() – Find orphans in all namespaces
Orphans.remove('db.collection') – Remove all orphans in a namespace
Balancer.start()

Please follow the directions and make sure the output of documents to delete is correct before running remove.

Re: [mongodb-user] Re: Duplicate documents in sharded environment Patrick Scott 10/2/12 6:35 AM
How is db.collection.count() computed? I noticed that it was decreasing as orphaned documents were deleted. It scared me enough that I stopped the script but then I checked each shard individually for the document count and together they equaled the result of a call to db.collection.count() from mongos.

My guess is that count() reflects the total count of objects in the collection on each shard which may include orphaned documents.
Re: [mongodb-user] Re: Duplicate documents in sharded environment Gianfranco 10/2/12 8:30 AM
The db.collection.count() from mongoS is a global operation, so it has communicate with the shards containing that collection.

What version of mongo are you running? all the same?
Re: [mongodb-user] Re: Duplicate documents in sharded environment Patrick Scott 10/2/12 8:40 AM
My shards and mongos' are running 2.0.6.
Re: [mongodb-user] Re: Duplicate documents in sharded environment Gianfranco 10/2/12 9:10 AM
If you are doing updates with upserts, there is a Fix in 2.1.0 to prevent this to happen again.

The latest 2.1.x branch is 2.1.2

If you're want to look into upgrading to the latest version (2.2.0) please read the release notes on how to procede:
Re: [mongodb-user] Re: Duplicate documents in sharded environment Patrick Scott 10/2/12 9:18 AM
I'm doing updates but not with upserts. I just want to make sure I'm deleting true orphaned documents. I have about 100000 out of ~83 million which isn't a lot. If collection.count() includes orphaned items then it makes perfect sense for the global count to decrease as I delete orphans. I just want to verify that behavior.
Re: [mongodb-user] Re: Duplicate documents in sharded environment Gianfranco 10/3/12 3:07 AM
Sorry, I'm not sure what count() function you're referring to.
The normal one on the shell? or a similar one on the script? which line?

If you want to make sure you can go back incase a non duplicate is deleted, as in similar situations, you should back up the datafiles or use mongoexport, specially if it's a production system.
Re: [mongodb-user] Re: Duplicate documents in sharded environment Patrick Scott 10/3/12 5:25 AM
I'm referring to the shell command db.<collection>.count(). Does it include orphaned documents?
Re: [mongodb-user] Re: Duplicate documents in sharded environment Gianfranco 10/3/12 5:35 AM
Yes it does. It counts all the documents across the shards for that collection (when connected to the mongoS)
Re: [mongodb-user] Re: Duplicate documents in sharded environment Patrick Scott 10/3/12 5:56 AM
Ok. Then that explains why the count was decreasing. Thanks!