Get all Orphaned Documents in a collection from MongoDB Shard server

1,615 views
Skip to first unread message

Mukesh Kumar

unread,
May 19, 2017, 11:43:29 AM5/19/17
to mongodb-user
Hello,

I have to clean Orphan document in MongoDB Shard Servers, but before deleting Orphan Document with cleanupOrphaned, I want to get all those document as a backup in some collection or a file.

Please suggest how to do.

Mukesh Kumar

unread,
May 23, 2017, 6:22:59 AM5/23/17
to mongodb-user
Any update on this, I don't want to clean orphan docs without know the docs and their count.

Conchi Bueno

unread,
May 28, 2017, 11:55:47 PM5/28/17
to mongodb-user
Hi,

It may be useful to check the meaning of  Orphaned Document. Could you elaborate on the reason for backing up these documents? Are you trying to do a "dry run" of the cleanupOrphaned command?

Two main points to consider:

- Orphaned documents are typically the result of an interrupted migration process (e.g. power loss, hardware issues, etc.).
- Since the orphan document does not belong to the shard, updates (including deletions) to the document in question will not go to that shard (instead, it will go to the correct shard that owns that document). The orphan document is thus a (possibly) outdated duplicate of the correct document or even a non existent one.

If you're trying to do a "dry run", there is a feature request in SERVER-17013 which I believe addresses your requirement. Please, watch/upvote the ticket to receive updates.

Currently as of MongoDB 3.4, there is no built-in feature in MongoDB that can show you orphaned documents, only delete them (using the cleanupOrphaned command). However, working with the definition of orphaned documents, it is relatively straightforward to write a script that can find orphaned documents in a particular shard.

The script would have to check the relevant collection in the shard if it contains documents that doesn't belong to the shard key range owned by the chunk. You can discover what key range owned by a particular shard from the output of sh.status() or the content of the chunks collection in the config database. Keep in mind that this manual process could yield the wrong result if the balancer is active while this process is running, as chunks can be moved around by balancer.

If you have further questions, could you please provide MongoDB version?

Regards,
Conchi

Mukesh Kumar

unread,
May 31, 2017, 9:47:44 AM5/31/17
to mongodb-user
Got it thanks.

Mukesh Kumar

unread,
Jun 5, 2017, 10:58:32 AM6/5/17
to mongodb-user
Hi,

In case of Range based sharding, it's working fine as I can see range of shard key, but in case of Hashed based sharding, e.g- I hashed on key _id, so now I can see range on hashed value which I can't find in a document. Please help on this.

mongos> db.version()
3.0.4
mongos> version()
3.0.4


 sharding version: {
        "_id" : 1,
        "minCompatibleVersion" : 5,
        "currentVersion" : 6,
        "clusterId" : ObjectId("Alphanumeric Numbers")
}


db.collection_name
                        shard key: { "_id" : "hashed" }
                        chunks:
                                rep1A   2
                                rep2A   2
                                rep3A   2
                                rep9A   2
                        { "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-8198552921648689600") } on : rep1A Timestamp(9, 2)
                        { "_id" : NumberLong("-8198552921648689600") } -->> { "_id" : NumberLong("-7173733806442603400") } on : rep1A Timestamp(9, 3)
                        { "_id" : NumberLong("-7173733806442603400") } -->> { "_id" : NumberLong("-6148914691236517200") } on : rep2A Timestamp(9, 4)
                        { "_id" : NumberLong("-6148914691236517200") } -->> { "_id" : NumberLong("-5124095576030431000") } on : rep2A Timestamp(9, 5)
                        { "_id" : NumberLong("-5124095576030431000") } -->> { "_id" : NumberLong("-4099276460824344800") } on : rep3A Timestamp(9, 6)
                        { "_id" : NumberLong("-4099276460824344800") } -->> { "_id" : NumberLong("-3074457345618258600") } on : rep3A Timestamp(9, 7)
                        { "_id" : NumberLong("7173733806442603400") } -->> { "_id" : NumberLong("8198552921648689600") } on : rep9A Timestamp(9, 18)
                        { "_id" : NumberLong("8198552921648689600") } -->> { "_id" : { "$maxKey" : 1 } } on : rep9A Timestamp(9, 19)

Pravin Dwiwedi

unread,
Jun 5, 2017, 12:04:04 PM6/5/17
to mongodb-user
I came across some script which may work for you--

MongoDB Sharded Cluster Orphaned (Duplicate) Document Finder/Remover--

Kevin Adistambha

unread,
Jun 6, 2017, 1:05:08 AM6/6/17
to mongodb-user

Hi Mukesh

In case of Range based sharding, it’s working fine as I can see range of shard key, but in case of Hashed based sharding, e.g- I hashed on key _id, so now I can see range on hashed value which I can’t find in a document. Please help on this.

By design, hash indexes only support equality matches, and cannot perform range-based queries (https://docs.mongodb.com/manual/indexes/#hashed-indexes). Since MongoDB doesn’t provide a user-facing command to compute the hash, and querying for orphaned documents involve range queries, it is currently not possible (as of MongoDB 3.4.4) to find orphaned documents for collections using hash indexes.

In this case, since orphaned documents are likely to be an outdated version of a document (or possibly even a non-existent document), the command cleanupOrphaned is the recommended method to remove orphaned documents.

Having said that, if a “dry-run” of cleanupOrphaned is what you require, please upvote SERVER-17013, and please comment on the ticket with a detailed description your use case.

Best regards,
Kevin

Mukesh Kumar

unread,
Jun 6, 2017, 1:17:22 PM6/6/17
to mongodb-user
Thanks for the information Kevin.
Reply all
Reply to author
Forward
0 new messages