Query to discover documents that have more than one field with the same name

107 views
Skip to first unread message

Chad Scharf

unread,
Aug 31, 2016, 5:54:34 PM8/31/16
to mongodb-user
Per MongoDB's documentation on https://docs.mongodb.com/manual/core/document/ I see the following:
BSON documents may have more than one field with the same name. Most MongoDB interfaces, however, represent MongoDB with a structure (e.g. a hash table) that does not support duplicate field names. If you need to manipulate documents that have more than one field with the same name, see the driver documentation for your driver.

This, however, is causing some issues in our C# application in that the C# driver does not handle this scenario and BsonDocument starts throwing element name already exists exceptions, naturally, as it expects each field name to be unique. How we're getting into this mess from the C# driver to begin with is a whole other topic regarding extra elements serialization, but we're here and we need to be able to detect where this exists in our data so we can correct those records, determine the correct property value to keep and clean all of this up. Doing it manually isn't an option because there are literally millions of documents in this collection.

Haven't been able to figure out a query that would be able to determine this, even with mapReduce this appears to be impossible as the JSON interpreter is simply overwriting the second copy of the property during the map phase for the document and we're only ever getting a single property and JavaScript does not support duplicate property names on an object, therefore we can't just enumerate keys... Hoping someone has run into this before or the need for it, either a language that we can write a quick routine (phython, ruby, node.js, whatever) or a way in the C# driver or native Mongo console that we can query these things and possibly even correct them.

db.getCollection('test').insert({ "key" : "value", "key" : "other value"});
db
.getCollection('test').find({/*where field name "key" appears more than once*/});



Kirby Kohlmorgen

unread,
Sep 9, 2016, 12:05:18 PM9/9/16
to mongodb-user
Hey Chad,

Like you said this is a difficult situation to debug because most of the MongoDB drivers arbitrarily select one of the duplicate fields.

Fortunately there are several drivers that don't have this functionality. I've written a short example in Golang that will print all of the documents that have duplicate fields.

package main

import (
"fmt"
"log"
)

func main() {
// connect to MongoDB
session, err := mgo.Dial("127.0.0.1:27017")
if err != nil {
log.Fatal(err)
}

// initalize the test collection on the test database
c := session.DB("test").C("test")

// get all the documents in the test collection
var results []bson.D
err = c.Find(bson.M{}).All(&results)
if err != nil {
log.Fatal(err)
}

// iterate over all the docs
for _, doc := range results {
// for each doc make a map for the keys in the doc
keys := make(map[string]bool)

// iterate over each key-value pair in each doc
for _, kv := range doc {
// if a key has been seen before print the doc
if _, ok := keys[kv.Name]; ok {
fmt.Println(doc)
}

// mark that we've seen this key
keys[kv.Name] = true
}
}
}

Here's a link to the MongoDB Golang driver v2 manual https://godoc.org/labix.org/v2/mgo if you would like to better understand how this code works or make necessary modifications.

Best of luck with your data cleansing!

Kirby
Reply all
Reply to author
Forward
0 new messages