Creating index on dynamically added fields in MongoDB

2,177 views
Skip to first unread message

Astro

unread,
Jun 14, 2014, 1:54:01 AM6/14/14
to mongod...@googlegroups.com

I have a schema for MongoDB collection where fields are added dynamically to the document. The queries are tremendously slow when I perform a search for document with dynamically added fields. The sample documents look like below:

{
    "_id" : ObjectId("4657439ad7f616df3f49"),
    "name" : "test1",
    "created_at" : ISODate("2014-06-13T10:48:41.501Z"),
    "Status" : "NEW",
    "email":"te...@test.com",
    "rank":1278

}

Fields name, email, rank are dynamically created by user and added to the document. Querying with any of these fields taking very long to respond. No indexes are created on these field as the nature and fields themselves are added dynamically. I have tried .explain(), here is the output:

{
    "cursor" : "BasicCursor",
    "isMultiKey" : false,
    "n" : 1,
    "nscannedObjects" : 1683063,
    "nscanned" : 1683063,
    "nscannedObjectsAllPlans" : 1683063,
    "nscannedAllPlans" : 1683063,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 4,
    "nChunkSkips" : 0,
    "millis" : 1708,
    "indexBounds" : {

    }

This clearly shows that there is need to do indexing. But I have no clue how to create index on these non-deterministic fields. Note*:The document has no embedded documents as discussed here.


s.molinari

unread,
Jun 14, 2014, 5:30:53 AM6/14/14
to mongod...@googlegroups.com
Just for curiosities sake, but would the field key be also created by the user? In other words, if I, as a user of your system, would want a "foo" field with the value "bar", then I can create it myself and no other users would have it in their own record? Is that what you mean by dynamic added fields?

Scott

Astro

unread,
Jun 14, 2014, 5:39:57 AM6/14/14
to mongod...@googlegroups.com


On Saturday, June 14, 2014 3:00:53 PM UTC+5:30, s.molinari wrote:
Just for curiosities sake, but would the field key be also created by the user? In other words, if I, as a user of your system, would want a "foo" field with the value "bar", then I can create it myself and no other users would have it in their own record? Is that what you mean by dynamic added fields?

Scott


The table itself keeps growing as user add fields to it. No fields are created before user adds fields of his own choice, that way the collection is formed. And hence querying on those fields become time consuming.

William Zola

unread,
Jun 14, 2014, 2:14:13 PM6/14/14
to mongod...@googlegroups.com
Hi Astro!

If you don't know in advance what the fields are going to be, then your best bet is to create a generic index using the key/value design pattern.  Here are some links describing this schema design:


Once you do this, you can query using the $elemMatch operator, or using the more sophisticated technique described here:

-William 

William Zola

unread,
Jun 14, 2014, 7:36:23 PM6/14/14
to mongod...@googlegroups.com
One more link describing this schema design:

s.molinari

unread,
Jun 15, 2014, 12:19:39 AM6/15/14
to mongod...@googlegroups.com
William!!!  Thank you! 

I was wondering the whole time how to control (standardize) indexing on fields dynamically created through the "system". I even came up with a convoluted way to store the data, but was talked out of it by some smart people here.;) Yet, the question still remained in my mind, "How can we build indexes smartly and controllably?" This seems to be the answer. 

In your opinion, could the resulting large index become a really costly factor? Is it as flexible as normal indexes would be? Would it makes sense to put similar data type fields (i.e. integer, text, date) in their own subdocuments and build separate indexes for each? Are there possibly any other methods to standardize indexing for dynamically created fields?

Scott

William Zola

unread,
Jun 15, 2014, 9:12:30 AM6/15/14
to mongod...@googlegroups.com
Hi Scott!

I'm very fond of the key/value (k/v) design pattern, where you represent arbitrary keys and values using an array of documents {k:'keyvalue', v:'valuevalue'}.  I'm glad that you like it too.  I'm having a hard time with some of your questions, because I'm finding them a bit ill-defined.  

First, though -- an important point. You can *combine* the k/v index with other indexes.  For example, I worked with a customer building a tenanted system: there were some fields in the documents that would always be present since they were used by the overall system (think: "CustomerID" and "ApplicationID": also "DateCreated" and "DateUpdated"), and then there were a set of fields that were created by the individual customer and which would vary wildly [0].

My overall advice was to store the customer fields in a 'custfields' array, using the k/v design, and to build a *compound* index, with the mandatory fields first and the 'custfields' array last.  IIRC, they knew that every query would always use CustomerID, ApplicationID, and one of the dates, so the overall index was {CustomerID :1, ApplicationID:1, DateCreated:1, custfields:1}.  One good thing about this design was that {CustomerID:1, ApplicationID:1, DateCreated:1} made a *very* nice shard key.   

They wanted to do searching and sorting based on the values in the 'cfields' array.  Searching is easy: you just have to use $elemMatch -- "{ cfields: {$elemMatch: {k:'color', v:'blue'} }" (or whatever).  Sorting is more challenging: you have use the aggregation framework, do something like this [1] to move the sorted fields to the top level, and then apply $sort to the generated fields.  

This was using 2.2, so they had to make sure that the result sets would fit in a single document and that no single $sort stage used too much memory.  If you use 2.6, where aggregation returns a cursor and the {allowDiskUse:1} parameter is available, neither of these is an issue any more.  

Going through your questions:

In your opinion, could the resulting large index become a really costly factor? 

It depends [2].  

I'm having a hard time understanding this question.  The obvious answer is "yes, if you design the schema badly, this can be a problem".  :-)  If you use the k/v section as part of a larger index (see above), then you'll get decent index locality.  Note that you *will* need to have enough RAM to keep the entire index in memory, since entries can be scattered anywhere, but that's no difference than any other MongoDB index that isn't right-balanced or left-balanced.  

The other question is "compared to what?".  It's generally a good idea to have as few indexes as possible.  If you can replace 8 or 10 indexes with a single k/v index, it's a net win on everything: on disk usage, on RAM usage, on disk I/O, on the time required to build and select a query plan. 

On the other hand, if you have a schema where all of your queries can be satisfied with only 2-3 compound indexes, those will be more efficient to use the "conventional" compound indexes than a k/v index.  

Where's the crossover point?  It depends on the details of your application: you'd have to test and measure.  

Is it as flexible as normal indexes would be? 

I'm not sure what you mean by "flexible".  I can't think of anything that you could do with a normal index that you couldn't do with a k/v index; you just have to be more clever about it.  What operations were you thinking about when you said "flexible"?   

Would it makes sense to put similar data type fields (i.e. integer, text, date) in their own subdocuments and build separate indexes for each? 

Nope.   MongoDB indexes handle multiple types in an index just fine, and it will not allow you to insert a document which has two indexed array fields [3].

Are there possibly any other methods to standardize indexing for dynamically created fields?

I don't know of anything better.  I think this is one of the coolest MongoDB design patterns ever.

 -William 


[0] Sound familiar? :-)
[2] Inside joke, folks.  

s.molinari

unread,
Jun 16, 2014, 3:11:26 AM6/16/14
to mongod...@googlegroups.com
Thanks William. Great stuff!!!!

Foot Note [0] - Ooohhh yes.;)
Foot Note [2] - LOL!:D I get it!

And you've actually answered all of my questions. 

But to clarify what you weren't sure of. By "just as flexible", I meant being able to query on the K/V index just like you could with any other index. You said, yes, if you are clever. Cool! You also mentioned sorting being the weak point. Couldn't sorting be done on the application side too? Obviously, I am thinking "user" viewable data only. As for reporting, sounds like the AggF would be an acceptable solution.

The k/v index and storage pattern is actually the solution we are looking for too! Thanks so much for this jewel of information. I am really happy knowing this pattern exists and can be built upon (and the mix of the same sort of fields we are going to be using makes for a good shard key too.;) That was actually also a worry I had in the back of my mind. How to make a shard key on things you have no idea exist?)

A bigger competitor to what we are aiming for seems to also have a similar indexing system. They always offer "my own records" (for data, which has an owner field) and "dates", as the first available filters for a "self-made" query for reporting or data listing purposes. After those selections, the user is able to create multiple filters on any other field available (except for larger text fields). This k/v index sounds like our solution to making the same kind of index possible. It just sounds so "bulky" too, which is why I asked about the size of the index getting costly. I guess, the ability to create a query practically on any field at any time in any circumstance and it perform well is a fantastic feature we need, which also should have its price too. Hmmmm......  I've asked a couple times about RAM to Disk storage ratios and never got a straight answer. It's always been "it depends". I understand that now and this form of index seems to me to be a RAM to Disk ratio killer. In other words, a lot of data that might not be very useful/ hot, will still be in RAM for this type of index. This makes the price of the database storage costlier. I guess we'll need to do some real world testing on this to see what this solution really means in terms of the data storage mix. 

One thing we are also looking at is to not only have our high level (and heavier) "data objects" available, but also more low level data objects (collections) for background data storage (needed for possible programming reasons). This is where I envision possible issues with indexing choices. We'll see how that goes. Oh, and we are also going to offer capped collections as a data store too. Why not, if it is sitting there waiting to be used?

Foot Note [3] - well, that just means we are forced to keep it simple. Simple is always good, as long as simple allows enough flexibility too. Sounds like we have just that, fantastic!

One last question to foot note 3. Would it then not be possible to offer an array of values as a field supplementary to standard fields, like an array of tags field, because of the single indexed array of fields limit? Or could it be possible to have a field in the K/V pairs, where the values are an array too and searchable with the index? Oh, that would be friggin awesome, if that could work. If that does work, how deep can the arrays go? For instance, can they also be K/V pairs with an array as values?

Scott

Astro

unread,
Jun 17, 2014, 9:24:43 AM6/17/14
to mongod...@googlegroups.com

Hi William,

Thank you..!
All of the solutions were perfect.
I found http://www.kamsky.org/stupid-tricks-with-mongodb/using-aggregation-framework-to-reshape-schema this very useful. But with that said is it always necessary to fill the dynamic fileds into an array of key:value pattern? I have my schema already designed and I find it almost to "NO" to change the schema. Is there any way to incorporate this multivalue array with key:value pair into my existing schema? Or modifying schema is the only way? My schema normally look like as posted in Question.


Thanks
Atish
 

William Zola

unread,
Jun 17, 2014, 9:50:14 AM6/17/14
to mongod...@googlegroups.com
Hi Ati!

Per your original comments, your requirements are: 
  • Support an arbitrary number of fields that you don't know ahead of time
  • Support arbitrary queries using those fields
  • Have indexes on those fields so that the queries (and updates!) will be fast
Given those requirements, the only solution I know of that will work well is to change your schema to use key/value documents.

 -William 
Reply all
Reply to author
Forward
0 new messages