Data denormalization in schema design using MongoDb : need help with

88 views
Skip to first unread message

ender0...@gmail.com

unread,
Mar 3, 2015, 1:32:43 PM3/3/15
to mongod...@googlegroups.com
Hi guys,

So I am designing a schema for an ecommerce app and I am evaluating MongoDb for it. I have no prior experience with NoSql databases and decent experience with relational databases (though not at a large scale).

Say there are 200 products in my catalog and there are 10,000 vendors selling these products. Each vendor can price any product at their own price and sell or not sell a given product as well. Further these products are grouped into certain categories (which I understand is a non-issue with a document database anyway). Say there are 10 such categories and each category has 20 of the above products. (10 categories times 20 products in each category gives us our total of 200 products).

I want to represent the above schema with the use case that the end-user can browse products belonging to a certain category for a particular vendor. This simplifies implementation because I am quite clear that on the front-end, the user will first select the vendor, then the category and then see the list of products belonging to that category along with the product details.

After much thought and attempt at not trying to think in an RDBMS way, I thought why not replicate the exact 'document' that the user will see in the front-end itself. But this would involve a lot of denormalization and a potentially huge database. Let me explain:

Lets say we call this collection "vendor_category_product_list". This will look like:
{
   "_id"  : 1
   "vendor_id" : 9600
   "vendor_name" : "Abc Electronics"
   "category_id" : 34
   "category_name" : "Tablets"
   products: [
      {
            "name" : "iPad 7"
            "description" : "Apple iPad with iOS 8"
            "price" : "500"
            "discount" : "10"
            "size" : "7 inches"
            "display" : "Retina Display"
      },
      {
            "name" : "Galaxy Ace Tablet"
            "description" : "Samsung tablet"
            "price": "400"
            "size" : "7x5 inches"
            "resolution" : "20000 pixels"
      }
      .........................................
      .............................
      18 more products
      .............................
      .........................................
   ]  
}

Another vendor might have the exact same products in the exact same category but the schema will still store each combination of vendor.category.productList in a separate document.
So, with the numbers I used above, there would be 10000 vendors times 10 categories = 100,000 documents in this collection. And each document will contain a JSON with denormalized data of 20 products each.

Is this a meaningful design? 100,000 documents seems to me a huge number and I have little idea how much space this will actually occupy and what will be its effects. One of the reasons I am going to this extreme denormalization is to avoid application level joins. Is this overkill?
I feel using application level joins might defeat the purpose of a document based database itself and I would rather go the RDBMS way.


Dwight Merriman

unread,
Mar 3, 2015, 2:17:13 PM3/3/15
to mongod...@googlegroups.com
it's very use case specific as to what is best, but typically for a system like this, the products aren't embedded in the vendor objects.

typically i would have a vendors collection and a products collection, and each product document has a vendor_id field.  typically there is a category field in each product document.

with that sort of setup you could query all products in a category, or all products for a vendor, or both, easily.  you could not easily query something that spans the vendor and product collections.  also your scope of atomicity is the document level, so as i have laid out here that is a fairly small scope.

so that suggests this is a "linking" not an embedding case, as normally done.  so what are some examples where one would embed to contrast?  here are a couple:
- shopping cart document would contain an array of cart line items.
- similarly, an order docoument for a customer would contain an array of line items purchased in that order.
- the product documents discussed above might include an embedded list of changes over time (such as price changes), in a field revision_history which is an array of documents that look something like { time : ..., who_changed : ..., what_changed : ... } 
- if a customer has mutliple shipping addresses i'd store them in an array of subdocuments in the customer document 

> there would be 10000 vendors times 10 categories = 100,000 documents in this collection
i don't understand this sentence.

ender0...@gmail.com

unread,
Mar 3, 2015, 2:49:07 PM3/3/15
to mongod...@googlegroups.com
Hi Dwight,

I like your suggestion of a vendors collection and a products collection with each product document having a vendor_id field. Assuming that price (or any other attribute) of the product is different for each vendor (and assuming I have 200 products and 10,000 vendors), the schema would result in 2,000,000 documents in the product collection. [since each vendor can price each product differently].

I am aware that I can use an array of vendor_ids as well (if the product between vendors were exactly the same. But that is likely not the case here).

My point is that is such a huge denormalization advisable?
If I am still not clear - think of it as if each of the 10,000 vendors wants to add one custom attribute to each of the 200 products.

Thanks for your reply.

-ender.
Reply all
Reply to author
Forward
0 new messages