low cardinality shard keys

181 views
Skip to first unread message

dst

unread,
Nov 3, 2011, 5:11:08 PM11/3/11
to mongodb-user
I am fairly new to Mongo, and started looking into sharding to solve a
problem where, I am beginning to think, it may have been the wrong
choice. The problem simplified: lets say I have three environments:
admin, product, review. I was thinking I could partition my
environments into three shards matching the environments. I put all
skids on after reading this recently:

From the Manning "Scaling MongoDB" book, they talk about Low-
Cardinality Shared Keys

"""
Some people don't really trust or understand how MongoDB automatically
distributes data, so they think something along the lines of "I have
four shards, so I will use a field with four possible values for my
shard keys." This is a really, really bad idea
...
If you chose a shard key with low cardinality, you will end up with
large, unmovalble, unsplittable chunks that will make your life
miserable
"""

Thats scary -- and its exactly what I am proposing ! Its not that I
mistrust Mongo, I think I may be using the wrong tool for the wrong
job. I simply want to move data across environments efficiently, and
also keep data close to the applications running in those
environments. If anyone else has similar experience or advice in this
area, I would appreciate hearing about it.

I ran a prototype of the concept, and it seemed to "work". I've
included the output from my session for anyone who is interested (I
learned a lot from it)...

-------------
First, I setup the shard servers

## start config server
./mongod --dbpath /dbs/config --port 20000 --fork

## start shard routing server
./mongos --port 30000 --configdb localhost:20000

## start admin shard
./mongod --dbpath /dbs/admin-master --port 10000 --fork

## start prod1 shard
./mongod --dbpath /dbs/prod1-master --port 10001 --fork


./mongod --dbpath /dbs/review-master --port 10003 --fork

## connect to Administrative Server (mongos)
./mongo localhost:30000/admin

db.runCommand({addshard: "localhost:10000", name: "admin", allowLocal:
true})
db.runCommand({addshard: "localhost:10001", name: "prod1", allowLocal:
true})
db.runCommand({addshard: "localhost:10003", name: "review",
allowLocal: true})

# Enable sharding on database mlt
db.runCommand({"enablesharding": "mlt"});

# Enable sharding on the dogs collection
db.runCommand({"shardcollection": "mlt.dogs", "key" : {"env": 1}})

# split by environment
db.runCommand({split: "mlt.dogs", middle: {"env": "admin"}})
{ "ok" : 1 }
db.runCommand({split: "mlt.dogs", middle: {"env": "prod"}})
{ "ok" : 1 }
db.runCommand({split: "mlt.dogs", middle: {"env": "review"}})
{ "ok" : 1 }

Here is a look at my "mlt" database's sharding info. Everything by
default went to the admin shard (on: admin)

{ "_id" : "mlt", "partitioned" : true, "primary" : "admin" }
mlt.dogs chunks:
admin 4
{ "env" : { $minKey : 1 } } -->> { "env" : "admin" } on :
admin { "t" : 1000, "i" : 1 }
{ "env" : "admin" } -->> { "env" : "prod" } on : admin
{ "t" : 1000, "i" : 3 }
{ "env" : "prod" } -->> { "env" : "review" } on : admin
{ "t" : 1000, "i" : 5 }
{ "env" : "review" } -->> { "env" : { $maxKey : 1 } } on :
admin { "t" : 1000, "i" : 6 }

Next, move the chunks on an empty database

db.runCommand({moveChunk: "mlt.dogs", find: { "env": "admin"}, to:
"admin"});
{ "ok" : 0, "errmsg" : "that chunk is already on that shard" }
mongos> db.runCommand({moveChunk: "mlt.dogs", find: { "env": "prod"},
to: "prod1"});
{ "millis" : 2094, "ok" : 1 }
mongos> db.runCommand({moveChunk: "mlt.dogs", find: { "env":
"review"}, to: "review"});
{ "millis" : 2180, "ok" : 1 }

Now we have allocated chunks into ranges between reserved words
$minKey and $maxKey

{ "_id" : "mlt", "partitioned" : true, "primary" : "admin" }
mlt.dogs chunks:
admin 2
prod1 1
review 1
{ "env" : { $minKey : 1 } } -->> { "env" : "admin" } on : admin
{ "t" : 3000, "i" : 1 }
{ "env" : "admin" } -->> { "env" : "prod" } on : admin { "t" :
1000, "i" : 3 }
{ "env" : "prod" } -->> { "env" : "review" } on : prod1 { "t" :
2000, "i" : 0 }
{ "env" : "review" } -->> { "env" : { $maxKey : 1 } } on : review
{ "t" : 3000, "i" : 0 }

Next, when connected to the mongos balancer node, I inserted 4
records

db.dogs.insert("name": "Collins", "env": "prod"}
db.dogs.insert("name": "Bluesy", "env": "review"}
db.dogs.insert("name": "Billy": "env": "prod"}
db.dogs.insert("name": "Spot", "env": "admin")

When I connect to each individual shard, here is what i get

mongo localhost:10003 // connecting to review shard
use mlt
db.dogs.find()
{ env: review, name: Bluesy}

mongo localhost:10000 // connect to admin shard
use mlt
db.dogs.find()
{env: admin, name: Spot}

mongo localhost:10001 // connect to prod1 shard
use mlt
db.dogs.find()
{env: admin, name: Collins}
{env: admin, name: Billy}


One thing that I learned. You cannot modify shard keys, implying once
an object is on a shard, it stays there.

db.dogs.update({"name": "Billy"}, { "$set": { "env": "review"}})
>> can't do non-multi update with query that doesn't have a valid shard key

db.dogs.update({"name": "Billy", "env": "prod"}, { "$set": { "env":
"review"}})
>> Can't modify shard key's value fieldenv for collection: mlt.dogs

For fun, lets say a value gets stored without a defined environment.

db.dogs.insert(name: "Homeless", env: "shelter")

Which shard did it end up in? Mongo seems to shard into splits by
standard string collation. So it ends up in the "review" shard, since
"s" comes after "r"

Brandon Diamond

unread,
Nov 4, 2011, 10:20:38 AM11/4/11
to mongodb-user
Unfortunately, this is not how sharding is intended to be used.

Sharding is a technique that allows you to serve a larger dataset than
can fit on a single instance by coordinating and balancing data across
multiple clusters of replicated mongoDB instances. The replication
provides redundancy and therefore availability while sharding allows
you to predictably split data across a number of machines.

Sharding can be somewhat inflexible in order to achieve high
performance in the intended use case. Thus, it is rather difficult to
modify a sharding key once it has been established.

Note also that sharding utilizes a balancer to help ensure each shard
services as uniform a quantity of data as is possible.

This all seems at odds with what you're trying to do. In this case,
I'm not totally certain as to why you're using sharding instead of
simply establishing 3 separate replica sets with 3 separate purposes.
It seems more elegant to simply connect to the admin/product/review
replica set since that's what you're hoping to get out of sharding. It
also seems like you should be using multiple collections or databases,
here.

Alternatively -- if size is your concern -- you CAN use sharding, but
you should adhere to Kristina's recommendations in her book. Let
mongos worry about where to send your data.

Hope this adds clarity,
- Brandon

David Sean Taylor

unread,
Nov 4, 2011, 11:25:33 AM11/4/11
to mongod...@googlegroups.com
Hi Brandon,

That really does bring more clarity, been learning a lot about Mongo in last few days thanks largely to people like yourself on this list, and reading informative books like Kristina's. Thanks.

> I'm not totally certain as to why you're using sharding instead of
> simply establishing 3 separate replica sets with 3 separate purposes.

Our data will often be published from one environment to another, so I didn't want to have to manage and move data across three (or more) connections to three replica sets. I tried to keep my example simple. The full shard key I was planning on using was account-group + environment (with maybe a fine-grained third key as suggested in one of the books I read). I was also trying to keep indexes and working sets in RAM to a minimum, with the added benefit of keeping replication of data to a minimal with sharding, and then having apps connect to their respective environment/shards with the additional bonus of optimized locality of crud operations

> It seems more elegant to simply connect to the admin/product/review
> replica set since that's what you're hoping to get out of sharding. It

My idea was for applications in each environment to connect directly to each shard (replica sets), and for the administrative app to connect to mongos

> Alternatively -- if size is your concern -- you CAN use sharding, but
> you should adhere to Kristina's recommendations in her book. Let
> mongos worry about where to send your data.


Yes, it appears I am going against the grain of what sharding is meant to do. By pre-allocating all my shards, I am getting in the way of the balancer doing its job. Actually its not clear to me if the analyzer would overwrite my initial splits. Was actually considering turning off the analyzer, but then I don't have enough knowledge to know what side-effects turning off the analyzer would have

Reply all
Reply to author
Forward
0 new messages