Shard tagging: shards only accepting data with a tag

Glenn Maynard

unread,

Jan 30, 2014, 11:37:15 AM1/30/14

to mongod...@googlegroups.com

Can I configure sharding so certain shards will *only* receive data with a particular tag? I want certain shards to never store data that has no tag. It looks like data with no tag can always go to any shard.

I'd like to be able to have separate shards for my main, smaller data that's accessed and written very often (using SSDs) and big, bulk data that's not accessed as often (using HDDs). It looks like shard tagging should let me do this, by tagging the bulk data with "HDD" so it gets moved to HDD shards. However, I don't want any other data to ever go to those shards, since they'll be intentionally tuned for lower I/O capacity.

I could manually tag every other collection with "SSD", but that's tedious and easy to miss.

--
Glenn Maynard

Andrew Ryder

unread,

Feb 9, 2014, 11:20:01 PM2/9/14

to mongod...@googlegroups.com

Hi Glenn!

As you have discovered there are basically 3 data divisions in a sharded cluster:
1. Collections which are hosted by a single shard (default)
2. Collections which are hosted (somewhat evenly) across all shards (db.shardCollection)
3. Collections which are hosted across particular shards (db.shardCollection, sh.addShardTag, sh.addTagRange)

The outcome you want is covered by #3, and tags are the mechanism by which that is achieved. You cannot do it without using tags.

However, what you describe is a hardware division between 'fast' data and 'slow' data. You have described 2 clusters. Would you be able to separate your configuration into 2 clusters, one that is 'fast' and one that is 'slow'? This will mitigate the need for tags entirely.

Kind regards,
Andrew

Glenn Maynard

unread,

Feb 10, 2014, 9:36:00 AM2/10/14

to mongod...@googlegroups.com

On Sun, Feb 9, 2014 at 10:20 PM, Andrew Ryder <andrew...@10gen.com> wrote:

Hi Glenn!

As you have discovered there are basically 3 data divisions in a sharded cluster:
1. Collections which are hosted by a single shard (default)
2. Collections which are hosted (somewhat evenly) across all shards (db.shardCollection)
3. Collections which are hosted across particular shards (db.shardCollection, sh.addShardTag, sh.addTagRange)

The outcome you want is covered by #3, and tags are the mechanism by which that is achieved. You cannot do it without using tags.

However, what you describe is a hardware division between 'fast' data and 'slow' data. You have described 2 clusters. Would you be able to separate your configuration into 2 clusters, one that is 'fast' and one that is 'slow'? This will mitigate the need for tags entirely.

That means maintaining two separate database connections from each client, moving data as its age moves it from fast to slow, running some queries on both servers (if it happens to overlap with "slow") and merging the results together, and so on. That's precisely what mongos is for. We want to mitigate doing all that complex stuff by using tags; not the other way around, mitigating tags by doing a lot of complex stuff. :)

This seems at least conceptually easy for Mongo to do, with a tweak to tag values. Today (correct me if my mental model is wrong), each shard has zero or more tags, and each document has zero or one tag. If a document has zero tags, it can go on any shard. If a document has one tag, it can go on any shard with that tag.

Modify this a bit:

- Each shard has zero or more tags. By default, a new shard has the tag "untagged", which can be removed from the shard.

- A document always has a tag. If no tag ranges match, the tag "untagged" is assumed. Documents can't have no tag.

By itself, this gives the same behavior. If no tag matches, then you get the tag "untagged". Every shard has it, so that untagged data can go to any shard.

For my use case, I'd remove the "untagged" tag from the slow shards, and add "slow" to them. That way, they would receive *only* documents with the "slow" tag, and never receive untagged data. On the other hand, the "fast" shards would keep the "default" tag, so they do continue to receive untagged data.

I can emulate this by manually setting an "untagged" tag on new collections, but that's brittle since the creation of simple collections is often implicit. It would be nice if Mongo could support this directly, so it's easy to add shards to a cluster that are never used unless a tag explicitly instructs it to be.

--
Glenn Maynard

Asya Kamsky

unread,

Feb 10, 2014, 11:13:07 PM2/10/14

to mongodb-user

Your model is slightly off. Any given document can only be associated with zero or one tag.

You cannot have a document be associated with multiple tags.

If in fact your description of your original use case is accurate in your adjectives "small" fast/current data and large old slow data then just tag your old range with HDD tag and tag your slow shards with HDD. Since there will be a lot more chunks on them than on "current" shards, the current data won't be migrated to slow shards.

Asya

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Asya Kamsky

unread,

Feb 10, 2014, 11:15:25 PM2/10/14

to mongodb-user

One other difference you might consider - you say creation of collections can be implicit - but sharding of collections can never be implicit. It *must* be explicitly specified.

Asya

Glenn Maynard

unread,

Feb 12, 2014, 10:57:05 AM2/12/14

to mongod...@googlegroups.com

On Mon, Feb 10, 2014 at 10:13 PM, Asya Kamsky <as...@mongodb.com> wrote:

Your model is slightly off. Any given document can only be associated with zero or one tag.

You cannot have a document be associated with multiple tags.

That's what I said, each document has zero or one tag.

If in fact your description of your original use case is accurate in your adjectives "small" fast/current data and large old slow data then just tag your old range with HDD tag and tag your slow shards with HDD. Since there will be a lot more chunks on them than on "current" shards, the current data won't be migrated to slow shards.

That seems like it'd depend on the implementation details of the chunk balancer. For example, the HDD shards would have far more storage--maybe 1TB on a HDD shard vs. 40 GB on an SSD shard. Maybe the chunk balancer today is based on total chunks, but in the future the shard's max storage might be part of the equation too. Also, when a new HDD shard is added to the cluster, and the balancer is moving data to it, it doesn't know to only move HDD data to it.

On Mon, Feb 10, 2014 at 10:15 PM, Asya Kamsky <as...@mongodb.com> wrote:

One other difference you might consider - you say creation of collections can be implicit - but sharding of collections can never be implicit. It *must* be explicitly specified.

The same issue applies to unsharded collections: I never want them to live on HDD shards by default.

It sounds like explicitly tagging whole collections separately is the way to go for now, but being able to mark shards as never storing the "null tag" would be nice to have.

--
Glenn Maynard

Reply all

Reply to author

Forward