Slow import data sharding

Christian Jiménez

unread,

Apr 1, 2016, 12:42:10 AM4/1/16

to mongodb-user

Hi
I am implementing a cluster in mongodb (v 3.0.10.), My components are:
- 3 ConfigServer on different servers (2 MAC and 1 UBUNTU on port 27020 each)
- 3 shards on different servers (2 MAC with 64 RAM and 1 UBUNTU 4 RAM)
- 1 mongos MAC RAM 64
- Shard key ("_id": "hashed")

The problem I have is that the data loading is very slow when I import a 100GB database through the command mongoimport

I can fix this?

Avinash Katore

unread,

Apr 2, 2016, 6:52:38 AM4/2/16

to mongodb-user

Hi,

My suggestion would be, you first import data without applying any indexes and shard keys. Once you import then apply indexes and shard keys.

Apply indexes in following manner

db.collection.createIndex({"column": 1}, {"background": true})

Apply indexes in background because you are having 100GB of data to index.

Apply shard keys in following manner.

db.collection.ensureIndex({"column": "hashed"})

sh.shardCollection("dbname.collection", { "column": "hashed" })

Christian Jiménez

unread,

Apr 2, 2016, 11:17:52 AM4/2/16

to mongodb-user

Thank you for answering , take into account your suggestion and I commented that I get results

Christian Jiménez

unread,

Apr 2, 2016, 12:54:10 PM4/2/16

to mongodb-user

Hello again

I tell you that thanks to your suggestion dramatically improved loading times data, but instead the shards distribution is very slow.

If you know how best this, I appreciate immensely

Kevin Adistambha

unread,

Apr 12, 2016, 8:36:15 PM4/12/16

to mongodb-user

Hi Christian,

To import into a sharded cluster deployment, you can pre-split the collection before importing any data.

You can do this as follows:

Deploy the sharded cluster.
Create an empty collection using db.createCollection() and also create the necessary indexes using db.collection.createIndex(), including the index that corresponds to the shard key.
Shard the collection using sh.shardCollection()
Pre-split the destination (empty) collection as described in Create Chunks in a Sharded Cluster. You should see the balancer distributes the empty chunks across the shards. Note: this pre-split step is automatically done if you are using a hashed shard key.
Import your data via mongos.

Since the empty chunks are already distributed and balanced, the imported data will go straight into the proper shard, and your import should run faster.

Please note that you should only pre-split an empty collection.

Choosing the correct shard key is extremely important for a sharded cluster deployment, since it is immutable once you shard a collection. For more information regarding shard keys, please see Considerations for Selecting Shard Keys.

Regarding the hashed shard key: A hashed shard key helps distribute writes if the indexed field you are hashing has high cardinality. However, a hashed shard key only supports equality queries based on the shard key, so range queries on a collection with a hashed shard key will always be less efficient scatter/gather queries (see: Distributed Queries). You may be able to choose a shard key that better supports your use case than a hashed shard key.

However, I also have some questions about your deployment:

I am implementing a cluster in mongodb (v 3.0.10.)

I would recommend using the latest 3.0.x series, which is currently 3.0.11.

3 ConfigServer on different servers (2 MAC and 1 UBUNTU on port 27020 each)

3 shards on different servers (2 MAC with 64 RAM and 1 UBUNTU 4 RAM)

Regarding the hardware and setup:

Do you mean that the Macs are equipped with 64 GB of RAM, and the Ubuntu machine is equipped with 4 GB of RAM? If that is so, is there a reason for using machines with such a large discrepancy in hardware capabilities? This is fine if this is a development setup, but not recommended if this is a production setup, because MongoDB balances the cluster by counting chunks, not data size.
Where are the config servers located (e.g. in the same machine as the shards, or on separate hardware), and what is the order of the config servers in the mongos --configdb setting? The first config server listed in the setting will get more read traffic vs. the other config servers.

Please note that the recommended sharded cluster deployment in a production environment involves using replica sets as shards, as noted in the Production Cluster Architecture page.

Best regards,
Kevin

Reply all

Reply to author

Forward