What is a best method to design the shard index for this dataset

68 views
Skip to first unread message

Jake

unread,
Oct 31, 2011, 3:42:36 PM10/31/11
to mongodb-user
Hello all,

I have referred the book "Scaling MongoDB" and still need some helps
for how to design the shard index for my specific dataset.

Here is the structure of the dataset

-2001 (contains hundreds of zip files)
== file_abcde.zip (each contains a csv file)
....
== file_afcde.zip
-2002
== file_cbede.zip
....
== file_ffcde.zip
-2003
== file_abtde.zip
....
== file_gafde.zip
...
-2100

Fact 1> Each each folder contains hundreds of zip file.
Fact 2> Each zip file contains a csv file.
Fact 3> Each csv file contains hundreds of records.
Fact 4> Roughly, each year contains the same amount of data
Fact 5> The total volume of all data will reach around 10TB.

Each zip file contains a single csv file that contains hundreds of
records each including a field name as "StartDate" (For example :
"20070501").

I thought about using MONTH as the shard index, but the problem is
that most of queries are run based on a single year.
However, cross-year queries do exist.

So I think the better choice may be (month, year) as the shard index.

Any suggestion and comments are welcome.

Thank you

Kristina Chodorow

unread,
Nov 1, 2011, 7:18:33 PM11/1/11
to mongodb-user
Can you give examples of the types of queries you'd like to do? Are
you continuing to store new .zip/csv files (or is it just one initial
import)? Is your insert rate something one machine can handle or do
you want it distributed across multiple machines?

Jake

unread,
Nov 2, 2011, 12:42:16 PM11/2/11
to mongodb-user
Hello Kristina,

> Can you give examples of the types of queries you'd like to do?  
For now, we are not sure about those queries. But our clients need to
do
different kinds of queries and we are investing which DB should be
used to handle their many TB data.
To me, MongoDB is the obvious choice but we are still investigating. I
have double-checked your book chapter regarding how to choose a good
shard index. For such a data volume, I just want to choose the best
way to present the shard index.

> Are you continuing to store new .zip/csv files (or is it just one initial
> import)?  
We don't store the file as zip on DB instead we extract data records
from the csv files(stored in file system as .zip files) and load on
MongoDB as a regular document.

> Is your insert rate something one machine can handle or do
> you want it distributed across multiple machines?
The initial loading volume will be around 1TB and more data will come
each year.
We expect the final volume will hit around 60TB.

Based on suggestion from http://tinyurl.com/5u94q4u MongoDB Pre-
Splitting for Faster Data Loading and Importing, I would like to pre-
split the data while I do the first initial data loading. The problem
is that I need a method to design the shard index. As the craigslist
case, Jeremy simply uses the postID as the shard index and presplit
the chunk to different shard servers. But your book indicates that
using an increasing value as key is not a good idea. so I am little
confused how further insertion works for craigslist.

In my case, I think the better way is to send the same year data to
one chunk on a specific shard server and keep rotating if the total
data size is less than 200M for each year.
For example,
2000 => shard001
2001 => shard002
2002 => shard003
2003 => shard001
2004 => shard002
2005 => shard003
..
Or, instead we can send each month to a specific shard server.
For example,
2000/01 => shard001
2000/02 => shard002
2000/03 => shard003
2000/04 => shard004
2000/05 => shard005
...

However, we still have to decide how to design the key for those
records. The number of records in each month variate from 1 to 4000 so
it is hard to predict the id range without querying the whole data set
in advance. Also, by using the default mongodb _id is not a good idea,
b/c I need to split the data as the initial loading step.

Your suggestion is highly appreciated.

Thank you

Kristina Chodorow

unread,
Nov 9, 2011, 4:46:49 PM11/9/11
to mongodb-user
It's difficult to say what a good shard key would be if you don't know
what usage will be.

The examples you gave might work fine, but under certain types of load
you could easily get hot spots (e.g., I'd imagine the shard with the
current year would be busier than the others). However, if you have a
system that could handle it, that's fine. If you want to go that
direction, MongoDB doesn't support assigning certain ranges to certain
shards yet. Keep your eye on https://jira.mongodb.org/browse/SERVER-2545
(you could currently maintain that manually by turning off the
balancer and using splitChunk/moveChunk).

If you want to let MongoDB handle splitting up load, you might want to
go with a shard key more like {year:1, someOtherField:1}.


On Nov 2, 11:42 am, Jake <askmat...@gmail.com> wrote:
> Hello Kristina,
>
> > Can you give examples of the types of queries you'd like to do?
>
> For now, we are not sure about those queries. But our clients need to
> do
> different kinds of queries and we are investing which DB should be
> used to handle their many TB data.
> To me, MongoDB is the obvious choice but we are still investigating. I
> have double-checked your book chapter regarding how to choose a good
> shard index. For such a data volume, I just want to choose the best
> way to present the shard index.
>
> > Are you continuing to store new .zip/csv files (or is it just one initial
> > import)?
>
> We don't store the file as zip on DB instead we extract data records
> from the csv files(stored in file system as .zip files) and load on
> MongoDB as a regular document.
>
> > Is your insert rate something one machine can handle or do
> > you want it distributed across multiple machines?
>
> The initial loading volume will be around 1TB and more data will come
> each year.
> We expect the final volume will hit around 60TB.
>
> Based on suggestion fromhttp://tinyurl.com/5u94q4uMongoDB Pre-
Reply all
Reply to author
Forward
0 new messages