Hello Kristina,
> Can you give examples of the types of queries you'd like to do?
For now, we are not sure about those queries. But our clients need to
do
different kinds of queries and we are investing which DB should be
used to handle their many TB data.
To me, MongoDB is the obvious choice but we are still investigating. I
have double-checked your book chapter regarding how to choose a good
shard index. For such a data volume, I just want to choose the best
way to present the shard index.
> Are you continuing to store new .zip/csv files (or is it just one initial
> import)?
We don't store the file as zip on DB instead we extract data records
from the csv files(stored in file system as .zip files) and load on
MongoDB as a regular document.
> Is your insert rate something one machine can handle or do
> you want it distributed across multiple machines?
The initial loading volume will be around 1TB and more data will come
each year.
We expect the final volume will hit around 60TB.
Based on suggestion from
http://tinyurl.com/5u94q4u MongoDB Pre-
Splitting for Faster Data Loading and Importing, I would like to pre-
split the data while I do the first initial data loading. The problem
is that I need a method to design the shard index. As the craigslist
case, Jeremy simply uses the postID as the shard index and presplit
the chunk to different shard servers. But your book indicates that
using an increasing value as key is not a good idea. so I am little
confused how further insertion works for craigslist.
In my case, I think the better way is to send the same year data to
one chunk on a specific shard server and keep rotating if the total
data size is less than 200M for each year.
For example,
2000 => shard001
2001 => shard002
2002 => shard003
2003 => shard001
2004 => shard002
2005 => shard003
..
Or, instead we can send each month to a specific shard server.
For example,
2000/01 => shard001
2000/02 => shard002
2000/03 => shard003
2000/04 => shard004
2000/05 => shard005
...
However, we still have to decide how to design the key for those
records. The number of records in each month variate from 1 to 4000 so
it is hard to predict the id range without querying the whole data set
in advance. Also, by using the default mongodb _id is not a good idea,
b/c I need to split the data as the initial loading step.
Your suggestion is highly appreciated.
Thank you