--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
On Sunday, April 10, 2011 at 9:58 PM, Shalom Rav wrote:
Better question: Why do you need to update 10M documents at once? That's incorrect schema design. Scalability isn't determined by how fast you can update a lot of records.
Joshua,
There are other ways for me to model my data, but they will require
document sizes of 50MB or more. Since MongoDB has a document size
limit of 16[MB], I have to split the information between 10,000,000
documents. It doesn't mean I will have to update all of them every
second; but in theory, I might need to. in practice, perhaps 50,000 or
so will be updated every couple of seconds.
I just want to get a sense of the write/update scalability of MongoDB
before I start coding.
Thanks,
Shalom.
The document size limitation is a (compiler) constant, and while you
can change it to something much larger it is best to think about how
to more efficiently store data in general. It sounds like you have
some very specific ideas about how your system might work. Have you
thought about if you will have limitations in your processing layer
(python) or over the network with so much changing data? It seems like
50,000 x 4MB is a lot of of data to push across a network, or many
networks for that matter, for example; if you are suggesting that
would be the average and not a peak, I'd start to be very concerned.
On Sunday, April 10, 2011 at 10:45 PM, Shalom Rav wrote:
Gentlemen,
Thank you for your help. Here's my situation: I would like to save
statistics on 10,000,000 (ten millions) pairs of particles, how they
relate to one another in any given interval of time.
So suppose that within a total experiment time of T1..T1000 (assume
that T1 is when the experiment starts, and T1000 is the time when the
experiment ends) I would like, per each pair of particles, to measure
the relationship between every Tn -- T(n+1) interval:
T1..T2 (this is the first interval)
T2..T3
T3..T4
......
......
T9,999,999..T10,000,000 (this is the last
interval)
FFor each such a particle pair (there are 10,000,000 pairs) I would
like to save some figures on each interval of [ Tn..T(n+1) ]
Once saved, the query I will be using to retrieve this data is as
follows: "give me all particle pairs on time interval [ Tn..T(n+1) ]
where the distance between the two particles is smaller than X and the
angle between the two particles is greater than Y". Meaning, the query
will always take place for *all particle pairs* on a certain interval
of time.
How would you model this in MongoDB so that the writes/reads are
optimized? any suggestions from experienced users will be greatly
appreciated.
On Sunday, April 10, 2011 at 11:06 PM, Shalom Rav wrote:
Joshua,
Thank you. Yes, I would like to store thousands of experiments (they
will need to be stored on disk -- memory is not big enough).
Regarding the suggested format:
{ time: [n], pair: [pair number], distance: [x], angle : [y],
experiment: [z] }
Are these figures refer to scalars? ( meaning, there will be ONE
DOCUMENT per each combination of {n, pair_number, distance, angle,
experiment}? )
For example:
{ time: 1, pair: 734, distance: 0.23, angle : 3.62, experiment: 1 }
{ time: 1, pair: 734, distance: 0.1, angle : 85.62, experiment: 2 }
.......................
.......................
Did I get it right?
if so, then there will be millions of documents in such a collection.
Is it a problem?
Also, would it be a good idea to give unique IDs by myself, or let
MongoDB do it for me?
Finally, will time-based queries be fast across these huge list of
millions of documents?
On Sunday, April 10, 2011 at 11:25 PM, Shalom Rav wrote:
Josh,
Thank you.
Is there a benefit to generating the unique ID (per experiment, time,
pair) by myself? (meaning, if I do so, will I be able to narrow down
the query search space)?
Given the right hardware, can you approximate how a query by
'time_interval' should take? (1[sec]? 10[sec]? 1[min]?) -- what shall
I do to optimize the query so that it runs fast?
Best,
Shalom.
Shalom Rav wrote:
> Joshua,
>
>> You will have to generate your own time id, unless there is a separate collection you wish to reference. The benefit to generating the experiment and pair ids yourself it it won't look like 4c9582af3e6dfb1b4b4f044e and will be a reasonably round number.
>
> I am sorry for not clarifying myself properly. What I wanted to ask
> was, suppose I do provide my own 'name' for every document, IS THERE A
> WAY I could query ONLY DOCUMENTS THAT are 'related' to certain names?
>
> For example, for simplicity purposes, suppose that I give the
> following names to documents:
>
> `exp0`
> `exp1`
> `exp2`
> ......
> `expN`
>
> Is there a way to have the query run ONLY ON DOCUMENTS that are (say)
> between `exp5` .. `exp287324`, and ignore the rest?
Store the 'name' on each document, create an index on it and perform
a range search using $gt and $lt operators (lexicographical ordering
of strings applies). Otherwise store the number of the experiment as
integer if you don't need the 'exp' prefix.
- -aj
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQGUBAEBAgAGBQJNonqGAAoJEADcfz7u4AZjbzALv31u6UzpDK1tUlBMl1IZFnCv
vC6D6hDse4574vdsuD6fVq/ri1dhn3phN+NW+LCKF9rnRGeq1E0xRH/rUuSW21cN
VPq902MOd/AMa+gvfwnRRywobq5flyEIsOxKGseVqptKo1l6zlJBWeDwZ2oxXllp
sBk3RT9ib+AvIz6feznNG+HyyViSmPPMcuGYmX4npSq4U58O9lSBaPw4LEkM0AgN
0h/DQPyIBNisCuWn3hlvejcMJTGBtdKOi7DPDAh3QdyC2MkFiPGKWCcORSqafrl0
tngca/ZQrVmxLJ4QOZJQNt/MAqfa1EqPiAD5TZL2HT58vdik4j7j5mcgYFo4+MNE
p1zMbsNrrRHU7Gou1r9H2LnWQODIHsjmtJ3nw3bA8GMMYiBv9pDn6oWybeAmXDuo
MTkBDfs+JSmMlS6d9CN1IQgF8quq+Vl0J3gZexS1f07X/D/ENvqp2E1IDWev3qXk
RNSub3oTPRsY/rhRUMPbqzyPOlok1Qo=
=ps5l
-----END PGP SIGNATURE-----
> ...
>
> read more »