Updating Millions of Documents In a Collection -- Performance?

Shalom Rav

unread,

Apr 10, 2011, 9:32:52 PM4/10/11

to mongodb-user

Suppose that my python client performs an operation that return a list
with 10,000,000 items. and now, correspondingly, I'd like to update
10,000,000 documents that are in MongoDB with the values that are
stored in the python list.

Will this updation process be significantly slower than, say, only
10,000 documents has to be updated?

Can any of the more experienced users suggest some performance
benchmarks on how many documents can be updated, say, per second?

Is MongoDB really scalable when one needs to update 10,000,000
documents at the same time?

Joshua Kehn

unread,

Apr 10, 2011, 9:36:40 PM4/10/11

to mongod...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Better question: Why do you need to update 10M documents at once? That's incorrect schema design. Scalability isn't determined by how fast you can update a lot of records.

Regards,

-Josh

____________________________________________

Joshua Kehn | Josh...@gmail.com

http://joshuakehn.com

Shalom Rav

unread,

Apr 10, 2011, 9:58:07 PM4/10/11

to mongodb-user

> Better question: Why do you need to update 10M documents at once? That's incorrect schema design. Scalability isn't determined by how fast you can update a lot of records.

Joshua,

There are other ways for me to model my data, but they will require
document sizes of 50MB or more. Since MongoDB has a document size
limit of 16[MB], I have to split the information between 10,000,000
documents. It doesn't mean I will have to update all of them every
second; but in theory, I might need to. in practice, perhaps 50,000 or
so will be updated every couple of seconds.

I just want to get a sense of the write/update scalability of MongoDB
before I start coding.

Thanks,
Shalom.

On Apr 10, 9:36 pm, Joshua Kehn <josh.k...@gmail.com> wrote:
> On Sunday, April 10, 2011 at 9:32 PM, Shalom Rav wrote:
>
> Suppose that my python client performs an operation that return a list
>
>
>
> > with 10,000,000 items. and now, correspondingly, I'd like to update
> > 10,000,000 documents that are in MongoDB with the values that are
> > stored in the python list.
>
> > Will this updation process be significantly slower than, say, only
> > 10,000 documents has to be updated?
>
> > Can any of the more experienced users suggest some performance
> > benchmarks on how many documents can be updated, say, per second?
>
> > Is MongoDB really scalable when one needs to update 10,000,000
> > documents at the same time?
>
> > --
> > You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> > To post to this group, send email to mongod...@googlegroups.com.
> > To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.

> > For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.

> > Better question: Why do you need to update 10M documents at once? That's incorrect schema design. Scalability isn't determined by how fast you can update a lot of records.
>
> Regards,
>
> -Josh
> ____________________________________________

> Joshua Kehn | Josh.K...@gmail.comhttp://joshuakehn.com

Joshua Kehn

unread,

Apr 10, 2011, 10:00:58 PM4/10/11

to mongod...@googlegroups.com

On Sunday, April 10, 2011 at 9:58 PM, Shalom Rav wrote:

Better question: Why do you need to update 10M documents at once? That's incorrect schema design. Scalability isn't determined by how fast you can update a lot of records.

Joshua,

There are other ways for me to model my data, but they will require
document sizes of 50MB or more. Since MongoDB has a document size
limit of 16[MB], I have to split the information between 10,000,000
documents. It doesn't mean I will have to update all of them every
second; but in theory, I might need to. in practice, perhaps 50,000 or
so will be updated every couple of seconds.

I just want to get a sense of the write/update scalability of MongoDB
before I start coding.

Thanks,
Shalom.

On to a different problem - there is very little reason for you to store that much information in one document. 16MB is a lot of data when talking about how much you're storing in a database.

Can you describe what data you're storing and how you are going about it?

Regards,

-Josh

____________________________________________

Joshua Kehn | Josh...@gmail.com

http://joshuakehn.com

Nat

unread,

Apr 10, 2011, 10:02:17 PM4/10/11

to mongodb-user

Different documents make mongodb perform differently. The best way is
to benchmark yourself. Some pieces of advice:

- If you don't update the document to make it larger that it cannot
fit the same space, it will be slow
- If you have an index on the updated field, it will be slower than
not having one.
- If your index doesn't fit in memory, you are bound to have update
performance problem

Note that if your document size is 50MB or more. You should think of
breaking document into smaller piece. For example, you should store
binary data in GridFS and have references to binary data instead of
embedding it directly.

Scott Hernandez

unread,

Apr 10, 2011, 10:14:23 PM4/10/11

to mongod...@googlegroups.com

With sharding the write scalability in dependent on the number of
shards you have by the (write) throughput of each shard.

The document size limitation is a (compiler) constant, and while you
can change it to something much larger it is best to think about how
to more efficiently store data in general. It sounds like you have
some very specific ideas about how your system might work. Have you
thought about if you will have limitations in your processing layer
(python) or over the network with so much changing data? It seems like
50,000 x 4MB is a lot of of data to push across a network, or many
networks for that matter, for example; if you are suggesting that
would be the average and not a peak, I'd start to be very concerned.

Shalom Rav

unread,

Apr 10, 2011, 10:45:30 PM4/10/11

to mongodb-user

Gentlemen,

Thank you for your help. Here's my situation: I would like to save
statistics on 10,000,000 (ten millions) pairs of particles, how they
relate to one another in any given interval of time.

So suppose that within a total experiment time of T1..T1000 (assume
that T1 is when the experiment starts, and T1000 is the time when the
experiment ends) I would like, per each pair of particles, to measure
the relationship between every Tn -- T(n+1) interval:

T1..T2 (this is the first interval)
T2..T3
T3..T4
......
......
T9,999,999..T10,000,000 (this is the last
interval)

For each such a particle pair (there are 10,000,000 pairs) I would
like to save some figures on each interval of [ Tn..T(n+1) ]

Once saved, the query I will be using to retrieve this data is as
follows: "give me all particle pairs on time interval [ Tn..T(n+1) ]
where the distance between the two particles is smaller than X and the
angle between the two particles is greater than Y". Meaning, the query
will always take place for *all particle pairs* on a certain interval
of time.

How would you model this in MongoDB so that the writes/reads are
optimized? any suggestions from experienced users will be greatly
appreciated.

On Apr 10, 10:14 pm, Scott Hernandez <scotthernan...@gmail.com> wrote:
> With sharding the write scalability in dependent on the number of
> shards you have by the (write) throughput of each shard.
>
> The document size limitation is a (compiler) constant, and while you
> can change it to something much larger it is best to think about how
> to more efficiently store data in general. It sounds like you have
> some very specific ideas about how your system might work. Have you
> thought about if you will have limitations in your processing layer
> (python) or over the network with so much changing data? It seems like
> 50,000 x 4MB is a lot of of data to push across a network, or many
> networks for that matter, for example; if you are suggesting that
> would be the average and not a peak, I'd start to be very concerned.
>

Joshua Kehn

unread,

Apr 10, 2011, 10:52:02 PM4/10/11

to mongod...@googlegroups.com

On Sunday, April 10, 2011 at 10:45 PM, Shalom Rav wrote:

Gentlemen,

Thank you for your help. Here's my situation: I would like to save
statistics on 10,000,000 (ten millions) pairs of particles, how they
relate to one another in any given interval of time.

So suppose that within a total experiment time of T1..T1000 (assume
that T1 is when the experiment starts, and T1000 is the time when the
experiment ends) I would like, per each pair of particles, to measure
the relationship between every Tn -- T(n+1) interval:

T1..T2 (this is the first interval)
T2..T3
T3..T4
......
......
T9,999,999..T10,000,000 (this is the last
interval)

FFor each such a particle pair (there are 10,000,000 pairs) I would

like to save some figures on each interval of [ Tn..T(n+1) ]

Once saved, the query I will be using to retrieve this data is as
follows: "give me all particle pairs on time interval [ Tn..T(n+1) ]
where the distance between the two particles is smaller than X and the
angle between the two particles is greater than Y". Meaning, the query
will always take place for *all particle pairs* on a certain interval
of time.

How would you model this in MongoDB so that the writes/reads are
optimized? any suggestions from experienced users will be greatly
appreciated.

Can I assume that the pairs are held in memory and not in the database for the duration of the experiment?

Regardless I would suggest that you use MongoDB in a log format. Structure your documents like this:

{ time: [n], pair: [pair number], distance: [x], angle : [y], [experiment: [z]] }

I put experiment in there in the case you want to store more then one in the same collection.

Regards,

-Josh

____________________________________________

Joshua Kehn | Josh...@gmail.com

http://joshuakehn.com

Shalom Rav

unread,

Apr 10, 2011, 11:06:19 PM4/10/11

to mongodb-user

Joshua,

Thank you. Yes, I would like to store thousands of experiments (they
will need to be stored on disk -- memory is not big enough).

Regarding the suggested format:

{ time: [n], pair: [pair number], distance: [x], angle : [y],
experiment: [z] }

Are these figures refer to scalars? ( meaning, there will be ONE
DOCUMENT per each combination of {n, pair_number, distance, angle,
experiment}? )

For example:

{ time: 1, pair: 734, distance: 0.23, angle : 3.62, experiment: 1 }
{ time: 1, pair: 734, distance: 0.1, angle : 85.62, experiment: 2 }
.......................
.......................

Did I get it right?

if so, then there will be millions of documents in such a collection.
Is it a problem?

Also, would it be a good idea to give unique IDs by myself, or let
MongoDB do it for me?

Finally, will time-based queries be fast across these huge list of
millions of documents?

> Joshua Kehn | Josh.K...@gmail.comhttp://joshuakehn.com

Joshua Kehn

unread,

Apr 10, 2011, 11:12:14 PM4/10/11

to mongod...@googlegroups.com

On Sunday, April 10, 2011 at 11:06 PM, Shalom Rav wrote:

Joshua,

Thank you. Yes, I would like to store thousands of experiments (they
will need to be stored on disk -- memory is not big enough).

Regarding the suggested format:

{ time: [n], pair: [pair number], distance: [x], angle : [y],
experiment: [z] }

Are these figures refer to scalars? ( meaning, there will be ONE
DOCUMENT per each combination of {n, pair_number, distance, angle,
experiment}? )

For example:

{ time: 1, pair: 734, distance: 0.23, angle : 3.62, experiment: 1 }
{ time: 1, pair: 734, distance: 0.1, angle : 85.62, experiment: 2 }
.......................
.......................

Did I get it right?

if so, then there will be millions of documents in such a collection.
Is it a problem?

Also, would it be a good idea to give unique IDs by myself, or let
MongoDB do it for me?

Finally, will time-based queries be fast across these huge list of
millions of documents?

Yes, they are scalars (one document per combination). Millions of documents in a collection is fine, you should have no issue.

MongoDB will give each document an ObjectID. If you need specific ID's (per experiment, time, pair) it would be best to generate those yourself.

Databases are designed to handle millions of records, you should have no speed issues there.

Regards,

-Josh

____________________________________________

Joshua Kehn | Josh...@gmail.com

http://joshuakehn.com

Shalom Rav

unread,

Apr 10, 2011, 11:25:32 PM4/10/11

to mongodb-user

Josh,

Thank you.

Is there a benefit to generating the unique ID (per experiment, time,
pair) by myself? (meaning, if I do so, will I be able to narrow down
the query search space)?
Given the right hardware, can you approximate how a query by
'time_interval' should take? (1[sec]? 10[sec]? 1[min]?) -- what shall
I do to optimize the query so that it runs fast?

Best,
Shalom.

> Joshua Kehn | Josh.K...@gmail.comhttp://joshuakehn.com

Joshua Kehn

unread,

Apr 10, 2011, 11:30:52 PM4/10/11

to mongod...@googlegroups.com

On Sunday, April 10, 2011 at 11:25 PM, Shalom Rav wrote:

Josh,

Thank you.

Is there a benefit to generating the unique ID (per experiment, time,
pair) by myself? (meaning, if I do so, will I be able to narrow down
the query search space)?
Given the right hardware, can you approximate how a query by
'time_interval' should take? (1[sec]? 10[sec]? 1[min]?) -- what shall
I do to optimize the query so that it runs fast?

Best,
Shalom.

I can't run an estimate off the top of my head, I have some data pools I could check on Monday to give you an approx. number. No more then 10 seconds would be my guess.

Perhaps there was a misunderstanding. You will have to generate your own time id, unless there is a separate collection you wish to reference. The benefit to generating the experiment and pair ids yourself it it won't look like 4c9582af3e6dfb1b4b4f044e and will be a reasonably round number.

If you need the queries to run faster you can index the columns you are searching by. I don't have any numbers for before / after on indexing that large a record set or how long indexing will take initially and per-update.

Regards,

-Josh

____________________________________________

Joshua Kehn | Josh...@gmail.com

http://joshuakehn.com

Shalom Rav

unread,

Apr 10, 2011, 11:44:53 PM4/10/11

to mongodb-user

Joshua,

> You will have to generate your own time id, unless there is a separate collection you wish to reference. The benefit to generating the experiment and pair ids yourself it it won't look like 4c9582af3e6dfb1b4b4f044e and will be a reasonably round number.

I am sorry for not clarifying myself properly. What I wanted to ask
was, suppose I do provide my own 'name' for every document, IS THERE A
WAY I could query ONLY DOCUMENTS THAT are 'related' to certain names?

For example, for simplicity purposes, suppose that I give the
following names to documents:

`exp0`
`exp1`
`exp2`
......
`expN`

Is there a way to have the query run ONLY ON DOCUMENTS that are (say)
between `exp5` .. `exp287324`, and ignore the rest?

Thank you for your patience and help. If it's not too hard, I would be
happy to get some performance number, as you suggested.

Best,
Shalom.

On Apr 10, 11:30 pm, Joshua Kehn <josh.k...@gmail.com> wrote:
> On Sunday, April 10, 2011 at 11:25 PM, Shalom Rav wrote:
>
> Josh,
>
>
>
> > Thank you.
>
> > Is there a benefit to generating the unique ID (per experiment, time,
> > pair) by myself? (meaning, if I do so, will I be able to narrow down
> > the query search space)?
> > Given the right hardware, can you approximate how a query by
> > 'time_interval' should take? (1[sec]? 10[sec]? 1[min]?) -- what shall
> > I do to optimize the query so that it runs fast?
>
> > Best,

> > Shalom.I can't run an estimate off the top of my head, I have some data pools I could check on Monday to give you an approx. number. No more then 10 seconds would be my guess.

>
> Perhaps there was a misunderstanding. You will have to generate your own time id, unless there is a separate collection you wish to reference. The benefit to generating the experiment and pair ids yourself it it won't look like 4c9582af3e6dfb1b4b4f044e and will be a reasonably round number.
>
> If you need the queries to run faster you can index the columns you are searching by. I don't have any numbers for before / after on indexing that large a record set or how long indexing will take initially and per-update.
>
> Regards,
>
> -Josh
> ____________________________________________

> Joshua Kehn | Josh.K...@gmail.comhttp://joshuakehn.com

Andreas Jung

unread,

Apr 10, 2011, 11:50:30 PM4/10/11

to mongod...@googlegroups.com

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Shalom Rav wrote:
> Joshua,
>
>> You will have to generate your own time id, unless there is a separate collection you wish to reference. The benefit to generating the experiment and pair ids yourself it it won't look like 4c9582af3e6dfb1b4b4f044e and will be a reasonably round number.
>
> I am sorry for not clarifying myself properly. What I wanted to ask
> was, suppose I do provide my own 'name' for every document, IS THERE A
> WAY I could query ONLY DOCUMENTS THAT are 'related' to certain names?
>
> For example, for simplicity purposes, suppose that I give the
> following names to documents:
>
> `exp0`
> `exp1`
> `exp2`
> ......
> `expN`
>
> Is there a way to have the query run ONLY ON DOCUMENTS that are (say)
> between `exp5` .. `exp287324`, and ignore the rest?

Store the 'name' on each document, create an index on it and perform
a range search using $gt and $lt operators (lexicographical ordering
of strings applies). Otherwise store the number of the experiment as
integer if you don't need the 'exp' prefix.

- -aj
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQGUBAEBAgAGBQJNonqGAAoJEADcfz7u4AZjbzALv31u6UzpDK1tUlBMl1IZFnCv
vC6D6hDse4574vdsuD6fVq/ri1dhn3phN+NW+LCKF9rnRGeq1E0xRH/rUuSW21cN
VPq902MOd/AMa+gvfwnRRywobq5flyEIsOxKGseVqptKo1l6zlJBWeDwZ2oxXllp
sBk3RT9ib+AvIz6feznNG+HyyViSmPPMcuGYmX4npSq4U58O9lSBaPw4LEkM0AgN
0h/DQPyIBNisCuWn3hlvejcMJTGBtdKOi7DPDAh3QdyC2MkFiPGKWCcORSqafrl0
tngca/ZQrVmxLJ4QOZJQNt/MAqfa1EqPiAD5TZL2HT58vdik4j7j5mcgYFo4+MNE
p1zMbsNrrRHU7Gou1r9H2LnWQODIHsjmtJ3nw3bA8GMMYiBv9pDn6oWybeAmXDuo
MTkBDfs+JSmMlS6d9CN1IQgF8quq+Vl0J3gZexS1f07X/D/ENvqp2E1IDWev3qXk
RNSub3oTPRsY/rhRUMPbqzyPOlok1Qo=
=ps5l
-----END PGP SIGNATURE-----

lists.vcf

Joshua Kehn

unread,

Apr 10, 2011, 11:51:26 PM4/10/11

to mongod...@googlegroups.com

Absolutely.

Suppose you have three experiments, named 1, 2, and 3:

db.records.find({experiments : {$in : [1, 2, 3]}})

Add a time restriction (for times 100, 110, 210):

db.records.find({experiments : {$in : [1, 2, 3]}, time : {$in : [100, 110, 210]}})

I would recommend checking out MongoDB Advanced Queries.

Regards,

-Josh

____________________________________________

Joshua Kehn | Josh...@gmail.com

http://joshuakehn.com

Shalom Rav

unread,

Apr 10, 2011, 11:54:24 PM4/10/11

to mongodb-user

Perfect, thank you Joshua! (Andreas, thank you too!)

On Apr 10, 11:51 pm, Joshua Kehn <josh.k...@gmail.com> wrote:
> Absolutely.
>
> Suppose you have three experiments, named 1, 2, and 3:
>
> db.records.find({experiments : {$in : [1, 2, 3]}})
>
> Add a time restriction (for times 100, 110, 210):
>
> db.records.find({experiments : {$in : [1, 2, 3]}, time : {$in : [100, 110, 210]}})
>
> I would recommend checking out MongoDB Advanced Queries.
>
> Regards,
>
> -Josh
> ____________________________________________

> Joshua Kehn | Josh.K...@gmail.comhttp://joshuakehn.com

Shalom Rav

unread,

Apr 11, 2011, 12:44:49 AM4/11/11

to mongodb-user

Joshua,

Instead of modeling the experiment using scalars:

{ pair: [pair number], time: [n], distance: [x], angle : [y],
experiment: [z] }

Would it be a good idea to transform this to an array representation?

{ pair: [pair number], time: [t1, t2, t3, t4, t5, ...., tn], distance:
[x1, x2, x3, x4, x5, ...., xn], angle : [y1, y2, y3, y4, y5, ....,
yn], experiment: [z1, z2, z3, z4, z5, ...., zn] }

Meaning, every document contains all results (as a function of time,
distance, angle, experiment) for ONE PARTICLE PAIR.
This way, I will only have to store 10,000,000 documents in the
collection, rather than hundreds of millions / billions of documents
(as there are many combinations).

Is this design good? (especially as compared to a scalar-based
design?) will it improve the performance of writes / reads? what are
the benefits / drawbacks?

Thanks,
Shalom.

Sam Millman

unread,

Apr 11, 2011, 4:51:53 AM4/11/11

to mongod...@googlegroups.com

16Meg won't allow for the amount of subdocs you want per root doc. Since 16Meg is the largest size of a document allowed it is not the best idea.

Don't worry about having potentially billions of documents, Mongo can handle it.

Shalom Rav

unread,

Apr 11, 2011, 7:23:35 AM4/11/11

to mongodb-user

Sam,

Thank you for your help. Suppose I can 'fit' this data to be less than
16[MB] per record. Is it still preferable to design based on 'scalar',
rather than 'array' (per each particle pair)?

Also, I know MongoDB can handle billions of documents; the question
is, what will the query performance be? (any hints on this?)

Thanks,
Shalom.

On Apr 11, 4:51 am, Sam Millman <sam.mill...@gmail.com> wrote:
> 16Meg won't allow for the amount of subdocs you want per root doc. Since
> 16Meg is the largest size of a document allowed it is not the best idea.
>
> Don't worry about having potentially billions of documents, Mongo can handle
> it.
>

Joshua Kehn

unread,

Apr 11, 2011, 7:27:28 AM4/11/11

to mongod...@googlegroups.com

I would still use a scalar design. You would have much more complex queries with storing arrays of data for each particle pair.

Regards,

-Josh

____________________________________________

Joshua Kehn | Josh...@gmail.com

http://joshuakehn.com

Sam Millman

unread,

Apr 11, 2011, 7:31:58 AM4/11/11

to mongod...@googlegroups.com

I would say design scalar (without arrays). You have the typical problem that archived data might come up and bite you in the ass later. I have found this with design patterns before where I thought my document format would work perfectly but then it just grows too big too quickly.

Performance (depending on the memory and I/O performance of the server) for billions will be as good as millions really. Mongo is insanely scalable.

Only thing you gotta watch is skip functions etc, use range queries instead and cache last seen item. For normal find() functions etc it is pretty much the same as 10 million document collection.

Also as Joshua says the arrays will create more complex queries which might/might not harm your ability to correctly query the data. It will however create complexity within your coding which in turn may slow your application down.

Shalom Rav

unread,

Apr 11, 2011, 9:59:57 AM4/11/11

to mongodb-user

Joshua, Sam, thank you for your advice.

Sam, you wrote: "Only thing you gotta watch is skip functions etc, use

range queries instead
and cache last seen item. For normal find() functions etc it is pretty
much
the same as 10 million document collection."

Can you kindly explain what you mean? My plan is to give the Mongo ID
to all documents myself, so that the document names follow a certain
pattern.

For instance, suppose I will be giving the following unique IDs to
documents:

'doc9.54'
'doc10.52'
'doc9.29'
'doc10.51'
'doc1.69223'
'doc18907.9673'
........
........

Would it be then easy to ask Mongo to query and return all documents
where the pattern is (say) 'doc9.X' (meaning, all Xs of 'doc9')?

Suppose even that I have billions of records per collection (but only,
say, 10 million of type 'doc9.X'). Will a query to get all 'doc9.X'
documents be fast? or will Mongo scan all the billions of documents?

Under such conditions, can I avoid indexing and thus increase write
performance? (as, ideally, the document name will be the 'index', and
documents will always be queried by the document name).

On Apr 11, 7:31 am, Sam Millman <sam.mill...@gmail.com> wrote:
> I would say design scalar (without arrays). You have the typical problem
> that archived data might come up and bite you in the ass later. I have found
> this with design patterns before where I thought my document format would
> work perfectly but then it just grows too big too quickly.
>
> Performance (depending on the memory and I/O performance of the server) for
> billions will be as good as millions really. Mongo is insanely scalable.
>
> Only thing you gotta watch is skip functions etc, use range queries instead
> and cache last seen item. For normal find() functions etc it is pretty much
> the same as 10 million document collection.
>
> Also as Joshua says the arrays will create more complex queries which
> might/might not harm your ability to correctly query the data. It will
> however create complexity within your coding which in turn may slow your
> application down.
>

Joshua Kehn

unread,

Apr 11, 2011, 10:08:18 AM4/11/11

to mongod...@googlegroups.com

I wouldn't give them alphabetic prefixes, stick to just numeric id's. I am going to run a few queries to see what "fast" is, but the best way to figure this out would probably be to spin up a server (Rackspace has on-demand cloud servers, Slicehost, Linode) and push some sample data in there then run a few benchmarking queries. That will be much more helpful then numbers from me.

Regards,

-Josh

___________________________________________

Joshua Kehn | Josh...@gmail.com

http://joshuakehn.com

Shalom Rav

unread,

Apr 11, 2011, 10:12:56 AM4/11/11

to mongodb-user

Thank you Joshua, I appreciate your help.

In terms of the unique 'id's, would it help to have them like this:

Num1.Num2 (for instance: 1.16, 347.33434, 27.23782349, 3877823.231)?

or shall I prefer to just use integers?

On Apr 11, 10:08 am, Joshua Kehn <josh.k...@gmail.com> wrote:
> I wouldn't give them alphabetic prefixes, stick to just numeric id's. I am going to run a few queries to see what "fast" is, but the best way to figure this out would probably be to spin up a server (Rackspace has on-demand cloud servers, Slicehost, Linode) and push some sample data in there then run a few benchmarking queries. That will be much more helpful then numbers from me.
>
> Regards,
>
> -Josh___________________________________________

> ...
>
> read more »

Sam Millman

unread,

Apr 11, 2011, 10:21:24 AM4/11/11

to mongod...@googlegroups.com

"Would it be then easy to ask Mongo to query and return all documents
where the pattern is (say) 'doc9.X' (meaning, all Xs of 'doc9')?"

This would only require a simple find. But say you wanted to page the results for some reason and said skip(1000000) and limit(1000005)

It would be extremely slow due to the way that skip works. You would use a range query instead and query from the last item that you grabbed from your previous set.

So if your just doing simple finds you don't really need to worry about speed issues.

"Suppose even that I have billions of records per collection (but only,
say, 10 million of type 'doc9.X'). Will a query to get all 'doc9.X'
documents be fast? or will Mongo scan all the billions of documents?"

If hitting on an index it should be very fast.

"Under such conditions, can I avoid indexing and thus increase write
performance? (as, ideally, the document name will be the 'index', and
documents will always be queried by the document name)."

Well it is all down to your apps preferences and really this needs benchmarking.

If you are making a more write intensive app I would say it is possible to rid the index (only possible though), however, if you are looking to create more graphical outputs than writes then you need that index. So that last one is a little difficult to answer.

Joshua Kehn

unread,

Apr 11, 2011, 10:22:37 AM4/11/11

to mongod...@googlegroups.com

That would work fine, it depends on what those id's mean to you. Do the represent a pair of numbers? If so, consider splitting that up into id1 and id2 - in the case you will need to query specifically by one or the other.

Running a few sample queries they take longer then I like. It's an unindexed set of 2.7M records. Searching for a specific id set took 23.9 seconds without an index. Re-running the same query (it's cached) performed better, 2.1 seconds. Searching for one specific record (halfway down) was extremely quick, 0.2 seconds.

Indexing will speed up the queries significantly. If you were to run the experiment (pumping 10B new records in, assuming 10M records * 10K times) and then prior to querying build the indexes I think you'll notice a significant improvement in query time.

Regards,

-Josh

___________________________________________

Joshua Kehn | Josh...@gmail.com

http://joshuakehn.com

Shalom Rav

unread,

Apr 11, 2011, 10:49:34 AM4/11/11

to mongodb-user

Sam,

Thank you.

I keep getting confused on the 'index' issue.

If I choose to give the documents their unique '_id' (and provide no
other indexing) would that still be helpful in querying?

So suppose I assign the '_id' figure for all documents. first document
has the '_id' of 0; second document has the '_id' of 1. This way, I
have billions and billions of documents.

The question is, suppose that I now want to retrieve ALL DOCUMENTS
with '_id' between (say) 1 and 10,000,000;

Would that query be FAST? will Mongo have to scan all documents, or
will it use my '_id' index?

On Apr 11, 10:21 am, Sam Millman <sam.mill...@gmail.com> wrote:
> "Would it be then easy to ask Mongo to query and return all documents
> where the pattern is (say) 'doc9.X' (meaning, all Xs of 'doc9')?"
>
> This would only require a simple find. But say you wanted to page the
> results for some reason and said skip(1000000) and limit(1000005)
>
> It would be extremely slow due to the way that skip works. You would use a
> range query instead and query from the last item that you grabbed from your
> previous set.
>
> So if your just doing simple finds you don't really need to worry about
> speed issues.
>
> "Suppose even that I have billions of records per collection (but only,
> say, 10 million of type 'doc9.X'). Will a query to get all 'doc9.X'
> documents be fast? or will Mongo scan all the billions of documents?"
>
> If hitting on an index it should be very fast.
>
> "Under such conditions, can I avoid indexing and thus increase write
> performance? (as, ideally, the document name will be the 'index', and
> documents will always be queried by the document name)."
>
> Well it is all down to your apps preferences and really this needs
> benchmarking.
>
> If you are making a more write intensive app I would say it is possible to
> rid the index (only possible though), however, if you are looking to create
> more graphical outputs than writes then you need that index. So that last
> one is a little difficult to answer.
>

> ...
>
> read more »

Shalom Rav

unread,

Apr 11, 2011, 10:54:29 AM4/11/11

to mongodb-user

Joshua,

Yes, those IDs represent a set of particle pair.

By the way, is there a way to update a record in MongoDB based on
criteria?

Suppose that the particleID is 5. Is there a way to update (in-place)
the value for the distance and the angle if the new angle is (say)
greater than the existing one?

Something like: update particleIDs set angle = newAngle where angle >
newAngle?

Or must I have my client READ the record and do the comparison /
updation in the client?

On Apr 11, 10:22 am, Joshua Kehn <josh.k...@gmail.com> wrote:
> That would work fine, it depends on what those id's mean to you. Do the represent a pair of numbers? If so, consider splitting that up into id1 and id2 - in the case you will need to query specifically by one or the other.
>
> Running a few sample queries they take longer then I like. It's an unindexed set of 2.7M records. Searching for a specific id set took 23.9 seconds without an index. Re-running the same query (it's cached) performed better, 2.1 seconds. Searching for one specific record (halfway down) was extremely quick, 0.2 seconds.
>
> Indexing will speed up the queries significantly. If you were to run the experiment (pumping 10B new records in, assuming 10M records * 10K times) and then prior to querying build the indexes I think you'll notice a significant improvement in query time.
>
> Regards,
>

> ...
>
> read more »

Sam Millman

unread,

Apr 11, 2011, 10:56:34 AM4/11/11

to mongod...@googlegroups.com

It will use your index it wont have to scan all documents.

Shalom Rav

unread,

Apr 11, 2011, 11:03:47 AM4/11/11

to mongodb-user

Great Sam. So queries should be fast then.

On Apr 11, 10:56 am, Sam Millman <sam.mill...@gmail.com> wrote:
> It will use your index it wont have to scan all documents.
>

> ...
>
> read more »

Sam Millman

unread,

Apr 11, 2011, 11:11:26 AM4/11/11

to mongod...@googlegroups.com

Yep

> ...
>
> read more »

Shalom Rav

unread,

Apr 11, 2011, 11:14:54 AM4/11/11

to mongodb-user

Sam / Joshua,

If I save per every particleID the following info:

distance (float32)

angle (float32)

intValue1 (int32)

intValue2 (int32)

what will be my total size, in bytes, per document? (is it: 4+4+4+4 =
16 + _id (which is say 8) --> 24 bytes)? or am I missing something?

On Apr 11, 11:11 am, Sam Millman <sam.mill...@gmail.com> wrote:
> Yep
>

> ...
>
> read more »

Joshua Kehn

unread,

Apr 11, 2011, 11:16:44 AM4/11/11

to mongod...@googlegroups.com

No idea. What are you trying to estimate?

Regards,

-Josh

___________________________________________

Joshua Kehn | Josh...@gmail.com

http://joshuakehn.com

Shalom Rav

unread,

Apr 11, 2011, 11:18:33 AM4/11/11

to mongodb-user

I'm trying to get a sense what will be the size of my database,
assuming I know the number of records?

num_or_records x size_per_record + some_overhead = size_of_db?

On Apr 11, 11:16 am, Joshua Kehn <josh.k...@gmail.com> wrote:
> No idea. What are you trying to estimate?
>
> Regards,
>
> -Josh___________________________________________

> Joshua Kehn | Josh.K...@gmail.comhttp://joshuakehn.com
>

> ...
>
> read more »

Joshua Kehn

unread,

Apr 11, 2011, 11:23:01 AM4/11/11

to mongod...@googlegroups.com

I know MongoDB will pre-alocate blocks on disk to increase the write speed. I can't give you a decent estimate. Perhaps someone with more knowledge as to space allotted per-document and overhead can give you a closer one.

My rule of thumb is to just not mind how big it is, it's not meant to fit the data in the smallest possible space. If it's getting too big, I increase the disk space.

Regards,

-Josh

___________________________________________

Joshua Kehn | Josh...@gmail.com

http://joshuakehn.com

Joshua Kehn

unread,

Apr 11, 2011, 11:03:34 AM4/11/11

to mongod...@googlegroups.com

Hmmm, interesting question and not one I have had to deal with. Someone else should be able to tell you if that is possible or not.

Regards,

-Josh

___________________________________________

Joshua Kehn | Josh...@gmail.com

http://joshuakehn.com

Reply all

Reply to author

Forward