MongoDB replication is very slow for removing records

1,939 views
Skip to first unread message

john liu

unread,
Mar 6, 2012, 12:02:47 AM3/6/12
to mongod...@googlegroups.com
we're using MongoDB replicaSet(5 nodes, 1+3+1 arbitor, we bulk inserting about 30 millions records a day, then but keep a few days records, we've to run clean up cron job overnight, but our operation is 7/24, night time is still very busy.

The nightly remove (about 20millions+ records x-day old data) which makes the replication badly delayed, up to a couple hours, which slow down all queries on the secondaries very bad.  The data plus index is only about 9G, we remove data based on the MongoID _id timestamp on the primary to limit index space. Enough ram (45G) to hold data+index+oplog on the secondary.

Since we only keep up to 3-5 days records, we're not planning to shard the data. We need high speed inserting rate and fast query speed on the three secondary, so we didn't spit the single collection to daily collections, unless we can get similar query speed by combining the query on 3-5 small collections comparing to a single collection. We don't know exactly the number of records we may receive daily or exactly number of days' records to keep, so size based cap collection may or may not a good option for us.

let me know if there's any good option.

thanks

John

Tyler Brock

unread,
Mar 6, 2012, 12:03:20 PM3/6/12
to mongodb-user
You could create a collection for each day and just drop the
collection via the cron job when it is passed 5 days old. This should
happen near instantaneously.

In fact it would be better to create a database per day so that the
disk space is reclaimed when the database is dropped.

-Tyler

john liu

unread,
Mar 6, 2012, 12:20:21 PM3/6/12
to mongod...@googlegroups.com
Will 5 queries' speed from 5 collections or dbs much slower than a single query?

John
--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.


Tyler Brock

unread,
Mar 6, 2012, 12:47:12 PM3/6/12
to mongodb-user
I'm not sure what you mean.

The idea is that dropping a collection or db is much faster than
removing.

Even running db.collection.drop() is much faster than running
db.collection.remove() with no arguments.

-Tyler

Tyler Brock

unread,
Mar 6, 2012, 12:49:31 PM3/6/12
to mongodb-user
I think what you are asking is:

"Is running 5 drops slower than 1 query?"

The answer is no. Dropping 5 collections or databases will run much
faster than running the single query to remove old documents.

-Tyler

john liu

unread,
Mar 6, 2012, 1:15:15 PM3/6/12
to mongod...@googlegroups.com
Hi Tyler,
Sorry for the confusing -

We've to query results from the 5-days collection, if we split this single collection into 5 daily collections, we've to issue 5 queries in our application codes plus extra connections overhead(if we split into 5 daily dbs) instead of a single query/connection now. 

We don't want to slow down query time is the challenge.

I tested drop collection or db is very fast in replicaSet, except drop collection as you addressed will not reclaim spaces, etc. Currently to remove 10million records on primary took 5mins, but took 1-2 hours on secondary to remove, and delaying new insert replication and blocking queries as well.

We're wondering whether MongoDB is not designed the way for our use case.

John

Tyler Brock

unread,
Mar 6, 2012, 1:45:41 PM3/6/12
to mongodb-user
Ok, are you sure that the secondaries have indexes on the field that
you are querying for during the remove (presumable insertion_date or
something like that)?

Is it possible that they don't have the same indexes as the primary?

Regarding the db per day solution:

There will be some overhead and added complexity to doing 5 separate
queries but that is the quickest way to remove stale data. Also, it
should be easy to combine the data returned from each query as the
data sets in each db are distinct. The performance difference should
be negligible if you are using connection pooling and possibly
threading.

-Tyler

john liu

unread,
Mar 6, 2012, 2:55:45 PM3/6/12
to mongod...@googlegroups.com
Hi Tyler,
thanks for your comment -

we use _id's timestamp to do deletion on primary, it looks like it uses the default _id index on primary with explain() and the remove operation is relative fast on primary. I'm not sure MongoDB replication on removing is implemented as statement based. Or a possible bug that secondary can't use _id index doing $lt on timestamp in replication?

Your suggestion to split into daily is a feasible one. Another suggestion is to find a way to use Capped collection -

if we know 5 days data will not exceed 15G or 150millions, can we use capped collection plus a composite index on user_id + dt(unixtime) on the capped collection? is capped collection aged out indexing automatically when old data aged out?

Both solutions are not ideal for us if we need to keep x days instead of y days. 

If there's a way to speed up statement based removing and make it having similar speed as inserting, that'll be great. Maybe that's not a possible feature in MongoDB now, the same reason why TTL is not on the release map?

thanks

John

Tyler Brock

unread,
Mar 6, 2012, 3:50:37 PM3/6/12
to mongodb-user
Regarding speed: With replication, the primary will apply the remove
operation by finding the date range to delete starting point in the
index and then walk that portion of the index removing each document
until you reach the end of the range.

However, the oplog is applied via _id to the secondaries, one by one,
and while there is an index on _id, the operations need to be done
sequentially and individually instead of in a batch.

You could make a capped collection with a size that greatly exceeds
the 5 day size, that might work very well for you.

Capped collections are not automatically aged out and indexes will
contain every document in the capped collection. However that
shouldn't be a problem as you will could simply provide a date range
in the query.

TTL is on the roadmap here: https://jira.mongodb.org/browse/SERVER-211

You can watch it and vote for it if you would like.

-Tyler

john liu

unread,
Mar 7, 2012, 4:30:17 PM3/7/12
to mongod...@googlegroups.com
thanks, Tyler.

On a related subject, with high frequently inserting/deleting, to avoid disk fragmentation, what's the best practice how often to run compact? what's the best indicator to make a good guess? will sequentially deletion not causing disk fragmentation as fast?

John

Barrie

unread,
Mar 13, 2012, 8:33:55 PM3/13/12
to mongod...@googlegroups.com
Hey John,

In general, if you look at the output of db.stats() and see that the dataSize is much less than the storageSize, that likely means you've added and deleted a lot of data.  The existing data may or may not be fragmented at that point.  There's no exact recipe for figuring out when to compact, but you can look in MMS or at the output of mongostats, and if you're getting a lot of page faults or index misses, even when your data size is small, this could be the result of fragmentation, and would be a good time to try running a compact.   

Hope this helps.

Barrie 

John


> > > > > > > For more options, visit this group at
> > > > > > >http://groups.google.com/group/mongodb-user?hl=en.
>
> > > > --
> > > > You received this message because you are subscribed to the Google
> > Groups
> > > > "mongodb-user" group.
> > > > To post to this group, send email to mongod...@googlegroups.com.
> > > > To unsubscribe from this group, send email to

> > > > For more options, visit this group at
> > > >http://groups.google.com/group/mongodb-user?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "mongodb-user" group.
> > To post to this group, send email to mongod...@googlegroups.com.
> > To unsubscribe from this group, send email to

> > For more options, visit this group at
> >http://groups.google.com/group/mongodb-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages