Polling for updates

430 views
Skip to first unread message

Sergey Shinderuk

unread,
Oct 25, 2010, 2:47:50 PM10/25/10
to mongodb-user
I'd like to know what is the right way to check for updates to a mongo
collection.

In our application we have a mongo collection with text documents, and
we use Apache Solr for full text search over this collection. Several
times a day this collection is updated: mainly new documents are added
and few old documents are updated. We need to keep Solr full text
index in sync. But full reindex of the collection is inappropriate.
There are around 5M documents in the collection, and just few
thousands of documents are added/updated.

So we need to have a delta-indexing scheme, sending updates to Solr
index just for the documents added since last time we checked the
collection for updates.

What is the best way to get the documents updated since last time?

One way is to watch for oplog events. But IMHO it's not the right or
convenient way. I'd like to be able to get a list of updated documents
no matter how much time passed since this update. But with oplog you
need to catch.

Another way, that seems quite natural, is to add a timestamp to each
document, and update it if document is updated. Then you can remember
the timestamp of the last indexed document, and next time look for
documents younger than this timestamp. But the problem here is to
generate server timestamps. You cannot use client timestamps, because
you need to synchronize clocks of all clients, moreover in case of
bulk updates all documents from same bulk will have the same
timestamp.

Before switching to mongo we used couchdb. Delta-updates with couch
were just a piece of cake. Every database in counch has an update
sequence number. And after each update (insert or delete), this number
is incremented. You can query for updates to the database starting
from a certain update sequence (the one you finished at last time). It
is just what we need. But there is nothing similar in mongo.

Right now we use a workaround that looks like a hack. When reading the
source code for mongo, we found out that BSON timestamp of zero has a
special treatment. If you put a zero timestamp in first or second
field of a document being inserted or updated, then the server will
put actual timestamp in this field. This way you essentially get
server timestamps. And each inserted or updated document (whether it
is in a bulk or not) receives a unique timestamp. We use this
timestamp for tracking changes in the collection and delta-indexing of
the updated documents.

Is this an ugly hack or not? We failed to found anything in the
documentation or commit comments about this feature and its use.

In any case, what is the right way to implement delta-indexing with
mongo?
Any suggestions or experience will be appreciated.



Nathan Ehresman

unread,
Oct 25, 2010, 2:55:53 PM10/25/10
to mongod...@googlegroups.com
2 thoughts:

1. Use a queue to keep track of documents updated, then whatever process
updates your Solr index could simply work its way through the queue.
You could even have multiple workers if you wanted.
2. This could be a nice application for using triggers, when they get
implemented. See http://jira.mongodb.org/browse/SERVER-124

Nathan

Alvin Richards

unread,
Oct 25, 2010, 3:14:08 PM10/25/10
to mongodb-user
So a couple of thoughts

1. To get the servertime, you can do this using db.eval

> myfunc = function(x){ return new Date(); };
> servertime = db.eval( myfunc) ;

2. Process the events
Yes your scheme would work, but bear in mind any concurrency. For
example, what if you need two threads to process the updates, you will
then need to coordinate what is being worked on by the two threads
etc.

3. Tailable cursors

If you track the update events in a collection, then you can use a
tailable cursor to get these changes as they happen. Restriction is
that you need to use a capped collection, so you will need to size
this correctly so that you do not loose events before they are
processed

http://www.mongodb.org/display/DOCS/Tailable+Cursors

-Alvin

On Oct 25, 11:55 am, Nathan Ehresman <nehres...@sentryds.com> wrote:
> 2 thoughts:
>
> 1. Use a queue to keep track of documents updated, then whatever process
> updates your Solr index could simply work its way through the queue.
> You could even have multiple workers if you wanted.
> 2. This could be a nice application for using triggers, when they get
> implemented.  Seehttp://jira.mongodb.org/browse/SERVER-124

Sergey Shinderuk

unread,
Oct 25, 2010, 3:29:13 PM10/25/10
to mongodb-user
Thank you for comments.

Right now we do use server timestamps by sending an empty timestamp in
ts field, and letting the server fill in the timestamp (as described
above).
And this approach is the most concurrency friendly as I can think of.

On 25 окт, 23:14, Alvin Richards <al...@10gen.com> wrote:
> So a couple of thoughts
>
> 1. To get the servertime, you can do this using db.eval
>
> > myfunc = function(x){ return new Date(); };
> > servertime = db.eval( myfunc) ;

As far as I know db.eval blocks other operations and will be a bootle
neck in updates heavy scenario.

>
> 2. Process the events
> Yes your scheme would work, but bear in mind any concurrency. For
> example, what if you need two threads to process the updates, you will
> then need to coordinate what is being worked on by the two threads
> etc.

Our scheme allows to have many consumers of updates. Actually we have
many instances of Solr and other external update handlers, not just
one.
Every handler just remembers the last timestamp, and next time it
polls for updates with a query

db.mycoll.find({ ts : { $gt : last_ts } }).sort({ ts : 1 })

So this way we can have multible external processes keeping in sync
with the documents collection.

> 3. Tailable cursors
>
> If you track the update events in a collection, then you can use a
> tailable cursor to get these changes as they happen. Restriction is
> that you need to use a capped collection, so you will need to size
> this correctly so that you do not loose events before they are
> processed
>
> http://www.mongodb.org/display/DOCS/Tailable+Cursors

Yes, it's a possible solution. But i think timestamps is easier, cause
you don't need to join update event with documents to process the
update.

Sergey Shinderuk

unread,
Oct 25, 2010, 3:34:36 PM10/25/10
to mongodb-user
Dear MongoDB developers!

Is it correct to send empty BSONTimestamp in a document to get server-
side timestamps?
We exploit this hack or feature now and want to know if it's ok or
there is a beeter way.

Thanks

jdill

unread,
Oct 25, 2010, 4:33:20 PM10/25/10
to mongodb-user
I would like to know this too. Thanks,

Kristina Chodorow

unread,
Oct 25, 2010, 7:24:46 PM10/25/10
to mongod...@googlegroups.com
The timestamp hack is fine.  You might want to keep your eye on/vote for http://jira.mongodb.org/browse/SERVER-1650, too.


2010/10/25 jdill <jer...@dilltree.com>
--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.


jdill

unread,
Oct 26, 2010, 1:11:47 AM10/26/10
to mongodb-user
Hey Sergey,

Can you post a code example of how you use the BSONTimestamp object in
your updates/inserts as you explained above?

Thanks,

Jeremy

On Oct 25, 6:24 pm, Kristina Chodorow <krist...@10gen.com> wrote:
> The timestamp hack is fine.  You might want to keep your eye on/vote forhttp://jira.mongodb.org/browse/SERVER-1650, too.
> > mongodb-user...@googlegroups.com<mongodb-user%2Bunsubscribe@google groups.com>
> > .

Sergey Shinderuk

unread,
Oct 26, 2010, 3:09:02 AM10/26/10
to mongodb-user
Here is an example of the server timestamping trick.
I use Java, because you need a driver with support for BSON Timestamp
type. JS and Ruby drivers don't support this.

import com.mongodb.*;
import org.bson.types.BSONTimestamp;

public class Dummy {
public static void main(String[] args) throws Exception {
Mongo mongo = new Mongo();
DB db = mongo.getDB("test");
DBCollection coll = db.getCollection("mycoll");

BasicDBObject doc = new BasicDBObject();
doc.put("ts", new BSONTimestamp());
doc.put("message", "hello mongo");
coll.save(doc);
}
}

After running this code, issue a query in the db shell:

> db.mycoll.find()
{ "_id" : ObjectId("4cc67c83107166a69ced7c63"), "ts" : { "t" :
1288076419000, "i" : 1 }, "message" : "hello mongo" }

Note "ts" is filled by the server.

The name of the field doesn't matter, but it should be the first or
the second field in the document. Ohterwise the server won't bother.

jdill

unread,
Oct 26, 2010, 12:29:41 PM10/26/10
to mongodb-user
Ah.. thanks Sergey. Maybe that's why i can't find it. PHP and C
driver doesn't have the BSONTimestamp. :( Shux.

Kristina Chodorow

unread,
Oct 26, 2010, 12:36:37 PM10/26/10
to mongod...@googlegroups.com
PHP does, it's called MongoTimestamp.


To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.

mwaschkowski

unread,
Oct 27, 2010, 9:08:34 AM10/27/10
to mongodb-user
Anyone reading this thread and interested in this or other similar
functionality, please vote for triggers:

http://jira.mongodb.org/browse/SERVER-124

I think this issue is close to getting chosen to be implemented, so a
few more votes might convince 10Gen that this is a worthy feature for
the next release!

Thanks,

Mark
> ...
>
> read more >>

Nat

unread,
Oct 27, 2010, 10:47:16 AM10/27/10
to mongodb-user
Why tailable cursor on oplog is not a viable solution?
> ...
>
> read more >>
Reply all
Reply to author
Forward
0 new messages