How to efficiently update (hash) a field in big collection

77 views
Skip to first unread message

andbed

unread,
Jun 2, 2018, 2:05:38 AM6/2/18
to mongodb-user
Hi,
I'm trying to update (hash) a field in a collection of ~40M docs and I must be doing something wrong because all my tries either timeout or do not finish at all. My recent query looks as follows:

db.getCollection('test').find().noCursorTimeout().snapshot().forEach( function(data) {  
  data.firstname = hex_md5(data.firstname);   
  db.getCollection('test').save(data); 
});

but it takes ages to finish. What is the better (more efficient) way?

Many thanks for any help,
Andrzej

Kevin Adistambha

unread,
Jun 28, 2018, 3:03:43 AM6/28/18
to mongodb-user

Hi Andrzej

It’s been some time since you posted this question. Have you managed to finish the update?

It sounds to me like the hardware you have is struggling to update all 40M documents in one go. To minimize the impact to your other operation, I would suggest you to do the process in controlled batches, e.g.:

db.getCollection('test').find({_id: <some restricted range of _id>}).forEach( function(data) {  
  db.getCollection('test').update({_id: data._id}, {$set: {firstname: hex_md5(data.firstname)}})
});

Having said that, if your update is touching every single document in the collection, the operation is likely to be bound by disk speed (unless the whole collection can fit in memory).

On another note, it’s generally not recommended to use noCursorTimeout(). The reason being, the server will keep the cursor open indefinitely. If the application crashes or stops working for any reason, that cursor will never be closed by the server, resulting in a resource leak that can only be fixed with a restart.

Best regards
Kevin

Reply all
Reply to author
Forward
0 new messages