disco and ddfs blacklisting

19 views

Skip to first unread message

dtho...@mozilla.com

unread,

Oct 6, 2014, 5:53:06 PM10/6/14

to disc...@googlegroups.com

Hello,

I have a disco cluster running in AWS, made of 8 slave nodes and 1 master. I need to replace all 8 slave nodes with 8 new slave nodes. My plan was to create the new nodes and add them to the cluster, then blacklist old nodes 1 at a time, waiting for them to turn green in the web interface before blacklist another node.

I got as far as blacklisting my first node (for both disco and ddfs), but it seems to be taking forever for replication complete. I'm manually triggering garbage collection runs, each of which seems to take < minute, but I've done this at least a dozen times and I feel like I'm doing something wrong. I looked in the master logs to determine the replication status, and found that each garbage collection run only seems to replicate 114 blobs. Each of the nodes has as many as 7k blobs, and I don't want to have to trigger 70 gc runs just to speed up migration of the node.

Is there any way force replication to actively continue until I'm back at 3 replicas per blob?

Also, if I blacklist all 8 old nodes at once (in ddfs as disco, or just in ddfs), will the master replicate data from the blacklisted nodes to the new nodes, or will replication fail because I have no non-blacklist sources for the blobs?

-Daniel Thornton

Shayan Pooya

unread,

Oct 6, 2014, 10:21:49 PM10/6/14

to disc...@googlegroups.com

Hello Daniel,

1. About the small number of replicated blobs in each GC, I have to investigate. would you please mention the version that you are using?

2. Disco should be able to recover the blobs from blacklisted nodes. So you should be able to blacklist all eight nodes and then run the GC. However, please note that this is not tested routinely and there might be a bug that results in data loss. I can test this and come back to you. If you decided to follow this path, make sure to check the logs for the following message:
"GC: all replicas missing for"

which will be shown for any lost data.

3. You can definitely automate the cycle of blacklisting one node, running the gc until the node is safe to remove and then moving to another node. All of them are accessible with the rest API. However, this should be the last resort.

--
You received this message because you are subscribed to the Google Groups "Disco-development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to disco-dev+...@googlegroups.com.
To post to this group, send email to disc...@googlegroups.com.
Visit this group at http://groups.google.com/group/disco-dev.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages