Hi Chris,
First of all, thank you very much for SeaweedFS. We use it in production for a few weeks with roughly 1 billion files in it. We chose it for its simple, yet powerful design and API.
However I’d like to discuss some issues we encountered during that short period of migration and usage:
We use 3 masters to provide master-failover. Currently we plan to switch back to a single master with VRRP on two machines for failover. The reason it occured two times in 1 week that we had a split brain situation where 1 master did not “find” the other masters. This resulted in 1 volume servers attached to master 1 and 3 volume servers on master 2. This has the following effects:
When doing round-robin among masters, mostly ⅓ of all request will fail
Both masters start to create new volume which can overlap. In our case we had one volume id with different collections and different replications created. When masters and volumes are reconciled again, this leads to write and or read fails in best case and to unpredictable inconsistencies in the worst case.
Generated file ids can overlap which will produce inconsistencies on reconciliation
No matter why this happened - it happened. So I / we have to deal with it.
Proposal:
In a multi-master environment, each master should have an ID and or know the total amount of masters. So each master can create globally unique volume + file ids.
Example:
Master 1 creates 1, 4, 7
Master 2 creates 2, 5, 8
Master 3 creates 3, 6, 9
We used this behavior for many many years and billions of records for sequence generation. MySQL can be configured like that to create auto-increments.
Seaweed does not guarantee consistency when a node is down. This applies to deletes ALL the time and puts in the period between a node goes down and the master recognizes that and marks the volumes as unwriteable. Or: If the replication of a file fails for some other reason. The server returns a 500 then but it does not assert consistency, so the uploaded file exists on the coordinator volume but not on one or more failed replicas.
For this situation a repair process is required. I have built one with external tools, which somehow works but it could be much more efficient and secure when done in SW.
There are several ways to guarantee eventual consistency. If we take a look at Cassandra, this can be achieved by hinted handoffs and or a repair process.
As a first step, a repair would probably be easier to implement. As far as I can see you also write deletes to index file as a kind of tombstone what also make it possible to repair deletes.
Rough repair process (with incremental repair):
set repair_time = now()
read last_repair_time
foreach volume:
list volume index of all replicas where
timestamp < repair_time (avoids race conditions)
timestamp >= last_repair_time
build diff based on timestamp + crc
stream files and deletions with original timestamp + remaining ttl
set last_repair_time on volume
To improve performance on incremental repairs and maybe to maintain backwards compatibility (if necessary) you could create a separate index file (like a commit log) for repairs which can be truncated to timestamp > repair_time after each repair.
To make compaction safe, there should be either a grace period for tombstones or the compaction should should only throw away tombstones after they have been repaired (tombstone_timestamp < last_repair_time). So it can be guaranteed, there are no stale deletes around.
Unfortunately I cannot build an external repair tool with the tools SW provides currently as I have no access to tombstones with “weed export”.
Unfortunately it is not possible to delete or move a volume online. I think this would be very easy to implement with a few routes:
http://vol_server/admin/volume/delete: Does what it says. I had many stale volumes around after master split-brain.
http://vol_server/admin/volume/detach: Unregisters the volume at the master, so that the volume becomes unwritable on purpose
http://vol_server/admin/volume/attach: Registers a volume so that it becomes available without vol_server restart
This would make it possible to transfer volumes across servers e.g. for node-rebalancing.
Summary
1. + 2. is critical for us and I can imagine it is also for other companies and projects. Actually every distributed system should have one or more ways to completely recover from system failures in a consistent way. I think these proposals (esp. 2.) would be also helpful when planning to implement async replication or even something like “tunable consistency” as known from Cassandra to let the admin decide where to place yourself on the CAP theorem.
What do you think of all that?
I’d love to contribute to that by writing all the code but unfortunately 1. I am not very familiar with go, 2. I have very little time to maintain more projects.
I hope you appreciate my feedback and I am excited about your opinion on it.
Thanks,
Benjamin
--
You received this message because you are subscribed to a topic in the Google Groups "Seaweed File System" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/seaweedfs/v-i5VtlTf_U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to seaweedfs+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Seaweed File System" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seaweedfs+unsubscribe@googlegroups.com.