Seaweed maintainability + consistency

Benjamin Roth

unread,

Dec 19, 2016, 9:39:25 AM12/19/16

to Seaweed File System

Hi Chris,

First of all, thank you very much for SeaweedFS. We use it in production for a few weeks with roughly 1 billion files in it. We chose it for its simple, yet powerful design and API.

However I’d like to discuss some issues we encountered during that short period of migration and usage:

1. Master split-brain CRITICAL

We use 3 masters to provide master-failover. Currently we plan to switch back to a single master with VRRP on two machines for failover. The reason it occured two times in 1 week that we had a split brain situation where 1 master did not “find” the other masters. This resulted in 1 volume servers attached to master 1 and 3 volume servers on master 2. This has the following effects:

When doing round-robin among masters, mostly ⅓ of all request will fail
Both masters start to create new volume which can overlap. In our case we had one volume id with different collections and different replications created. When masters and volumes are reconciled again, this leads to write and or read fails in best case and to unpredictable inconsistencies in the worst case.
Generated file ids can overlap which will produce inconsistencies on reconciliation

No matter why this happened - it happened. So I / we have to deal with it.

Proposal:

In a multi-master environment, each master should have an ID and or know the total amount of masters. So each master can create globally unique volume + file ids.

Example:

Master 1 creates 1, 4, 7

Master 2 creates 2, 5, 8

Master 3 creates 3, 6, 9

We used this behavior for many many years and billions of records for sequence generation. MySQL can be configured like that to create auto-increments.

2. Volume repair and eventual consistency - SOMEHOW CRITICAL

Seaweed does not guarantee consistency when a node is down. This applies to deletes ALL the time and puts in the period between a node goes down and the master recognizes that and marks the volumes as unwriteable. Or: If the replication of a file fails for some other reason. The server returns a 500 then but it does not assert consistency, so the uploaded file exists on the coordinator volume but not on one or more failed replicas.

For this situation a repair process is required. I have built one with external tools, which somehow works but it could be much more efficient and secure when done in SW.

There are several ways to guarantee eventual consistency. If we take a look at Cassandra, this can be achieved by hinted handoffs and or a repair process.

As a first step, a repair would probably be easier to implement. As far as I can see you also write deletes to index file as a kind of tombstone what also make it possible to repair deletes.

Rough repair process (with incremental repair):

set repair_time = now()
read last_repair_time
foreach volume:
list volume index of all replicas where

timestamp < repair_time (avoids race conditions)
timestamp >= last_repair_time

build diff based on timestamp + crc
stream files and deletions with original timestamp + remaining ttl
set last_repair_time on volume

To improve performance on incremental repairs and maybe to maintain backwards compatibility (if necessary) you could create a separate index file (like a commit log) for repairs which can be truncated to timestamp > repair_time after each repair.

To make compaction safe, there should be either a grace period for tombstones or the compaction should should only throw away tombstones after they have been repaired (tombstone_timestamp < last_repair_time). So it can be guaranteed, there are no stale deletes around.

Unfortunately I cannot build an external repair tool with the tools SW provides currently as I have no access to tombstones with “weed export”.

3. Maintenance helpers - HELPFUL

Unfortunately it is not possible to delete or move a volume online. I think this would be very easy to implement with a few routes:

http://vol_server/admin/volume/delete: Does what it says. I had many stale volumes around after master split-brain.

http://vol_server/admin/volume/detach: Unregisters the volume at the master, so that the volume becomes unwritable on purpose

http://vol_server/admin/volume/attach: Registers a volume so that it becomes available without vol_server restart

This would make it possible to transfer volumes across servers e.g. for node-rebalancing.

Summary

1. + 2. is critical for us and I can imagine it is also for other companies and projects. Actually every distributed system should have one or more ways to completely recover from system failures in a consistent way. I think these proposals (esp. 2.) would be also helpful when planning to implement async replication or even something like “tunable consistency” as known from Cassandra to let the admin decide where to place yourself on the CAP theorem.

What do you think of all that?

I’d love to contribute to that by writing all the code but unfortunately 1. I am not very familiar with go, 2. I have very little time to maintain more projects.

I hope you appreciate my feedback and I am excited about your opinion on it.

Thanks,

Benjamin

ChrisLu

unread,

Dec 19, 2016, 4:06:57 PM12/19/16

to Seaweed File System

Hi, Benjamin,

Thanks for your detailed suggestions!

Please file issues on github and we can address them accordingly.

In general, I feel one master should be solid enough. Current master cluster is only proxying traffic to the elected master. I wonder how much traffic the master is handling in your case.

Chris

Benjamin Roth

unread,

Dec 19, 2016, 4:34:39 PM12/19/16

to seaw...@googlegroups.com

Hi Chris,

Thanks for your response! I will create the issues later.

Master:

As you said, the current multi-master setup won't help in a situation where the master is overloaded. For us it is only a failover to avoid a SPOF. We thought, if SW offers this out of the box, we will use it but it turned out that it is not 100% rock solid. But we absolutely need a way to switch over to a different master process in case of failure. This e.g. includes scheduled downtimes of the master host.

Question: Can that (safely and reliably) be achieved by using VRRP / Heartbeat to switch over to a clean master or is there some data to be replicated among master instances to be sure there are no duplicate file ids or volume ids?

Master performance:

The limit where a single master is overloaded is indeed very high but it can be reached, because:

The master has to be queried for every single read operation OR the query has to be cached somewhere else. If you have VERY much read requests, there is actually a limit. Not only by CPU (yes, master is a very light process) but also by network. On the network side, not the bandwidth is a problem but the amount of packages _may_ become one - depending on your hardware and drivers. We actually brought memcache (also a very light process) to its limits with a single queue network interface as a single CPU was not able to handle the incoming software interrupts any more that were produced by the network driver. That memcache node had maybe 50k-100k req/s or sth like that. The limit on a modern network card with good drivers is approx 1million packets / s.

So there is a theoretical limit which actually could still be avoided if there were several masters with independant increments (fids + volumeids) and all vol servers connect to each master. So there would be absolutely no SPOF and still a scalable solution.

- This is nothing I would consider as critical right now. I don't think we will hit that limit in the near future. I was just thinking out loud.

But: Avoiding a SPOF is absolutely crucial for every distributed system.

Btw.: We roughly have 500-600 req/s on the vol servers in total right now, but we cache all hot requests in memcache, which is done by nginx. I cannot tell exactly how many master req/s we have or how much they were if we did not cache them. I guess it's around 5k-6k req/s - so nothing that overloads a single master. In the past it turned out that it's worth to cache all the requests. Just for the records: We used mogilefs for many years and now migrated to seaweed.

--
You received this message because you are subscribed to a topic in the Google Groups "Seaweed File System" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/seaweedfs/v-i5VtlTf_U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to seaweedfs+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Chris Lu

unread,

Dec 19, 2016, 5:00:31 PM12/19/16

to Seaweed File System

Again, thanks for all the details!

To avoid overloading the master, the volumeId~locations lookup can be cached.

To make fast master switch, there is only one major meta information: the high watermark of assigned volume id. The other states are all soft state.

"the high watermark of assigned volume id" is actually also soft state. But if switched very fast, there would be duplicated volumes created. This is where the split brain problem comes from. It's just one single number. Need some ideas on how to make it work and keep it simple.

Chris

--
You received this message because you are subscribed to the Google Groups "Seaweed File System" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seaweedfs+unsubscribe@googlegroups.com.

Benjamin Roth

unread,

Dec 20, 2016, 1:40:02 PM12/20/16

to seaw...@googlegroups.com

I created issues #418, #419, #420 on that.

For fast master switching / vol-id creation:

It would be best to not use an increment for volume ids. Instead UUIDs or just a unique but maybe not so long number could be used.

I could imagine of something like unix timestamp + per node modulo increment with max value encoded in 0-9,a-z for shorter URLs.

A simple increment is always exposed to race conditions and race conditions are always bad.
An offset increment per node (1, 4, 7 + 2, 5, 8 + 3, 6, 9, ...) works until the number of masters changes or the initial value is lost. This either also requires a hard state or introduces a race condition again. Not the same but similar to 1.
A predictable and unique offset helps to reduce risks if soft state is lost. Current timestamp is always larger than the last vol created. To allow creation of multiple volumes per timestamp (depending on timestamp resulution) you can add an increment value (like in 2.). So different masters can create volumes at the same time without ever reusing an existing vol id or to have a race condition if 2 split brain masters create a volume at the same time(stamp).

I don't know if it is completely backwards compatible as this changes the vol id from int to string. On the other side, a fileId is also always a string and normally stored as a single reference to a file.

What do you think?

Reply all

Reply to author

Forward