delete consistency during volume failure

54 views
Skip to first unread message

Benjamin Roth

unread,
Nov 23, 2016, 1:27:02 PM11/23/16
to Seaweed File System
Hi,

We are about to run Seaweed in production - so first thank you for building such a great product!
Today I ran some failure tests and I noticed that partially failed deletes are not recovered.

Example:
- 3 Volumes A, B, C with RF=3
- I store a file replicated on all volumes
- Stop Volume A
- Delete file on Volume B
- Start Volume A
- File ist deleted on B + C but still exists on A
- Volume B logs a lot of errors like: "I1123 18:02:48  2162 store_replicate.go:117] replicating opetations [2] is less than volume's replication copy count [3]"

It does not seem like there is something like a persistent commit log to recover replication if node A comes back.
This leads to inconsitencies that can never be fixed. Even if I copied Volume B or C over to A, Volume A is down during the copy process and all deletes that are done during the copy are lost again.

The only way I see to hack around this is to do some checks in my application like:
- If delete to volume fails (e.g. volume failure has not been propagated to master) - enqueue delete and retry later
- If lookup returns less volumes than expected (application knows replication count by some config) - enqueue delete and retry later

But Seadweed does not tell me that something went wrong or blocks the delete until all volumes are available again.

So my question:
1. Are my observations correct?
2. Are there plans to implement a hinted handoff? It should be enough to buffer a certain (configurable) period like 1h - just enough to copy the volumes over. E.g. Cassandra discontinues to store hinted handoffs when a node is offline for more than 3h.
3. Any other recommendations how to deal with the situations?

Thanks in advance for any helpful answer :)
Benjamin

Chris Lu

unread,
Nov 23, 2016, 8:39:22 PM11/23/16
to Seaweed File System
1. Correct.
2. The volumes will become read-only if any replica is down for maintenance.
3. The recommendation is to ask the weed master to assign a writable volume to write.

Chris

--
You received this message because you are subscribed to the Google Groups "Seaweed File System" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seaweedfs+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Benjamin Roth

unread,
Nov 23, 2016, 11:52:11 PM11/23/16
to seaw...@googlegroups.com

Thanks for your reply. But i think you got me wrong.
I was talking about deletes. I cannot delete a File from an arbitrary volume. I have to delete it from a certain volume even if it is not completely available. Or i have to postpone the delete.
What you described is a different use case i guess.


You received this message because you are subscribed to a topic in the Google Groups "Seaweed File System" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/seaweedfs/1EBo-dJic8k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to seaweedfs+unsubscribe@googlegroups.com.

Chris Lu

unread,
Nov 24, 2016, 2:10:32 AM11/24/16
to Seaweed File System
I see. You can send "http://server:port/xxxx?type=replicate" to the volume server, to delete the entry directly. 



Chris

Benjamin Roth

unread,
Nov 24, 2016, 2:55:06 AM11/24/16
to seaw...@googlegroups.com
Ok thanks!
But there is one (or two) thing left:
  1. Can I ask the master if a volume is healthy? E.g. like for assign it would be great to have a response field in the lookup if a volume is healthy or if a node is failing. I guess the master has to know that anyway like in the assign case where the failed volume is not electable for assigns. So an application would able to detect if a volume is ready for a delete or not and then can either deny or defer the delete in case of failure
  2. What about that short period between volume failure and the event when the master recognizes that the volume is down. I noticed that this may take some time like 10s or so. What if I do a write or a delete in that period. I guess both the write and delete are lost, right?
A little bit more docs in the wiki on failure cases would be perfect. Most probably this would avoid that people like me ask questions like this :D  
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Chris Lu

unread,
Nov 24, 2016, 3:44:54 AM11/24/16
to Seaweed File System
1. Master does not know. Only healthy volume servers will send heartbeats to master. But master knows the number of replica copies is less than required.
2. Correct. This should not happen often. And you can increase the heartbeat frequency to mitigate the problem if it seems critical.

Chris
Reply all
Reply to author
Forward
0 new messages