Sporadic errors during replication w/ larger blobs & consistency properties of async filer replication

Thilo-Alexander Ginkel

unread,

Nov 28, 2020, 12:32:10 PM11/28/20

to seaw...@googlegroups.com

Hi there,

while experimenting with seaweedfs (via s3) I noticed that when
uploading somewhat larger blobs (~ 4MiB) using the Minio warp
benchmark the async replication of the volume process sporadically
logs an error:

-- 8< --
E1128 16:59:57 1 upload_content.go:234] upload 4194304 bytes to
http://10.132.15.198:8080/46,06edfbb86bdefa?ts=1606582797&ttl=&type=replicate:
Post http://10.132.15.198:8080/46,06e
dfbb86bdefa?ts=1606582797&ttl=&type=replicate: read tcp
172.17.0.3:37852->10.132.15.198:8080: read: connection reset by peer
goroutine 1011466 [running]:
runtime/debug.Stack(0x109, 0x0, 0x0)
/usr/lib/go/src/runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
/usr/lib/go/src/runtime/debug/stack.go:16 +0x22
github.com/chrislusf/seaweedfs/weed/operation.upload_content(0xc000a1c200,
0x4d, 0xc0012d3928, 0x0, 0x0, 0x100, 0x400000, 0x0, 0x0, 0xc0012d3ba0,
...)
/go/src/github.com/chrislusf/seaweedfs/weed/operation/upload_content.go:235
+0xd36
github.com/chrislusf/seaweedfs/weed/operation.doUploadData(0xc000a1c200,
0x4d, 0x0, 0x0, 0xc000a1c200, 0xc00be9e000, 0x400000, 0x7ffe00, 0x0,
0x0, ...)
/go/src/github.com/chrislusf/seaweedfs/weed/operation/upload_content.go:169
+0x49d
github.com/chrislusf/seaweedfs/weed/operation.retriedUploadData(0xc000a1c200,
0x4d, 0x0, 0x0, 0x0, 0xc00be9e000, 0x400000, 0x7ffe00, 0xc000ac9000,
0x0, ...)
/go/src/github.com/chrislusf/seaweedfs/weed/operation/upload_content.go:96
+0x1ba
github.com/chrislusf/seaweedfs/weed/operation.UploadData(...)
/go/src/github.com/chrislusf/seaweedfs/weed/operation/upload_content.go:69
github.com/chrislusf/seaweedfs/weed/topology.ReplicatedWrite.func1(0xc007dfe7a0,
0x12, 0xc007dfe7c0, 0x12, 0x7243a3, 0xc0035f8940)
/go/src/github.com/chrislusf/seaweedfs/weed/topology/store_replicate.go:85
+0x670
github.com/chrislusf/seaweedfs/weed/topology.distributedOperation.func1(0xc00097a2d0,
0xc007dfe7a0, 0x12, 0xc007dfe7c0, 0x12, 0xc001b2c300)
/go/src/github.com/chrislusf/seaweedfs/weed/topology/store_replicate.go:152
+0x55
created by github.com/chrislusf/seaweedfs/weed/topology.distributedOperation
/go/src/github.com/chrislusf/seaweedfs/weed/topology/store_replicate.go:151
+0xda
W1128 16:59:57 1 upload_content.go:100] uploading to
http://10.132.15.198:8080/46,06edfbb86bdefa?ts=1606582797&ttl=&type=replicate:
upload 4194304 bytes to http://10.132.15.198:808
0/46,06edfbb86bdefa?ts=1606582797&ttl=&type=replicate: Post
http://10.132.15.198:8080/46,06edfbb86bdefa?ts=1606582797&ttl=&type=replicate:
read tcp 172.17.0.3:37852->10.132.15.198:8080:
read: connection reset by peer
-- 8< --

The volume process on the other end does not log an error. Will this
be retried and compensated for?

Another thought I came up with regards to the consistency properties
of multiple filer instances backed by a leveldb2 store and configured
active-active-replication (via `-peers`): Is my assumption correct
that the replication is eventually consistent? I sometimes observed
that a file that had just been uploaded was not retrievable via `get`
shortly thereafter. The request probably arrived on another filer
instance. Is there a way to compensate for this (except for setting up
a "real" distributed filer store)?

Thanks,
Thilo

Chris Lu

unread,

Nov 28, 2020, 4:16:17 PM11/28/20

to Seaweed File System

1. Yes. They will be retried. Please let me know the details of how to reproduce this.

2. The metadata is asynchronously replicated. Could you make the clients sticky to a filer? Maybe nginx?

Chris

--
You received this message because you are subscribed to the Google Groups "Seaweed File System" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seaweedfs+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/seaweedfs/CANvSZQ9hTSioOy%3Dyb%2B%3DtKaM%3Dncbp0r-F-6jHx-KNJ2_jzC9JkQ%40mail.gmail.com.

Thilo-Alexander Ginkel

unread,

Nov 29, 2020, 11:24:41 AM11/29/20

to seaw...@googlegroups.com

Hi Chris,

thanks for your reply!

On Sat, Nov 28, 2020 at 10:16 PM Chris Lu <chri...@gmail.com> wrote:
> 1. Yes. They will be retried. Please let me know the details of how to reproduce this.

a) Set up a cluster as described in [3]
b) Run run-warp.sh (from [2], you'll need to replace the hostnames
with those from your setup) like this:

run-warp.sh 4 list --concurrent 5 --objects 100000 --obj.size=4KiB
--duration 60s

This is making use of the minio warp benchmark [1].

> 2. The metadata is asynchronously replicated. Could you make the clients sticky to a filer? Maybe nginx?

While this would probably work for a synthetic benchmark, it won't
work out in practice when readers and writers are separate processes.
So there is no way to avoid a distributed store?

Thanks,
Thilo

[1] https://github.com/minio/warp
[2] https://gist.github.com/ginkel/fe64ab8ff5f0ab0b4261e577daa4eb09
[3] https://github.com/chrislusf/seaweedfs/issues/1622

Chris Lu

unread,

Nov 29, 2020, 8:38:24 PM11/29/20

to Seaweed File System

run-warp.sh 4 list --concurrent 5 --objects 100000 --obj.size=4KiB
--duration 60s

Thanks! I ran it a couple of times and did not see the stack trace yet.

Also, I added it to the wiki.

https://github.com/chrislusf/seaweedfs/wiki/S3-API-Benchmark

If you have more benchmark results, please share them also. btw: you should be able to change the wiki directly.

> 2. The metadata is asynchronously replicated. Could you make the clients sticky to a filer? Maybe nginx?

While this would probably work for a synthetic benchmark, it won't
work out in practice when readers and writers are separate processes.
So there is no way to avoid a distributed store

You can have an nginx to hash traffic based on file path to different filers. See https://stackoverflow.com/questions/31994395/how-to-use-url-pathname-as-upstream-hash-in-nginx

upstream backend {
    hash $request_uri consistent;

    server filer1.example.com;
    server filer2.example.com;
}

Anyway, "Read-your-own-writes" is a feature requiring a true distributed filer store.

Chris

Thilo-Alexander Ginkel

unread,

Dec 7, 2020, 9:44:13 AM12/7/20

to seaw...@googlegroups.com

Hi Chris,

On Mon, Nov 30, 2020 at 2:38 AM Chris Lu <chri...@gmail.com> wrote:
>> > 2. The metadata is asynchronously replicated. Could you make the clients sticky to a filer? Maybe nginx?
>>
>> While this would probably work for a synthetic benchmark, it won't
>> work out in practice when readers and writers are separate processes.
>> So there is no way to avoid a distributed store
>
> You can have an nginx to hash traffic based on file path to different filers. See https://stackoverflow.com/questions/31994395/how-to-use-url-pathname-as-upstream-hash-in-nginx
>
> upstream backend {
> hash $request_uri consistent;
>
> server filer1.example.com;
> server filer2.example.com;
> }

that could be a possible solution, thanks!

> Anyway, "Read-your-own-writes" is a feature requiring a true distributed filer store.

Hm, after having read the haystack paper I am wondering whether the
O(1) complexity proposed still applies when using a distributed store.

I gave etcd a try (due to its relatively low operational complexity),
but it failed horribly due to range queries timing out after > 30
seconds. There may be better alternatives for a distributed filer
store from a performance point of view, but for many of the available
options I would IMHO be losing much of the ease and simplicity of
deploying SeaweedFS.

In the end, figuring out how to deal with eventual consistency may be
more worthwhile than trying to implement strong consistency and
sacrificing some of the key properties of SeaweedFS...

Regards,
Thilo

Chris Lu

unread,

Dec 7, 2020, 1:12:27 PM12/7/20

to seaw...@googlegroups.com

Can you try Cassandra?

--
You received this message because you are subscribed to the Google Groups "Seaweed File System" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seaweedfs+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/seaweedfs/CANvSZQ-Mmsf1Obru68JvMoY1%2BLNPF8d7KBQodQ-yStf1jziZpA%40mail.gmail.com.

Reply all

Reply to author

Forward