Quay and FEATURE_STORAGE_REPLICATION

142 views
Skip to first unread message

Frank van Gemeren

unread,
Mar 29, 2021, 1:29:39 PM3/29/21
to quay-sig
Hi all,

I'm currently running the lando version from projectquay. 
I'd like to replicate my images to another S3 bucket.
I've set FEATURE_STORAGE_REPLICATION to "true" but when it tries to replicate, it can't find the layers (according to the code, this seems like a database-related lookup error) and then fails.  Something similar seems to happen for the backfill script. Quay itself works perfectly fine as far as I can tell.

Question 1: is there anything that I can do to debug further? The logs unfortunately don't help me as much as I'd like.

Question 2: I'm currently contemplating whether or not to use native S3 replication instead. Would this give the same result as the FEATURE_STORAGE_REPLICATION feature? Are there any gotcha's when doing a failover?

Thank you,
Frank

Daniel Messer

unread,
Apr 1, 2021, 6:16:40 AM4/1/21
to Frank van Gemeren, quay-sig
Hi Frank,

just to level set: what have you created so far in terms of deployed databases, pods and load balancers? Quay in Geo-replication needs access to a single shared Postgres/MySQL database and also all Quay pods need to have access to all S3 buckets. Is this the case?

I don't think native S3 replication will work because it does not allow transparent redirects for clients. They either have to connect to one bucket or another. So in the Quay use case you would have to connect the two Quay deployments to different buckets. The problem arises when a client requests blobs that haven't been mirrored yet. The S3 lookup will fail and so will the client. That would continue to happen until the S3 replication eventually caught up.
Quay Geo-replication is aware of this case and transparently redirects the requests to the remote bucket that has the original copy of the blob.

HTH,
Daniel

--
You received this message because you are subscribed to the Google Groups "quay-sig" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quay-sig+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quay-sig/6c7ad8e0-b43b-41b3-9a81-6d1ae75517dcn%40googlegroups.com.


--
Daniel Messer

Product Manager Operator Framework & Quay

Red Hat OpenShift

Frank van Gemeren

unread,
Apr 20, 2021, 1:32:19 PM4/20/21
to quay-sig
Hi Daniel,

Sorry for the late response. Holiday and debugging ;)

We're using kube2iam and the pods can all contact the RDS and S3 buckets.

In our replication config, instead of having names per geographic location like in the examples, we have "logical' names like "default" and "peer", which we switch depending on if the deployment is running in the primary location or the secondary. Our current assumption is that this "switch" of the value of the locations the DISTRIBUTED_STORAGE_CONFIG is messing things up.

DISTRIBUTED_STORAGE_CONFIG:
default:
- S3Storage
- {s3_bucket: ${default_bucket}, storage_path: /datastorage/registry}
peer:
- S3Storage
- {s3_bucket: ${peer_bucket}, storage_path: /datastorage/registry}
DISTRIBUTED_STORAGE_DEFAULT_LOCATIONS: [default, peer]
DISTRIBUTED_STORAGE_PREFERENCE: [default, peer]

You can see the mako (templating) variables.

In the logs, it checks to see if the file exists at a path in "default", and then it indeed does not exist in S3. Since it can't find it, it stops the replication. This is visible when manually calling the backfillreplication script.

Questions:
- Does this variable switcheroo look like a plausible reason for why we don't see any replication?
- Are the names of the DISTRIBUTED_STORAGE_CONFIG (in our case "default" and "peer) stored in the database in some way, for example as lookups of some kind? We'd like to bring back clarify by being explicit about primary and secondary in the config, but we're afraid that if we do that, the database might not be able to find any existing layers etc.

Thanks,

Frank



Op donderdag 1 april 2021 om 12:16:40 UTC+2 schreef dme...@redhat.com:

Frank van Gemeren

unread,
Jul 14, 2021, 2:39:17 PM7/14/21
to quay-sig
As a closing of this one: the current idea is that this behavior happened because of a failover test a long long time ago, causing the (primary?) database to contain some inconsistent entries, since the secondary S3, which was probably primary during the test, was emptied since then, causing the database to be unable to find those layers.

I believe we fixed it by just setting quay up from scratch with a full database wipe, since this was our testing installation, but I'm not sure. We also probably tried to "sync from secondary to primary", but my memory fails me.

Thanks,
Frank

Op dinsdag 20 april 2021 om 19:32:19 UTC+2 schreef Frank van Gemeren:

Daniel Messer

unread,
Jul 16, 2021, 1:15:30 PM7/16/21
to Frank van Gemeren, quay-sig
If you delete blobs in S3 buckets directly, Quay in general has no way of knowing this happened.

Reply all
Reply to author
Forward
0 new messages