aws s3 metadata

21 views
Skip to first unread message

Peter Marshall

unread,
Jun 14, 2023, 2:49:16 AM6/14/23
to s3...@googlegroups.com
I'm impressed using s3ql (v 3.8.1) - it does (nearly) exactly what I want!

I've run into two issues recently with s3 metadata - one I've managed to
fix, the other I'm looking for advice.

When syncing an s3 bucket containing an s3ql file system to a new one,
and then trying to mount the new one, I was getting errors with the
metadata. The reason was that aws sync will lose metadata if it uses
multipart transfers.

I fixed this using "aws configure set default.s3.multipart_threshold
20MB" to increase the multipart threshold above the size of the s3ql
data blocks.

I also used "aws s3 sync --metadata-directive COPY" but I believe that
is the default anyway, so it was probably superfluous.

Then I tried to use AWS Backup on s3 buckets to do a similar job. This
is where the problem came in - the ETag is not always preserved on
restore, and is not an MD5 when multipart uploads get used. The decision
on whether AWS uses multi-part uploads on the restore seems to be out of
my control.

I've not yet looked into the internals of s3ql - is it possible to fix
the ETag issue? I know ETags can't be changed, but could s3ql recover
from the ETags changing?

Thanks,

Pete.

Nikolaus Rath

unread,
Jun 14, 2023, 5:07:49 AM6/14/23
to noreply-spamdigest via s3ql
Hi,

The etags are checksums calculated over the contents of the data. If the etags change, then this means the content of the object has changed. So this should not happen on backup/restore.

Best,
-Nikolaus
> --
> You received this message because you are subscribed to the Google
> Groups "s3ql" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to s3ql+uns...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/s3ql/8cf45e3e-d5c0-584a-467d-ab2cdd18d0bf%40goteck.co.uk.

Peter Marshall

unread,
Jun 14, 2023, 5:27:53 AM6/14/23
to s3...@googlegroups.com
Thanks for your reply. This does happen on restore (I've tried!), and I
think it is because for multipart uploads and other encryption methods,
the ETag is calculated differently. So even though the object is the
same, if it is multipart-uploaded by AWS then the ETag is different. On
backing up and restoring the bucket, you can see the ETag change in the
AWS S3 console between the restored object and the original.

See https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html,
ETag bullet 3.

In my case, all the objects are 10MB in size, but the AWS restore still
changes the ETag.

I need to check that the encryption I was using on the restored bucket
was the same as the original - maybe I am falling foul of ETag bullet 2
by switching encryption.

Pete.


Peter Marshall

unread,
Jun 14, 2023, 7:17:25 AM6/14/23
to s3...@googlegroups.com
I've experimented, and it's not due to encryption, so I presume it is
related to multipart during the AWS backup or restore process.

e.g. original bucket, with mountable s3ql:

object ncs3ql_data_1 has ETag 6af237cd6aa167ec276eb58f9f9a52c6

Same object on restored bucket: ETag dc3d145a13fa955d024aaa4165826530-1

If I download the file from the restored bucket, and run md5 on it, I
get back the original ETag, so it is the same data.

If I try fsck on the restored bucket, it's v not happy:

WARNING: MD5 mismatch for s3ql_passphrase:
b428f7203f2bfd8b547be1ade86a74a3-1 vs 0c8d069f75210a79332014f7cb38454a
WARNING: MD5 mismatch for s3ql_passphrase:
b428f7203f2bfd8b547be1ade86a74a3-1 vs 0c8d069f75210a79332014f7cb38454a
WARNING: MD5 mismatch for s3ql_passphrase:
b428f7203f2bfd8b547be1ade86a74a3-1 vs 0c8d069f75210a79332014f7cb38454a
Encountered BadDigestError (BadDigest: ETag header does not agree with
calculated MD5), retrying Backend.perform_read (attempt 3)...
WARNING: MD5 mismatch for s3ql_passphrase:
b428f7203f2bfd8b547be1ade86a74a3-1 vs 0c8d069f75210a79332014f7cb38454a
Encountered BadDigestError (BadDigest: ETag header does not agree with
calculated MD5), retrying Backend.perform_read (attempt 4)...

Nikolaus Rath

unread,
Jun 15, 2023, 12:17:54 PM6/15/23
to s3...@googlegroups.com
Hi Peter,

Thanks for following up. Looking at
https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html, it
looks like there is a bug in S3QL: The S3 backend expects the ETag to
match the MD5 of the content.

This hasn't been a problem so far because when S3QL itself uploads the
objects, this is the case. But when you're modifying objects with an
external tool, this assumption no longer holds.

I'm not sure how to best fix it. One way would be to just not verify the
content. As long as encryption is being used, it will detect any
corruption. However, for un-encrypted buckets this could result in
undetected corruption.

The above page talks about the "algorithm that was used to create a
checksum of an object", which seems to be what we want. However, there
is no mention of an actual checksum other than the ETag (which seemingly
cannot be validated by the client). Does anyone know if Amazon provides
other checksums that could be used (e.g. Content-MD5).

Best,
-Nikolaus
> --
> You received this message because you are subscribed to the Google Groups "s3ql" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to s3ql+uns...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/s3ql/00dfa783-12e1-b8e9-85ca-d7809b44e029%40goteck.co.uk.

Peter Marshall

unread,
Jun 15, 2023, 12:49:41 PM6/15/23
to s3...@googlegroups.com
I found this info on how to calculate Etags on a local file:

https://teppen.io/2018/06/23/aws_s3_etags/ and the linked
https://teppen.io/2018/10/23/aws_s3_verify_etags/

Nikolaus Rath

unread,
Jun 16, 2023, 5:29:46 AM6/16/23
to s3...@googlegroups.com
Hi,

I'm skeptical that this is a future proof solution, since
https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html
just says that "If an object is created by either the Multipart
Upload or Part Copy operation, the ETag is not an MD5 digest, regardless
of the method of encryption.". If there is no documentation from AWS
about how to compute the ETag for these cases, we should not attempt to
do so.


Best,
-Nikolaus
> --
> You received this message because you are subscribed to the Google Groups "s3ql" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to s3ql+uns...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/s3ql/ce3756b1-79f3-4adf-0912-baadf8103a61%40goteck.co.uk.

r0ps3c

unread,
Jun 17, 2023, 9:50:23 AM6/17/23
to s3ql
In case it helps, https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#large-object-checksums seems to document the algorithm. I'd had some similar issues to the OP a while ago and did some testing that indicated the documentation was accurate, but haven't checked recently.

Peter Marshall

unread,
Jun 19, 2023, 2:30:14 AM6/19/23
to s3...@googlegroups.com
From reading the code, it looks like the Etag is used on writing to
check that what was written is what AWS received. This is still ok,
since the writing is under s3ql control. Although "Using Content-MD5
when uploading objects" in
https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html
suggests we could provide the MD5 and AWS will itself reject the upload
if it doesn't match.

The Etag is used on reading to check that the calculated md5 of the
received data matches the Etag - which we now know is not always the case.

Could the md5 (or some other signature) of the data be stored in
metadata, and we check that instead of the Etag on reading? I've only
briefly looked at the source - maybe an existing header is suitable.

It would be great if we could get the AWS automatic S3 backup/restore to
work.


Nikolaus Rath

unread,
Jun 19, 2023, 3:55:33 PM6/19/23
to s3...@googlegroups.com
Thanks for digging this up!

I do not think this is of much help, unfortunately. S3QL doesn't know
how the object was originally uploaded (let alone what the MD5 of the
parts was), so it still can't check the ETag on download.

Best,
-Nikolaus
> To view this discussion on the web visit https://groups.google.com/d/msgid/s3ql/53689c84-5e86-47b4-911e-11583c084c42n%40googlegroups.com.

Nikolaus Rath

unread,
Jun 19, 2023, 3:57:58 PM6/19/23
to s3...@googlegroups.com
On Jun 19 2023, Peter Marshall <pe...@goteck.co.uk> wrote:
> Could the md5 (or some other signature) of the data be stored in metadata, and we check
> that instead of the Etag on reading? I've only briefly looked at the source - maybe an
> existing header is suitable.

Yes, that is possible in principle but not currently done. We'd have to
extend the metadata format to store this checksum.

I'm just not convinced that it's worth it, since this effectively
duplicates what's already done when using encryption.

So perhaps the right answer is to disable ETag checking completely and
require encryption to be used? Or disable it when encryption is active,
so that it affects cases?

Best,
-Nikolaus

Peter Marshall

unread,
Jun 20, 2023, 4:42:38 AM6/20/23
to s3...@googlegroups.com
Or just disable etag checking based on a configuration option, and let
the user decide?

The documentation can explain the consequences - i.e. none if encryption
used, and potential undetectable corruption if not.

That way when all etags are md5 checksums, we don't lose anything. And
when they are not, we can mount what is currently a broken FS by
disabling the check.

Nikolaus Rath

unread,
Jun 21, 2023, 3:46:25 AM6/21/23
to s3...@googlegroups.com
Hi,

That'd be another option.

In any case, someone would need to write the code and submit a pull
request for this to happen.

Best,
-Nikolaus
> --
> You received this message because you are subscribed to the Google Groups "s3ql" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to s3ql+uns...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/s3ql/0f58bff2-cede-36ab-818e-061b32259a7b%40goteck.co.uk.

Peter Marshall

unread,
Jun 21, 2023, 4:03:15 AM6/21/23
to s3...@googlegroups.com
Understood.

I'll try to commit a preliminary PR soon for comment, that will allow
the user to disable the checking of ETags on read.

I've opened this issue: https://github.com/s3ql/s3ql/issues/297

Reply all
Reply to author
Forward
0 new messages