offset 47914999 WARC-Record-ID <urn:uuid:d53099ff-352d-4c09-adff-0e0f1584c398> response
block digest failed: sha1:LMVQVNB5Z5IL5CERGAMWR5GFK4ZRFFVS
So I check my CSV, and find it is this:
```
aws s3api get-object --bucket commoncrawl --key crawl-data/CC-MAIN-2023-50/segments/1700679100942.92/warc/CC-MAIN-20231209170619-20231209200619-00674.warc.gz --range bytes=1060598339-1060607367 test.warc.gz
{
"AcceptRanges": "bytes",
"LastModified": "2023-12-09T21:31:20+00:00",
"ContentLength": 9029,
"ETag": "\"e16fcc6e8fab5aaeb081acb4b268c9c5-18\"",
"VersionId": "null",
"ContentRange": "bytes 1060598339-1060607367/1176231142",
"ContentType": "application/octet-stream",
"ServerSideEncryption": "AES256",
"Metadata": {},
"StorageClass": "INTELLIGENT_TIERING"
}
```
It extracts fine locally in my re-packed WARC file, even with a failed digest:
```
warcio extract /Volumes/WD-1TB-USB/cc/crawl-001.warc.gz 47914999 | head
WARC/1.0
WARC-Type: response
WARC-Date: 2023-12-09T18:19:57Z
WARC-Record-ID: <urn:uuid:d53099ff-352d-4c09-adff-0e0f1584c398>
Content-Length: 33839
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:d841a84d-0608-4a76-9204-1d150241495b>
WARC-Concurrent-To: <urn:uuid:36acccc5-32b3-419f-8495-a36570686067>
WARC-IP-Address: 47.98.117.69
WARC-Target-URI:
https://www.shifudao.com/```
And as for the fresh downloaded copy, it both passes the digest and extracts:
```
warcio check test.warc.gz
[no output]
warcio extract test.warc.gz 0 | head
WARC/1.0
WARC-Type: response
WARC-Date: 2023-12-09T18:19:57Z
WARC-Record-ID: <urn:uuid:d53099ff-352d-4c09-adff-0e0f1584c398>
Content-Length: 33840
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:d841a84d-0608-4a76-9204-1d150241495b>
WARC-Concurrent-To: <urn:uuid:36acccc5-32b3-419f-8495-a36570686067>
WARC-IP-Address: 47.98.117.69
WARC-Target-URI:
https://www.shifudao.com/```
So I can only assume that some aspect of opening this with `warcio.archiveiterator` and then writing it back out into a larger file with `WARCWriter` causes this unexpected behavior. But I'm not familiar with Python and don't know why that might be.
I do notice just now that the Content-Length header is off by 1. Maybe that's the issue? I can't guess how that comes to be, but there it is. A clue or unrelated?