Combine individual downloaded warc.gz files?

252 views
Skip to first unread message

Jason Boxman

unread,
Feb 20, 2024, 10:26:33 PM2/20/24
to Common Crawl
Hi,

I've successfully downloaded the WARC files that I was looking for earlier. Thanks for that!

My original storage plan is less than optimal, and I'd like to ask if there's any way to re-pack the WARC files that I have into a handful of much larger files. My local disks perform poorly with small files, and I have about 330GB (millions) of individual WARC files now.

My current approach is to open the individual files with warcio and then rewrite them into a larger file, but I think I'm getting the occasion invalid digest when I subsequently try to verify (both fastwarc and `warcio check` agree on this):

```
block digest failed: sha1:IS4GVWDBVMYYEITQR5Y4ZMP3ET5TUPWR
offset 438573128 WARC-Record-ID <urn:uuid:7799c9e2-49d6-4041-b1bc-de76e34ec71
```

So I think my approach is flawed. I am able to extract the record with the failed digest successfully though. Any suggestions on how I might re-pack/concat these files together? Is the digest failure not fatal?

Thanks!

Greg Lindahl

unread,
Feb 20, 2024, 11:07:26 PM2/20/24
to common...@googlegroups.com
Jason,

You should pack your warc records together until they're about 1
gigabyte. cdx_toolkit does that when it creates "extraction" warcs:

$ cdxt --ia --limit 100 warc 'commoncrawl.org/*'

The digest check failure thing is interesting. I was relatively new to
web archiving when I added "check" to warcio. I never did the obvious
next step, which was to survey Common Crawl to see how many records
fail. For example, there's a disagreement about what the payload
digest should be for chunked encoding.

There are probably other bugs lurking. If you could look into what
these failed digest records have in common, that would be very good. I
see you've already checked that fastwarc complains about the same
records, that's good to know.

cdx_toolkit can also get digests from the Internet Archive, and it
would be good to know if any of those digests don't agree.

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/ea6c0d33-b975-4fed-badd-cc6180e6d252n%40googlegroups.com.

Jason Boxman

unread,
Feb 21, 2024, 1:20:57 AM2/21/24
to Common Crawl
It seems to be something that's happened locally:

```
  offset 47914999 WARC-Record-ID <urn:uuid:d53099ff-352d-4c09-adff-0e0f1584c398> response
    block digest failed: sha1:LMVQVNB5Z5IL5CERGAMWR5GFK4ZRFFVS
```

So I check my CSV, and find it is this:

```
aws s3api get-object --bucket commoncrawl --key crawl-data/CC-MAIN-2023-50/segments/1700679100942.92/warc/CC-MAIN-20231209170619-20231209200619-00674.warc.gz --range bytes=1060598339-1060607367 test.warc.gz

{
    "AcceptRanges": "bytes",
    "LastModified": "2023-12-09T21:31:20+00:00",
    "ContentLength": 9029,
    "ETag": "\"e16fcc6e8fab5aaeb081acb4b268c9c5-18\"",
    "VersionId": "null",
    "ContentRange": "bytes 1060598339-1060607367/1176231142",
    "ContentType": "application/octet-stream",
    "ServerSideEncryption": "AES256",
    "Metadata": {},
    "StorageClass": "INTELLIGENT_TIERING"
}
```

It extracts fine locally in my re-packed WARC file, even with a failed digest:

```
warcio extract /Volumes/WD-1TB-USB/cc/crawl-001.warc.gz 47914999 | head
WARC/1.0
WARC-Type: response
WARC-Date: 2023-12-09T18:19:57Z
WARC-Record-ID: <urn:uuid:d53099ff-352d-4c09-adff-0e0f1584c398>
Content-Length: 33839
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:d841a84d-0608-4a76-9204-1d150241495b>
WARC-Concurrent-To: <urn:uuid:36acccc5-32b3-419f-8495-a36570686067>
WARC-IP-Address: 47.98.117.69
WARC-Target-URI: https://www.shifudao.com/
```

And as for the fresh downloaded copy, it both passes the digest and extracts:

```
warcio check test.warc.gz
[no output]

warcio extract test.warc.gz 0 | head
WARC/1.0
WARC-Type: response
WARC-Date: 2023-12-09T18:19:57Z
WARC-Record-ID: <urn:uuid:d53099ff-352d-4c09-adff-0e0f1584c398>
Content-Length: 33840
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:d841a84d-0608-4a76-9204-1d150241495b>
WARC-Concurrent-To: <urn:uuid:36acccc5-32b3-419f-8495-a36570686067>
WARC-IP-Address: 47.98.117.69
WARC-Target-URI: https://www.shifudao.com/
```

So I can only assume that some aspect of opening this with `warcio.archiveiterator` and then writing it back out into a larger file with `WARCWriter` causes this unexpected behavior. But I'm not familiar with Python and don't know why that might be.

I do notice just now that the Content-Length header is off by 1. Maybe that's the issue? I can't guess how that comes to be, but there it is. A clue or unrelated?

Thanks!

Henry S. Thompson

unread,
Feb 21, 2024, 5:36:14 AM2/21/24
to common...@googlegroups.com
Jason Boxman writes:

> ...

> I do notice just now that the Content-Length header is off by
> 1. Maybe that's the issue? I can't guess how that comes to be, but
> there it is. A clue or unrelated?

That rings a bell. In my old notes I find the following:

from https://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/

"Please note that the WARC files of August 2018 (CC-MAIN-2018-34)
are affected by a WARC format error and contain an extra \r\n
between HTTP header and payload content. Also the given
'Content-Length' is off by 2 bytes. For more information about this
bug see this post on our user forum."

Long shot, but might be relevant.

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND
e-mail: h...@inf.ed.ac.uk
URL: https://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

Jason Boxman

unread,
Feb 21, 2024, 5:20:41 PM2/21/24
to Common Crawl
My confusion here, is the file I downloaded from CC passes an sha1 check. But after the following treatment, it does not. For the vast majority of files, it works fine. But for a few outliers, I ultimately get an sha1 check failure when re-running an sha1 check against the new, large WARC file.

Does WARCWriter take the existing sha1 signature as-is when it writes these? I'm not familiar with how any of this is supposed to work in an idealized world or if the sha1 check failure is anything that in practice I need to worry about if each individual record when re-read looks correct?

Thanks!

```
[...]
  output = open(current_packfile_path, 'ab')  # Append mode for continuing with existing file
  writer = WARCWriter(output, gzip=True)
  current_packfile_size = os.path.getsize(current_packfile_path)

  for dirpath, _, filenames in os.walk(input_directory):
      for filename in filenames:
          if filename.endswith(".warc.gz") or filename.endswith(".warc"):
              warc_path = os.path.join(dirpath, filename)
              try:
                  with open(warc_path, 'rb') as stream:
                      for record in ArchiveIterator(stream):
                          output.flush()
                          current_packfile_size = os.path.getsize(current_packfile_path)
                          if current_packfile_size >= max_size_gb * 1024**3:
                              output.close()
                              seq += 1
                              current_packfile_path = os.path.join(output_directory, f"{packfile_basename}{str(seq).zfill(3)}.warc.gz")
                              output = open(current_packfile_path, 'wb')
                              writer = WARCWriter(output, gzip=True)
                              current_packfile_size = 0

                          writer.write_record(record)
              except StatusAndHeadersParserException as e:
                  print(f"Error processing {warc_path}: {e}", file=sys.stderr)
                  continue  # Skip to the next file

  output.close()
```

Greg Lindahl

unread,
Feb 21, 2024, 6:09:44 PM2/21/24
to common...@googlegroups.com
Amusingly enough, this was a bug that Sebastian fixed after my "warcio
test" code (which was never completed and shipped) found it.

But it's a systematic "off by 2" problem, not "off by 1".

By the way, we have an ongoing project to find all of the errata which
apply to each particular crawl. Our hope is to not only have
human-readable documentation but to also have code you can use to fix
up our captures.

As an example of how this would be useful, the Internet Archive tried
to incorporate some of our captures into the Wayback Machine, but, the
bug we had for a while where we mislabeled truncations caused a
significant number of Wayback user complaints.

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/f5b5xyi85pj.fsf%40ecclerig.inf.ed.ac.uk.

Greg Lindahl

unread,
Feb 22, 2024, 3:24:25 AM2/22/24
to common...@googlegroups.com
Jason,

It's been 5 years since I looked at warcio's code, but it could easily
be the case that rewriting a record isn't going to check or change the
digest that's already present.

So if you have the wrong length for the content, for example fetching
(offset+length) instead of (offset+length-1), that's bad. But usually
if that was the cause, it would affect every record.

cdx_toolkit has working code that creates extracted warcs based on a
cdx query. You can look at it at
https://github.com/cocrawler/cdx_toolkit/blob/5eddafc63e47cf4047a3eb345d273555e4c4fa40/cdx_toolkit/warc.py#L123

One debugging tool I'd recommend is cdxj-indexer.

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/8061627a-0e1b-4af8-8e4c-0933910ad7ean%40googlegroups.com.

Jason Boxman

unread,
Feb 23, 2024, 4:58:28 PM2/23/24
to Common Crawl
As I work through this, I think what actually happened is, given my lack of any competency with Python, the code I have that downloaded these WARC files in parallel from S3 was writing the occasional corrupt file. As near as I can tell, I have many hundreds or thousands of these out of several million files. (I seem to be missing over 50% of the files I intended to download, as well, as compared to the total lines in my CSV from Parquet query.)

So this might be a false alarm and programmer error on my part. I've gone back and downloaded a few with the AWS CLI directly, and they extract fine. This would also explain the `StatusAndHeadersParserException, ArchiveLoadFailed` exceptions I'm catching. This is what happens when I try to be clever.

Greg Lindahl

unread,
Feb 23, 2024, 5:01:28 PM2/23/24
to common...@googlegroups.com
Ah, if you're doing that in parallel, I'd also recommend that you look at

https://status.commoncrawl.org/

And make sure that you aren't any of the bumps on the performance graphs.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/cc935d84-c1ba-4f06-ada7-50f66d7e0858n%40googlegroups.com.

Jason Boxman

unread,
Feb 23, 2024, 5:12:33 PM2/23/24
to Common Crawl
I noticed that in January and wanted to avoid hammering the service. My approach was highly inefficient, though, and my network graph showed download speeds of around 250k-500k per second in total. It took about 4 weeks to (fail) to get all my target files. I was nowhere need 40k! requests a second. The configuration was:

```
s3_config = Config(
  s3={
    'multipart_threshold': '4GB',
    'max_concurrent_requests': 12,
    'multipart_chunksize': '32MB'
  },
  retries={
    'mode': 'adaptive',
    'max_attempts': 100
  }
)
```

Greg Lindahl

unread,
Feb 25, 2024, 6:13:08 PM2/25/24
to common...@googlegroups.com
Jason,

That looks good, it does one request per gigabyte-sized file, which is
the minimum number of requests.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/6d644109-26f1-4539-94c6-d40447415d44n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages