Using local parquet index with cdx-toolkit

64 views
Skip to first unread message

Vittorio Rossi

unread,
Apr 4, 2024, 12:27:40 PMApr 4
to Common Crawl
Unfortunately, I missed the discord server invite link.

Is there a feature in cdx-toolkit that allows the warc option (download cc warcs) reading filenames and offsets from a local parquet file?
I downloaded five parquets from 2022 and filtered them to around 100k rows which I would like to retrieve as warc records.

Greg Lindahl

unread,
Apr 15, 2024, 3:02:49 PMApr 15
to common...@googlegroups.com
Vittorio,

Indeed, cdx_toolkit has all of the code to do that, but it's not
hooked up on the command line. It's been on my TODO list for about a
year now.

If you can do a little Python, the code you want to call is
cdx_toolkit.warc.fetch_warc_record(). That gets you the record, then
you'll also have to set up a CDXToolkitWARCWriter and call
CDXToolkitWARCWriter.write_record() for each of your 100k rows. It's
not much code, but of course if you're not already familiar with
cdx_toolkit development, it will take a while.

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/fc93d819-db51-4aca-bc40-591d9e980c62n%40googlegroups.com.

Greg Lindahl

unread,
Apr 15, 2024, 5:47:55 PMApr 15
to common...@googlegroups.com
Also, here's a permanent Discord invite link:
https://discord.com/invite/njaVFh7avF

It's also linked on our website.

Vittorio Rossi

unread,
Apr 16, 2024, 11:27:59 AMApr 16
to Common Crawl
Thanks for both the invite and the advice! If I can spare some time in my semester project, I'll try it in the coming month and come back with feedback.
Reply all
Reply to author
Forward
0 new messages