Query task: How is a WARC file transformed to a WET file? Are tools available for this?

526 views
Skip to first unread message

Richard Hill

unread,
Feb 8, 2022, 3:25:48 PM2/8/22
to Common Crawl

Hello folks

To increase  my understanding of the commoncrawl project and the WARC, WAT, WET formats I have been using https://pypi.org/project/warcio/  to local create WARC files from selected URLs to compare the file output with that of the same URL in commoncrawl.

I now want to transform these WARC files to WET files. How do you do that?

The JSON structure of a WET file is clear  but how is the plain text efficiently extracted from the WARC file and the webpage.

I have found the program below but to be honest  I am struggling to understand the context in which this will be used.   What  is used to create the WET files for the monthly  crawl?

Sebastian Nagel

unread,
Feb 11, 2022, 10:23:53 AM2/11/22
to common...@googlegroups.com
Hi Richard,

> What is used to create the WET files for the monthly crawl?

The linked Java class is used.

You'll find instructions how to use it here:
https://groups.google.com/g/common-crawl/c/hsb90GHq6to/m/SSVocyq8AAAJ
(tested with Java 8 and 11)

Let me know if you need more information.

> I have been using https://pypi.org/project/warcio/ to
> local create WARC files from selected URLs to compare the file output
> with that of the same URL in commoncrawl.

Interesting, if there are any substantial differences, I'd like to hear
about them.

Thanks and best!

Sebastian
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/d983b90e-2719-4537-bac8-bb24d557d2a2n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/d983b90e-2719-4537-bac8-bb24d557d2a2n%40googlegroups.com?utm_medium=email&utm_source=footer>.
Reply all
Reply to author
Forward
0 new messages