Query task: How is a WARC file transformed to a WET file? Are tools available for this?
526 views
Skip to first unread message
Richard Hill
unread,
Feb 8, 2022, 3:25:48 PM2/8/22
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Hello folks
To increase my understanding of the commoncrawl project and the WARC, WAT, WET formats I have been using https://pypi.org/project/warcio/ to local create WARC files from selected URLs to compare the file output with that of the same URL in commoncrawl.
I now want to transform these WARC files to WET files. How do you do that?
The JSON structure of a WET file is clear but how is the plain text efficiently extracted from the WARC file and the webpage.
I have found the program below but to be honest I am struggling to understand the context in which this will be used. What is used to create the WET files for the monthly crawl?
> I have been using https://pypi.org/project/warcio/ to
> local create WARC files from selected URLs to compare the file output
> with that of the same URL in commoncrawl.
Interesting, if there are any substantial differences, I'd like to hear
about them.