You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Hi Moid,
the pages are shuffled before they are written to WARC and WET files. So, every WARC/WAT/WET file
contains a (pseudo)random sample of web pages.
The easiest way is to use the URL index to look up a domain and then iterate over found pages
and fetch them from WARC files. How to actually do it was discussed recently in this group [1].
Here a short example:
On 05/26/2017 11:18 AM, has....@gmail.com wrote:
> Hi Sebastian,
>
> Thanks for the help!
> I will try the approaches mentioned here.
> One more thing is commoncrawl data can be used for commercial purpose also?
>
> Thanks & Regards,
> Moid Hassan
>
> On Thursday, May 25, 2017 at 2:48:01 PM UTC+5:30, has....@gmail.com wrote:
>
> Is it possible to download the warc files or wet file which have data related to only one website.??