Hi,
Have you looked at http://commoncrawl.org/the-data/get-started/? It has links to a page per monthly crawl, which contains a link to the WET files.
The above page also explains how to build the URL to the WET index from the YYYY-DD prefix of a specific month.
Yossi.
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
offset, length = int(record['offset']), int(record['length']) offset_end = offset + length - 1
resp = requests.get(prefix + record['filename'].replace("/warc/","/wet/").replace(".warc.gz",".warc.wet.gz"), headers={'Range': 'bytes={}-{}'.format(offset, offset_end)})
b'<?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>InvalidRange</Code><Message>The requested range is not satisfiable</Message><RangeRequested>bytes=847111823-847128797</RangeRequested><ActualObjectSize>144936160</ActualObjectSize><RequestId>B2FACC3B4B986CB1</RequestId><HostId>rhOJvMJX/tay0JK953e5KFdOK9TJcWLeN6Z677/jdJrpHFGGXE15ijxn8S7GdmKIx1vlHuG4joU=</HostId></Error>'
resp = requests.get(prefix + record['filename'].replace("/warc/","/wet/").replace(".warc.gz",".warc.wet.gz"), headers={WARC-Target-URI= http://advocatehealth.com/condell/emergencyservices3})
But one more problem arises, CC also provides us with offsets for WARC files, but not for WET files. Turns out it was also not a problem. WET file essentially is a WARC file without tags. So you can use this library.
(1) use the WET file and read it from the beginning until the WARC-Target-URI is found
Hi,
> removing the header information
You need to pass the fetched WARC record first to a WARC parser.
The you can get the payload (the HTML document) and pass this
to a HTML parser.
> ignore reuqest with HTTP 404 or 302
The HTTP status is contained in the index. The easiest way is to filter
on the "status" field and process only page captures with status "200".
Best,
Sebastian
On 1/30/19 5:59 PM, TradPhy wrote:
> Besides removing the header information I want to ignore reuqest with HTTP 404 or 302 if page is not
> available. How can I manage that?
>
>
> Many thanks!
>
> Am Mittwoch, 30. Januar 2019 14:48:27 UTC+1 schrieb TradPhy:
>
> (1) use the WET file and read it from the beginning until the WARC-Target-URI is found
>
>
> Do you mean to first read the whole file in memory or is there really a way to read step by step?
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to