Hi,
> I mean, I want to download an Html page of a specific URL that saved in the WARC file without download
> the whole WARC file. any idea?
With the given WARC file name, offset and length - just send a HTTP range request from $offset to ($offset+$length-1) to fetch the WARC
record from
commoncrawl.s3.amazonaws.com, then uncompress the record. Here the solution using curl and gzip:
curl -s -r778636498-$((778636498+30547-1)) \
"
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-40/segments/1600400189928.2/warc/CC-MAIN-20200919013135-20200919043135-00712.warc.gz"
\
| gzip -dc
If you want to fetch many record, please have a look at the following discussions how to do this
efficiently:
https://groups.google.com/g/common-crawl/c/iZVW5ai9jQI/m/9RKQll_lAQAJ
https://groups.google.com/g/common-crawl/c/Gk8lVd222y0/m/hxEnVBj2AgAJ
> suppose I retrieve columnar index (this link
> <
https://index.commoncrawl.org/CC-MAIN-2020-40-index?url=https%3A%2F%2Fbbc.com%2F*&output=json>)
That's the CDX index. But never mind. For the "columnar" index, see
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
Best,
Sebastian
On 2/16/21 1:04 PM, mili lali wrote:
> Hi Dears,
> first of all thanks for your nice works.
> suppose I retrieve columnar index (this link
> <
https://index.commoncrawl.org/CC-MAIN-2020-40-index?url=https%3A%2F%2Fbbc.com%2F*&output=json>), It shows me Jsons like below. JSON has
> "url" field and shows the filename of WARC file that this url saves in it. To access just WARC of this URL: I must download the whole WARC
> file, or Is it possible to download WARC of only that URL?
>
> I mean, I want to download an Html page of a specific URL that saved in the WARC file without download the whole WARC file. any idea?
>
> best regards
>
> {"urlkey": "com,bbc)/news/av/business-52834236", "timestamp": "20200919025807", *"url": "
https://www.bbc.com/news/av/business-52834236"*,
> "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "3HOICXBC6KLMNHTEXF5EHIUS47XDDM73", "length": "30547",
> "offset": "778636498", "filename":
> "crawl-data/CC-MAIN-2020-40/segments/1600400189928.2/warc/CC-MAIN-20200919013135-20200919043135-00712.warc.gz", "charset": "UTF-8",
> "languages": "eng"}
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/cd91066e-7162-4979-8def-bfea8833ceffn%40googlegroups.com
> <
https://groups.google.com/d/msgid/common-crawl/cd91066e-7162-4979-8def-bfea8833ceffn%40googlegroups.com?utm_medium=email&utm_source=footer>.