retrieve columnar index and Don't download whole WARC file for specific URL

159 views
Skip to first unread message

mili lali

unread,
Feb 16, 2021, 7:04:27 AM2/16/21
to Common Crawl
Hi Dears,
first of all thanks for your nice works.
suppose I retrieve columnar index (this link), It shows me Jsons like below. JSON has "url" field and shows the filename of WARC file that this url saves in it. To access just WARC of this URL: I must download the whole WARC file, or Is it possible to download WARC of only that URL?

I mean, I want to download an Html page of a specific URL that saved in the WARC file without download the whole WARC file. any idea?

best regards

{"urlkey": "com,bbc)/news/av/business-52834236", "timestamp": "20200919025807", "url": "https://www.bbc.com/news/av/business-52834236", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "3HOICXBC6KLMNHTEXF5EHIUS47XDDM73", "length": "30547", "offset": "778636498", "filename": "crawl-data/CC-MAIN-2020-40/segments/1600400189928.2/warc/CC-MAIN-20200919013135-20200919043135-00712.warc.gz", "charset": "UTF-8", "languages": "eng"}

Sebastian Nagel

unread,
Feb 16, 2021, 7:31:55 AM2/16/21
to common...@googlegroups.com
Hi,

> I mean, I want to download an Html page of a specific URL that saved in the WARC file without download
> the whole WARC file. any idea?

With the given WARC file name, offset and length - just send a HTTP range request from $offset to ($offset+$length-1) to fetch the WARC
record from commoncrawl.s3.amazonaws.com, then uncompress the record. Here the solution using curl and gzip:
curl -s -r778636498-$((778636498+30547-1)) \

"https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-40/segments/1600400189928.2/warc/CC-MAIN-20200919013135-20200919043135-00712.warc.gz"
\
| gzip -dc

If you want to fetch many record, please have a look at the following discussions how to do this
efficiently:
https://groups.google.com/g/common-crawl/c/iZVW5ai9jQI/m/9RKQll_lAQAJ
https://groups.google.com/g/common-crawl/c/Gk8lVd222y0/m/hxEnVBj2AgAJ

> suppose I retrieve columnar index (this link
> <https://index.commoncrawl.org/CC-MAIN-2020-40-index?url=https%3A%2F%2Fbbc.com%2F*&output=json>)

That's the CDX index. But never mind. For the "columnar" index, see
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

Best,
Sebastian


On 2/16/21 1:04 PM, mili lali wrote:
> Hi Dears,
> first of all thanks for your nice works.
> suppose I retrieve columnar index (this link
> <https://index.commoncrawl.org/CC-MAIN-2020-40-index?url=https%3A%2F%2Fbbc.com%2F*&output=json>), It shows me Jsons like below. JSON has
> "url" field and shows the filename of WARC file that this url saves in it. To access just WARC of this URL: I must download the whole WARC
> file, or Is it possible to download WARC of only that URL?
>
> I mean, I want to download an Html page of a specific URL that saved in the WARC file without download the whole WARC file. any idea?
>
> best regards
>
> {"urlkey": "com,bbc)/news/av/business-52834236", "timestamp": "20200919025807", *"url": "https://www.bbc.com/news/av/business-52834236"*,
> "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "3HOICXBC6KLMNHTEXF5EHIUS47XDDM73", "length": "30547",
> "offset": "778636498", "filename":
> "crawl-data/CC-MAIN-2020-40/segments/1600400189928.2/warc/CC-MAIN-20200919013135-20200919043135-00712.warc.gz", "charset": "UTF-8",
> "languages": "eng"}
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/cd91066e-7162-4979-8def-bfea8833ceffn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/cd91066e-7162-4979-8def-bfea8833ceffn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Reply all
Reply to author
Forward
0 new messages