Hi Gregory,
there is probably nothing wrong with your program code.
Common Crawl focuses on HTML pages with data and web science,
natural language processing and data mining as main use cases.
There is only a small percentage of other document formats,
but that's by mistake and not intended.
Best,
Sebastianfen
On 11/30/2016 06:56 PM, Gregory Petropoulos wrote:
> My goal is to use the common crawl to generate screen shots of websites.
>
> To do this I am using python. Below is my code to copy the .warc file for a given url.
>
> |
> from bs4 import BeautifulSoup
> import requests
> import json
> import pycurl
> import gzip
> import io
>
> r = requests.get('
http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.google.com/*&output=json')
> soup = BeautifulSoup(r.content, "html.parser")
> hits = soup.contents[0].split("\n")
>
> page = json.loads(hits[100])
> offset, length = int(page['offset']), int(page['length'])
> offset_end = offset + length - 1
> base = '
https://commoncrawl.s3.amazonaws.com/'
> filename = page['filename']
> resp = requests.get(base + filename, headers={'Range': 'bytes={}-{}'.format(offset, offset_end)})
> raw_data = io.BytesIO(resp.content)
> f = gzip.GzipFile(fileobj=raw_data)
> with open('full.warc.gz', 'wb') as f_out:
> f_out.write(resp.content)
> |
>
> After downloading the file I use pywb <
https://github.com/ikreymer/pywb> to host the file:
>
> pip install pywb
> wb-manager init my_coll
> wb-manager add my_coll <path/to/warc>
> wayback
>
> The webpage lacks images and background information. I have tried loading other pages as well. I
> only get text and links. Any suggestions on what I am doing wrong?
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.