No images with common crawl .warc files and pywb

72 views
Skip to first unread message

Gregory Petropoulos

unread,
Nov 30, 2016, 12:56:28 PM11/30/16
to Common Crawl
My goal is to use the common crawl to generate screen shots of websites.

To do this I am using python.  Below is my code to copy the .warc file for a given url.

from bs4 import BeautifulSoup
import requests
import json
import pycurl
import gzip
import io

soup = BeautifulSoup(r.content, "html.parser")
hits = soup.contents[0].split("\n")

page = json.loads(hits[100])
offset, length = int(page['offset']), int(page['length'])
offset_end = offset + length - 1
filename = page['filename']
resp = requests.get(base + filename, headers={'Range': 'bytes={}-{}'.format(offset, offset_end)})
raw_data = io.BytesIO(resp.content)
f = gzip.GzipFile(fileobj=raw_data)
with open('full.warc.gz', 'wb') as f_out:
    f_out.write(resp.content)

After downloading the file I use pywb to host the file:

pip install pywb
wb-manager init my_coll
wb-manager add my_coll <path/to/warc>
wayback
The webpage lacks images and background information.  I have tried loading other pages as well.  I only get text and links.  Any suggestions on what I am doing wrong?

Sebastian Nagel

unread,
Nov 30, 2016, 1:02:09 PM11/30/16
to common...@googlegroups.com
Hi Gregory,

there is probably nothing wrong with your program code.

Common Crawl focuses on HTML pages with data and web science,
natural language processing and data mining as main use cases.
There is only a small percentage of other document formats,
but that's by mistake and not intended.

Best,
Sebastianfen

On 11/30/2016 06:56 PM, Gregory Petropoulos wrote:
> My goal is to use the common crawl to generate screen shots of websites.
>
> To do this I am using python. Below is my code to copy the .warc file for a given url.
>
> |
> from bs4 import BeautifulSoup
> import requests
> import json
> import pycurl
> import gzip
> import io
>
> r = requests.get('http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.google.com/*&output=json')
> soup = BeautifulSoup(r.content, "html.parser")
> hits = soup.contents[0].split("\n")
>
> page = json.loads(hits[100])
> offset, length = int(page['offset']), int(page['length'])
> offset_end = offset + length - 1
> base = 'https://commoncrawl.s3.amazonaws.com/'
> filename = page['filename']
> resp = requests.get(base + filename, headers={'Range': 'bytes={}-{}'.format(offset, offset_end)})
> raw_data = io.BytesIO(resp.content)
> f = gzip.GzipFile(fileobj=raw_data)
> with open('full.warc.gz', 'wb') as f_out:
> f_out.write(resp.content)
> |
>
> After downloading the file I use pywb <https://github.com/ikreymer/pywb> to host the file:
>
> pip install pywb
> wb-manager init my_coll
> wb-manager add my_coll <path/to/warc>
> wayback
>
> The webpage lacks images and background information. I have tried loading other pages as well. I
> only get text and links. Any suggestions on what I am doing wrong?
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages