Hi José,
images are normally excluded from the crawl and are not fetched. Common Crawl is about textual
content and not images, sound, videos, etc. Sometimes a server returns an image for a URL
which does not look like an image URL or a redirect points to an image. That's why some images and
other multimedia documents slipped in. In total, there are a few millions of images. The easiest way
to find them is the URL index. However, the index server API [1] does not allow to directly search
for them, it's only possible to filter by MIME types secondarily, e.g.:
http://index.commoncrawl.org/CC-MAIN-2016-40-index?url=*.de&filter=mime:image/jpeg&output=json
It's also possible to get the pages/images via the index server API.
Alternatively, it's possible to grep the index files for a list of desired MIME types.
Files can be found also on the commoncrawl bucket on S3. It's 300 objects/files per index:
s3://commoncrawl/cc-index/collections/CC-MAIN-2016-40/indexes/cdx-*.gz
Alternatively, you might also have a look at [2].
Best,
Sebastian
[1]
https://github.com/ikreymer/pywb/wiki/CDX-Server-API
[2]
https://github.com/centic9/CommonCrawlDocumentDownload/
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.