Is non-html information being indexed?

45 views
Skip to first unread message

José González-Brenes

unread,
Oct 19, 2016, 3:34:58 PM10/19/16
to Common Crawl
Hello,

I've tried query a few different websites, but I'm only getting textual entries.  For example:


Does not return a single image. What am I doing wrong?

Thanks,
Jose


Sebastian Nagel

unread,
Oct 20, 2016, 2:59:36 AM10/20/16
to common...@googlegroups.com
Hi José,

images are normally excluded from the crawl and are not fetched. Common Crawl is about textual
content and not images, sound, videos, etc. Sometimes a server returns an image for a URL
which does not look like an image URL or a redirect points to an image. That's why some images and
other multimedia documents slipped in. In total, there are a few millions of images. The easiest way
to find them is the URL index. However, the index server API [1] does not allow to directly search
for them, it's only possible to filter by MIME types secondarily, e.g.:
http://index.commoncrawl.org/CC-MAIN-2016-40-index?url=*.de&filter=mime:image/jpeg&output=json
It's also possible to get the pages/images via the index server API.

Alternatively, it's possible to grep the index files for a list of desired MIME types.
Files can be found also on the commoncrawl bucket on S3. It's 300 objects/files per index:
s3://commoncrawl/cc-index/collections/CC-MAIN-2016-40/indexes/cdx-*.gz

Alternatively, you might also have a look at [2].

Best,
Sebastian

[1] https://github.com/ikreymer/pywb/wiki/CDX-Server-API
[2] https://github.com/centic9/CommonCrawlDocumentDownload/
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

José Pablo González

unread,
Oct 20, 2016, 10:43:00 AM10/20/16
to common...@googlegroups.com
Thank you for the clarification! : )




> To post to this group, send email to common...@googlegroups.com
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/0uKzeyVPpRc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages