Finding URLs by content type / file extension

77 views
Skip to first unread message

Tom Morris

unread,
Apr 22, 2015, 4:44:12 PM4/22/15
to common...@googlegroups.com
[Pulling this into it's own thread separate from the announcement thread]

On Wed, Apr 22, 2015 at 3:50 PM, Dominik Stadler <dominik...@gmail.com> wrote:

I am also interested in accessing the index in a different way for my
small research project to get a large number of files of certain file
types to mass-test some frameworks that handle files, e.g. Apache Tika
and Apache POI, see
https://github.com/centic9/CommonCrawlDocumentDownload, currently I am
using the previous URL Index which stored the data in a different
format.

It sounds like you were using file extension, rather than the content type to determine the type of the target document.  That's going to miss URLs like this (from one of the new indexes):


If you're happy just looking for patterns in the URL, whether it be .pdf or format=pdf, and don't care about URLs where format=1 means PDF, you can adapt the technique that I just posted in another thread to process the CDX files.

Tom

ikre...@gmail.com

unread,
Apr 22, 2015, 5:24:00 PM4/22/15
to common...@googlegroups.com
Hi,

Also, the new index (CC-MAIN-2015-14) now has mime type and status as field in the json block.

This allows you to filter like this:

which should filter out results that had a content type of text/html only. This has been added starting with the 2015-14 index.


Ilya

Aline Bessa

unread,
Apr 22, 2015, 8:45:37 PM4/22/15
to common...@googlegroups.com
Great! Thanks, guys.
Reply all
Reply to author
Forward
0 new messages