Finding URLs by content type / file extension

Tom Morris

unread,

Apr 22, 2015, 4:44:12 PM4/22/15

to common...@googlegroups.com

[Pulling this into it's own thread separate from the announcement thread]

On Wed, Apr 22, 2015 at 3:50 PM, Dominik Stadler <dominik...@gmail.com> wrote:

I am also interested in accessing the index in a different way for my
small research project to get a large number of files of certain file
types to mass-test some frameworks that handle files, e.g. Apache Tika
and Apache POI, see
https://github.com/centic9/CommonCrawlDocumentDownload, currently I am
using the previous URL Index which stored the data in a different
format.

It sounds like you were using file extension, rather than the content type to determine the type of the target document. That's going to miss URLs like this (from one of the new indexes):

cdx-00000-urls.gz:http://www.izha.edu.al/index.php?view=article&catid=38%3Abiblioteka-elektronike&id=114%3Aprograme-klasa-10&format=pdf&option=com_content&Itemid=18

cdx-00000-urls.gz:http://www.akbn.gov.al/index.php/sq/hidrokarburet/kuadri-ligjor?format=pdf

cdx-00000-urls.gz:http://www.nationalfilmcenter.gov.al/index.php?view=article&catid=42%3Arreth-nesh-&id=107%3Aorganizimi-&format=pdf&option=com_content&Itemid=109

If you're happy just looking for patterns in the URL, whether it be .pdf or format=pdf, and don't care about URLs where format=1 means PDF, you can adapt the technique that I just posted in another thread to process the CDX files.

Tom