I am also interested in accessing the index in a different way for my
small research project to get a large number of files of certain file
types to mass-test some frameworks that handle files, e.g. Apache Tika
and Apache POI, see
https://github.com/centic9/CommonCrawlDocumentDownload, currently I am
using the previous URL Index which stored the data in a different
format.