Using CDXFormatIndex.getPrefixIterator(prefix),
I would expect to get only the entries that match this prefix.
Instead, it finds the first entry matching this prefix, and then it returns all entries from that point until the end of the archive.
so, it returns entries that do not match the prefix.
Why?
How to *only* get the entries that match the prefix?
Example (written in Scala)
package application
import org.archive.wayback.core.CaptureSearchResult
import org.archive.wayback.resourceindex.cdx.CDXFormatIndex
import org.archive.wayback.util.url.AggressiveUrlCanonicalizer
import scala.collection.JavaConverters._
object ResponseWarcReaderExample {
def main(args: Array[String]) {
val index = new CDXFormatIndex()
index.setPath("/dataset/files.warc.cdx")
val key = canonicalize("http://www.rmspumptools.com/innovation.php")
val it = index.getPrefixIterator(key).asScala.foreach { (r: CaptureSearchResult) =>
println(s"${r.getFile}:${r.getOffset}: ${r.getOriginalUrl}")
}
}
val canonicalizer = new AggressiveUrlCanonicalizer()
def canonicalize(url: String): String =
canonicalizer.urlStringToKey(url)
}
The output is as follows:
files.warc.gz:1053529: http://www.rmspumptools.com/innovation.php
files.warc.gz:1181319: https://www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents
files.warc.gz:11538: http://www.slaperoo.com/
files.warc.gz:1268086: https://www.smarttech.com/patents
files.warc.gz:826021: http://speckip.com/
...
Note that CDXFormatIndex/CDXIndex.getPrefixIterator calls FlatFile.getRecordIterator(final String prefix),
which returns an input stream starting with the offset the first entry matching the prefix; and then it reads everything until the end of the archive.
RandomAccessFile raf = new RandomAccessFile(file,"r");
findKeyOffset(raf, prefix);