CDXFormatIndex.getPrefixIterator(prefix) returns entries that do not match the prefix

34 views
Skip to first unread message

David Portabella

unread,
Feb 8, 2017, 12:20:16 PM2/8/17
to openwayback-dev
Using CDXFormatIndex.getPrefixIterator(prefix),
I would expect to get only the entries that match this prefix.
Instead, it finds the first entry matching this prefix, and then it returns all entries from that point until the end of the archive.
so, it returns entries that do not match the prefix.

Why?
How to *only* get the entries that match the prefix?


Example (written in Scala)

package application

import org.archive.wayback.core.CaptureSearchResult
import org.archive.wayback.resourceindex.cdx.CDXFormatIndex
import org.archive.wayback.util.url.AggressiveUrlCanonicalizer

import scala.collection.JavaConverters._

object ResponseWarcReaderExample {
 
def main(args: Array[String]) {
 val index
= new CDXFormatIndex()
 index
.setPath("/dataset/files.warc.cdx")

 val key
= canonicalize("http://www.rmspumptools.com/innovation.php")
 val it
= index.getPrefixIterator(key).asScala.foreach { (r: CaptureSearchResult) =>
 println
(s"${r.getFile}:${r.getOffset}: ${r.getOriginalUrl}")
 
}
 
}

 val canonicalizer
= new AggressiveUrlCanonicalizer()
 
def canonicalize(url: String): String =
 canonicalizer
.urlStringToKey(url)
}


The output is as follows:
files
.warc.gz:1053529: http://www.rmspumptools.com/innovation.php
files
.warc.gz:1181319: https://www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents
files
.warc.gz:11538: http://www.slaperoo.com/
files
.warc.gz:1268086: https://www.smarttech.com/patents
files
.warc.gz:826021: http://speckip.com/
...





Note that CDXFormatIndex/CDXIndex.getPrefixIterator calls FlatFile.getRecordIterator(final String prefix),
which returns an input stream starting with the offset the first entry matching the prefix; and then it reads everything until the end of the archive.
  RandomAccessFile raf = new RandomAccessFile(file,"r");
  findKeyOffset(raf, prefix);



andrew.jackson

unread,
Feb 13, 2017, 5:40:57 AM2/13/17
to openwayback-dev
I believe the idea is to allow you to use the same API to perform other queries, such as listing all URIs that start with a given prefix (e.g. all URIs for a host).

If you only want the list that matches a specific URI, you should stop pulling results from the iterator when the URI is no longer the one you want.

HTH,
Andy Jackson

David Portabella

unread,
Feb 13, 2017, 10:43:26 AM2/13/17
to openwayback-dev
The API containts these two functions: getPrefixIterator and getUrlIterator.

With getPrefixIterator, I would expect to return all entries that match a given prefix. With getUrlIterator, I would expect to get all entries that match a url. The problem is that getPrefixIterator returns the first entry that match a given prefix, and all the following entries until the end of the archive (which do not match the prefix). The same for getUrlIterator: it returns the first entry that match the full url, and all the following entries until the end of the archive.

I think that the example is pretty clear on this. I asked for all the entries that match the prefix: rmspumptools.com/innovation.php
and I get that entry http://www.rmspumptools.com/innovation.php (this is correct), but also https://www.sjm.com/en, and http://www.slaperoo.com/ and ... until the end of the archive (which do not match the prefix).

andrew.jackson

unread,
Feb 13, 2017, 11:41:03 AM2/13/17
to openwayback-dev
This admittedly surprising behaviour appears to be intended: https://github.com/iipc/openwayback/blob/6475121cef79240b5e18a5f2c224ff9ba933b43d/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java#L39-L41

e.g. here's an example of the match being tested in the client rather than in the iterator: https://github.com/iipc/openwayback/blob/6475121cef79240b5e18a5f2c224ff9ba933b43d/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/CaptureToUrlSearchResultIterator.java#L79

I'm a bit wary of changing FlatFile.java as this is used in a number of places that may depend on this behaviour. However, it would seem reasonable to modify the CDXIndex or CDXFormatIndex child classes to behave in a less surprising fashion.


Best,
Andy

David Portabella

unread,
Feb 13, 2017, 11:59:42 AM2/13/17
to openwayback-dev
I see.

This is my current workaround:

  def filterByPrefix(prefix: String): Iterator[CaptureSearchResult] = {
    val key = canonicalize(url)
    getIndex.getPrefixIterator(key).asScala.takeWhile(_.getUrlKey.startsWith(key))
  }

  def filterByUrl(url: String): Iterator[CaptureSearchResult] = {
    val key = canonicalize(url)
    getIndex.getUrlIterator(key).asScala.takeWhile(key == _.getUrlKey)
  }

  val canonicalizer = new AggressiveUrlCanonicalizer()
  def canonicalize(url: String): String =
    canonicalizer.urlStringToKey(url)


What is also "surprising", is that calling getPrefixIterator a second time will throw an Exception.
You need to close and open a new CDXFormatIndex.


Cheers,
David
Reply all
Reply to author
Forward
0 new messages