Re: Initial benchmarks for full text search on IndexedDB

115 views
Skip to first unread message

Rick Byers

unread,
Jul 19, 2017, 2:58:21 PM7/19/17
to Conrad Irwin, stora...@chromium.org, Victor Coston, Joshua Bell
[Resending to storage-dev now that we believe the group permissions are fixed]

On Wed, Jul 19, 2017 at 3:25 AM, <con...@superhuman.com> wrote:
Hi storage-dev!

(Happy for these emails to be public, not sure if this list is, should I also
CC blink-dev?)

In response to Victor's request to "what does IndexedDB need to do to replace
WebSQL" I've implemented a very naive search engine on IndexedDB. I've so
far only tried searching for a single term, and haven't written the logic
for field-specific queries, and-ing, or-ing, invert-ing or "quoting".

The dataset I'm using is most of my emails, case-normalized, punctuation
removed, and then stemmed using the porter stemmer (for easy comparison with
WebSQL). There are 28M instances of 747,621 unique terms across 146,121
documents. The distribution of terms is predictably skewed:

  33 terms > 100,000 occurrences
  366 > 10,000
  2507 > 1,000
  11948 > 100
  61452 > 10
  272121 > 1

For now, I'm just replicating WebSQL's full text search engine's approach (i.e.
expecting the same big-O performance), in the simplest possible ways over
IndexedDB.

The way terms are stored in WebSQL's FTS [1] is an optimized version of this
logical structure:

  <term><docid><field><position>

There are two ways I can easily insert this into IndexedDB with different
trade-offs:

1. Have two stores, one for the reverse index, and one for the documents.
2. Have one store which contains the documents including an extra field for the
   search index entries that is then multiEntry: indexed.

With some experiementation we can rule out approach 1:

* It requires 28M put operations (instead of 146k) In benchmarking it took 3069
  seconds to create the index (vs. 348s for the second approach) [for comparision,
  WebSQL can index all these documents in 51s]
* It uses more disk space than 2. [1.6Gb vs 1.4Gb]. I was surprised by this
  because 2. is storing the search index twice, but I guess the combination of prefix
  compression [2] and snappy is enough to skew my intuition.  [for comparison,
  WebSQL can pack this index into 635Mb].

I'm happy to share more code if you're interested, but the main thing is "how
are the index entries constructed". I wanted to ensure that the logical
structure was the same as WebSQL, so using a 64-bit sorted docid, and using
<term><docid><field><position> ordering.


   for (field in email) {
     int i = 0;
     for (token of tokenize(field)) {
        let sort = 0xffffffff - Math.floor(date / 1000);
        let thrid = parseInt("0x" + thread_id.substr(8))

        let key = token + "\x00" +
          String.fromCharCode((sort >> 16) & 0xffff) + String.fromCharCode((sort) & 0xffff) +
          String.fromCharCode((thrid >> 16) & 0xffff) + String.fromCharCode((thrid) & 0xffff)
          String.fromCharCode(ids[field]) + String.fromCharCode(i++);

        entries.push(key)
     }
   }

So we have two databases, one in IndexedDB:

      db.createObjectStore("documents").createIndex("search", "entries", {multiEntry: true})

And one in WebSQL:

      CREATE VIRTUAL TABLE search_test USING fts3(
            thread_id, subject, body, from, to, cc,
            bcc, attachments, labels, meta, rfc822msgid,
            deliveredto, replyto, tokenize=porter
            )

So far I've only tried single-term searches, mostly because the IndexedDB API
is extremely fiddly for anything complicated:

To query WebSQL:

    let start = performance.now();
    database.transaction((tx) => {
        tx.executeSql("SELECT * from search_test WHERE search_test MATCH ? LIMIT 25",
          [query], (tx, r) => {
            results = r.rows;
          })
      }, null, () => {
        console.log('websql took ' + performance.now() - start)
        console.table(results)
      });
    })

To query IndexedDB:

    let start = performance.now();
    let tx = idb.transaction(["documents", "index"], "readonly")
    let bound = IDBKeyRange.bound(token + "\x00", token + "\x01", true, false)
    let seen = {}
    tx.objectStore("documents").index("search").openCursor(bound).onsuccess = (e) => {
      let cursor = e.target.result;

      if (cursor && !seen[cursor.primaryKey]) {
        seen[cursor.primaryKey] = true;
        results.push(cursor.value.document);
      }

      if (cursor && results.length < 25) {
        cursor.continue()
      } else {
        console.log('indexeddb took', performance.now() - start)
        console.table(results)
      }

Running simple searches for the first 25 results for my name, I get the
following performance characteristics:

* WebSQL: median 3.8ms, p(99) 5.5ms  [3]
* IndexedDB: median 23ms, p(99) 39ms [4]
* (rejected IndexedDB two stores no index is remarkably similar: median 23ms, p(99) 35ms)
  I think that makes sense considering that that's how indexes are implemented, but hints
  that the fix for my problems isn't going to just be "push stuff into the C++ layer".

This is consistent with our benchmarks from about a year ago, that pegged
IndexedDB at about 5-10x slower than WebSQL for most simple queries.

Running queries for non-existant terms is a bit strange:

* WebSQL: median 1.9ms, p(99) 3.1ms [5]
* IndexedDB: wierdly modal see [6], peaks at 0.5, 5 and 10ms [median 4.6ms, p(99), 9.0ms]

From this little playing it's hard to suggest any sensible things, it would certainly help if:

* IndexedDB was consistently ~5-10x faster. Without being too facetious, these queries
  are the ones where the indexes used are in indexedDb's wheelhouse, and it's an order
  of magnitude behind WebSQL both on indexing and searching. Some of the indexing time
  (25-30%) is the porter stemmer in Javascript, so that's probably reducable external to
  the database; but I don't know how to make querying significantly faster.

* It'd also be awesome to know why the zero results search has such clear modal peaks
  in its output, any ideas?

* It was possible to express operations on indexes more clearly. In this example,
  it'd be interesting to know how fast the cursor could be iterated over if pushed down to the
  leveldb layer of abstraction using some kind of API like this:

  tx.objectStore("documents").select(
    IDBQuery.new({
     index: "index",
     bound: IDBKeyRange.bound(token + "\x00", token + "\x01", true, false)
     limit: 25,
     unique: true
   })

  [though obviously that API is not nearly powerful enough to represent most search queries using
   the current index layout, and it's ugly compared to the SQL equivalent]

I'll continue playing and see if I can hook up some of the remaining core
pieces: queries with more than one term, field queries and "adjacent word"
queries.

I also want to play with maintaining my own b-tree in javascript and using IndexedDB as a page store (similar to how FTS on WebSQL works), this should hopefully reduce dependence on IndexedDB's performance, but significantly increase the javascipt overhead penalty.

Conrad




Conrad Irwin

unread,
Jul 21, 2017, 2:57:30 AM7/21/17
to Victor Costan, stora...@chromium.org, Engineering-archive
(trying to keep replies on list)

On Wed, Jul 19, 2017 at 2:44 PM, Victor Costan<pwn...@chromium.org>wrote:

Is there any chance you could place your benchmarks in a GitHub repository, in a form in which we could run them?  We have to be able to measure these metrics ourselves, in order to be able to improve on them.

I've uploaded the code here: https://github.com/superhuman/fts (I hoped I'd find time to tidy up the code, so sorry in advance!). It's missing most of what is needed for a full text search engine, I'll try and find time to push on it a little soon; if you want to add any of the missing features, please go ahead and I can add you as a contributor to the repository.


We don't want to depend on your inbox's content :) so we can replace that with our own public data, like blink-dev + chromium-dev traffic. As long as there's a clear format for the data, we can take care of that.

I've documented the input data format, and there are some tools that can convert from gmail's JSON to the correct format in the tools/ directory (obviously if you're going to use a public archive, you might need to write a tool to parse directly from MIME).

If you do end up getting it working, please send a pull request! I'd love to have a sharable data set we can collaborate on.

Conrad


Victor Costan

unread,
Jul 21, 2017, 7:02:59 PM7/21/17
to Conrad Irwin, stora...@chromium.org, Engineering-archive
On Thu, Jul 20, 2017 at 11:57 PM, Conrad Irwin <con...@superhuman.com> wrote:
(trying to keep replies on list)

Thank you very much! I'm not sure what happened with my previous message, but my intent was for this thread to stay on storage-dev, so others can contribute.

On Wed, Jul 19, 2017 at 2:44 PM, Victor Costan<pwn...@chromium.org>wrote:

Is there any chance you could place your benchmarks in a GitHub repository, in a form in which we could run them?  We have to be able to measure these metrics ourselves, in order to be able to improve on them.

I've uploaded the code here: https://github.com/superhuman/fts (I hoped I'd find time to tidy up the code, so sorry in advance!). It's missing most of what is needed for a full text search engine, I'll try and find time to push on it a little soon; if you want to add any of the missing features, please go ahead and I can add you as a contributor to the repository.

I was juts about to e-mail you to assure you that quick and raw is better than late and polished. Also, for our benchmarking purposes, small code = good, so please don't worry about fancier queries.
 
We don't want to depend on your inbox's content :) so we can replace that with our own public data, like blink-dev + chromium-dev traffic. As long as there's a clear format for the data, we can take care of that.

I've documented the input data format, and there are some tools that can convert from gmail's JSON to the correct format in the tools/ directory (obviously if you're going to use a public archive, you might need to write a tool to parse directly from MIME).

If you do end up getting it working, please send a pull request! I'd love to have a sharable data set we can collaborate on.

I'll try to find some time soon for that. Thank you very much for sharing your approach and code!

    Victor

Conrad Irwin

unread,
Jul 24, 2017, 5:30:25 PM7/24/17
to Victor Costan, stora...@chromium.org, Engineering-archive




On Fri, Jul 21, 2017 at 4:02 PM, Victor Costan<pwn...@chromium.org>wrote:
I was juts about to e-mail you to assure you that quick and raw is better than late and polished. Also, for our benchmarking purposes, small code = good, so please don't worry about fancier queries.

Makes sense, but note that I expect the AND query to be about an order of magnitude slower still. It would be irritating for you to think you've made full-text-search work if you can make this benchmark pass — that is absolutely not the case :). This is a necessary, but not sufficient example.

Conrad

Victor Costan

unread,
Aug 4, 2017, 3:18:44 AM8/4/17
to Conrad Irwin, stora...@chromium.org, Engineering-archive
Sorry I haven't replied to this earlier! Thank you very much for drawing our attention to AND queries.

We're going to try things out for a while, and won't have much to say. I'll come back to this thread when we have interesting results.

    Victor 

Conrad Irwin

unread,
Aug 4, 2017, 12:17:51 PM8/4/17
to Victor Costan, stora...@chromium.org, Engineering-archive
No worries!

Thanks again for taking this on, and please let me know if you do discover anything. (With the caveat, that iff you do manage to build an IndexedDB search engine you may find yourself with a new project to maintain for life — so many people would love it :)

conrad


On Fri, Aug 04, 2017 at 12:18 AM, Victor Costan<pwn...@chromium.org>wrote:
On Mon, Jul 24, 2017 at 2:30 PM, Conrad Irwin <conrad@superhuman.com> wrote:
Reply all
Reply to author
Forward
0 new messages