Hi storage-dev!(Happy for these emails to be public, not sure if this list is, should I alsoCC blink-dev?)In response to Victor's request to "what does IndexedDB need to do to replaceWebSQL" I've implemented a very naive search engine on IndexedDB. I've sofar only tried searching for a single term, and haven't written the logicfor field-specific queries, and-ing, or-ing, invert-ing or "quoting".The dataset I'm using is most of my emails, case-normalized, punctuationremoved, and then stemmed using the porter stemmer (for easy comparison withWebSQL). There are 28M instances of 747,621 unique terms across 146,121documents. The distribution of terms is predictably skewed:33 terms > 100,000 occurrences366 > 10,0002507 > 1,00011948 > 10061452 > 10272121 > 1For now, I'm just replicating WebSQL's full text search engine's approach (i.e.expecting the same big-O performance), in the simplest possible ways overIndexedDB.The way terms are stored in WebSQL's FTS [1] is an optimized version of thislogical structure:<term><docid><field><position>There are two ways I can easily insert this into IndexedDB with differenttrade-offs:1. Have two stores, one for the reverse index, and one for the documents.2. Have one store which contains the documents including an extra field for thesearch index entries that is then multiEntry: indexed.With some experiementation we can rule out approach 1:* It requires 28M put operations (instead of 146k) In benchmarking it took 3069seconds to create the index (vs. 348s for the second approach) [for comparision,WebSQL can index all these documents in 51s]* It uses more disk space than 2. [1.6Gb vs 1.4Gb]. I was surprised by thisbecause 2. is storing the search index twice, but I guess the combination of prefixcompression [2] and snappy is enough to skew my intuition. [for comparison,WebSQL can pack this index into 635Mb].I'm happy to share more code if you're interested, but the main thing is "howare the index entries constructed". I wanted to ensure that the logicalstructure was the same as WebSQL, so using a 64-bit sorted docid, and using<term><docid><field><position> ordering.for (field in email) {int i = 0;for (token of tokenize(field)) {let sort = 0xffffffff - Math.floor(date / 1000);let thrid = parseInt("0x" + thread_id.substr(8))let key = token + "\x00" +String.fromCharCode((sort >> 16) & 0xffff) + String.fromCharCode((sort) & 0xffff) +String.fromCharCode((thrid >> 16) & 0xffff) + String.fromCharCode((thrid) & 0xffff)String.fromCharCode(ids[field]) + String.fromCharCode(i++);entries.push(key)}}So we have two databases, one in IndexedDB:db.createObjectStore("documents").createIndex("search", "entries", {multiEntry: true})And one in WebSQL:CREATE VIRTUAL TABLE search_test USING fts3(thread_id, subject, body, from, to, cc,bcc, attachments, labels, meta, rfc822msgid,deliveredto, replyto, tokenize=porter)So far I've only tried single-term searches, mostly because the IndexedDB APIis extremely fiddly for anything complicated:To query WebSQL:let start = performance.now();database.transaction((tx) => {tx.executeSql("SELECT * from search_test WHERE search_test MATCH ? LIMIT 25",[query], (tx, r) => {results = r.rows;})}, null, () => {console.log('websql took ' + performance.now() - start)console.table(results)});})To query IndexedDB:let start = performance.now();let tx = idb.transaction(["documents", "index"], "readonly")let bound = IDBKeyRange.bound(token + "\x00", token + "\x01", true, false)let seen = {}tx.objectStore("documents").index("search").openCursor(bound).onsuccess = (e) => {let cursor = e.target.result;if (cursor && !seen[cursor.primaryKey]) {seen[cursor.primaryKey] = true;results.push(cursor.value.document);}if (cursor && results.length < 25) {cursor.continue()} else {console.log('indexeddb took', performance.now() - start)console.table(results)}Running simple searches for the first 25 results for my name, I get thefollowing performance characteristics:* WebSQL: median 3.8ms, p(99) 5.5ms [3]* IndexedDB: median 23ms, p(99) 39ms [4]* (rejected IndexedDB two stores no index is remarkably similar: median 23ms, p(99) 35ms)I think that makes sense considering that that's how indexes are implemented, but hintsthat the fix for my problems isn't going to just be "push stuff into the C++ layer".This is consistent with our benchmarks from about a year ago, that peggedIndexedDB at about 5-10x slower than WebSQL for most simple queries.Running queries for non-existant terms is a bit strange:* WebSQL: median 1.9ms, p(99) 3.1ms [5]* IndexedDB: wierdly modal see [6], peaks at 0.5, 5 and 10ms [median 4.6ms, p(99), 9.0ms]From this little playing it's hard to suggest any sensible things, it would certainly help if:* IndexedDB was consistently ~5-10x faster. Without being too facetious, these queriesare the ones where the indexes used are in indexedDb's wheelhouse, and it's an orderof magnitude behind WebSQL both on indexing and searching. Some of the indexing time(25-30%) is the porter stemmer in Javascript, so that's probably reducable external tothe database; but I don't know how to make querying significantly faster.* It'd also be awesome to know why the zero results search has such clear modal peaksin its output, any ideas?* It was possible to express operations on indexes more clearly. In this example,it'd be interesting to know how fast the cursor could be iterated over if pushed down to theleveldb layer of abstraction using some kind of API like this:tx.objectStore("documents").select(IDBQuery.new({index: "index",bound: IDBKeyRange.bound(token + "\x00", token + "\x01", true, false)limit: 25,unique: true})[though obviously that API is not nearly powerful enough to represent most search queries usingthe current index layout, and it's ugly compared to the SQL equivalent]I'll continue playing and see if I can hook up some of the remaining corepieces: queries with more than one term, field queries and "adjacent word"queries.I also want to play with maintaining my own b-tree in javascript and using IndexedDB as a page store (similar to how FTS on WebSQL works), this should hopefully reduce dependence on IndexedDB's performance, but significantly increase the javascipt overhead penalty.Conrad
Is there any chance you could place your benchmarks in a GitHub repository, in a form in which we could run them? We have to be able to measure these metrics ourselves, in order to be able to improve on them.
We don't want to depend on your inbox's content :) so we can replace that with our own public data, like blink-dev + chromium-dev traffic. As long as there's a clear format for the data, we can take care of that.
(trying to keep replies on list)
On Wed, Jul 19, 2017 at 2:44 PM, Victor Costan<pwn...@chromium.org>wrote:Is there any chance you could place your benchmarks in a GitHub repository, in a form in which we could run them? We have to be able to measure these metrics ourselves, in order to be able to improve on them.I've uploaded the code here: https://github.com/superhuman/fts (I hoped I'd find time to tidy up the code, so sorry in advance!). It's missing most of what is needed for a full text search engine, I'll try and find time to push on it a little soon; if you want to add any of the missing features, please go ahead and I can add you as a contributor to the repository.
We don't want to depend on your inbox's content :) so we can replace that with our own public data, like blink-dev + chromium-dev traffic. As long as there's a clear format for the data, we can take care of that.I've documented the input data format, and there are some tools that can convert from gmail's JSON to the correct format in the tools/ directory (obviously if you're going to use a public archive, you might need to write a tool to parse directly from MIME).If you do end up getting it working, please send a pull request! I'd love to have a sharable data set we can collaborate on.
I was juts about to e-mail you to assure you that quick and raw is better than late and polished. Also, for our benchmarking purposes, small code = good, so please don't worry about fancier queries.
On Mon, Jul 24, 2017 at 2:30 PM, Conrad Irwin <conrad@superhuman.com> wrote: