Storage for Web (not just SQL Storage)

34 views
Skip to first unread message

Nikunj Mehta

unread,
Jun 24, 2009, 6:11:44 AM6/24/09
to nikunj...@oracle.com
(please don't remove me from the cc list, since I am not subscribed to
this newsgroup.)

First of all, I am posting this as a new topic rather than replying
separately to each post in the previous thread [0]. Arun R. suggested
this wonderful and timely discussion and I am energized to see the
discussion move past SQL after such a long wait. Please pardon my
assertive tone in the arguments below; I have just been ignored for
too long by too many people on this topic. Apologies also for such a
long post, but I presume that it is better to put together a
comprehensive argument even at the risk of people skimming it. I hope
you don't skim it.

On Jun 22, 5:36 pm, Arun Ranganathan <a...@mozilla.com> wrote:
> The dust is just about settling on our 1.9.1 branch release. For future
> versions of Firefox following Fx 3.5, it's time to think about the SQL
> Storage feature exposed to web content.

Secondly, let me say this: Local persistence requirements != SQL
Storage.

Regrettably, the thinking reflected in the above text highlights just
how badly the current Web Storage spec [1] has mangled the WebApps
WG's interest in finding a solution for storage needs of Web clients.
I have attempted several times to point this to the whole community
[2]. In attempts to draw the Mozilla participants in to this
discussion I have even posted on mozila.dev.platform [3], but to no
avail.

Anyway, I hope the discussants here acknowledge the misplaced hope
heretofore in the current SQL-dependent Web Storage spec.

> Prominent alternatives so far are CouchDB (or BrowserCouch[2], which is
> a JavaScript implementation of CouchDB). Arguments given in favor of
> this so far are that the actual query semantics are delegated out to
> JavaScript, making it easy to learn and standardize. Generally,
> arguments in favor of JavaScript APIs for storage (including using HTTP
> + JSON mechanisms) make the case that these ways are more "web-like"
> than raw SQL statements. Additionally, performance advantages using a
> MapReduce paradigm (such as in CouchDB) are desirable to take advantage
> of multiple processor cores. Arguments against suggest that yet another
> API is a reinvention of the wheel -- *why not standardize something many
> developers are already familiar with?*

(emphasis mine)

IMHO, while agility is good and fail-fast is a noble objective, the
urge to standardize that which is familiar is the principal hazard in
the debate on WebStorage so far. There have been no substantive
discussions on requirements, save for those which were again initiated
following my complaints [4] and no well-documented investigation of
the alternatives. Let us not rush in and tie ourselves in knots for
the next decade.

On Jun 22, 8:53 pm, Mark Finkle <mfin...@mozilla.com> wrote:
> SQL storage is about creating a different type of storage _backend_.

As Jonas already mentioned, in the context of Web Storage, SQL is part
of the front-end, not the back end. If it were just in the back-end,
no one would care since the browser would hide it from apps and
JavaScript. Since script (no matter in whose library) needs to talk
SQL inside browsers, interoperability is compromised.

On Jun 22, 9:41 pm, Mark Finkle <mfin...@mozilla.com> wrote:
> There is plenty of work ahead to create a "web profile" for SQL, no doubt. In any case, I firmly believe that JS wrappers will be used over any form of SQL backend. Beyond convenience, those wrappers will also be used to mitigate incompatibilities - which will surely exist, even in my rosy future.

I loath to use the "experience" word, but do you have any idea what it
would take to create a useful "Web profile"? We would be sitting here
discussing it when my kindergartner enters high school.

On Jun 22, 10:29 pm, Dion Almaer <d...@mozilla.com> wrote:
> It personally feels too early to say either:
> - We shouldn't do SQL at all as it isn't "Webby" enough

While I haven't personally seen what "Webby" means, let me suggest the
following to it. To the best of my knowledge, no one has offered a
rebuttal to these:
1. SQL's transaction model doesn't match HTTP's idempotence and
failure tolerance,
2. Web deals with resources that have uniform identifiers, SQL's
identifiers are parochial
3. Typical hypertext design performs non-linear denormalization, SQL's
relational model favors normalization
4. Relational stores have been replicated successfully but never at
Web scale and not through "open standard" protocols

The fact that one developer or even a few were able to work around
these limitations in a closed way shouldn't mean too much to a
standards body.

> Maybe we can charge forward in a couple of dimensions?

On Jun 23, 8:35 am, Jonas Sicking <jo...@sicking.cc> wrote:
> Is DB replication something that sites will want? One use case that
> keeps being brought up is offline gmail. I would not expect gmail to
> want to replicate the full mail database to the client simply because
> it's really large.
>
> Or is the idea to replicate only part of the database (sorry, i'm
> revealing my lack of DB knowledge here)
>
> However, if DB replication is something that we expect that people do
> want, then I agree it's great if synchronization is part of the
> specification and implementation.
>
> As others have said, we really need to collect use cases and requirements.

Jonas and I have had several conversations about off-line. At least he
confesses his expertise about SQL. Most of the WebApps WG stays mum on
this topic. Anyway, I have written up these use cases and requirements
in response to your previous requests [3a]

Here's one of my suggested dimensions. Instead of thinking about local
persistence, how about we think about off-line access to network data?
From my blog [4],

[[
There are probably two camps of developers interested in Web Storage -
one that develop local applications and another that develop offline-
ready applications. The first camp doesn't really care for the data
that lives on the server; it is just a back up of the data that lives
locally. The second camp has tons of data that is shared among users
and was previously only available in a connected mode. The first camp
probably doesn't deal with HTTP or REST abstractions for their data
but the second one has to deal with the HTTP interface to data as well
as situations in which offline access to that data is required.

For the second category, IMHO, it may be better to focus on getting
the right HTTP abstractions emulated locally for offline purposes.
]]

On Jun 22, 11:42 pm, Jonas Sicking <jo...@sicking.cc> wrote:
> My second concern is that if we standardize a SQL dialect, how much
> work will we have to do in order to exactly conform to that dialect.
> The prospect of even writing an SQL parser is not that exiting to me,
> much less if we have to implement parts of that dialect.

Perhaps you haven't taken my previous answers to this question
seriously [5a, 5b]. Forget the second part (parser development), just
the first (standardize a SQL dialect) is enough to throw me into a
tizzy. (I have been waiting for you to get back with comments on the
BITSY draft [6]).

On Jun 22, 11:48 pm, Chris Anderson <jch...@apache.org> wrote:
> Vladimir already mentioned that BrowserCouch isn't capable of working
> on larger data sets than local storage can manage. I'd propose that we
> consider a more complete strawman, actual CouchDB in the browser. For
> the purposes of this discussion, this would mean a BrowserCouch
> capable of working with larger data sets. But it would mean something
> more important as well.

Not a bad idea to start with CouchDB as a strawman. AtomDB could be
another strawman [7].

> Against this background, CouchDB's other features, like the REST API
> and JavaScript Map Reduce, are merely nice to have. I think when we
> talk about next-generation web storage, we should be thinking about
> offline replication above all else. This is the feature that allows
> users to treat cloud hosts as commodities, paving the way for the p2p
> web.

On Jun 23, 6:27 am, Chris Anderson <jch...@apache.org> wrote:
> I'll reiterate the importance of offline replication. Most existing
> offline stores (I'm looking at you Gears) require application
> developers to implement code to pull changes in from the server and
> store them locally. This is a waste of developer time and results in
> half-baked replication schemes.

Chris is right on the money with his suggestion about built-in off-
line replication and the fact that this has been missing from Web
Storage and its predecessor WHATWG's HTML5, a view I have fearlessly
espoused for long [8]. CouchDB's map-reduce and replication protocol
are incidental to this discussion and its inclusion in Web Storage
would preclude many fine alternatives. I implore this group to
consider the primitives that are needed in a standard (and, by
implication, in the browser). I have enunciated two of these
primitives in an Oracle submission to WebApps WG called BITSY [9]. The
primitives are:

1. programmable HTTP cache
2. HTTP interception in JavaScript

On Jun 23, 12:01 pm, Mike Beltzner <beltz...@mozilla.com> wrote:
> Are we conflating issues here? As I recall HTML5 has a specification
> for offline application notification which already indicates that
> dealing with sync, etc, is the responsibility of the application author.

This is an oft-repeated, casual argument about author responsibility.
When you can't trust an author to get his <table> tags right, what
hope does he have of synching data? Another conflation, just because
the term sync is used, is that HTML5 offline application should be
sufficient. Anyone who has seriously attempted to follow the argument
to its logical conclusion knows better.

On Jun 23, 5:09 am, Mark Finkle <mfin...@mozilla.com> wrote:
> ----- "Vladimir Vukicevic" <vladi...@mozilla.com> wrote:
> > The key difference here is that anything implemented on top of local storage can't really operate on large (out-of-core) data sets, nor is there any provision for efficient indexing/querying of that data. Those are the component pieces that we're trying to figure out how to expose, without just going down the SQL (whether SQLite or something else) path because that path is almost completely undefined and underspecified.
>
> Yes, I think this is key as well. It's the main reason to consider a storage option that is _not_ Local Storage.
>
>
>
> > A good exercise would be to consider what the minimum set of missing capabilities from the web are, perhaps something like:
>
> > 1) How do you provide access to out-of-core data in content javascript (that is, how do you expose the difference between "disk" and "memory");
>
> > 2) Is it possible to implement efficient query execution purely in javascript, interpreting either a SQL-like syntax or a JS-like syntax.
>
> Excellent points. If we could create a storage solution that meets these goals, I think we'd be on the right track. With such a solution, JS wrappers could manipulate the API in many different ways, but still be more performant than a Local Storage (single blob) solution.
>
> > Would that be enough to implement efficient queries over large data using whatever wrapper the user wishes? Deciding on some testcases/benchmarks here would be useful (e.g. 10,000 email messages, searching for those that contain some word)..
>
> Yeah, coming up with some testcases would be very helpful.

On Jun 23, 11:39 am, Vladimir Vukicevic <vladi...@mozilla.com> wrote:
> Yes, there are variations on this approach (jsLINQ as a proof of
> concept, osme others). All of them need an underlying storage mechanism
> that they can be built on top of, though, and I think that's something
> that we haven't really examined (the low level solution).

A standard B-tree will give you amazing performance, and the good
thing is that there is not as much disagreement about what a B-tree
does, nor has this interface been hidden away.

If you combine the above two primitives with an API for standard B-
Tree operations, such as the Berkeley DB Java Edition's
com.sleepycat.je package API [10], then you have for yourself a
scalable, synchronization ready, Web-friendly architectural solution
to local persistence*.

One possible advanced access mechanism, implemented in JavaScript,
might be XQuery, which would maintain its indices using the B-Tree
primitives and access data from the programmable HTTP cache. Another
access mechanism could be CouchDB, which would also store its views in
the B-Tree storage. Heck you could even rewrite all (well, almost) of
the HTML5 application cache using these primitives.

On Jun 23, 12:12 am, Mike Shaver <sha...@mozilla.com> wrote:
> I like the Couch API better than SQL because it's easier to represent
> non-relational data with it (esp. including hierarchies, which are
> brutal to work with in SQL), and because it feels to me like its
> semantics can be more clearly extended to cover both local and remote
> data sources.

Aha, I have called this seamless on-line/off-line data access [11a,
11b]. I am glad to see more people latching on to that concept [11c].

On Jun 23, 8:41 am, Dion Almaer <d...@mozilla.com> wrote:
> The reason that Gear's didn't implement syncing was because the team
> couldn't agree on a generic sync primitive that would work for the various
> Google products that were being worked on. Syncing is hard, so they punted.
> The team spent a LOT of time on this side of offline Gmail and I still find
> it to be fairly buggy. (I know Chris... maybe they should have used CouchDB
> ;).

Pardon me for injecting sarcasm in to this, but if the GMail team
can't handle SQL database sync over HTTP, I doubt anyone is going to
be seriously interested in solving that problem.

On Jun 23, 10:23 am, Mark Finkle <mfin...@mozilla.com> wrote:
> ----- "Dion Almaer" <d...@mozilla.com> wrote:
> Having done this for a Windows app in a previous life, I know that it's hard. I also wonder whether a "one-size-fits-all" sync mechanism will work in general.
<snip>
> I don't see this as a shame. There are a lot of moving parts to a sync system and it can be highly data format specific. As long as the storage system has hooks for bolting on a sync system, I think that vacuum will be filled.

On Jun 23, 12:16 pm, Dion Almaer <d...@mozilla.com> wrote:
> Sorry, I am just riffing on the "sync" point that came up on the thread and
> warning that coming up with a generic sync solution should probably be a
> non-goal (Mark also talked about his experience with sync there
> too). Focusing on the low level storage solution makes sense as Vlad
> and others have said (with an eye on the overall developer
> experience).

This is why BITSY moved from a single format and single protocol
approach to a format/protocol-agnostic approach [12]. And it was easy
to do so since the premise were sound, unlike the SQL APIs where
synchronous vs. asynchronous consumed so much time that we didn't
actually think about other implications of using SQL.

On Jun 23, 12:09 pm, Mark Finkle <mfin...@mozilla.com> wrote:
> ----- "Mike Beltzner" <beltz...@mozilla.com> wrote:
> Are we conflating issues here? As I recall HTML5 has a specification for offline application notification which already indicates that dealing with sync, etc, is the responsibility of the application author.
>
> Exactly. I'm saying let's keep it that way. Conflating the issues would be to add a sync spec to the storage spec. The two do not need to be merged. Let's keep them apart. The storage spec should allow for the existence of a sync mechanism, without actually defining it. Events and callbacks have been enough in the past to allow for such things. Most web specs contain some form of events and callbacks. We should be fine.
>
> Perhaps what you're saying is that whatever storage mechanism we provide should have easy-to-call methods for saying "syncUp()"?
>
> No. That's too explicit. The storage mechanism doesn't need to be that aware of my data format and syncing requirements.

Correct, there is no need for a sync primitive in Web Storage - leave
it to a library to deal with parsing formats and applying protocols &
network communications. See BITSY 0.5 for more details [9].

On Jun 23, 12:18 pm, Shawn Wilsher <sdwi...@mozilla.com> wrote:
> On 6/23/09 12:16 PM, Dion Almaer wrote:> Sorry, I am just riffing on the "sync" point that came up on the thread
> > There are also practical
> > issues such as: would we be better off having SQL in most modern
> > browsers so developers have something they can work with now-ish, versus
> > pushing back and re-thinking what the Web should really do for
> > storage.... and doing these in parallel (and again, letting the market
> > decide who wins in the long run).
>
> The problem with just going with SQL for now is that we'll be stuck with it.

Again, my +1 for this. I argued earlier [13] that

[[
Oracle does not support the substance of the current Web Storage
draft
[1][2][3]. This is a path-breaking change to the Web applications
platform and rushing such a major change without substantive
consideration of alternatives is not in its own best interest.
]]

On Jun 23, 2:51 pm, Arun Ranganathan <a...@mozilla.com> wrote:
> [ Attached Message ]From:Adrian Bateman <adria...@microsoft.com>To:Chris Wilson <Chris.Wil...@microsoft.com>, "a...@mozilla.com" <a...@mozilla.com>Cc:Rob Sayre <rsa...@mozilla.com>Date:Tue, 23 Jun 2009 14:30:46 -0700Local:Tues, Jun 23 2009 2:30 pmSubject:RE: SQL Storage | What Should Be Done?Hi Arun,
>
> I'm not convinced a relational store is necessarily the best answer. We all seem to spend a lot of time writing object/relational mappings and maybe it would be best to take a higher level approach for the web platform. I note also that there was a suggestion the other day on the WebApps mailing list for a more XML focused approach to structured storage using XPath or XQuery as the query language. That is one kind of alternative that I have been considering given that it may address many of the use cases for structured data and already has a bunch of ratified and implemented standards to support it.

I remember discussing this matter and my proposal with Adrian at TPAC
2008 and it is good to see Adrian's public opinion matches our
conversation.

> I'm keen for us to continue this conversation - I think it would be helpful and interesting to see what common ground we can agree on. If we could form a concrete counter-proposal that would surely benefit the wider discussion.

I hope the group seriously considers the proposal I have laid and
allows an opportunity to explore this more formally.

Nikunj
http://o-micron.blogspot.com

[0] http://groups.google.com/group/mozilla.community.web-standards/msg/7835a1b9956d8ac1
[1] http://dev.w3.org/html5/webstorage/
[2] http://lists.w3.org/Archives/Public/public-webapps/2009AprJun/0142.html
[3] http://groups.google.com/group/mozilla.dev.platform/msg/b501b1602bf59c2c
[3a] http://lists.w3.org/Archives/Public/public-webapps/2008OctDec/0104.html
[3b] http://lists.w3.org/Archives/Public/public-webapps/2009AprJun/0153.html
[4] http://o-micron.blogspot.com/2009/04/getting-to-offline-web-data-via-bitsy.html
[5a] http://lists.w3.org/Archives/Public/public-webapps/2009AprJun/0137.html
[5b] http://lists.w3.org/Archives/Public/public-webapps/2009AprJun/0136.html
[6] http://lists.w3.org/Archives/Public/public-webapps/2009AprJun/0343.html
[7] http://o-micron.blogspot.com/2008/07/uniform-data-synchronization-for-mobile.html
[8] http://lists.w3.org/Archives/Public/public-webapps/2009AprJun/0131.html
[9] http://www.oracle.com/technology/tech/feeds/spec/bitsy.html
[10] http://www.oracle.com/technology/documentation/berkeley-db/je/java/index.html
[11a] http://www.aqualab.cs.northwestern.edu/HotWeb08/papers/Mehta-MAA.pdf
[11b] http://o-micron.blogspot.com/2008/06/mobile-databases-or-write-through-web.html
[11c] http://o-micron.blogspot.com/2009/05/write-through-web-caches-and-gmail.html
[12] http://o-micron.blogspot.com/2009/04/bitsy-050-develop-seamlessly-on-lineoff.html
[13] http://lists.w3.org/Archives/Public/public-webapps/2009AprJun/0142.html

* We may be able to avoid byte-level data manipulation if we limit to
a small set of data types.

Chris Anderson

unread,
Jun 24, 2009, 7:01:45 AM6/24/09
to Nikunj Mehta, nikunj...@oracle.com, community-w...@lists.mozilla.org
On Wed, Jun 24, 2009 at 3:11 AM, Nikunj Mehta<nrme...@gmail.com> wrote:
> (please don't remove me from the cc list, since I am not subscribed to
> this newsgroup.)
>

[snip]

> A standard B-tree will give you amazing performance, and the good
> thing is that there is not as much disagreement about what a B-tree
> does, nor has this interface been hidden away.
>

I prefer not to make technical conjectures without code to back them
up, but I've been thinking about what the bare-minimum required to
back BrowserCouch in a serious way would be. A key/value JSON store
would almost work, but without efficient key-range queries we'd have a
hard time being performant.

Access to a large, persisted JavaScript object is almost enough to
implement a real-deal BrowserCouch. Adding a fast key-order (with both
forward and reverse traversal) iterator API to that JavaScript object
is all that's needed to make it a viable backend for a BrowserCouch
that can handle real world workloads. A B-Tree is a great way to do
this.

So my guesstimate as to narrowest API that's needed to support a fast
BrowserCouch would be something like:

=== js code ===

var btree = LocalStore.open("dbname");

btree["mydocid"] = {"some":"json"};

btree.forEach(function(key, value) {
// in order traversal
})

btree.forEach(function(key, value) {
// reverse order traversal
}, false)

btree.forEach("startkey", function(key, value) {
// in order traversal, starting from "startkey"
// we could use throw() to stop traversal
})

btree.forEach("endkey", function(key, value) {
// reverse order traversal, starting from "endkey"
// use throw() to stop traversal
}, false)

// it's worth discussing whether or not JS devs would be in charge of
triggering persistence. if so, add this call. otherwise, all changes
are persisted.

btree.persist();

// delete a btree

LocalStore.drop("dbname")


===

This basic API doesn't implement sync. In this case I'd expect the
BrowserCouch.js code to handle HTTP replication to a remote CouchDB.
Thinking about it more I'm starting to think it would be too hard to
come to agreement about a sync() API, so it's better to shoot for the
simpler solution.

I like the B-Tree because you can build more than just CouchDB on it,
and even brand-new programmers will understand it.

Chris

--
Chris Anderson
http://jchrisa.net
http://couch.io

Nikunj Mehta

unread,
Jun 24, 2009, 6:56:22 PM6/24/09
to
On Jun 24, 4:01 am, Chris Anderson <jch...@apache.org> wrote:

> On Wed, Jun 24, 2009 at 3:11 AM, Nikunj Mehta<nrmeht...@gmail.com> wrote:
> > (please don't remove me from the cc list, since I am not subscribed to
> > this newsgroup.)
>
> [snip]
>
> > A standard B-tree will give you amazing performance, and the good
> > thing is that there is not as much disagreement about what a B-tree
> > does, nor has this interface been hidden away.
>
> I prefer not to make technical conjectures without code to back them
> up, but I've been thinking about what the bare-minimum required to
> back BrowserCouch in a serious way would be. A key/value JSON store
> would almost work, but without efficient key-range queries we'd have a
> hard time being performant.

Unless you can come up with a performant JavaScript implementation of
a B-tree over key/value pairs.

>
> Access to a large, persisted JavaScript object is almost enough to
> implement a real-deal BrowserCouch. Adding a fast key-order (with both
> forward and reverse traversal) iterator API to that JavaScript object
> is all that's needed to make it a viable backend for a BrowserCouch
> that can handle real world workloads. A B-Tree is a great way to do
> this.

Great to see this.

>
> So my guesstimate as to narrowest API that's needed to support a fast
> BrowserCouch would be something like:
>
> === js code ===

<snip>
Good to have the example. I was also thinking about a small set of
requirements to enable basic indexing that you would require in
CouchDB. Here's my first cut:

1. Store, remove, and retrieve individual data items
a. items identified by a string key
b. item values can be either null, object, or string
2. Allow duplicate values for the same key
3. Allow organization of items in to more than one database per
application
4. Sequentially walk items in a cursor starting at a point identified
by a key value
a. items should be presented in an increasing key value as determined
by JavaScript's String comparison operator '=='
b. once the highest key value is reached, no more items are provided
c. if the matching key value is empty, the first key in the database
is the starting point
d. can walk through items in reverse order
e. items obtained in the cursor are deletable - IOW, cursor is
modifiable
5. Provide a sequence object in each database to generate a
monotonically increasing sequence of numbers
6. A database belongs to a single storage unit.
7. Applications are able to open or create a storage unit.
8. Allow storage of multiple items to be committed as part of an
atomic transaction
9. Offer READ_COMMITTED and READ_UNCOMMITTED cursors
10. Transaction scope is limited to operations on databases within the
same storage unit.
11. All APIs are synchronous

> // it's worth discussing whether or not JS devs would be in charge of
> triggering persistence. if so, add this call. otherwise, all changes
> are persisted.
>
> btree.persist();

This is assuming that btree is an in-memory object. I think this is
unproductive. Best to treat every key-value storage operation as
resulting in persistence taking place. Of course, if transactions are
used, then one can delay actually writing to durable storage until
commit time.


>
> // delete a btree
>
> LocalStore.drop("dbname")
>
> ===
>
> This basic API doesn't implement sync. In this case I'd expect the
> BrowserCouch.js code to handle HTTP replication to a remote CouchDB.
> Thinking about it more I'm starting to think it would be too hard to
> come to agreement about a sync() API, so it's better to shoot for the
> simpler solution.

Not clear what you mean here. I thought CouchDB's primary access model
is via HTTP. How can you replicate that behavior without intercepting
HTTP requests (using mechanisms such as those proposed in BITSY)?
Regardless, I concur that synchronization can be bolted on provided
there is an interception and programmable cache. Robust CouchDB like
storage solutions are possible only provided all three primitives are
available:

1. interception
2. programmable cache
3. b-tree

Reply all
Reply to author
Forward
0 new messages