Purge and Weak References

203 views
Skip to first unread message

J. Chris Anderson

unread,
Nov 21, 2013, 6:52:43 PM11/21/13
to mobile-c...@googlegroups.com
We've been talking to users and one thing that comes up is "how do I delete data from the device that I don't need anymore?" The short answer is purge, but this is an area where we could use some more sugar. Here are my scattered thoughts on the general requirements around purge. I'm posting them here because I know there are some folks on this list who have more experience using purge in application code than me, so please do reply with anything you'd like to see enhanced.

There are a bunch of use cases where users could end up with more data on the device than they should. Think about a Yelp style thing where it'd sync documents for businesses near your current location, but you travel a lot. Without purge you'd still have documents in there from your last trip to London. Currently app developers can manually clean things up, but that's not fun. So there's a lot we can do to help the developer deal with device storage limits. Here are a few ideas, just to sketch a direction. There are open Github issues for some of these, others may be bad ideas... the goal here isn't to share implementation plans, it's to get feedback so we can see what would help you the most in real world apps.

We could keep an LRU and occasionally sweep documents we haven't read in forever.

We also need a mechanism for transparently loading documents (or attachments) from the cloud on request. So eg an app tries to load a doc from the local CBL, but it is not there. The developer has configured a cloud fallback option, so if we're online, CBL attempts to fetch the data from the cloud before returning anything to the client.

Once we have the ability to fallback to a sync gateway, we get more flexibility to purge data from the device. So we can have an autopurge feature that removes large attachments that haven't been viewed in a long time. And if the user requests it again, it will just reload from the cloud. This kind of reload is what I mean by a weak reference.

We also may want some kind of per-document metadata to prioritize sync and storage. Eg if you set doc._sync_priority = 0, it always gets synced before documents that have the default sync_priority, and if want a document to only sync over wifi by default, you could tag it with a higher number, maybe sync_priority = 10 would be the cell-data cut-off, and sync_priority > 100 would mean don't sync, just lazy load / weak reference.

I use 0 as the sync level with the most priority (and 1 as the default priority) because otherwise you get an arms race with everyone trying to be more important. Sync Priority 0 would be useful for foundational elements that are required to draw the initial UI. It would frequently be coupled with Purge Priority 0 (never purge this document) especially in cases where parts of an application's UI chrome are distributed via Sync.

Which brings me to Purge Priority, another kind of metadata we could track per document. A document with purge_priority = 0 will never be autopurged, so you'd tag foundational data with that. There could be ephemeral documents () where you can purge almost immediately, and a range of documents in the middle. For instance you'd rather purge comments on a business in the Yelp-style app, than purge the business address and phone number. You'd rather purge chat logs than contact lists, etc. 

For a sensor app, you'd be able to purge documents as soon as you replicated them upstream. So we'd want a way to configure that.

So far I have been talking about special per-document fields that can configure sync behavior. But maybe it'd be better to do it per-channel... I won't sketch that here, but 

Once we get autopurge/lazy load nailed down maybe the next horizon is using queries to do targeted sync. So maybe you fire a query to Couchbase Server and get a set of documents that match, along with a summary. And you can use the summary to display UI while you wait for the documents to sync.

As we head down this path we end up emphasizing the programming model and the low-latency advantage, over the offline capability...

I'm also curious how this sort of lazy load / weak reference stuff can play in a p2p scenario. Even further out there is using query-based sync in a p2p app.

Andrew Reslan

unread,
Nov 22, 2013, 6:12:50 AM11/22/13
to mobile-c...@googlegroups.com
This is seriously exciting, I'll share some of things I have been working on in my app, hopefully it will be useful input.

Rather than automatically pull sync large attachments (images, videos e.t.c), these are pulled lazily to reduce space requirements on the device and can be purged if not accessed (LRU).

My current solution when talking to sync_gateway is to write an interest document on the client, the interest contains a reference to the document that the user wishes to have access to. once the client syncs with sync_gateway if the interest is valid the sync_function gives the user access to the document via a channel. 

On the client side, the client will delete the interest once it has received the document, which will be deleted on the server on the next sync (removing the access), on sync_gateway I have a process that will delete interest documents after a period of time where clients have not synced for a while. A client will create a new interest on a subsequent run if the file is still not found locally.

I would be great to see this type of functionality appear in the core, it will be a big winner for the user experience.

Andy






J. Chris Anderson

unread,
Nov 22, 2013, 9:36:59 AM11/22/13
to mobile-c...@googlegroups.com


On Friday, November 22, 2013 3:12:50 AM UTC-8, Andrew Reslan wrote:
This is seriously exciting, I'll share some of things I have been working on in my app, hopefully it will be useful input.

Rather than automatically pull sync large attachments (images, videos e.t.c), these are pulled lazily to reduce space requirements on the device and can be purged if not accessed (LRU).


This sounds like the sort of semantics we'd like to offer. Hopefully by having a built-in implementation we can beat the scalability of the channel-per-interest approach you describe. 

I'm glad you are having success cleaning up documents which are no longer interesting. I would like to have more advanced client-side control for this as well, so that you don't have to jump through any hoops on the server to clean up data on the client. 

But in the mean time your pattern can apply to a wide variety of apps. Imagine the geo-reviews app I described earlier and each channel corresponds to a region. So you register an interest in your nearby regions and the sync gateway sets up access. When you travel again (come home from London) all your device has to do is change which channels it's registered an interest in, and as long as Sync Gateway removes them from your access list, they are cleaned from your device. 

Implementing this in the core would allow us to better support the case where you have multiple devices with different interests, running as the same user. So you wouldn't have to revoke access to Mountain View data for the device that went to London, the device could just passively know that the Mountain View channels can be purged before the London ones. 

We could spend weeks just coming up with features in this vein. I think that's a great idea. After we've really soaked our minds in the full depth of what we could offer for weak references / data cleanup / compaction / autopurge / deferred attachments will be we ready to implement. I think a lot of these patterns might start out as application specific, and then be something we can boil down to library code, and then from the libraries take hints as to what features to implement in core. (As an unrelated example of this process, I'm building an app for iOS7 and in the process hopefully will have an iOS library that handles iOS native Facebook login, which we can all extend to add other login mechanisms. From that library we may eventually take advice about core sync features.)

One way to kick start this process is with imaginary code. Here's some imaginary code that assumes if we close all the channels associated with a document, the document will be cleaned up on next compaction. That's a lot to assume. :)

// make sure we still care about a channel
- (void) closeOldChannels {
 // get all my current channels
  NSArray * channels = [db allChannels]; // all the channels we are trying to sync right now
  for (CBLChannel * ch in channels) {
    if (![self isInterestedIn ch]) {
       [ch close];

Andrew Reslan

unread,
Nov 22, 2013, 10:03:15 AM11/22/13
to mobile-c...@googlegroups.com
I meant to add, that one side effect of this approach is that async become a first class consideration (if it wasn't already).

I adopted the CCNX base API abstraction for a content manager class.

Public Member Functions

ContentObject put (ContentObject co) throws IOException
 Put a single content object into the network. 
ContentObject get (Interest interest, long timeout) throws IOException
 Get a single piece of content from CCN. 
void registerFilter (ContentName filter, CCNInterestHandler callbackHandler) throws IOException
 Register a standing interest filter with callback to receive any matching interests seen. 
void unregisterFilter (ContentName filter, CCNInterestHandler callbackHandler)
 Unregister a standing interest filter. 
void expressInterest (Interest interest, CCNContentHandler handler) throws IOException
 Query, or express an interest in particular content. 
void cancelInterest (Interest interest, CCNContentHandler handler)
 Cancel this interest. 


When an attachment arrives via sync the manager checks for outstanding interests for that document if found they are deleted and a notification fired. Any UIView that is waiting for the content registers for a notification, if the UIView goes out of scope it de-registers itself from the Notification Center. This is critical where each cell in a table or collection view is being reused as rows are scrolled through, and content may not be delivered for seconds or longer after the first interest.

Jens Alfke

unread,
Nov 22, 2013, 11:40:09 AM11/22/13
to mobile-c...@googlegroups.com

On Nov 21, 2013, at 3:52 PM, J. Chris Anderson <jch...@couchbase.com> wrote:

We also need a mechanism for transparently loading documents (or attachments) from the cloud on request. So eg an app tries to load a doc from the local CBL, but it is not there. The developer has configured a cloud fallback option, so if we're online, CBL attempts to fetch the data from the cloud before returning anything to the client.

This will be fairly easy to do for attachments, and we have a few issues filed on that already. The replication protocol is already flexible enough to let us pull docs without attachments and to fetch individual attachments on demand. The question-marks seem to be
  • How does the API reflect that getting the contents of an attachment can now be asynchronous, and can potentially fail?
  • How does the app tell Couchbase Lite whether to download attachments?

It gets trickier to support docs/revisions that are known (to the client) but not downloaded. A revision that’s not downloaded can’t be indexed, so it won’t show up in view queries. Also, for smallish documents the relative savings of not having the JSON (but having the metadata) probably aren’t that big. So it may be that if your docs are big enough that you want to keep them only partially downloaded, then it may make more sense to put the big part of the doc in a JSON-formatted attachment that’s downloaded on demand.

Going further, I can imagine being able to send view queries to the server. So you could run a query against a huge data set, then examine the results and pull down the docs you need. I think this mostly involves implementing parts of the CouchDB view API in Sync Gateway, and bringing in part of the old CouchQuery class as a subclass(?) of CBLQuery.

—Jens

J. Chris Anderson

unread,
Nov 22, 2013, 12:16:34 PM11/22/13
to mobile-c...@googlegroups.com
I like chasing these feature ideas down because they can uncover even more interesting questions. For instance, for something like this I'd be more interested in integrating with Couchbase Server's query capability, than adding any new big Sync Gateway features. That way we can use existing Couchbase Server connectors like full text, graph query, and the upcoming N1QL ad-hoc query language. The problem becomes tighter Couchbase Server integration, rather than Sync Gateway features...

Thinking about it like "I want to run a query to get this list of items to display for the user, and I also want the documents responsible for drawing that list to start syncing down right away also, so when the user drills into one of them the data is there. ... it may be a configuration option as to whether the documents matching the query as synced once vs subscribed to...

I think a well thought out weak reference / autopurge / lazy load system will be a foundation for a lot of the more fun (especially) ad-hoc sync queries.

Chris
 

Jens Alfke

unread,
Nov 25, 2013, 2:59:08 PM11/25/13
to mobile-c...@googlegroups.com

On Nov 22, 2013, at 9:16 AM, J. Chris Anderson <jch...@couchbase.com> wrote:

for something like this I'd be more interested in integrating with Couchbase Server's query capability, than adding any new big Sync Gateway features. That way we can use existing Couchbase Server connectors like full text, graph query, and the upcoming N1QL ad-hoc query language. The problem becomes tighter Couchbase Server integration, rather than Sync Gateway features...

I agree, actually. As I see it, the query the app sends the gateway would be passed on (with a bit of translation) to the Couchbase Server bucket, and the results similarly sent back. This would be pretty easy to do for regular Couchbase views.

—Jens

Anton Anisimov

unread,
Nov 26, 2013, 5:57:22 AM11/26/13
to mobile-c...@googlegroups.com
I guess it would be great to have an ability to setup priority for document. Like priority_1 means it could be uploaded from server only on demand and then removed, priority2 – sync only document without attachment, priority3 – auto sync (or maybe no priority means auto sync), but as one more thing priority 4 as manual time to save on local database, where you can setup 5 days, 30 days. etc.

Such approach automate process a little bit.

Tony Luong

unread,
May 31, 2014, 3:08:58 PM5/31/14
to mobile-c...@googlegroups.com
Hi Chris,

I've just recently experimented into Couchbase Lite and think that this feature is going to make CBL very useful for many apps. Has this been implemented as of version 1.0?

J. Chris Anderson

unread,
Jun 1, 2014, 11:35:19 AM6/1/14
to mobile-c...@googlegroups.com


On Saturday, May 31, 2014 12:08:58 PM UTC-7, Tony Luong wrote:
Hi Chris,

I've just recently experimented into Couchbase Lite and think that this feature is going to make CBL very useful for many apps. Has this been implemented as of version 1.0?


No it's not implemented yet, so the more we know about how you'd want to use it the better.

Chris 
Reply all
Reply to author
Forward
0 new messages