Speeding up secondary indexes on a security-freak environment

60 views
Skip to first unread message

Ricardo Carraretto

unread,
Nov 12, 2015, 7:45:49 PM11/12/15
to PouchDB
Hello,

i've been doing some reasearch on the group to understand the best approach to speed up my up, while keeking its security aspects, but couldn't really find a final solution. This are some of the implementation requirements/characteristics:

- Server side is hosted on Cloudant;
- Security is a must have, so after the doc is created, the final _id becomes a hash of the actual _id and the data is stored encrypted (it includes the _id before being hashed). The app handles sensitive data, so I can't trust on the hosting side;
- The remote DB is replicated as-is to a local PouchDB;
- Local DB is kept encrypted in the case of device loss;
- A view is built over all docs of the local DB, decrypting them and emitting the actual _id as the index key;
- All queries are made against the index (had to give up on allDocs for the sake of security), and the docs are decrypted when retrieved and encrypted when saved. If the device is stolen, the only "exposed" data are the index keys, but one can't make much use of this as it is not a high sensitive info;
- Live bi-directional replication keeps data consistent accross the place, when the device is online;
- it is supposed to be able to run offline too (main reason I chose Pouch/Couch as persistence mechanism).

This is working pretty well on Chrome, few bumps on Safari (both iOS), but this is a topic for another thread. Right now I'm concerned about the performance as the data stored starts to grow. I expect this to land in the range of 15-30K small documents. Before coding any crazy speed up strategy, I'd like to ask the experts here what would work best:

1) have 1 single DB and 1 index per document type? Maybe querying smaller indexes will be faster, but building/keeping all these indexes can be cumbersome;
2) have 1 DB per document type and 1 index per DB? More DBs to handle with individual indexes, not sure if this helps or not...
3) have a 2nd local DB, running on top of the in-memory adapter that would be a replica of the index of option 1. I could query this DB with allDocs, get the hashed ID and then query the actual encrypted local DB.
4) apply the same strategy as in 3 to the option 2?
5) any simpler idea I have just not though of yet?

I'm pretty excited with all the possibilities that exists around Pouch/Couch. This app was initially designed with a MySQL backend, APIgility as the restful api, and I have coded my own local caching mechanism. At some point I realized that I was only coding the supporting infrastructure and not actually moving my app forward. PouchDB really did save my life here :-)

Kind regards
Ricardo Carrarretto

Nolan Lawson

unread,
Nov 21, 2015, 1:10:08 PM11/21/15
to PouchDB
Hi Ricardo,

Thanks for the thoughtful post! Please see my comments below.


On Thursday, November 12, 2015 at 7:45:49 PM UTC-5, Ricardo Carraretto wrote:
Hello,

i've been doing some reasearch on the group to understand the best approach to speed up my up, while keeking its security aspects, but couldn't really find a final solution. This are some of the implementation requirements/characteristics:

- Server side is hosted on Cloudant;
- Security is a must have, so after the doc is created, the final _id becomes a hash of the actual _id and the data is stored encrypted (it includes the _id before being hashed). The app handles sensitive data, so I can't trust on the hosting side;
- The remote DB is replicated as-is to a local PouchDB;
- Local DB is kept encrypted in the case of device loss;


Encryption (crypto-pouch I assume?) does have a performance overhead. Watch out for that!
 

- A view is built over all docs of the local DB, decrypting them and emitting the actual _id as the index key;
- All queries are made against the index (had to give up on allDocs for the sake of security), and the docs are decrypted when retrieved and encrypted when saved. If the device is stolen, the only "exposed" data are the index keys, but one can't make much use of this as it is not a high sensitive info;
- Live bi-directional replication keeps data consistent accross the place, when the device is online;
- it is supposed to be able to run offline too (main reason I chose Pouch/Couch as persistence mechanism).

This is working pretty well on Chrome, few bumps on Safari (both iOS), but this is a topic for another thread. Right now I'm concerned about the performance as the data stored starts to grow. I expect this to land in the range of 15-30K small documents. Before coding any crazy speed up strategy, I'd like to ask the experts here what would work best:

1) have 1 single DB and 1 index per document type? Maybe querying smaller indexes will be faster, but building/keeping all these indexes can be cumbersome;


That works. You may also get better performance from 1 DB per type and avoiding map/reduce entirely, but it would need to be benchmarked; I'm not sure. I certainly tend to do that with my own code, merely because it's simpler. (Then I handle the joins myself in my own code, using `allDocs()` or `get()` for the individual DBs.)
 

2) have 1 DB per document type and 1 index per DB? More DBs to handle with individual indexes, not sure if this helps or not...

3) have a 2nd local DB, running on top of the in-memory adapter that would be a replica of the index of option 1. I could query this DB with allDocs, get the hashed ID and then query the actual encrypted local DB.
4) apply the same strategy as in 3 to the option 2?
5) any simpler idea I have just not though of yet?

I'm pretty excited with all the possibilities that exists around Pouch/Couch. This app was initially designed with a MySQL backend, APIgility as the restful api, and I have coded my own local caching mechanism. At some point I realized that I was only coding the supporting infrastructure and not actually moving my app forward. PouchDB really did save my life here :-)

Kind regards
Ricardo Carrarretto



It sounds like a lot of your questions revolve around the best way to structure your database for maximum... performance? ergonomics? Not sure what you're aiming for here. In terms of performance, we definitely don't have enough benchmarks to really give one strong recommendation one way or another, but for ergonomics, you'd probably be best off with the pouchdb-find or relational-pouch plugins. Hope that helps!

Cheers,
Nolan

Ricardo Carraretto

unread,
Nov 22, 2015, 7:42:10 PM11/22/15
to pou...@googlegroups.com
Hi Nolan,

Thanks for answering back. Please let me comment inline:


Em sábado, 21 de novembro de 2015, Nolan Lawson <no...@nolanlawson.com> escreveu:
Hi Ricardo,

Thanks for the thoughtful post! Please see my comments below.

On Thursday, November 12, 2015 at 7:45:49 PM UTC-5, Ricardo Carraretto wrote:
Hello,

i've been doing some reasearch on the group to understand the best approach to speed up my up, while keeking its security aspects, but couldn't really find a final solution. This are some of the implementation requirements/characteristics:

- Server side is hosted on Cloudant;
- Security is a must have, so after the doc is created, the final _id becomes a hash of the actual _id and the data is stored encrypted (it includes the _id before being hashed). The app handles sensitive data, so I can't trust on the hosting side;
- The remote DB is replicated as-is to a local PouchDB;
- Local DB is kept encrypted in the case of device loss;


Encryption (crypto-pouch I assume?) does have a performance overhead. Watch out for that! 

I did look into crypto-pouch, but then decided to add a custom model (hashed _id as keys and encrypted JSON as data). The idea behind hashed _ids is that assuming the plaintext _id is unique, so will be the hashed one. I also store the same id within the data, to be able to rebuild it (wouldn't be possible if I kept only the hash, as hashes are one-way functions). 

- A view is built over all docs of the local DB, decrypting them and emitting the actual _id as the index key;
- All queries are made against the index (had to give up on allDocs for the sake of security), and the docs are decrypted when retrieved and encrypted when saved. If the device is stolen, the only "exposed" data are the index keys, but one can't make much use of this as it is not a high sensitive info;
- Live bi-directional replication keeps data consistent accross the place, when the device is online;
- it is supposed to be able to run offline too (main reason I chose Pouch/Couch as persistence mechanism).

This is working pretty well on Chrome, few bumps on Safari (both iOS), but this is a topic for another thread. Right now I'm concerned about the performance as the data stored starts to grow. I expect this to land in the range of 15-30K small documents. Before coding any crazy speed up strategy, I'd like to ask the experts here what would work best:

1) have 1 single DB and 1 index per document type? Maybe querying smaller indexes will be faster, but building/keeping all these indexes can be cumbersome;


That works. You may also get better performance from 1 DB per type and avoiding map/reduce entirely, but it would need to be benchmarked; I'm not sure. I certainly tend to do that with my own code, merely because it's simpler. (Then I handle the joins myself in my own code, using `allDocs()` or `get()` for the individual DBs.)
 

I can't get rid of a secondary index because I need to rebuild the _ids in order to be able to do searches on them. They are all meaninful, for example, for projectd, all _ids are "projects:XXXX9999" (4 letter + 4 numbers). To avoid the relational aspect, the milestones (which are smaller pieces of a project), are like "milestone:XXXX9999:0001" through "milestone:XXXX9999:00nn".

I guess if I replicate from a single Cloudant DB into several PouchDBs (one per type), and then create one secondary index (to decrypt the actual _id) per DB, I'll have smaller indexes that are faster to be queried.

2) have 1 DB per document type and 1 index per DB? More DBs to handle with individual indexes, not sure if this helps or not...

3) have a 2nd local DB, running on top of the in-memory adapter that would be a replica of the index of option 1. I could query this DB with allDocs, get the hashed ID and then query the actual encrypted local DB.
4) apply the same strategy as in 3 to the option 2?
5) any simpler idea I have just not though of yet?

I'm pretty excited with all the possibilities that exists around Pouch/Couch. This app was initially designed with a MySQL backend, APIgility as the restful api, and I have coded my own local caching mechanism. At some point I realized that I was only coding the supporting infrastructure and not actually moving my app forward. PouchDB really did save my life here :-)

Kind regards
Ricardo Carrarretto



It sounds like a lot of your questions revolve around the best way to structure your database for maximum... performance? ergonomics? Not sure what you're aiming for here. In terms of performance, we definitely don't have enough benchmarks to really give one strong recommendation one way or another, but for ergonomics, you'd probably be best off with the pouchdb-find or relational-pouch plugins. Hope that helps!

Performance will be the challenging piece, I think. The way the data is modeled, there won't be complex relationships that could benefit from relational-pouch. By using meaningful _ids I could get fairly well around indexes just leveraging allDocs (before adding cryptography). Since a view is actually another PouchDB, does the query leverages allDocs on the automatically generated PouchDB-view? (I think read it sowewhere, but can't recall where now).

Kind regards
Ricardo Carraretto 


Cheers,
Nolan 

--
You received this message because you are subscribed to the Google Groups "PouchDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pouchdb+u...@googlegroups.com.
To post to this group, send email to pou...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pouchdb/91a003b8-a4cd-4236-afb4-b37068130bc3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nolan Lawson

unread,
Dec 6, 2015, 4:06:08 PM12/6/15
to PouchDB
Hi Ricardo,


> Since a view is actually another PouchDB, does the query leverages allDocs on the automatically generated PouchDB-view?

Yes. keys/key/startkey/endkey in query() work more-or-less as you'd imagine, using allDocs() on the secondary database.
Reply all
Reply to author
Forward
0 new messages