SQLite and how to ensure proper (and fast) synchronization with in-memory objects.

185 views
Skip to first unread message

Vincent

unread,
Jan 23, 2011, 11:15:57 AM1/23/11
to cocoa-unbound
Hi guys, hi Brent :)

Planning to start a project that kind of behaves like an RSS reader,
I've come to the point where one has to ask himself "SQLite or Core
Data?".

My app is going to be doing quite a lot of (complex) SQL-style
querying, which can be quite painful with Core Data and certainly not
fast, once you hit the higher n-thousands.
This and other reasons led me to the conclusion I should rather use
SQLite than Core Data.

And then there also was this famous article by you, Brent, on your
decisions to dump Core Data in favor of SQLite & FMDB:
http://inessential.com/2010/02/26/on_switching_away_from_core_data

All four listed arguments against Core Data would totally apply to my
own app. And I can totally see the penalties with Core Data in these
four scenarios.

There is one thing however that I couldn't make any sense of:
SQLite maintains an on-disk representation of your app's data, right?
Your app however deals with an in-memory representation of same data.
This requires syncing: "[Brent:] Making the switch did mean I had to
do some things manually that Core Data would have done for me: keeping
any in-memory items synced with the database storage, mostly."

If running "[feedItem markAsUnread];" on 10,000 feedItems in Core Data
was too slow for NNW, and SQLite on the other side super fast…
…then wouldn't you lose all the benefits of SQLite's performance at
the very point where you have to sync your in-memory objects (the
stuff that gets presented to the user) with their newly updated/
created/deleted/etc on-disk clones in SQLite?

"[Brent:] With one query the app can set the status for a whole bunch
of items at once, without having to instantiate them as objects:
update newsItems set read = 1 where..."
Right, but this doesn't update any of the already allocated in-memory
NSObjects in your NSArrayController, feeding your NSTableView, e.g.
How about those?

I searched and searched but couldn't find any insightful articles/
tutorials/documentations on how to deal with the syncing between
NSObjects & SQLite.

Thanks in advance,
Vincent

Gus Mueller

unread,
Jan 28, 2011, 12:45:36 PM1/28/11
to cocoa-...@googlegroups.com
On Jan 23, 2011, at 8:15 AM, Vincent wrote:

> There is one thing however that I couldn't make any sense of:
> SQLite maintains an on-disk representation of your app's data, right?
> Your app however deals with an in-memory representation of same data.
> This requires syncing: "[Brent:] Making the switch did mean I had to
> do some things manually that Core Data would have done for me: keeping
> any in-memory items synced with the database storage, mostly."
>
> If running "[feedItem markAsUnread];" on 10,000 feedItems in Core Data
> was too slow for NNW, and SQLite on the other side super fast…
> …then wouldn't you lose all the benefits of SQLite's performance at
> the very point where you have to sync your in-memory objects (the
> stuff that gets presented to the user) with their newly updated/
> created/deleted/etc on-disk clones in SQLite?

One of my apps (VoodooPad) is in a bit of a similar situation (it can have thousands of pages, which can be in memory, and of course in an sqlite file). My solution with synchronization is to make sure I do as little of it as possible. I try and keep as many of the page objects on disk as possible - and whenever I need to grab one, I pull a fresh one out of the database (so it's possible to have two or more objects in memory that represent the same page).

If there's a change made to that object that the other copies need to know about, a notification is sent out and the other page objects pick it up, or they just get discarded and a new copy is pulled out of the db. This is pretty rare though.

In general, instead of handing around objects all the time I'll hand around uuids which represent the page in the database. Then when I need to act on that data or pull it out, I'll just ask my database manager object to hand me back a new object with the uuid. And since I get a new one each time, there's no problem using those guys on multiple threads.

Hopefully that made sense.

-gus

--

August 'Gus' Mueller
Flying Meat Inc.
http://flyingmeat.com/

Brent Simmons

unread,
Jan 30, 2011, 12:25:15 AM1/30/11
to cocoa-...@googlegroups.com

> Hi guys, hi Brent :)
>
> Planning to start a project that kind of behaves like an RSS reader,
> I've come to the point where one has to ask himself "SQLite or Core
> Data?".
>
> My app is going to be doing quite a lot of (complex) SQL-style
> querying, which can be quite painful with Core Data and certainly not
> fast, once you hit the higher n-thousands.
> This and other reasons led me to the conclusion I should rather use
> SQLite than Core Data.
>
> And then there also was this famous article by you, Brent, on your
> decisions to dump Core Data in favor of SQLite & FMDB:
> http://inessential.com/2010/02/26/on_switching_away_from_core_data
>
> All four listed arguments against Core Data would totally apply to my
> own app. And I can totally see the penalties with Core Data in these
> four scenarios.


Nevertheless, I *strongly* recommend using Core Data. As I pointed out in my initial post on the subject, Core Data is awesome. There are a few things that Core Data does slowly -- but devices and Macs are getting faster, and it's a good bet Core Data will keep improving. (Remember the old hockey adage: skate to where the puck is going to be.)

I myself use FMDB when I'm working with non-object things. But for something like an article from a feed (or tweet from Twitter, or post from Facebook, or similar), I highly recommend using Core Data. Any time you're dealing with things that are more object-y than database-y, go with Core Data.

-Brent


Vincent

unread,
Jan 30, 2011, 12:29:51 PM1/30/11
to cocoa-unbound
Thank you both for your replies, much appreciated,

having spent some thoughts on your responses I've come to the
conclusion that probably the best way to handle my case (of a feed-
reader-like app. just more feeds, less reading) would be (in steps):

1. Receive (potentially thousands of) items from feeds/streams/
whatever.
2. Add all received items to SQLite db via INSERT OR REPLACE.
(probably faster to just replace existing than to check each for
existance, do some testing)
3. At this point I'd need to do some filtering, and additional
processing on all new/changed items. I should be able to do most of it
in SQLite via SELECTs though. In memory I'd thus just need to keep the
items' SQLite row ids then. Just an NSSet of NSNumbers. Very
lightweight.
4. When the user selects a particular feed/stream, run a SQLite SELECT
for all items with row ids mathing those from "3." plus all other
items that remained "unread" from previous refreshes and ignore all
others.
5. Allocate Core Data objects (~100) from SQLite query in "4." and
display them. (and get rid of them as soon as not needed anymore)
6. Run SQLite UPDATE query on object property changes (use optimized
compound query for updating multiple items).

In short: try to reduce the amount of to-be-allocated-and-touched in
advance by use of smart SQLite queries to narrow search/filter.
And for those objects that eventually happen to actually get
allocated, use Core Data for best possible memory use and all those
other conveniences.

Does this make any sense SQLite-/CoreData-wise?

Landon Fuller

unread,
Jan 31, 2011, 7:28:42 PM1/31/11
to cocoa-...@googlegroups.com


I actually argue the opposite view: I never use Core Data. Here's the rational behind my position:

In any given application that serializes data to disk, there are really two different models in an application:
- The in-memory application object model.
- The on-disk serialized data model.

The in-memory application object model services the needs of the application; it is a literal representation of the state of your application, in memory. The API and data it provides need only be maintained in the context of that specific version of the application, during that single runtime. You're free to modify the application model during development without constraint, as the specific concepts it is modeling do not need to be shared across release of the application -- if you add a bit of data or API to your application model and remove it later, there's no concern about long-term maintenance of that data, migrating it across versions, etc.

In contrast to the application model, the on-disk serialized data model is a high-level, abstract representation of your application's data that must be maintained across iterations of your application, and when optimally expressed, will likely not even map 1:1 with the optimal application's model. It requires unique consideration on a number of fronts:
- Data longevity, specifically in ensuring that the concepts as modeled are maintainable across the lifetime of your data.
- Data validity -- care to avoid duplicate, corrupt, or invalid data. Whereas there's often little harm to maintaining multiple read-only in-memory data records, on-disk data should be updated carefully.
- Atomic or transactional updates. Often data changes must be implemented as an all or none transactional/atomic update.

CoreData attempts to tightly weld these two different abstract representations together, attempting to allow developers to leverage their single application model as both an on-disk serialization as well as an in-memory representation. Unfortunately this abstraction is extremely leaky, and Core Data's requirements spread pervasively through the application's in-memory model:
- Every managed object must be a subclass of NSManagedObject, which results in objects inheriting a large number of non-overridable methods and strict requirements that provide object behavior as per Core Data's internal requirements, preventing the application author from expressing an API more suited to their in-memory model representation (such as overriding -hash, -isEqual, etc).
- Managed objects are not thread-safe -- special case must be taken when using GCD or threads directly:
- Managed objects should not be shared across threads (as per Apple's recommendation)
- If shared across threads, it is the application author's responsibility to or to implement extremely complex and difficult locking to allow sharing of instances.
- Care must be taken to safely merge data changes made in multiple managed object contexts on multiple threads, which potentially may occur asynchronously.
- The most lightweight transactional/atomic update mechanism is the NSManagedObjectContext, however, when using multiple managed object contexts, as noted above, care must be taken to merge data correctly between contexts.

The leakiness of this abstraction is extremely similar to that of distributed objects, where the idea was that complexity of network communications could be hidden behind a simple object model; your application's model could also be your network model. This led to the same problems that Core Data has today, but instead with the network protocol poorly welded into the application model.

These approaches towards unifying models fail (in my opinion) because they fail to take into account that the models are genuinely different, and actually express different data:
- The application model represents the application's in-memory state. It is transient and thus may be iterated on freely as to suite the application.
- The on-disk serialization model is an abstract representation of the data the application operates on using the in-memory model. It is not the in-memory model, and must be maintained across versions of the application (or even potentially across different applications). The data must be updated according to transactional requirements of the application, errors must be safely handled in disk serialization, etc.
- In the case of distributed objects, which I posit is extremely similar to Core Data, the network protocol defines not just a means of accessing the peer's state, but a dialog between independent applications with their own, potentially very different internal application model/state. The protocol must hand errors, retries, and other complexities that can not be adequately represented through seemingly transparent method calls.

This may be an unpopular view -- I honestly don't know, since this is the first time I've taken any time to explain it -- but I think that ultimately, application implementations benefit from a separation of application and serialization models in terms of implementation time, complexity, cost, and stability. It may be possible to achieve this by maintaining distinct application and Core Data-specific object hierarchies, but such an approach would seem to discard most of the advertised value of using Core Data in the first place.

Cheers,
Landon

Vincent

unread,
Feb 3, 2011, 11:25:57 AM2/3/11
to cocoa-unbound
Landon, thank you very much for your detailed and interesting input!
I for one very much welcome your divergent opinion on Core Data on
this issue.
Not to forget some very true and quite essential potential issues when
working with Core Data, that you mentioned.

My app is pretty database-y though, so I've decided to stick with
SQLite for the heavy stuff and storage and use Core Data (probably
even store-less) as a sandbox for handling in-memory objects.

Ben T

unread,
Feb 3, 2011, 5:12:39 PM2/3/11
to cocoa-unbound


On Jan 23, 8:15 am, Vincent <satsu...@googlemail.com> wrote:
> Hi guys, hi Brent :)
>
> Planning to start a project that kind of behaves like an RSS reader,
> I've come to the point where one has to ask himself "SQLite or Core
> Data?".
>
> My app is going to be doing quite a lot of (complex) SQL-style
> querying, which can be quite painful with Core Data and certainly not
> fast, once you hit the higher n-thousands.

As I'll get to in a moment, Brent's discussion regarding Core Data
performance was primarily regarding bulk updates & deletions. Nearly
all query performance issues we've ever seen are pilot error.
NSFetchRequest offers a lot of options to tune a query to your
specific needs, whether it's just pulling back primary keys,
performing batch queries, prefetching related objects, or doing
aggregate calculations.

For query performance, the typical issues are either (a) faulting
because related but necessary objects weren't prefetched (b) fetching
all or most of the data or (c) doing unrealistic things with Unicode
strings. All three can be dealt with, including canonicalizing
Unicode text into a denormalized search column that's more
computational tractable for prefix or equality matching. There's an
ADC example called DerivedProperty showing how to do that. The first
two are visible in Instruments and SQL logging.

We have one developer getting query times for beginsWith predicates of
10ms - 200ms on a 400,000 entry table. On a phone. The large
variance is because he's doing live type along querying, and the first
couple letters are more expensive. We gave him advice to make it even
faster, but it's already at the point he has better things to do with
his time.

On a Mac Pro, Core Data can return about 750,000 rows per second. Or
if you want to identify the matching rows and iterate through them in
batches, you can hit nearly 3 million rows per second and process all
of them in O(1) memory.

On multi-core machines for N-thousand row queries, Core Data nearly
always outperforms direct SQLite usage. Once you play in that league,
it's about I/O (which is the same between the two) and really careful
memory management. Which we do full time, while you have to write the
rest of your application. And sometimes also run a business.

If your experience is otherwise, please file a bug with
bugreport.apple.com and we'll address your circumstances. The only
significant outstanding issue with query performance I'm aware of is
the need for compound indices.

> This and other reasons led me to the conclusion I should rather use
> SQLite than Core Data.
>
> And then there also was this famous article by you, Brent, on your
> decisions to dump Core Data in favor of SQLite & FMDB:http://inessential.com/2010/02/26/on_switching_away_from_core_data
>
> All four listed arguments against Core Data would totally apply to my
> own app. And I can totally see the penalties with Core Data in these
> four scenarios.

The bulk modifications and bulk deletions are genuine issues for some
specific usage patterns on lower end devices. It's very rare for
these to be real issues on a Mac. We're aware of save performance,
particularly on older embedded devices, needing more love. If these
are issues for your app, please file a bug on the "reports are votes"
principle.

Brent's issue #3 has a solution in Core Data. Import the new objects
in batches instead of 1 at a time. Instead of looking up existing
with O(N) fetches, use an IN clause on a respectable batch size, like
500 or 1000. This is the database equivalent of the 1000 x read(1
byte) v.s. 1 x read(1000 bytes). "insert or replace" style uniquing
has substantial issues when it comes to object identity and
relationship maintenance. It's pretty easy to bork your foreign keys
this way. People sometimes try to cheat out on that by then using
their unique criteria as the primary key itself. So they fix the FK
maintenance by some indirection. And pay for it by having slower
everything else all the time.

While issue #3 has a solution in Core Data, I'll be the first to admit
it's not as elegant as seamless uniquing for many circumstances.
However, for bulk import, the batch IN query is probably superior as
uniquing would have to be more expensive to accommodate
relationships. "insert or replace" is great if you have a simplistic
schema without relationships

Brent is simply wrong about #4. Use -countForFetchRequest. If you
have the objectID, use -existingObjectWithID: which is even faster
than hand SQL as it may short circuit in memory and not do any I/O.

> There is one thing however that I couldn't make any sense of:
> SQLite maintains an on-disk representation of your app's data, right?
> Your app however deals with an in-memory representation of same data.
> This requires syncing: "[Brent:] Making the switch did mean I had to
> do some things manually that Core Data would have done for me: keeping
> any in-memory items synced with the database storage, mostly."
>
> If running "[feedItem markAsUnread];" on 10,000 feedItems in Core Data
> was too slow for NNW, and SQLite on the other side super fast…
> …then wouldn't you lose all the benefits of SQLite's performance at
> the very point where you have to sync your in-memory objects (the
> stuff that gets presented to the user) with their newly updated/
> created/deleted/etc on-disk clones in SQLite?

That is often the case with developers using SQLite directly in
anything remotely resembling an MVC application design. Typically,
developers start with something simple, and then accrete features, and
eventually find themselves doing their own relationship maintenance,
change tracking, delete propagation, and notifications.

If you're not using MVC, or your schema / features remain simplistic
(like no relationships), then balancing the issues is different.

Brent's issue with bulk modifications is something of an exception
here.


> "[Brent:] With one query the app can set the status for a whole bunch
> of items at once, without having to instantiate them as objects:
> update newsItems set read = 1 where..."
> Right, but this doesn't update any of the already allocated in-memory
> NSObjects in your NSArrayController, feeding your NSTableView, e.g.

Correct.

> How about those?
>
> I searched and searched but couldn't find any insightful articles/
> tutorials/documentations on how to deal with the syncing between
> NSObjects & SQLite.

Going down this path will involve you writing a huge amount of
infrastructure. If you want to do this as a educational endeavor,
knock yourself out. If you're interested in shipping a product, why
don't you write a few simple test projects (CLIs) for the key
operations you're concerned about and benchmark them with
representative sample data ? If the performance is adequate, you
don't need to write your own ER framework. If not, files some bugs
and attach the CLI & sample data. If you keep the sample project's
code base small enough, we'll take a look and propose solutions for
you. Then you can decide if those work for you, or if it's just not a
good fit.

- Ben

Vincent

unread,
Feb 5, 2011, 7:28:26 PM2/5/11
to cocoa-unbound
Wow, this thread turn out way hotter and insightful than ever
expected. :D
Thanks guys and thanks Ben for your very detailed comment!

Based on my own experiences and what I've read from others I've
decided to make my project use a combo of Core Data & SQLite
as some aspects of the app are very object-y (feeds, feed entries,
etc) and meant for interactivity
while others are pretty much database-y (analytical and statistical
log entries for certain heavily database-y calculations,
or just stuff that—being pretty much pure "state containers"—would
make little to no sense as actual objects,
in particular as there will easily be up to 1000k+ of them and none
ever to be used other than for calculations).

And Ben, be assured I will do thorough testing before opting for this
or that. ;)

Thanks again guys, you rock!

- Vincent
Reply all
Reply to author
Forward
0 new messages