Hi Jon,
There are no indexes bundled with CQEngine at the moment which write to disk. But it's fairly easy to write a new index class for CQEngine, which wraps one of the many on-disk data structures available.
CQEngine does support external (e.g. on-disk/off heap) indexes, and also there is support to keep the collection itself external. It's possible to use CQEngine as an indexing/query engine for data which is spread across multiple sources (in memory/on disk/in remote systems).
In terms of performance, obviously CQEngine benefits greatly from having indexes in RAM. Lookups in RAM are measured in nanoseconds, while lookups on disk are measure in milliseconds. However if you have locality of reference in your access patterns (and your on-disk index structure takes advantage of this), then the OS page cache could keep the most frequently accessed sections of an on-disk index in RAM.
==Persisting indexes to disk==
There are a few open source on-disk Map-like data structures, the classic one being BerkelyDB[1], and newer ones like LevelDB[2] and MapDB[3].
It would be quite easy to write a new index class (i.e. which implements either the Index[4] or the AttributeIndex[5] interface), where the implementation delegates to one of those data structures.
It would also be possible to write an index based on Lucene[6] (for advanced text indexing), or even a conventional database (for whatever reason).
Then add the new type of index to your IndexedCollection as normal.
For your dataset, if some indexes would fit in memory and some would not, CQEngine can dynamically choose between in-memory indexes and on-disk indexes based on the retrieval costs reported by the indexes. On-disk indexes should return fairly high retrieval costs, so that given a choice for a query it will choose the in-memory ones.
By the way if you do write any of those indexes, it would be awesome if you'd consider contributing them back to the project!
==Reducing memory usage==
Above, we looked at storing indexes on disk. But what if the collection itself is too big to fit in memory?...
Option 1 - efficient in-memory representation
* Take a look at HugeCollections[7], which essentially store large collections in a more compact form. CQEngine can currently build indexes on HugeCollections without any problem.
Option 2 - don't store objects in memory, store foreign keys
* Note that CQEngine doesn't actually index a collection of objects, although it might seem that way. CQEngine indexes Attributes, which are actually functions.
* Since CQEngine indexes Attributes not objects, it's not necessary to actually store full objects representing the fields that are indexed in memory at all. Instead of maintaining an IndexedCollection<SomeObject>, consider maintaining an IndexedCollection<Integer>, where the Integers are actually foreign keys into some on-disk or remote data source, from which attributes read the associated values.
Option 3 - don't even store foreign keys in memory, store foreign keys on disk
* If even IndexedCollection<Integer> was too big to fit in memory, it is furthermore possible to replace the in-memory java.util.Set in which CQEngine would store these integers, with one which persists to disk -- see [8]. If you do this, then CQEngine would not actually store any data in memory at all.
Good luck!
Niall
[8] Override IndexedCollectionImpl.createSet(int initialSize) method to return an on-disk implementation of java.util.Set