Does CQEngine Write to Disk?

846 views
Skip to first unread message

Jon

unread,
Jan 28, 2013, 3:16:56 PM1/28/13
to cqengine...@googlegroups.com
Hi,

First off, I'd like to say this is a great tool. I just have one question.

Lets say I want to index a dataset that's larger than the amount of memory on my machine. Can CQEnginer write the indexed data to disk?


Thanks a lot!

Niall

unread,
Jan 29, 2013, 6:15:07 PM1/29/13
to cqengine...@googlegroups.com
Hi Jon,

There are no indexes bundled with CQEngine at the moment which write to disk. But it's fairly easy to write a new index class for CQEngine, which wraps one of the many on-disk data structures available.

CQEngine does support external (e.g. on-disk/off heap) indexes, and also there is support to keep the collection itself external. It's possible to use CQEngine as an indexing/query engine for data which is spread across multiple sources (in memory/on disk/in remote systems).

In terms of performance, obviously CQEngine benefits greatly from having indexes in RAM. Lookups in RAM are measured in nanoseconds, while lookups on disk are measure in milliseconds. However if you have locality of reference in your access patterns (and your on-disk index structure takes advantage of this), then the OS page cache could keep the most frequently accessed sections of an on-disk index in RAM.

==Persisting indexes to disk==

There are a few open source on-disk Map-like data structures, the classic one being BerkelyDB[1], and newer ones like LevelDB[2] and MapDB[3].

It would be quite easy to write a new index class (i.e. which implements either the Index[4] or the AttributeIndex[5] interface), where the implementation delegates to one of those data structures.

It would also be possible to write an index based on Lucene[6] (for advanced text indexing), or even a conventional database (for whatever reason).

Then add the new type of index to your IndexedCollection as normal.

For your dataset, if some indexes would fit in memory and some would not, CQEngine can dynamically choose between in-memory indexes and on-disk indexes based on the retrieval costs reported by the indexes. On-disk indexes should return fairly high retrieval costs, so that given a choice for a query it will choose the in-memory ones.

By the way if you do write any of those indexes, it would be awesome if you'd consider contributing them back to the project!

==Reducing memory usage==

Above, we looked at storing indexes on disk. But what if the collection itself is too big to fit in memory?...

Option 1 - efficient in-memory representation

  * Take a look at HugeCollections[7], which essentially store large collections in a more compact form. CQEngine can currently build indexes on HugeCollections without any problem.

Option 2 - don't store objects in memory, store foreign keys

  * Note that CQEngine doesn't actually index a collection of objects, although it might seem that way. CQEngine indexes Attributes, which are actually functions.
  * Since CQEngine indexes Attributes not objects, it's not necessary to actually store full objects representing the fields that are indexed in memory at all. Instead of maintaining an IndexedCollection<SomeObject>, consider maintaining an IndexedCollection<Integer>, where the Integers are actually foreign keys into some on-disk or remote data source, from which attributes read the associated values.

Option 3 - don't even store foreign keys in memory, store foreign keys on disk

  * If even IndexedCollection<Integer> was too big to fit in memory, it is furthermore possible to replace the in-memory java.util.Set in which CQEngine would store these integers, with one which persists to disk -- see [8]. If you do this, then CQEngine would not actually store any data in memory at all.

Good luck!

Niall

[8] Override IndexedCollectionImpl.createSet(int initialSize) method to return an on-disk implementation of java.util.Set

Jon

unread,
Feb 4, 2013, 11:24:01 PM2/4/13
to cqengine...@googlegroups.com
Thanks for the response! This gives me a lot to think about when approaching a problem like this.

Niall Gallagher

unread,
Feb 5, 2013, 3:12:35 AM2/5/13
to cqengine...@googlegroups.com
You probably got more info than you bargained for there :D

I looked into MapDB and others in a bit more detail, and I think this is something I want CQEngine to provide, for working with datasets larger than available RAM.

I started a new project which will provide additional on-disk support for CQEngine (as plugins) using MapDB, so watch this space! - http://code.google.com/p/cqengine-mapdb/


--
-- You received this message because you are subscribed to the "cqengine-discuss" group.
http://groups.google.com/group/cqengine-discuss
---
You received this message because you are subscribed to the Google Groups "cqengine-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cqengine-discu...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Behrang QasemiZadeh

unread,
Oct 26, 2013, 12:28:04 PM10/26/13
to cqengine...@googlegroups.com
Hi Niall,

Any update on the topic of MapDB + CQEngine?
I am interested in its application for text indexing. Do you know of any application of CQEngine for text indexing?

Best Regards,

Behrang

Niall

unread,
Oct 29, 2013, 8:19:59 PM10/29/13
to cqengine...@googlegroups.com
Hi Behrang,

I'm afraid there is no update on CQEngine+MapDB yet. It is still something I'd like to do, but requires a few days to really think about it, and I've not had a few consecutive days free like that recently.

Regarding text indexing - "text indexing" is quite a broad area of research. I'm not sure which aspects of it you are interested in. If you mean simply retrieving objects whose text fields match some query, then yes you could use the radix/suffix tree indexes to make those queries very fast. Think of CQEngine as a low latency database.

But a database alone does not a search engine make (as Yoda might say :).

So if you are interested in semantic understanding of the query - I search for "vehicle", and I get "cars, trucks and other subclasses of vehicle, and synonyms", then I'm afraid those areas are outside of CQEngine's remit. With CQEngine, you need to build hierarchies/equivalences like that into your dataset explicitly, CQEngine won't do it for you.

Take a look at Lucene, it is probably the de-facto library for doing that kind of indexing in Java. If you'd like to use Lucene with CQEngine though, you could write a simple wrapper of CQEngine's Index interface, on top of a Lucene index, and that way you could query Lucene indexes within broader CQEngine queries or using CQEngine syntax.

Hope that helps?!
Niall

Suminda Dharmasena

unread,
Jun 18, 2014, 1:08:15 PM6/18/14
to cqengine...@googlegroups.com
For indexing this might be and option also: http://mg4j.di.unimi.it/

Niall Gallagher

unread,
Jun 18, 2014, 2:27:28 PM6/18/14
to cqengine...@googlegroups.com

That looks interesting. Is it an alternative to Lucene?

Sent from my Android

On 18 Jun 2014 18:08, "Suminda Dharmasena" <sirina...@gmail.com> wrote:
For indexing this might be and option also: http://mg4j.di.unimi.it/

--
-- You received this message because you are subscribed to the "cqengine-discuss" group.
http://groups.google.com/group/cqengine-discuss
---
You received this message because you are subscribed to the Google Groups "cqengine-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cqengine-discu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Suminda Dharmasena

unread,
Jun 18, 2014, 3:02:49 PM6/18/14
to cqengine...@googlegroups.com
I have not used Lucene to make a proper comparison, but I have used other projects from http://di.unimi.it/
Reply all
Reply to author
Forward
0 new messages