Is using ElasticSearch with OrientDB possible?

1,335 views
Skip to first unread message

Kevin I

unread,
Mar 16, 2015, 11:41:36 AM3/16/15
to orient-...@googlegroups.com
I can see that OrientDB lucene indices can be done through orientdb-lucene, but is there a way to use ElasticSearch in OrientDB? In TitanDB, ElasticSearch support was inbuilt. It would be great if OrientDB has that too.

If not, can I make the two work together out of the box? I haven't used ElasticSearch before, so it would be of great help if anyone can help me out with this.

Thanks.

Colin

unread,
Mar 16, 2015, 4:08:11 PM3/16/15
to orient-...@googlegroups.com
Hi Patrick,

ES uses Lucene indices, replicated and sharded across nodes, and then aggregates the results when queried.

I don't know if OrientDB will ever work with ES, since they share many of the same features, and it would probably make more sense to just extend some of OrientDB's distributed capabilities with the index when using Lucene.

Regards,
-Colin

Orient Technologies

The Company behind OrientDB

Message has been deleted

Nicolas Harraudeau

unread,
Mar 19, 2015, 12:37:18 PM3/19/15
to orient-...@googlegroups.com
Hi Patrick,
I have searched a way to do it myself but didn't found a correct way to do it. Here is what I found:

Having worked with indexing problems before on another search engine and other sources, there are always two different jobs:
- The first one does a full scan of the source. With OrientDB it is possible using a simple JDBC driver and a few requests. OrientDB can be completely scanned using pagination http://www.orientechnologies.com/docs/last/Pagination.html
- The second job is more complex. It has to fetch only modified documents as often as you need in order to have up to date results.

When fetching updates you want to scan from the start date of the last scan because modifications can happen during the scan itself. Let's name this start date "checkpoint".

My first thought was that I could save the last modification timestamp in OrientDB docs. But I didn't found any way to generate it during commit. It MUST not be generated by the application as this would add dates which are generated BEFORE the checkpoint but saved AFTER this same checkpoint. Think of your application making a modification that spans the start of the update scan.

The second approach would be to create a "Modifications to scan" vertex and link to it every modified document. This would not scale as it would conflict more and more during transactions.

The third approach is to use Hooks which would mark documents as modified. However the documentation is rather poor on those. In order to be used by an update scan, hook registration need to be transactional. I asked here if adding a hook invalidates the running transactions (https://groups.google.com/forum/#!topic/orient-database/FBHiZg68b1s) but did not receive any answer. I tested it myself and found that it is not working as I would like (https://github.com/orientechnologies/orientdb/issues/3763). There is still no information as to how it SHOULd work. No specifications.

Maybe one of those features will enable to have a correct update stream:

In the mean time, I don't see any way to index correctly OrientDB. If someone succeeded at indexing OrientDB I am interested too.

OrientDB-Lucene is promising but it is too limited for me right now. I cannot work without features like highlights or complex scoring.

Enrico Risa

unread,
Mar 19, 2015, 12:44:18 PM3/19/15
to orient-...@googlegroups.com
Hi Guys,

i'm the maintainer of Lucene Plugin, for the plugin i implemented a custom index engine.
You can see some documentation here.
http://www.orientechnologies.com/docs/2.0/orientdb.wiki/Custom-Index-Engine.html

The integration should not be too hard. Once implemented
You could create an elastic search index directly with Orientdb Sql syntax like

'create index Foo.bar on Foo (bar) FULLTEXT ENGINE ELASTICSEARCH'

Could be a really good project :D
I really don't have time now but i can help with some code if someone is interested.

Enrico

--

---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-databa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nicolas Harraudeau

unread,
Mar 19, 2015, 2:04:11 PM3/19/15
to orient-...@googlegroups.com, enric...@gmail.com
Hi Enrico,

Thank you for your rapid answer.

This is indeed an interesting possibility. However, I see some problems:

- If I understand correctly there is one index per OrientDB node. Elasticsearch has its own replication and consistency mechanism. Thus the index should be updated only once. This might also create problems with transactions.
- Does an OIndexEngine contain the full documents or only the RIDs? The goal of indexing in Elasticsearch is to be able to query it directly as it offers different features (like highlights and ngram autocomplete).
- I'm not sure that running Elasticsearch and OrientDB in the same process is a good idea. Elasticsearch is known to have out of memory and split brain problems. It might create nightmarish situations. I would prefer to index OrientDB from time to time using a separate process.

Do I make wrong suppositions?

Regards,

Nicolas Harraudeau

unread,
Mar 20, 2015, 7:30:14 AM3/20/15
to orient-...@googlegroups.com
Hi Kevin,

There might be a way to mark documents as updated. This is not an easy solution and I didn't try it yet. It uses MVCC and Optimistic Transactions (you can read more about this here http://www.orientechnologies.com/docs/2.0/orientdb.wiki/Transactions.html).
Let's say you have your application on one side which is adding, deleting, updating documents in OrientDB. On the other side you have your replication process which reads OrientDB and writes in Elasticsearch.

When your replication process starts scanning OrientDB, it creates/replace first a unique vertex (let's call it "checkpoint vertex") which contains the start date of the scan.
Each time your application modifies OrientDB, it reads the checkpoint vertex and set the modification date of each indexed vertex/edge to its date. If a scan started during the modification, the checkpoint vertex has been changed and the transaction should fail.
For deletes, a vertex describing the delete has to be created.

This has some drawbacks:
- the application has either to know what is indexed in ES, or it has to set a date on every vertex/edge.
- you must use transactions even when you want to modify one vertex/edge.

I don't like this solution very much but it might be ok for you.

You might also use a file or something else as a modifications log. But then you can't backup both the modification log and the OrientDB graph at the same time.

Regards,

Kevin I

unread,
Mar 23, 2015, 4:01:33 AM3/23/15
to orient-...@googlegroups.com
Thanks all for the suggestions. From what I've understood so far, I think it's better to go with Orientdb-Lucene for now. I don't think its worth the effort to make ES work with OrientDB as it clearly requires some additional work in updating the indices in ES, since the application I am working on is not going to have millions of users.

I have used neither ES nor Lucene before, but I guess Lucene should be enough since ES is built on top of Lucene. So I guess I'll give it a try.

--

---
You received this message because you are subscribed to a topic in the Google Groups "OrientDB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/orient-database/2g5VbvwDLk4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to orient-databa...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Always remember that the world around you is made by people that are no smarter than you and me.
Reply all
Reply to author
Forward
0 new messages