Hi Patrick,
I have searched a way to do it myself but didn't found a correct way to do it. Here is what I found:
Having worked with indexing problems before on another search engine and other sources, there are always two different jobs:
- The second job is more complex. It has to fetch only modified documents as often as you need in order to have up to date results.
When fetching updates you want to scan from the start date of the last scan because modifications can happen during the scan itself. Let's name this start date "checkpoint".
My first thought was that I could save the last modification timestamp in OrientDB docs. But I didn't found any way to generate it during commit. It MUST not be generated by the application as this would add dates which are generated BEFORE the checkpoint but saved AFTER this same checkpoint. Think of your application making a modification that spans the start of the update scan.
The second approach would be to create a "Modifications to scan" vertex and link to it every modified document. This would not scale as it would conflict more and more during transactions.
Maybe one of those features will enable to have a correct update stream:
In the mean time, I don't see any way to index correctly OrientDB. If someone succeeded at indexing OrientDB I am interested too.
OrientDB-Lucene is promising but it is too limited for me right now. I cannot work without features like highlights or complex scoring.