Naive question: Is Zoie really supports real-time search?

43 views
Skip to first unread message

Renat Gilfanov

unread,
Aug 26, 2011, 2:31:31 AM8/26/11
to zoie
Hello,

I've started learning Zoie recently, and faced with "trivial" problem
- my indexed documents are not visible for immediate search without
explicit calls of flushEvents() / flushEventsToMemoryIndex() /
syncWithVersion.

I found the similiar topic in this group, but things recomended there
(using NoopReaderCache, setFreshness to small value, setting batch
size to 1) didn't help.

In that topic I saw a comment that the developers of Zoie themselfes
use flushEvents() only for debugging purposes / it's not recomended in
such case.
Another point that my measurements show that
flushEventsToMemoryIndex() call is quite expensive - it takes 100 -
200 ms while I've indexed simple document with 2 or 3 fields. I guess
even Lucene NRT(which performance seems isn't good enough for our
project) would be several times faster in this case.

I'd really like to know - may be I haven't understood some important
concepts and did something wrong in my code? How could it be that
search framework with proud "a document is made available as soon as
it is added to the index" slogan on it's title page can't handle the
simpliest example you can imagine??

Is it possible to build a system where indexed documents will be
available for search in reasonable small time basing on Zoie?

I'll be very thankfull for any advice regarding my example /
implementation of "realtime search" system.

I use 3.0.0 version, here is part of my code (almost everything is
from "code samples" section):

//Indexing part

ZoieConfig config = new ZoieConfig();
config.setFreshness(10);
config.setBatchSize(1);
config.setReadercachefactory(new ReaderCacheFactory() {
@Override
public <R extends IndexReader> AbstractReaderCache<R> newInstance(
IndexReaderFactory<ZoieIndexReader<R>> readerfactory) {
return new NoopReaderCache(readerfactory);
}
});

ZoieSystem indexingSystem = new ZoieSystem(idxDir, new
DataIndexableInterpreter(), decorator, config);
indexingSystem.start(); // ready to accept indexing events
Data[] data = new Data[1]; // build a batch of data object to index
data[0] = new Data();
data[0].setId(3);
data[0].setContent("a");

// construct a collection of indexing events
ArrayList<DataEvent> eventList = new
ArrayList<DataEvent>(data.length);
for (Data datum : data) {
eventList.add(new DataEvent<Data>(datum, String.valueOf(0)));
}

indexingSystem.consume(eventList);

// long t = System.currentTimeMillis();
// indexingSystem.flushEventsToMemoryIndex(6000);
// System.out.println("took: " +(System.currentTimeMillis() - t));

SearchThread thread = new SearchThread(indexingSystem);
thread.start();
thread.join();

indexingSystem.shutdown();

-----------------------------
//Search part

List<ZoieIndexReader> readerList = indexingSystem.getIndexReaders();
MultiReader reader = new MultiReader(readerList.toArray(new
IndexReader[readerList.size()]), false);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser(Version.LUCENE_20, "id",
indexingSystem.getAnalyzer());
TopDocs docs = searcher.search(new QueryParser(Version.LUCENE_20,
"tst", indexingSystem.getAnalyzer()).parse("id:3"), 10);
System.out.println("Totally found: " + docs.totalHits);
searcher.close();
indexingSystem.returnIndexReaders(readerList);

John Wang

unread,
Aug 27, 2011, 7:27:06 AM8/27/11
to zo...@googlegroups.com
Try with SimpleReaderCache instead of NoopReaderCache.

By default, zoie uses DefaultReaderCache, which adds indexing latency in trade-off to search latency with a freshness paramter.

-John


--
You received this message because you are subscribed to the Google Groups "zoie" group.
To post to this group, send email to zo...@googlegroups.com.
To unsubscribe from this group, send email to zoie+uns...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/zoie?hl=en.


Renat Gilfanov

unread,
Aug 27, 2011, 10:19:02 AM8/27/11
to zoie
Hi,

Thanks for reply, but there is no such class as SimpleReaderCache in
the Zoie core. Is it part of some other subproject / latest version of
Zoie core?

Avaialable subclasses are: DefaultReaderCache, NoopReaderCache,
SmartReaderCache.

Or did you mean SmartReaderCache instead? Tried it, alas, result is
the same.

François Cassistat

unread,
Aug 27, 2011, 2:35:38 PM8/27/11
to zo...@googlegroups.com
Interesting post !

I am working on a prototype on a system that uses Zoie for searches and using flushEventsToMemoryIndex() because I have the same immediate searches requirements. flushEventsToMemoryIndex is taking 800-900ms in my case for a little number of documents being updated...

Please let us know if any of you find a solution.


Frank



2011/8/26 Renat Gilfanov <gre...@mail.ru>

John Wang

unread,
Aug 28, 2011, 10:39:02 AM8/28/11
to zo...@googlegroups.com
Hi guys:

    My bad, I thought we had a SimpleReaderCache. Anyhoot, I just committed an implementation:


   You can just set it via:

   ZoieConfig.setReaderFactory(SimpleReaderCache.FACTORY);

   and pass the config to your Zoie instance.

   Alternatively, you can also set freshness to a very small number to ensure realtime.

   Give it a try and let me know.

Thanks

-John

2011/8/27 François Cassistat <francois....@gmail.com>

Renat Gilfanov

unread,
Aug 28, 2011, 3:19:43 PM8/28/11
to zoie
Hi,

Unfortunately it still doesn't work :( Wondering what can be wrong.
May be I should create a small test case and send you?

Btw, during build one test failed:

testTrimming(proj.zoie.test.HourglassTest) Time elapsed: 250.514 sec
<<< FAILURE!
java.lang.AssertionError:
at org.junit.Assert.fail(Assert.java:91)
at org.junit.Assert.assertTrue(Assert.java:43)
at org.junit.Assert.assertTrue(Assert.java:54)
at proj.zoie.test.HourglassTest.testTrimming(HourglassTest.java:147)

On 28 авг, 18:39, John Wang <john.w...@gmail.com> wrote:
> Hi guys:
>
>     My bad, I thought we had a SimpleReaderCache. Anyhoot, I just committed
> an implementation:
>
> https://github.com/javasoze/zoie/commit/eb6fc1c9a8088d07f9c1f2ee0110e...
>
>    You can just set it via:
>
>    ZoieConfig.setReaderFactory(SimpleReaderCache.FACTORY);
>
>    and pass the config to your Zoie instance.
>
>    Alternatively, you can also set freshness to a very small number to
> ensure realtime.
>
>    Give it a try and let me know.
>
> Thanks
>
> -John
>
> 2011/8/27 François Cassistat <francois.cassis...@gmail.com>
>
>
>
>
>
>
>
> > Interesting post !
>
> > I am working on a prototype on a system that uses Zoie for searches and
> > using flushEventsToMemoryIndex() because I have the same immediate searches
> > requirements. flushEventsToMemoryIndex is taking 800-900ms in my case for a
> > little number of documents being updated...
>
> > Please let us know if any of you find a solution.
>
> > Frank
>
> > 2011/8/26 Renat Gilfanov <gren...@mail.ru>

Renat Gilfanov

unread,
Aug 29, 2011, 4:33:57 PM8/29/11
to zoie
Additional details:

1. When I add intentional delay before search:
Thread.sleep(200);
it finds documents with SimpleReaderCache enabled (but no search
results when I use value < 200 ms in my example), changing the
freshness doesn't help at all.

2. The search fails to find anything because
indexingSystem.getIndexReaders() returns empty list.

John Wang

unread,
Aug 30, 2011, 4:11:30 PM8/30/11
to zo...@googlegroups.com
HI Renat:

     By default, SimpleReaderCache is not hooked up in the test. I just tried to run the test both in case of SimpleReaderCache as well as the default, both passes. Can you provide more details on your configuration?

     When using SImpleReaderCache, freshness setting is no longer needed. The Thread.sleep(200) should not be needed as well.

     For us to better assist you with your problem, can you provide a reproducible test case?

Thanks

-John

Renat Gilfanov

unread,
Aug 31, 2011, 4:51:19 PM8/31/11
to zoie
Hi John,

if you don't mind I'll paste my code right here - if you find it
inconvinient, I'll send the sources to your email.

Btw, I enabled logging and found interesting detail - in case when I
invoke Thread.sleep(),
there is "proj.zoie.impl.indexing.AsyncDataConsumer: flushBuffer: post-
flush: currentVersion: 0" line before search,
which seems like guaranteing search results.
When there is no any delay, search finds nothing and this line goes
after search.

-----------------------Data.java--------------------------
package test.zoie.domain;

public class Data {
private long id;
private String content;
public void setId(long id) {
this.id = id;
}
public long getId() {
return id;
}
public void setContent(String content) {
this.content = content;
}
public String getContent() {
return content;
}
}
-------------------DataIndexable.java-------------------------
package test.zoie.domain;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import proj.zoie.api.indexing.ZoieIndexable;

public class DataIndexable implements ZoieIndexable {
private Data _data;
public DataIndexable(Data data) {
_data = data;
}

public long getUID() {
return _data.getId();
}

public IndexingReq[] buildIndexingReqs() {
Document doc = new Document();
doc.add(new
Field("content",_data.getContent(),Store.YES,Index.ANALYZED));
doc.add(new Field("id", String.valueOf(_data.getId()), Store.YES,
Index.ANALYZED_NO_NORMS));
return new IndexingReq[]{new IndexingReq(doc)};
}

public boolean isDeleted() {
return"_MARKED_FOR_DELETE".equals(_data.getContent());
}

public boolean isSkip(){
return "_MARKED_FOR_SKIP".equals(_data.getContent());
}
}
-----------DataIndexableInterpreter.java---------------------------
package test.zoie.domain;

import proj.zoie.api.indexing.ZoieIndexable;
import proj.zoie.api.indexing.ZoieIndexableInterpreter;

public class DataIndexableInterpreter implements
ZoieIndexableInterpreter<Data> {
public ZoieIndexable interpret(Data src){
return new DataIndexable(src);
}

@Override
public ZoieIndexable convertAndInterpret(Data src) {
return new DataIndexable(src);
}
}
---------------------Main.java------------------------------
package test.zoie.main;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.MultiReader;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.DefaultSimilarity;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Similarity;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.util.Version;

import proj.zoie.api.ZoieException;
import proj.zoie.api.ZoieIndexReader;
import proj.zoie.api.DataConsumer.DataEvent;
import proj.zoie.api.indexing.IndexReaderDecorator;
import proj.zoie.impl.indexing.DefaultIndexReaderDecorator;
import proj.zoie.impl.indexing.SimpleReaderCache;
import proj.zoie.impl.indexing.ZoieConfig;
import proj.zoie.impl.indexing.ZoieSystem;
import test.zoie.domain.Data;
import test.zoie.domain.DataIndexableInterpreter;

public class Main {
public static final String INDEX_DIR = "myIdxDir";

private static ZoieSystem indexingSystem;

public static void delete(File file){
if (!file.exists()){
return;
}
if (file.isDirectory()){
for (File child : file.listFiles()){
delete(child);
file.delete();
}
}
else {
file.delete();
}
}

public static void indexDoc(int id) throws ZoieException{
Data[] data = new Data[1];
data[0] = new Data();
data[0].setId(id);
data[0].setContent("tst");
ArrayList<DataEvent> eventList = new
ArrayList<DataEvent>(data.length);
for (Data datum : data) {
eventList.add(new DataEvent<Data>(datum, String.valueOf(0)));
}
indexingSystem.consume(eventList);
}

public static int findDoc(String id) throws IOException,
ParseException{
List<ZoieIndexReader> readerList = indexingSystem.getIndexReaders();
MultiReader reader = new MultiReader(readerList
.toArray(new IndexReader[readerList.size()]), false);

IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser(Version.LUCENE_20, "id",
indexingSystem.getAnalyzer());

TopDocs docs = searcher.search(parser.parse("id:"+id), 10);
int found = docs.totalHits;

searcher.close();
indexingSystem.returnIndexReaders(readerList);

return found;
}

public static void main(String[] args) throws ZoieException,
IOException,
ParseException, InterruptedException {

File idxDir = new File(INDEX_DIR);
delete(idxDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_20);
Similarity similarity = new DefaultSimilarity();
IndexReaderDecorator decorator = new DefaultIndexReaderDecorator();
ZoieConfig config = new ZoieConfig();
config.setReadercachefactory(SimpleReaderCache.FACTORY);
indexingSystem = new ZoieSystem(idxDir,
new DataIndexableInterpreter(), decorator, config);
indexingSystem.start();

indexDoc(5);

// Thread.sleep(200);

System.out.println("Found: "+findDoc("5"));

indexingSystem.shutdown();
delete(idxDir);
}
}

John Wang

unread,
Sep 2, 2011, 12:41:59 PM9/2/11
to zo...@googlegroups.com
Hi Renat:

     I ran your code, some comments:

1) The reason you are not seeing the hit showing up is because you are indexDoc->indexingSystem.consume is non-blocking. Consume puts the event on the indexing thread for processing, and then returns. So calling search immediately races with the indexing logic. Having your code in the same thread makes this latency more prominent. That's why Thread.sleep() works. Normally, search and indexing threads on done on different threads.

2) config.setReadercachefactory(SimpleReaderCache.FACTORY) does make a difference, as what you have observed.

Hopefully this helps.

-John


--

Renat Gilfanov

unread,
Sep 9, 2011, 4:56:25 AM9/9/11
to zoie
Hi,

The problem is that it doesn't work even if I do searching in the
separate thread right after indexing...

John Wang

unread,
Sep 9, 2011, 12:34:04 PM9/9/11
to zo...@googlegroups.com
Correct Renat. What I am pointing out is the async. nature presents a race condition between indexing and search thread. If you want to be absolutely deterministic, call syncWithVersion() if you know what is the version of the event you are sending before searching. (Version is the number you send with DataEvent)

-John

2011/9/9 Renat Gilfanov <gre...@mail.ru>
Reply all
Reply to author
Forward
0 new messages