Lucene in Neo4j has some misbehaviours in terms of reliable search querys - compared to OrientDB

31 views
Skip to first unread message

Curtis Mosters

unread,
Oct 21, 2014, 6:29:50 PM10/21/14
to orient-...@googlegroups.com
I'm still in the evaluation of Neo4j vs. OrientDB. Most importantly I need Lucene as full-text index engine. So I created on both databases the same schema with the same data (300Mio lines). I'm also experienced with querying different things in both systems. I used the Standard Analyzer on both sides. The OrientDB test query results are all fine and really good in terms of reliability and speed. The speed of Neo4j is also ok but the results are kind of bad in most of the cases. So let's come to the different issues I have with Neo4j Lucene indexing. I always give you an example of how it would look in OrientDB and which result set you should be getting out of the query.

So in these examples, there are Applns that have title(s). Titles are indexed with Lucene in both databases. Applns also have an ID just to demonstrate the ordering. At the end of each query I have some questions about them. It would be great to get some feedback or even answers about them.

Query #0: One word query with no order

Well this query is very simple. It shall be tested how the database behave if there is just a simple word and nothing else. As you can see the Neo4j result is way longer then the one from OrientDB. OrientDB is using TFIDF to keep the results short and more reliable to the actual search. As you can see as first result in OrientDB, there is title with SOLAR. That is totally missing in Neo4j, too.

In Neo4j: START n=node:titles('title:solar') RETURN n.title,n.ID LIMIT 10
  1. SOLAR RADIATION SHIELDING PARTICULATE AND SOLAR RADIATION SHIELDING RESIN MATERIAL DISPERSED WITH ...    38321319
  2. Solar module for cooling solar cells on the underside of a solar panel has air inlet and outlet openings ...    12944121
  3. Solar construction component for solar thermal assemblies, solar thermal assembly, method for operating a solar...    324146113
  4. ...
In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "solar" LIMIT 10
  1. SOLAR    24900187
  2. Solar unit and solar apparatus    1876343
  3. Solar module with solar concentrator    13496706
  4. ...
Questions:
  1. Why is Neo4j not using TFIDF or what do they use instead?
  2. Is Neo4j able to use some ordering of the keyword match?
  3. Is it possible to change TFIDF to somethign else in OrientDB?
Query #1: One word query with order by ID

Neo4j is ordering the ID's before using TFIDF. As known from Query#0 Neo4j is not using TFIDF so it's basically just searching via first results of the Lucene query. In OrientDB besides it's still searching by good TFIDF's and then ordering.

In Neo4j: START n=node:titles('title:solar') RETURN n.title,n.ID ORDER BY n.ID ASC LIMIT 10
  1. Stackable flat-roof/floor frame for solar panels    318
  2. Method for producing contact for solar cells    636
  3. Solar cell and fabrication method thereof    1217
  4. ...
In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "solar" ORDER BY ID ASC LIMIT 10
  1. Solar unit and solar apparatus     1876343
  2. Solar module with solar concentrator    13496706
  3. SOLAR TRACKER FOR SOLAR COLLECTOR    16543688
  4. ...
Questions:
  1. How would a search in OrientDB look like that should be ordered by the ID and still matching the best TFIDF of them.
  2. Is there a way in Neo4j to order the Lucene match before ordering by the ID?
Query #2: One word with using a star search

Star search had no influence on the Neo4j results. OrientDB results changed in a good way.

In Neo4j: START n=node:titles('title:solar*') RETURN n.title,n.ID ORDER BY n.ID ASC LIMIT 10
  1. Stackable flat-roof/floor frame for solar panels    318
  2. Method for producing contact for solar cells    636
  3. Solar cell and fabrication method thereof    1217
  4. ...
In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "solar*" ORDER BY ID ASC LIMIT 10
  1. High performance solar methane generator    8354701
  2. All-plastic honeycomb solar water-heater    8355379
  3. Plate type solar energy heat collector plate core and its manufacturing method    8356173
  4. ...
Questions:
  1. Does Neo4j ignore star searches?
Query #3: Searching for 2 words devided by a space

The strange here is that you need to change 'title:solar panel' to that query here. Otherwhise you just get errors. OrientDB seems good so far.

In Neo4j: START n=node:titles(title="solar panel") RETURN n.title,n.ID ORDER BY n.ID ASC LIMIT 10
  1. Returned 0 rows in 817 ms
In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "solar panel" ORDER BY ID ASC LIMIT 10
  1. SOLAR PANEL    1584567
  2. SOLAR PANEL    1616547
  3. SOLAR PANEL    2078382
  4. SOLAR PANEL    2078383
  5. Solar panel    2178466
  6. ...
Questions:
  1. Why does Neo4j need a special Query here to at least don't throw any error?
  2. Why is the query failing and not giving anything back? I know that Neo4j is searching here for lower letters, so it's case sensitive. But why it is like this? I mean I use the default analyzer and the doc of Neo4j Lucene says it's true, so it means to_lower_letter.
Query #4: Now searching for the same query in capital letters

The same issue like in #3. In Neo4j just searching returning the capital letters results of the words. OrientDB results looking fine again.

In Neo4j: START n=node:titles(title="SOLAR PANEL") RETURN n.title,n.ID ORDER BY n.ID ASC LIMIT 10
  1. SOLAR PANEL    348800
  2. SOLAR PANEL    420683
  3. SOLAR PANEL    1393804
  4. SOLAR PANEL    1584567
  5. SOLAR PANEL    1616547
  6. ...
In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "SOLAR PANEL" ORDER BY ID ASC LIMIT 10
  1. SOLAR PANEL    1584567
  2. SOLAR PANEL    1616547
  3. SOLAR PANEL    2078382
  4. SOLAR PANEL    2078383
  5. Solar panel    2178466
  6. ...
Questions:
  1. Same question like in #3, how to search with to_lower_letter?
Query #5: Combining two words and using the star search

Here I want to combine words search with star search. But with the equal search I'm not able to find matches because he expects the star as usual sign in the title. But I'm not able to say 'title:SOLAR PANEL*'. That's also forbidden. In OrientDB everything is fine.

In Neo4j: START n=node:titles(title="SOLAR PANEL*") RETURN n.title,n.ID ORDER BY n.ID ASC LIMIT 10
  1. Returned 0 rows in 895 ms
In OrientDB: SELECT title,ID FROM Appln WHERE title LUCENE "SOLAR PANEL*" ORDER BY ID ASC LIMIT 10
  1. SOLAR PANELS     1405717
  2. SOLAR PANEL     1584567
  3. SOLAR PANEL     1616547
  4. SOLAR PANEL     2705081
  5. Solar Panel     2766555
  6. ...
Questions:
  1. How can you combine some words with the star search in Neo4j?
Query #6: Counting query results

The last thing I really need is a fast lookup how many results are there overall. Here Neo4j is finding a result way faster but always finding less matches then OrientDB. Searching for Solar is kind of close to each other. But another test was not that close.

In Neo4j: START n=node:titles("title:Solar") RETURN count(*)

143211 in 220 sec

In OrientDB: SELECT count(*) title FROM Appln WHERE title LUCENE "Solar" LIMIT -1

148029 in 50 sec

Questions:
  1. How can that lookup times be improved on both systems?
  2. Why does both systems find different number of matches? Also happens on other keywords. Maybe other indexing eninge used?
Well that is everything for now. If you need any other query just tell me and I deliver it.
I think it's very important to compare the Lucene implementation because with Millions of nodes Lucene has to many advantages. Thanks for any small tip.

Btw: please don't give tips about using Java code instead for the query. I want to use Cypher because the request shall be done in the browser, like in OrientDB. I know that everything here is easily be done with Java code. Thank you.
Reply all
Reply to author
Forward
0 new messages