Perform Spell check and correction in Neo4j

258 views
Skip to first unread message

Anas

unread,
Apr 10, 2013, 5:04:31 AM4/10/13
to ne...@googlegroups.com

Hello,

i am having a Neo4J database where a huge amount of data related to airlines and airports is stored. And this data can be accessed through a python/django based webapp from which a user can check for airports and get their localization on a map.

The problem is that i need to handle the case where a query is misspelled.

 i saw that there were many solutions based on Lucene, Solr, PyEnchant,... but these latter require generally that a dictionary (containing keywords in each line) is built before making spell checking. And for a big number of entries, this would be costly in time, and this would get worst if the data is dynamic and database must be updated and the dictionary would as well.

Currently, data is inserted in the database using a BatchInserter in java and entries are indexed at the same time.

So my question is : how can i perform spell check & correction while traversing the neo4j db for a given query ?

Thank you for your help

Peter Neubauer

unread,
Apr 10, 2013, 6:27:38 AM4/10/13
to Neo4j User
Anas,
not sure I am solving your problem, but look at http://typology.de/
for a project where students took Google N-Grams to predict and
correct spelling and next words using Neo4j. Is that something
relevant here?

/peter

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

The authoritative book on graph databases - http://graphdatabases.com
Neo4j questions? Please use SO - http://stackoverflow.com/search?q=neo4j
> --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to neo4j+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Anas Zakki

unread,
Apr 10, 2013, 8:50:27 AM4/10/13
to ne...@googlegroups.com
Hello Peter,

Thank you for the link.

I will try to read the sources carefully and give you my feedback asap.


You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/6ZpBt4DtZHM/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

Michael Hunger

unread,
Apr 10, 2013, 2:58:54 PM4/10/13
to ne...@googlegroups.com
Ususally as you said people use lucene for this, to do a wildcard search on the starting letters and present those as choices. (b/c fuzzy search is so slow). Depending on the data set size (e.g. if it is just airports it should be tiny) it might be even more sensible to build up a b-tree in memory for the auto-suggester.

Is it just the airport names and abbreviations?

Anas Zakki

unread,
Apr 11, 2013, 2:51:56 AM4/11/13
to ne...@googlegroups.com
Hello,

@ Peter: i saw the code and i found it interesting for auto-completion , but i don't think it would be suitable to do this in case the user misspelt the word he's looking for in the search engine ...

@ Michael: actually, the graph contains several types of nodes : airlines, airports, countries and continents. Each node is related to a root node ( all airports are related to the root node airport), has its properties and relationships: for node airport for instance, we have iata code , its name, and its ID ( for example : ORY, Paris Orly  airport,  id=4189 according to official documents) and is related to the node France, and airlines that go from/to Orly are related to it also.
So a user may search for Orly for example and get the airport, the airlines , and the country and can navigate between them.
Assume now that the user has misspelt his query ( Rorly, olrly or whatever), we should be able to correct this to perform the right query to the db.
Either we should have a  dictionary (dict.txt)  with all the words that are present in the db and check for the most similar word to the query( not realistic ) , or to have a tree as you said, and use it to perform the spell correction ( should it be constructed while inserting data in db and how ?) . But for this one, i don't know actually how to proceed...

Hope what i am aiming to do is more clear :)

Thanks

Michael Hunger

unread,
Apr 11, 2013, 3:25:40 AM4/11/13
to ne...@googlegroups.com
Hmm misspelling is always difficult, most people use an external provider like solr or elasticsearch for things like that.

1. auto-suggestion to reduce the # of misspellings
2. then search with the input data and 
3. only on no results do the fuzzy search which is much more expensive (full index scan with distance computation)

Within neo I would limit it to 1 and 2


Anas Zakki

unread,
Apr 11, 2013, 10:08:59 AM4/11/13
to ne...@googlegroups.com
You mean that if the user is typing "Pa", we'll display in the search field / a box the words starting by this prefix "Paris
that suppose that we can make instant search in the graph db according to what is being entered. How could this be made  ?
 
Concerning Lucene fuzzy search, how can i exploit neo4j  indexes to check for the most similar word to the query in the most effective way ? because otherwise, i would have to list all the properties of all the database nodes and compare each one to the query ! 

Michael Hunger

unread,
Apr 11, 2013, 10:37:27 AM4/11/13
to ne...@googlegroups.com
For prefix search you just use an index query on top of a fulltext index (case-insensitive) like "field:pa*"

in cypher:
start n=node:serach("field:pa*") return n limit 10

in java.

db.index().forNodes("search").query("field:pa*") 

// you can pass in a more advanced lucene query object in there instead of a query that controls sorting and top-n limits too

but it is much slower as lucene has to do a full index scan internally and compute the distance for each word in the index to your search term and then order and return the top-n results.

Michael

Anas Zakki

unread,
Apr 12, 2013, 3:55:30 AM4/12/13
to ne...@googlegroups.com
 the auto suggestion is working fine . Thank you

Concerning the fuzzy search, i found this on neo4j website : http://docs.neo4j.org/chunked/1.9.M05/rest-api-indexes.html#rest-api-find-node-by-query
that permit to use the "~" to perfom it.
For instance, i tried this link on the REST api : http://localhost:7474/db/data/index/node/keywords?query=name:parris~0.6 and it worked perfectly !
But for composed words like "New-York" or "New York" it is not working at all ! i am even getting an exception for New York !

How can i handle the second case ?
and is it possible to get this feature in java  ?

Michael Hunger

unread,
Apr 12, 2013, 4:04:58 AM4/12/13
to ne...@googlegroups.com
Put them in double quotes

Sent from mobile device

Anas Zakki

unread,
Apr 12, 2013, 4:07:08 AM4/12/13
to ne...@googlegroups.com
is it possible to have this in java ?

Michael Hunger

unread,
Apr 12, 2013, 7:04:57 AM4/12/13
to ne...@googlegroups.com, ne...@googlegroups.com
Yes, I already gave the examples

You can pass lucene query objects to the index.query method

Sent from mobile device

Anas Zakki

unread,
Apr 15, 2013, 10:42:55 AM4/15/13
to ne...@googlegroups.com
Hello,

I tried what you suggested me but i got something strange :
when i am making a cypher query through the neo4j dashboard i get more nodes than when i make it through java.
here is the cypher codes, both return the same result :

START root=node(*)
RETURN count(root);

START n=node:keywords("*:*")
return n.name 

and in java : IndexHits<Node> hits = graphDb.index().forNodes("keywords").query("*:*");
                    System.out.println("size of hits : "+hits.size()

"keywords" is an index that indexes all the nodes in the db.

here are some snapshots of what i am getting.

thanks
Screenshot-Neo4j Monitoring and Management Tool - Mozilla Firefox.png
Screenshot2.png

Michael Hunger

unread,
Apr 15, 2013, 11:00:37 AM4/15/13
to ne...@googlegroups.com, ne...@googlegroups.com
Indexhits.size() is just an estimate

Iterate and count 

you can use IteratorUtil.count(indexHits)

Both queries are nothing you should run in your db

What's the use-case for this?

Indexhits count depend on what you have indexed how often with diff. keys and values



Sent from mobile device
<Screenshot-Neo4j Monitoring and Management Tool - Mozilla Firefox.png>
<Screenshot2.png>

Anas Zakki

unread,
Apr 15, 2013, 11:09:24 AM4/15/13
to ne...@googlegroups.com
actually my purpose wasn't initially to check whether i could have the same results using java or Cypher.

the problem is that i could have access to some nodes through the neo4j dashboard whereas i cannot through java ... that's why i tried to make sure that i have them concretely in the db.

the following screenshots show this : 
Screenshot.png
Screenshot-1.png

Anas Zakki

unread,
Apr 16, 2013, 2:26:19 AM4/16/13
to ne...@googlegroups.com
Hello Michael

Maybe i can send you my graph db folder and then my problem would be more clear i think?

Thanks

Michael Hunger

unread,
Apr 16, 2013, 7:56:01 AM4/16/13
to ne...@googlegroups.com
I think you are not using the same database,

one is the one of the server, /path/to/server/data/graph.db

the other is /home/anas/db

Anas Zakki

unread,
Apr 16, 2013, 8:13:59 AM4/16/13
to ne...@googlegroups.com
you are absolutely right 

i was using two different databases ... 

Thank you very much for your help 
Reply all
Reply to author
Forward
0 new messages