Spell correction possible?

66 views
Skip to first unread message

tom-m...@att.net

unread,
Apr 5, 2013, 9:33:12 PM4/5/13
to concurrent-t...@googlegroups.com
Nice project. Extremely well put together. I'm wondering if there is a way to do high speed spell correction with the underlying data model? Like do a Levenstein distance calculation to find close matches to a query word in the dictionary. It seems the graph nature of a trie could make the calculations more efficient. ?  

Niall

unread,
Apr 7, 2013, 5:32:57 PM4/7/13
to concurrent-t...@googlegroups.com
Thanks!

Spelling correction is an interesting question.

I guess there are a few levels of spelling correction/suggestion, some of which are supported, some not yet...

RadixTree.getKeysStartingWith(...) can be used for precise real-time autocomplete. Say the user is typing keywords into a search box, every time a new character is entered, the tree can be queried to suggest keywords which are prefixed by the characters entered so far. If the user makes a typo such that it no longer matches any keywords, the list of suggestions would disappear.

RadixTree.getClosestKeys(...) can be used for real-time autocomplete which is tolerant of typos most recently entered. Say the user is typing keywords into a search box, every time a new character is entered, the tree can be queried to suggest keywords which are prefixed by the characters entered so far. But if the user makes a typo such that it no longer matches any keywords, the list of suggestions will continue to include keywords matching the characters before the typo.

I'm working on adding a concurrent permuterm tree at the moment...

PermutermTree.getMatchingKeys(...) could be used as a basis for real-time autocomplete which is tolerant of typos entered anywhere in the string. The permuterm tree itself supports straight wildcard-based retrieval. But for typo correction, it would require some higher-level algorithm to shuffle the wildcards around in the query string, and query the tree multiple times for every character entered by the user.

Lets say the user made a typo in the first character of the input and continued typing, but the rest of the input was valid. When the higher level algorithm placed a wildcard over the first character, suggestions would be returned. The higher level algorithm could effectively combine suggestions from multiple placements of the wildcard. This would be sufficient for spelling suggestion which is tolerant of single typos entered anywhere in the string.

To account for multiple typos in the string, the algorithm would need additional logic. Maybe multiple wildcards could be placed at a time. Maybe each wildcard would span multiple characters in the input. Maybe the algorithm would vary its approach based on the number of suggestions returned.

Levenstein distance could be used to rank the suggestions from an approach like that above, or to basically assign a score to each suggestion. There would be significant savings by running Levenstein distance only on the subset of keywords returned from the (accelerated) wildcard queries, as opposed to running it on all keywords every time the user enters a character.

Of course an approach like that above will not compete with context-aware/probabilistic spell checkers which are based on a model of the language or domain, in terms of the relevance of their suggestions. But it would be faster/less CPU intensive for sure.

I suspect that ultimately to get the best performance, a custom traversal algorithm would be best. There has been some research on tree-accelerated spell checking: http://www.javaworld.com/javaworld/jw-02-2001/jw-0216-ternary.html Of course there are n-gram-based algorithms also. I suppose there are pros and cons of every approach.

Maybe at some point there could be a SpellingSolver or something. BTW i'd be happy to accept contributions in this area, or help out if you want to start this as a project of your own or something.

Thanks!
Niall
Reply all
Reply to author
Forward
0 new messages