Fast spell checking or entity linking with bbkh

Skip to first unread message

Amirouche Boubekki

Jun 28, 2020, 8:37:51 AM6/28/20
to conceptnet-users
This is not strictly related to conceptnet but might come useful if you want to use conceptnet as part of an NLP / NLU pipeline where you need to spell check and link a given text to conceptnet.

So the idea is that you have a text where they might be spelling mistakes. The easiest option would be to use an existing spell checker like hunspell / aspell / ispell. The problem with that approach is that anytime you add items to the knowledge base you need to update the spell checker dictionary. My idea is to rely on a single source of truth database, that I can drive from python or scheme.

It seems to me the most used spell checker in Python is fuzzywuzzy. I tried to use it and here are a few results with timings. As far as I understand, fuzzywuzzy will not compile or preprocess or index the "choices" before guessing a match which leads to a very big run time:

$ python data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv 10 resaerch

('öres', 90)('erc', 90)
('e', 90)
('rch', 90)
('c', 90)
('c̄', 90)
('sae', 90)
('sé', 90)
('öre', 90)
('re', 90)


in the above query e and a are swapped, and fuzzywuzzy fail to find even remotely something similar. Mind the fact that the last line is run time in seconds.

$ python data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv 10 reserch

('research', 93)
('c̄', 90)
('öre', 90)
('rc', 90)
('ré', 90)
('ser', 90)
('rese', 90)
('re', 90)
('ch', 90)
('öres', 90)


$ python data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv 10 research

('research', 100)
('researchy', 94)
('ré', 90)
('sear', 90)
('rê', 90)
('öres', 90)
('ar', 90)
('nonresearcher', 90)
('c@', 90)
('unresearched', 90)


As you can see the time to run is very big, and that will become bigger over time as the KB grows with more words.

To help with that task I created a hash in the spirit of simhash that preserve similarity in the prefix of a hash so that it is easy to query in an Ordered Key-Value Store (OKVS). Here are the same queries using the that algorithm:

$ python query 10 resaerch
* most similar according to bbk fuzzbuzz
** research      -2

$ python query 10 reserch
* most similar according to bbk fuzzbuzz
** research      -1
** resch      -2
** resercher      -2

$ python query 10 research
* most similar according to bbk fuzzbuzz
** research      0
** researches      -2
** researchee      -2
** researcher      -2

As you can see it is much much faster and the result seems more relevant. The algorithm can be found at:
Reply all
Reply to author
0 new messages