Fast spell checking or entity linking with bbkh

57 views
Skip to first unread message

Amirouche Boubekki

unread,
Jun 28, 2020, 8:37:51 AM6/28/20
to conceptnet-users
This is not strictly related to conceptnet but might come useful if you want to use conceptnet as part of an NLP / NLU pipeline where you need to spell check and link a given text to conceptnet.

So the idea is that you have a text where they might be spelling mistakes. The easiest option would be to use an existing spell checker like hunspell / aspell / ispell. The problem with that approach is that anytime you add items to the knowledge base you need to update the spell checker dictionary. My idea is to rely on a single source of truth database, that I can drive from python or scheme.

It seems to me the most used spell checker in Python is fuzzywuzzy. I tried to use it and here are a few results with timings. As far as I understand, fuzzywuzzy will not compile or preprocess or index the "choices" before guessing a match which leads to a very big run time:


$ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv 10 resaerch

('öres', 90)('erc', 90)
('e', 90)
('rch', 90)
('c', 90)
('c̄', 90)
('sae', 90)
('sé', 90)
('öre', 90)
('re', 90)

26.097001791000366

in the above query e and a are swapped, and fuzzywuzzy fail to find even remotely something similar. Mind the fact that the last line is run time in seconds.

$ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv 10 reserch

('research', 93)
('c̄', 90)
('öre', 90)
('rc', 90)
('ré', 90)
('ser', 90)
('rese', 90)
('re', 90)
('ch', 90)
('öres', 90)

26.26053023338318

$ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv 10 research

('research', 100)
('researchy', 94)
('ré', 90)
('sear', 90)
('rê', 90)
('öres', 90)
('ar', 90)
('nonresearcher', 90)
('c@', 90)
('unresearched', 90)

26.261364459991455

As you can see the time to run is very big, and that will become bigger over time as the KB grows with more words.

To help with that task I created a hash in the spirit of simhash that preserve similarity in the prefix of a hash so that it is easy to query in an Ordered Key-Value Store (OKVS). Here are the same queries using the that algorithm:

$ python fuzz.py query 10 resaerch
* most similar according to bbk fuzzbuzz
** research      -2
0.011413335800170898


$ python fuzz.py query 10 reserch
* most similar according to bbk fuzzbuzz
** research      -1
** resch      -2
** resercher      -2
0.011811494827270508


$ python fuzz.py query 10 research
* most similar according to bbk fuzzbuzz
** research      0
** researches      -2
** researchee      -2
** researcher      -2
0.012357711791992188


As you can see it is much much faster and the result seems more relevant. The algorithm can be found at: https://stackoverflow.com/a/58791875/140837
Reply all
Reply to author
Forward
0 new messages