Fast spell checking or entity linking with bbkh

27 views
Skip to first unread message

Amirouche Boubekki

unread,
Jun 30, 2020, 7:08:38 AM6/30/20
to opencog

This is not strictly related to opencog but might come useful if you want to use it as part of an NLP / NLU pipeline where you need to spell check and link a given text to a knowledge base.

So the idea is that you have a text where they might be spelling mistakes. The easiest option would be to use an existing spell checker like hunspell / aspell / ispell. The problem with that approach is that anytime you add items to the knowledge base you need to update the spell checker dictionary. My idea is to rely on a single source of truth database, that I can drive from python or scheme.

It seems to be the most used spell checker in Python is fuzzy-wuzzy. I tried to use it and here are a few results with timings. As far as I understand, fuzzy-wuzzy will not compile or preprocess or index the "choices" before guessing a match which leads to a very big run time:


$ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv 10 resaerch

('öres', 90)('erc', 90)
('e', 90)
('rch', 90)
('c', 90)
('c̄', 90)
('sae', 90)
('sé', 90)
('öre', 90)
('re', 90)

26.097001791000366

In the above query e and a are swapped, and fuzzywuzzy fail to find even remotely something similar. Mind the fact that the last line is run time in seconds.

$ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv 10 reserch

('research', 93)
('c̄', 90)
('öre', 90)
('rc', 90)
('ré', 90)
('ser', 90)
('rese', 90)
('re', 90)
('ch', 90)
('öres', 90)

26.26053023338318

$ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv 10 research

('research', 100)
('researchy', 94)
('ré', 90)
('sear', 90)
('rê', 90)
('öres', 90)
('ar', 90)
('nonresearcher', 90)
('c@', 90)
('unresearched', 90)

26.261364459991455

As you can see the time to run is very big, and that will become bigger over time as the KB grows with more words.

To help with that task I created a hash in the spirit of simhash that preserve similarity in the prefix of a hash so that it is easy to query in an Ordered Key-Value Store (OKVS). Here are the same queries using the that algorithm:

$ python fuzz.py query 10 resaerch
* most similar according to bbk fuzzbuzz
** research      -2
0.011413335800170898


$ python fuzz.py query 10 reserch
* most similar according to bbk fuzzbuzz
** research      -1
** resch      -2
** resercher      -2
0.011811494827270508


$ python fuzz.py query 10 research
* most similar according to bbk fuzzbuzz
** research      0
** researches      -2
** researchee      -2
** researcher      -2
0.012357711791992188

I tried similar queries over wikidata labels, it gives good results under 250 ms.

As you can see it is much much faster and the result seems more relevant. The algorithm can be found at: https://stackoverflow.com/a/58791875/140837

I will be glad if someone can try that algorithm in their system?

Similarly, I will be glad if you can give me pointers on how to evaluate (precision / recall?) against a gold standard.

This one step toward the goal of re-implementing link-grammar using only a sat-solver and an okvs.

Linas Vepstas

unread,
Jun 30, 2020, 2:04:09 PM6/30/20
to opencog, link-grammar
Hi Amirouche,

There are other, far more sophisticated ways of doing spell checking. The best way is to do context-dependent checking... and link-grammar provides extremely precise context. For example, "I gave him teh hammer" -- the LG spelling guesser offers up "the", "then", "ten" as possible fixes. However, only one of these choices leads to a grammatically-correct sentence. Thus, the other spelling guesses can be  discarded because they lead to grammatical nonsense.

Because of this, spelling checkers and POS taggers are NOT used in the language pipeline, and, based on practical experience, they actually make things worse.  That is, they make suggestions and provide tags that are misleading or wrong, and lower the quality of the results -- It turns out that English grammar provides tight constraints on what is possible, and those constraints are much tighter (and have higher accuracy/recall/precision) than single-word taggers.

I assume that large statistical datasets using multi-word phrases are also possible. (I presume that the excruciatingly annoying Grammarly uses such a system).  Such statistical systems might currently do as well as or better than grammar, since, it turns out, writing a complete grammar is impossible; there is always yet one more, extremely rare exception to every rule.  Roughly speaking, grammar rules have a Zipfian distribution.

=========

"A sat solver and an okvs" -- we've played with SAT. Basically, the SAT solvers are slower. They used to be sometimes-faster, but Amir's work fixed that.  I don't know what OKVS is.

Please realize that the grammar rules have NOTHING AT ALL to do with English, or any other natural language -- its a generic structural property that applies to graphs. This includes biology, biochemistry, and 3D conformance structures -- For example, zebrafish provide a good way of studying immunoglobulin, and one can compare mutual information to Ising models, to mechanical shape conformance. Although it looks just like link-grammar, there are NO "words" in here!  You could maybe use a SAT solver by trying different levels of a "potential" for interaction.

I've started work on a "generic" grammar, that can handle both human language, and also biology, here:  https://github.com/opencog/generate  It's at version 0.1.0 .. its in C++ but Anton is toying with the idea of porting it to Java.  (he wants to run it on cell phones)

I had wanted to/hoped I could recycle some of the LG code base for this more-generic parsing/generation system, but I found that too hard, and a clean-room implementation was easier. That said, the algorithms are difficult, and its version 0.1.0 because there are obvious avenues for improvements/enhancements.

-- Linas

p.s. LG is written in C, and so it can only use those spellling-guessers that have a C api.  The current choices are aspell and hunspell, -- once upon a time, "hunspell" was "better" but seems to no longer be maintained.  Again -- aspell is used to provide suggestions, and parsing determines which of these suggestions is correct (if any).

The dictionaries themselves also have a handful of corrections built-in. For example.

rather_then.#rather_than: [rather_than]bad-spelling;
there.#their: [[their.p]0.65]bad-spelling;

where the 0.65 is (minus) the log-likelihood.

Accents are handled similarly:
% Bad German accent
vas.#was-v-d: [[was.v-d]0.05]bad-spelling;
vas.#what: [[what]0.05]colloquial;
das.#this-p: [[this.p]0.05]colloquial;
das.#this-d: [[this.d]0.05]colloquial;

Also:
% Initial unstressed syllable.
'Cause.#because 'Fore.#before 'Fraid.#afraid 'Gainst.#against

% Poetic contractions; Shakespearean contractions
'r.#our ’r.#our: [[our]0.5]colloquial;
e'en.#even: [even.e]colloquial;
e'er.#ever: [ever]colloquial;
ha'.#have ha’.#have: [have.v]colloquial;
heav’n.#heaven: [heaven.s]colloquial;
o'.#of: [of]colloquial;
o'.#on: [on]colloquial;
o'er.#over: [over]colloquial;

There are also some interesting games to be played with phonetics... but these remain mostly unexplored.

There are also regex rules for the automatic recognition of unknown words, for Latin terms, and for micro-biology/biochemistry terminology. Regex works, because these tend to have a very regular structure.

-- Linas

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/64c7b12c-e196-433a-a9e6-b622ff953ccen%40googlegroups.com.


--
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
        --Peter da Silva

Amirouche Boubekki

unread,
Jul 1, 2020, 12:42:33 PM7/1/20
to link-grammar, opencog
Le mar. 30 juin 2020 à 20:04, Linas Vepstas <linasv...@gmail.com> a écrit :
>
> Hi Amirouche,
>
> There are other, far more sophisticated ways of doing spell checking. The best way is to do context-dependent checking... and link-grammar provides extremely precise context. For example, "I gave him teh hammer" -- the LG spelling guesser offers up "the", "then", "ten" as possible fixes. However, only one of these choices leads to a grammatically-correct sentence. Thus, the other spelling guesses can be discarded because they lead to grammatical nonsense.

Exactly! That is my goal.

> Because of this, spelling checkers and POS taggers are NOT used in the language pipeline, and, based on practical experience, they actually make things worse. That is, they make suggestions and provide tags that are misleading or wrong, and lower the quality of the results -- It turns out that English grammar provides tight constraints on what is possible, and those constraints are much tighter (and have higher accuracy/recall/precision) than single-word taggers.
>
> Such statistical systems might currently do as well as or better than grammar,

They will always fail on new grammar rules, and re-training the
algorithm over those is painful I guess. Maybe new grammar rules is
not a use-case since nowadays there is grammar-police-robot in every
browser...

> since, it turns out, writing a complete grammar is impossible; there is always yet one more, extremely rare exception to every rule. Roughly speaking, grammar rules have a Zipfian distribution.

I mean probably yes with the approach taken by link-grammar where only
a couple of experts can edit the grammar. My goal is to make it also
easy for the user to improve the grammar, hence the single source of
truth database that will include the dictionary of words, grammar
relations between words, and test cases for each relation. I am still
nurturing the idea to allow a user as part of the conversational
interface, to make it possible to extend the grammar based on a subset
of English grammar that is not ambiguous and that would be the seed of
the system. I still need to create a proof-of-concept of this.

> "A sat solver and an okvs" -- we've played with SAT. Basically, the SAT solvers are slower. They used to be sometimes-faster, but Amir's work fixed that. I don't know what OKVS is.

How much slower ? 2 or 3 times slower ? or more like 10 or 100 times
slower. My primary use for parsing sentences is to be able to have
some kind of limited conversation with a human. In a _narrow_ first
step, I do not plan to parse essays by Kant or Goethe for the time
being.

> p.s. LG is written in C, and so it can only use those spellling-guessers that have a C api. The current choices are aspell and hunspell, -- once upon a time, "hunspell" was "better" but seems to no longer be maintained. Again -- aspell is used to provide suggestions, and parsing determines which of these suggestions is correct (if any).

Thanks for the enlightening conversation. I was imagining such a thing
but did not have time to look into aspell intricates.

I think a clean implementation of the ideas behind Link Grammar can be
useful for OpenCog at least to help approach Link Grammar C/C++
codebase. You can compare my project to MINIX vs. Linux. MINIX is
pedagogical. Linux is industrial. It seems to me learning grammar in
an unsupervised way, is not ready?

(A little bit unrelated to this topic: I think we already discussed
that. My plan is to do most of the still fuzzy stuff in Scheme and
re-use existing off-the-shelf libraries like minisat where it makes
sense. I mean performance is very important but being able to
comprehend the whole system is much more important for me. I think the
example of spellchecking or candidate selection is a good example
where one can fine-tune the accuracy and speed of the algorithm, but
in my case, I prefer a simple algorithm that can possibly scale and
that is easy to use instead of complex machinery that handles all the
cases but is difficult to work with. I will learn patience :)

Thanks!

Amirouche Boubekki

unread,
Jul 1, 2020, 12:44:54 PM7/1/20
to link-grammar, opencog
Even if minisat is slow at the moment, maybe some people will pick
interest in the subject and optimize it for that use-case.

Linas Vepstas

unread,
Jul 1, 2020, 2:10:53 PM7/1/20
to opencog, link-grammar
On Wed, Jul 1, 2020 at 11:42 AM Amirouche Boubekki <amirouche...@gmail.com> wrote:
Le mar. 30 juin 2020 à 20:04, Linas Vepstas <linasv...@gmail.com> a écrit :
>
> Hi Amirouche,
>
> Such statistical systems might currently do as well as or better than grammar,

They will always fail on new grammar rules, and re-training the
algorithm over those is painful I guess. Maybe new grammar rules is
not a use-case since nowadays there is grammar-police-robot in every
browser...

Well, I was trying to be expansive. Some people want to build a grammar-police-robot, and there are various ways of doing that.

For opencog, a goal is to understand speech, and if you look at transcripts of zoom videos, you quickly appreciate that no one speaks in a "grammatical" fashion ... Written and spoken English are almost two different languages.


> since, it turns out, writing a complete grammar is impossible; there is always yet one more, extremely rare exception to every rule.  Roughly speaking, grammar rules have a Zipfian distribution.

I mean probably yes with the approach taken by link-grammar where only
a couple of experts can edit the grammar. My goal is to make it also
easy for the user to improve the grammar,

I believe that this is a fundamentally impossible task. The proof would be, if it was easy, some linguist would have done it by now. I mean, linguists have been working on grammar since before there were computers, and what have we got?

hence the single source of
truth database

There is no single source of truth for language. Every human being speaks and understands a language differently. You are not aware of it, because there is enough overlap that you can communicate with others.  But, if you take samples of Irish-American or German-American from the 1920's or even Black-American or Hispanic-American from modern times, you will discover that it's grammar is quite drastically different than New York Times English.  And there are more gradations: the rules that work for Irish-American fail on British-Irish.  The geographical separation, and the separation of 2-5 decades in time, means that these two dialects have diverged.  They are quite different. There is no single source of truth.
 
that will include the dictionary of words, grammar
relations between words, and test cases for each relation. 

There is not even one unique parse for any given sentence.  Poetry is filled with ambiguous sentences. But if you want a textbook example: "I saw the man with the telescope" which has two parses that are both equally plausible.  This is surprisingly common. Mild alterations of wording can have dramatic changes in parses and meaning. Simple substitutions of "synonymous" words can have dramatic changes in parses and meaning.

  based on a subset
of English grammar that is not ambiguous

There is no such subset. It does not exist.
 
> "A sat solver and an okvs" -- we've played with SAT. Basically, the SAT solvers are slower. They used to be sometimes-faster, but Amir's work fixed that. 

How much slower ? 2 or 3 times slower ?

Now? maybe 1.1x to 5x slower. Depends on the sentence.  Depends on the dictionary (say, the Russian vs the English dict). The algorithms are different; the performance profile is different.  If you are starting from scratch, then the SAT solver is "good enough". In LG, there is no incentive to pursue it; the SAT solver is not as flexible, it has trouble with parse ranking.

I mean, there is nothing wrong with minisat as a SAT solver. It's fine. What's wrong is the idea that SAT is suitable for parsing. It's not.

Basically, if you assign each grammar rule a probability, (or a probability and a confidence) then you would like to rank parses according to probability. The SAT algo's don't know how to  avoid exploration of low-probability regions.  Its a classic combinatorial-explosion problem:  you have to perform hill-climbing, or simulated annealing, or whatever, as you parse, and trim the search space as much as possible.   SAT algo's don't do that -- they explore the entire search space, and return a yes-no answer. You don't want a yes/no answer, you want a  likelihood. SAT solvers choke on any problem that has a combinatoric explosion.

There is an interesting middle-ground, by stealing the key SAT idea. The key idea behind SAT is: prune away all trees, all DAG's, and then exhaustively explore the remaining multiply-connected, combinatorially-explosive core. SAT is "fast" because "most" (!??) problems, after pruning, have tiny cores. So, likewise, in the "ideal parser", one would prune away all branches, and then explore the combintatorially-explosive core with your favorite combinatorial-explorer:  hill-climbing, simulated-annealing, bayesian-net, neural-net, genetic algo, or whatever. I've been trying to get Ben and others interested in this idea for  umpteen years, but no luck yet.  I'm even writing grant proposals. But no one understands what I say 😢  I've been told that it's either "trivially obvious" or its "gibberish"  ... I can't find the middle ground...
 

Thanks for the enlightening conversation. I was imagining such a thing
but did not have time to look into aspell intricates.

There is nothing at all intricate about aspell. You give it a mis-spelled word, and it offers up alternative suggestions. It's based on ispell, which is maybe 40 years old, now, or more? Pre-internet software.


It seems to me learning grammar in
an unsupervised way, is not ready?

It's not ready. I encountered two very different problems. 

In some of the learned grammars, a large number of alternatives are produced, and the existing link-grammar parser is too slow. There are 5 or 10-word sentences that take a day or two to parse, because the lexis contains 10K alternatives per word.  If you have the patience to wait that long, the resulting parse looks pretty good ....This is the combinatorial explosion I was talking about. Of course, I don't have to generate those kinds of dictionaries -- there are others with parse very quickly, but with lower accuracy.

The other, more important problem is the lack of a reference standard.  The solution to that is obvious: treat parsing as 1/2 of a transducer-pair, with a generator as the other 1/2.  Think of, for example, a microphone and a loudspeaker. Ideally, you push a sine-wave through a loudspeaker, and get a sine wave back on the microphone. You cannot measure the quality of the microphone .. or the loudspeaker ... without this.

The solution is to generate random grammars, consisting of N words, and M grammatical rules, and a Zipfian or other probability distribution for the use of words/rules. Then generate a corpus of K "random" sentences. Now apply the learning tool ... can you learn all N words, all M rules correctly? How does the accuracy depend on the size K of the corpus?  How fast can one learn, as a function of N, M, and the word-distribution?

I've started to build that generator, but I am not done yet.

Trying to use human-annotated corpora gives garbage. Its a beginner mistake.
 
--linas

Reply all
Reply to author
Forward
0 new messages