I created sqlite words database and the basic spell suggester is
working (look for a word with same letters without the vowels) but I
want to make it smarter but of course with reasonable performance.
any leads?
You're proposing two leads which seem orthogonal:
(a) port Norvig's code to Tcl
(b) write something from scratch, with a completely different
mechanism
Now if it's (a), it is easy, just ask ;-)
-Alex
Of course it would be great if some noble soul would port it but I
would not dare to ask for it. I was also hoping for:
(c) there is already somewhere a code that does this that for some
reason hides from google's praying eye.
> Of course it would be great if some noble soul would port it but I
> would not dare to ask for it. I was also hoping for:
> (c) there is already somewhere a code that does this that for some
> reason hides from google's praying eye.
TclPython:
creates slave interpreter that understands python:
http://wiki.tcl.tk/5630
http://jfontain.free.fr/tclpython.htm
uwe
Sort of - I built a minimal wrapper on libpspell years ago, don't
remember if I ever released it. If the API:
pspell::check_word /word/
returns true if /word/ appears to be correct
pspell::suggest_words /word/
returns a list of suggested corrections for /word/
is what you're looking for then I can put it up somewhere. I also
vaguely remember building a text widget based megawidget that did the
usual red-underlining of errors with a right-click selection popup.
Would have been based on Itk though, probably need quite a lot of tlc
to get rid of the years of code rot.
Cyan
OK, I've done this for you:
-Alex
Looking at that code (not studying it in detail) it seems to me that:
- putting the main loop in a proc would speed up things
- using the split-up version of the strings would speed up things as
well
I do not know how much nor if it would have a noticeable effect, but
that
is my first impression.
Regards,
Arjen
Yes. As I said on that wiki page: feel free to optimize, measure, and
edit the page :)
(and of course report here, for those - like me - who aren't hooked to
wiki updates ...)
-Alex
Hardly anyone is; we just use the wiki's rather nice history
functions. :-)
Donal.
Amazing, thanks a lot for the beautiful brand new wiki page. Playing
with it one can see that the real challenge in the spell checker is
how to build the probability correctly.
It would be nice for all to have option for this. I need it for a
starpack so for me it can be useful only if it can work in such
configuration.
Yes. Note that in line with the huge amount of literature existing on
statistical language models, the obvious next step if you're not
satisfied with this unigram (single-word probabilities) model, is to
go bigram. To do this, you'll need to:
- populate the "model" array with both single words (as does the
current code) and consecutive pairs from the input corpus, with a
nonalpha separator like ":". For example, "there:is" will likely get a
hefty count.
- refine the search strategy in the following way:
- allow to enter full sentences instead if single words.
collapse nonalpha to space instead of "".
- do the usual distance-2 search for the single word $w (not
pruning to distance 1 even if there are solutions)
- also do it for $prev:$w and $w:$next (where $prev and $next
are the words immediately surrounding $w in the input sentence)
- give preference to candidates with bigram matches (still
with subpreference for distance 1 over distance 2).
A secondary refinement could be: for the last rule, use a soft
decision: compute a composite probability based on both the bigram and
unigram values, so that a high-prob unigram can still take over a rare
bigram.
Disclaimer: this off the top of my head -- I'm no expert in text
language models, though I do have experience in similar methods for
speech recognition.
HTH,
-Alex
Many year ago I integrated gnu ispell into an application. As I recall
the code is in C but not all that complicated and could probably be
ported with out to much trouble. The nice thing about going that route
is it has a built in dictionary that is gnu licensed. The dictionary
is root word suffix form IIRC.
tomk
And the bad thing about it is that it is gnu licensed and not BSD licensed.
--
+------------------------------------------------------------------------+
| Gerald W. Lester, President, KNG Consulting LLC |
| Email: Gerald...@kng-consulting.net |
+------------------------------------------------------------------------+