>I would like to find a pattern once then tot up the occurrences of the
>pattern, I can do this for infliction and derived errors but am having
>problems when it comes to errors in a root word, without a rule I can
>say what the difference between the dictionary form and the error form
>but some rules require information about surrounding graphemes and
>here I am stumped.
I have worked on this kind of issue before. Here are some rough
overall categories of errors. If anyone knows more categories,
please follow up and post them.
1. Typographical: missing letter, extra letter, transposition of
letters, or substitution of a letter for another (for example, the
adjacent letter on a typewriter). There is a lot of published
literature on these kinds of "string edits." Given a pair of strings,
the minimum number of edits needed to transform from one to the other
is called the "edit distance." Most of the automatic spelling correction
programs today can handle misspelled words that are at an edit distance
of 1 from a dictionary word and will return a list of all dictionary words
within 1 string edit.
2. Phonetic: substitution of a particular spelling for another
phonetically related spelling. An example of this would be to
misspell the word "rough" as "ruf." This kind of error is applicable
to languages like English where spelling is not very tightly phonetic.
To handle this kind of error, you can create tables of correspondences
between letter sets and phonetic units. You can transform everything
into phonetic units, then use the edit distance techniques on them.
There is a small body of literature related to this sort of thing.
3. Context-sensitive: substitution of an inappropriate dictionary word
correctly spelled for the appropriate word. For example, substitution
of "passed" for "past." Groups of words that are often inappropriately
substituted for each other in this way are called "confusion sets" in
the literature. This is a more difficult task and you can use all of
the techniques of natural language processing to tackle it. Most
spelling correction programs today are not capable of handling
this kind of error.
Hope this helps.
Thomas Raffill
[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <comp...@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]