Spelling Error analysis

Simon

unread,

Mar 19, 2003, 6:22:47 PM3/19/03

to

I am researching the analysis of spelling errors with a view to
offering advice on remedial action that can be taken. I have decided
to attempt to build a shell that analyses the data, on which rules can
be offered or patterns can be shown if there is no rule available for
the pattern. I want to be able to group 'like errors' then find a rule
to match the most prevalent errors or just offer the error pattern if
a rule is not avalable i.e. I have not coded it.
I would like to find a pattern once then tot up the occurrences of the
pattern, I can do this for infliction and derived errors but am having
problems when it comes to errors in a root word, without a rule I can
say what the difference between the dictionary form and the error form
but some rules require information about surrounding graphemes and
here I am stumped.

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <com...@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]

Thomas Raffill

unread,

Mar 19, 2003, 9:52:46 PM3/19/03

to

>I would like to find a pattern once then tot up the occurrences of the
>pattern, I can do this for infliction and derived errors but am having
>problems when it comes to errors in a root word, without a rule I can
>say what the difference between the dictionary form and the error form
>but some rules require information about surrounding graphemes and
>here I am stumped.

I have worked on this kind of issue before. Here are some rough
overall categories of errors. If anyone knows more categories,
please follow up and post them.

1. Typographical: missing letter, extra letter, transposition of
letters, or substitution of a letter for another (for example, the
adjacent letter on a typewriter). There is a lot of published
literature on these kinds of "string edits." Given a pair of strings,
the minimum number of edits needed to transform from one to the other
is called the "edit distance." Most of the automatic spelling correction
programs today can handle misspelled words that are at an edit distance
of 1 from a dictionary word and will return a list of all dictionary words
within 1 string edit.

2. Phonetic: substitution of a particular spelling for another
phonetically related spelling. An example of this would be to
misspell the word "rough" as "ruf." This kind of error is applicable
to languages like English where spelling is not very tightly phonetic.
To handle this kind of error, you can create tables of correspondences
between letter sets and phonetic units. You can transform everything
into phonetic units, then use the edit distance techniques on them.
There is a small body of literature related to this sort of thing.

3. Context-sensitive: substitution of an inappropriate dictionary word
correctly spelled for the appropriate word. For example, substitution
of "passed" for "past." Groups of words that are often inappropriately
substituted for each other in this way are called "confusion sets" in
the literature. This is a more difficult task and you can use all of
the techniques of natural language processing to tackle it. Most
spelling correction programs today are not capable of handling
this kind of error.

Hope this helps.

Thomas Raffill

Kyongho Min

unread,

Mar 20, 2003, 8:53:05 PM3/20/03

to

spj.w...@virgin.net (Simon) wrote in message news:<b5au47$h6p$1...@mulga.cs.mu.OZ.AU>...

> I am researching the analysis of spelling errors with a view to
> offering advice on remedial action that can be taken. I have decided
> to attempt to build a shell that analyses the data, on which rules can
> be offered or patterns can be shown if there is no rule available for
> the pattern. I want to be able to group 'like errors' then find a rule
> to match the most prevalent errors or just offer the error pattern if
> a rule is not avalable i.e. I have not coded it.
> I would like to find a pattern once then tot up the occurrences of the
> pattern, I can do this for infliction and derived errors but am having
> problems when it comes to errors in a root word, without a rule I can
> say what the difference between the dictionary form and the error form
> but some rules require information about surrounding graphemes and
> here I am stumped.
>

I have studied the task and implemened a system covered three levels: lexical,
syntactic, and semantic.
If you visit the following URL, there are some seplleing-error source data,
bibliography, and my papers.
I hope it wil be helpful for you.

URL: www.cse.unsw.edu.au/~min

Regards,

Kyongho MIN

Apokrif

unread,

Mar 22, 2003, 4:26:01 AM3/22/03

to

Thomas Raffill :

> I have worked on this kind of issue before. Here are some rough
> overall categories of errors. If anyone knows more categories,
> please follow up and post them.
> 1. Typographical

> 2. Phonetic

I have written a Pascal program for these two categories. It's in French,
but one can easily adapt it (e.g. by changing the tables used for phonetic
equivalences).

km

unread,

Mar 27, 2003, 12:51:58 AM3/27/03

to

"Kyongho Min" <kyong...@aut.ac.nz> wrote in message
news:b5dra1$ob1$1...@mulga.cs.mu.OZ.AU...

> URL: www.cse.unsw.edu.au/~min
"Damerau's Misspelt Words(...)"

regional differences should be factored in as well, lest you take
misspelt for misspelled...

- Paul