correcting word groups (general spelling question)

Yuri D'Elia

unread,

Jun 30, 2015, 7:16:52 AM6/30/15

to help-gn...@gnu.org

I have a very general question about spelling and word correction in
general (not even emacs related), which I think cannot really be solved
currently with the underlying tools, but I'll ask nonetheless...

Whenever I edit some text, I might use a group of words (for the sake of
argument, say "Hello Kitteh"), for which a word of the group (Kitteh) is
obviously incorrect with the current dictionary, but it's correct in the
context of the group itself "Hello Kitteh".

If I add Kitteh to the dictionary, I will obviously allow Kitteh to roam
free in my text, which I don't want. I really want to lean "Hello
Kitteh" instead, and also give auto-correct suggestions based on this
specific meaning, and not on the individual word.

AFAIK, aspell/ispell cannot do that and hunspell's manual doesn't
suggest anything beyond word-by-word support.

Ideas?

Emanuel Berg

unread,

Jun 30, 2015, 8:10:03 PM6/30/15

to help-gn...@gnu.org

Yuri D'Elia <wav...@thregr.org> writes:

> If I add Kitteh to the dictionary, I will obviously
> allow Kitteh to roam free in my text, which I don't
> want. I really want to lean "Hello Kitteh" instead,
> and also give auto-correct suggestions based on this
> specific meaning, and not on the individual word.

You can do it like this:

(setq ispell-skip-region-alist
(append ispell-skip-region-alist '(("Hello" . "Kitteh")) ))

However, I consider this over-engineering. How many
such combinations to you have?

Why not just add "Kitteh"?

Are you really that worried that will appear instead
of the correct spelling?

Remember, "the only thing to fear is fear itself".

--
underground experts united
http://user.it.uu.se/~embe8573

Richard Wordingham

unread,

Jun 30, 2015, 11:47:14 PM6/30/15

to help-gn...@gnu.org

On Wed, 01 Jul 2015 02:08:32 +0200
Emanuel Berg <embe...@student.uu.se> wrote:

> Yuri D'Elia <wav...@thregr.org> writes:
>
> > If I add Kitteh to the dictionary, I will obviously
> > allow Kitteh to roam free in my text, which I don't
> > want. I really want to lean "Hello Kitteh" instead,
> > and also give auto-correct suggestions based on this
> > specific meaning, and not on the individual word.
>
> You can do it like this:
>
> (setq ispell-skip-region-alist
> (append ispell-skip-region-alist '(("Hello" . "Kitteh")) ))

No you can't. It would skip all of, "Hello James. We noq go around
saying, 'Hello Kitteh'.". "Noq" is a typo for "now".

> However, I consider this over-engineering. How many
> such combinations to you have?
>
> Why not just add "Kitteh"?
>
> Are you really that worried that will appear instead
> of the correct spelling?

A better example is "fro", which only appears in "to and fro". "Fro"
is a common typo for "for" and "from".

Hunspell seems to have a method of handling these words. It consists
of the qualifier -r on hunspell(1) and WARN in the affix file (see
hunspell(4)), along with a qualifier on the word itself in the
dictionary file. I haven't tried it, but presumably it relies on the
author being careless rather than ignorant of spelling.

Another approach is to move these words from the domain of
spell-checking to that of grammar-checking. It looks as though the
LibreOffice add-on langtool allows one to add grammar rules that would
check that Kitteh is preceded by Hello, and that 'fro' occurs in 'to
and fro'.

Richard.

Emanuel Berg

unread,

Jul 1, 2015, 1:29:03 PM7/1/15

to help-gn...@gnu.org

Richard Wordingham <richard.w...@ntlworld.com>
writes:

>> You can do it like this ...

>
> No you can't. It would skip all of, "Hello James.
> We noq go around saying, 'Hello Kitteh'.". "Noq" is
> a typo for "now".

Yes you can, with a small modification that you would
have found yourself if you'd stopped to think one
second instead of crying out from the holster:

(setq ispell-skip-region-alist
(append ispell-skip-region-alist '(("Hello Kitteh" . "")) ))

Try it here:

Hello Kitteh.

Hello James. Noq I'm misspelling both "now" and
kitten: kitteh. Will it ignore Hello Kitteh, but still
get the two misspellings?

John Mastro

unread,

Jul 1, 2015, 1:47:40 PM7/1/15

to help-gn...@gnu.org

Emanuel Berg <embe...@student.uu.se> wrote:
> (setq ispell-skip-region-alist
> (append ispell-skip-region-alist '(("Hello Kitteh" . "")) ))
>
> Try it here:
>
> Hello Kitteh.
>
> Hello James. Noq I'm misspelling both "now" and
> kitten: kitteh. Will it ignore Hello Kitteh, but still
> get the two misspellings?

That gives me an error ("matching region not found"), but these two
seem to work as intended (after, admittedly, very little testing).

;; 1
(push (list "Hello Kitteh" #'ignore) ispell-skip-region-alist)
;; 2
(push (list "Hello Kitteh" (lambda (&rest _args) (match-end 0)))
ispell-skip-region-alist)

--
john

Richard Wordingham

unread,

Jul 1, 2015, 3:31:28 PM7/1/15

to help-gn...@gnu.org

Both need further tweaking for various non-zero quantities of white
space. A new line between 'Hello' and 'Kitteh' defeats these two
refinements. (Tested on Emacs 24.4.2.)

Richard.

Yuri D'Elia

unread,

Jul 1, 2015, 4:34:22 PM7/1/15

to help-gn...@gnu.org

On 07/01/2015 05:46 AM, Richard Wordingham wrote:
>> However, I consider this over-engineering. How many
>> such combinations to you have?
>>
>> Why not just add "Kitteh"?
>>
>> Are you really that worried that will appear instead
>> of the correct spelling?
>
> A better example is "fro", which only appears in "to and fro". "Fro"
> is a common typo for "for" and "from".

I know that when I'm editing text this comes out more frequently then I
imagined, even though I never paid too much attention. I usually just
added the word and be done with it.

I an article that I was drafting the other day a had 3/4 cases of word
combined with acronyms that would have made perfect sense to check only
as a group and not individually. Also because the words make no sense
individually, and I really don't want them in the dictionary.

I would like to check for correct the capitalization of the entire
group, which also makes sense.

> Another approach is to move these words from the domain of
> spell-checking to that of grammar-checking. It looks as though the
> LibreOffice add-on langtool allows one to add grammar rules that would
> check that Kitteh is preceded by Hello, and that 'fro' occurs in 'to
> and fro'.

I fear that adding this as grammar would indeed be overengineering. Also
because I would be too lazy to add a grammar rule and lose all the
benefits. This might be perfectly dumb (that is: select words -> add to
dictionary).

I was actually thinking of using some
utf8-nonprinting-character-as-whitespace to do the trick (sooo bad,
although it might work with libreoffice as well).

Emanuel Berg

unread,

Jul 1, 2015, 6:49:12 PM7/1/15

to help-gn...@gnu.org

John Mastro <john.b...@gmail.com> writes:

> That gives me an error ("matching region not
> found"), but these two seem to work as intended
> (after, admittedly, very little testing).

Indeed, I forgot, I have since long disabled the popup
debugger so for me that error was logged in the
background. It is a good way to deal with errors by
the way...

But of course it works, because the principle
is clear.

Perhaps this is not so stupid after all. Because if
(when) it works it is a small thing to set up an
interface to just boost such exceptions. I don't know
how many there are, but in some strange contexts where
the whitelisted combinations contain words that are
very similar to ordinary words - why not?

Try this:

(setq ispell-skip-region-alist
(append ispell-skip-region-alist '(("Hello Kitteh" . "[:word:]") )))

Hello Kitteh.

Hello James.Noq I'm misspelling both "now" and

kitten: kitteh. Will it ignore Hello Kitteh, but still
get the two misspellings?

Emanuel Berg

unread,

Jul 1, 2015, 6:55:13 PM7/1/15

to help-gn...@gnu.org

Yuri D'Elia <wav...@thregr.org> writes:

> I fear that adding this as grammar would indeed be
> overengineering. Also because I would be too lazy to
> add a grammar rule and lose all the benefits.
> This might be perfectly dumb (that is: select words
> -> add to dictionary).

No, no, I'm sorry I said that. What other examples did
you come across, if it isn't a secret? Perhaps testing
will be more fun and natural if you provide us with
those as well.

Yuri D'Elia

unread,

Jul 8, 2015, 6:19:43 AM7/8/15

to help-gn...@gnu.org

On 07/02/2015 12:50 AM, Emanuel Berg wrote:
>> I fear that adding this as grammar would indeed be
>> overengineering. Also because I would be too lazy to
>> add a grammar rule and lose all the benefits.
>> This might be perfectly dumb (that is: select words
>> -> add to dictionary).
>
> No, no, I'm sorry I said that. What other examples did
> you come across, if it isn't a secret? Perhaps testing
> will be more fun and natural if you provide us with
> those as well.

It's hard to make examples without a lot of context unfortunately.

But one thing that came up while I was drafting a paper a couple of
weeks ago was about the names of the various studies and consorzia.

These are usually (very bad) acronyms, which are often spelled in full
with bad capitalization on purpose. One example is the "SarniNIA study",
which I would like to learn as a group, since Sarninia is also often
used in the paper (the name of the isle itself): SardiNIA by itself
would be ambiguous, but "SardiNIA study" instead would be always
correct. Now this is a borderline example, since you have three
capitalized letters that are easy to spot, but it's not always so easy.

Technical papers often mix technical jargon and words in very specific
contextes. Often they contain localized words, which are just 1-2 edits
away from a regular one.

Again, these are words I never put in the dictionary, because they're
just to easy to mistype. I'd rather have them marked as spelling errors
to inspect them. However, they can very often be made unique using one
or two words of context around them, which would avoid the continuous
hassle of seeing 5-6% of the document marked with a spelling error. They
would also be unique enough to be consistent between documents, which
would save me even further time.

Using ispell-skip-region-alist is a "nice" hack. It just needs some
polish to read/write to a simple text file, and maybe a support function
to add the current region to it. I can do that myself ;).

But I cannot stop thinking that I cannot be the only guy with this
"problem", and the solution doesn't strike me as particularly
complicated for the benefit that you have. I would have expected "word"
or "libreoffice" to have something similar for the sake of the user, but
it doesn't.

Emanuel Berg

unread,

Jul 11, 2015, 9:55:40 PM7/11/15

to help-gn...@gnu.org

Yuri D'Elia <wav...@thregr.org> writes:

> Again, these are words I never put in the
> dictionary, because they're just to easy to mistype.
> I'd rather have them marked as spelling errors to
> inspect them. However, they can very often be made
> unique using one or two words of context around
> them, which would avoid the continuous hassle of
> seeing 5-6% of the document marked with a spelling
> error. They would also be unique enough to be
> consistent between documents, which would save me
> even further time.

As for me, I'd just add them to the dictionary and
don't worry about them appearing in normal text, and
even if they were, I don't see a typo like that really
having a negative effect to speak of. But yes, it
might be a small annoyance now that you mention it.

> But I cannot stop thinking that I cannot be the only
> guy with this "problem", and the solution doesn't
> strike me as particularly complicated for the
> benefit that you have. I would have expected "word"
> or "libreoffice" to have something similar for the
> sake of the user, but it doesn't.

I think something to this extent should be added.
Perhaps you can report it as a bug (report it as
a suggestion which is the same thing) and see if the
ispell people like the idea. Maybe they can use
a separate file for such combinations, to be
preprocessed, if it would imply a performance penalty
to have it in the normal wordlist.