[vim/vim] Ship UTF-8 affix files for spell checking (#3747)

52 views
Skip to first unread message

Warren Young

unread,
Dec 31, 2018, 2:04:42 PM12/31/18
to vim/vim, Subscribed

For English, at least, Vim currently does not ship spell/en/main.aff, which means that text like the following is flagged with a spelling error:

 I don‘t like Emacs.

Note the curled Unicode quote.

Adding the following as the content of a new file called .../spell/en/main.aff fixes it:

SET utf-8

MIDWORD '-‘

I want this done in the main project because some packaging schemes cause local changes like that to be lost on updates. My immediate use case is MacVim, where /Applications/MacVim.app/ is deleted before the new one is unpacked, but this may happen with other packaging schemes.

I realize this will cause Vim to assume UTF-8, but I think that's been a safe default for years now.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub

Bram Moolenaar

unread,
Dec 31, 2018, 4:03:34 PM12/31/18
to vim/vim, Subscribed

Warren Young wrote:

> For English, at least, Vim currently does not ship
> `spell/en/main.aff`, which means that text like the following is

> flagged with a spelling error:
>
> I don‘t like Emacs.
>
> Note the curled Unicode quote.
>
> Adding the following as the content of a new file called `.../spell/en/main.aff` fixes it:
>
> SET utf-8
> MIDWORD '-‘

What is main.aff? I don't see it used anywhere. There are several .aff
files, e.g. en_US.aff.

The encoding mentioned here is the encoding of the spell file. It is
already utf-8:
SET UTF-8

You also add the dash here which I think is incorrect. The dash
already is a word character, also when it's at the start or end of a
word.


> I want this done in the main project because some packaging schemes
> cause local changes like that to be lost on updates. My immediate use
> case is MacVim, where `/Applications/MacVim.app/` is deleted before
> the new one is unpacked, but this may happen with other packaging
> schemes.
>
> I realize this will cause Vim to assume UTF-8, but I think that's been
> a safe default for years now.

No, it only specifies the encoding of the spell files. So yes, it works
fine.

Perhaps you were referring to the Mac version of Vim? I would not know
why it has differet spell files.

--
From "know your smileys":
|-P Reaction to unusually ugly C code

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Warren Young

unread,
Dec 31, 2018, 4:12:20 PM12/31/18
to vim/vim, Subscribed

The original posting is based on some incorrect thinking.

The primary one is that my chosen example is bad: "don" is an English word, so I was mislead into thinking my proposed fix helps. Let's use a different example text:

I couldn’t do that in Emacs.

That gets flagged as a spelling error because "couldn" isn't an English word.

Now we're left with new problems, the primary one being that my main.aff fix is ineffective. More skimming and searching in :help spell tells me that this is because the affix file is only used by mkspell, and that "only developers need to know about it."

From that I infer that what's needed isn't for Vim to ship these affix files or for it to provide a way for normal end users to supply their own local version, but instead for the ones Vim developers use on their end to be modified to account for Unicode curly quotes in contractions and such.

This isn't about English specifically or even about English contractions. I assume it applies widely, such as to French m’aidez.

Perhaps you were referring to the Mac version of Vim? I would not know why it has differet spell files.

I filed the issue first against MacVim, but they closed it and sent me here. From that I assume they're not doing any MacVim customization to Vim's spell checking mechanism.

Bram Moolenaar

unread,
Dec 31, 2018, 6:56:43 PM12/31/18
to vim/vim, Subscribed

Warren Young wrote:

> The original posting is based on some incorrect thinking.
>
> The primary one is that my chosen example is bad: "don" is an English
> word, so I was mislead into thinking my proposed fix helps. Let's use
> a different example text:
>
> I couldn’t do that in Emacs.
>
> That gets flagged as a spelling error because "couldn" isn't an English word.
>
> Now we're left with new problems, the primary one being that my
> `main.aff` fix is ineffective. More skimming and searching in `:help
> spell` tells me that this is because the affix file is only used by
> `mkspell`, and that "only developers need to know about it."

Correct, the spell code needs to know what characters exactly make up
correct word. That is processed into a complicated data structure used
to find spelling mistakes (actually finds correct spellings, and what's
left arre mistakes).


> From that I infer that what's needed isn't for Vim to ship these affix
> files or for it to provide a way for normal end users to supply their
> own local version, but instead for the ones Vim developers use on
> their end to be modified to account for Unicode curly quotes in
> contractions and such.
>
> This isn't about English specifically or even about English
> contractions. I assume it applies widely, such as to French m’aidez.

What quotes are valid inside what words is language specific. The
normal single quote is used by most languages, this special kind of
quote added by Unicode is more specific and is only valid in a number of
languages.


> > Perhaps you were referring to the Mac version of Vim? I would not
> > know why it has differet spell files.
>
> I filed the issue first against MacVim, but they closed it and sent me
> here. From that I assume they're not doing any MacVim customization to
> Vim's spell checking mechanism.

OK, I was thinking the MacVim doesn't have Mac specific spell checking.


--
From "know your smileys":
(X0||) Double hamburger with lettuce and tomato


/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Dominique Pellé

unread,
Mar 25, 2019, 5:45:44 AM3/25/19
to vim/vim, Subscribed

I think that the *.aff file should contain something like this (among other ICONV rules):

ICONV ’ '

At least recent French Hunspell files have this.
But unfortunately, vim import of Hunspell files does not take yet into account ICONV.

Bram Moolenaar

unread,
Mar 25, 2019, 5:22:04 PM3/25/19
to vim/vim, Subscribed

Dominique wrote:

> I think that the *.aff file should contain something like this (among other ICONV rules):
> ```
> ICONV ’ '
> ```
> At least recent French Hunspell files have this.
> But unfortunately, vim import of Hunspell files does not take yet into
> account `ICONV`.

Is there documentation about what ICONV does exactly?

--
Support your right to bare arms! Wear short sleeves!


/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Dominique Pellé

unread,
Mar 25, 2019, 6:18:21 PM3/25/19
to vim/vim, Subscribed

@brammool wrote:

Is there documentation about what ICONV does exactly?

From https://linux.die.net/man/4/hunspell:

ICONV pattern pattern2

    Define input conversion table.

My understanding that that Hunspell transforms input using ICONV rules before probing the directory.
So that various apostrophes can become the regular ' apostrophe for example.

You can see many ICONV rules in this French dictionary with rules for apostrophe, digraphs and various forms of ways to write diacritics:

https://github.com/titoBouzout/Dictionaries/blob/master/French.aff

ICONV 38

ICONV ’ '

ICONV ffi ffi

ICONV ffl ffl

ICONV ff ff

ICONV fi fi

ICONV fl fl

ICONV à à

ICONV â â

ICONV ä ä

ICONV é é

ICONV è è

ICONV ê ê

etc.

Bram Moolenaar

unread,
Mar 27, 2019, 5:54:16 PM3/27/19
to vim...@googlegroups.com, Dominique Pellé

> @brammool wrote:
>
> > Is there documentation about what ICONV does exactly?
>
> >From https://linux.die.net/man/4/hunspell:
>
> ```
>
> ICONV pattern pattern2
>
> Define input conversion table.
>
> ```
>
> My understanding that that Hunspell transforms input using ICONV rules
> before probing the directory.

You mean dictionary?

> So that various apostrophes can become the regular ' apostrophe for example.

But what is a "pattern"? Just text (white space separated) or are there
wildcards?

What does "input" mean? I assume the original (typed) text that needs
to be checked. Then when suggesting a fix, I wonder how we revert the
conversion. Always do the opposite, or under some conditions?

I also wonder what rules there are. E.g. quotes at the start and end of
a word can be handled differently.

> You can see many ICONV rules in this French dictionary with rules for
> apost> rophe, digraphs and various forms of ways to write diacritics:
>
> https://github.com/titoBouzout/Dictionaries/blob/master/French.aff
>
> ```
> ICONV 38
> ICONV ’ '
> ICONV ffi ffi
> ICONV ffl ffl
> ICONV ff ff
> ICONV fi fi
> ICONV fl fl
> ICONV à à
> ICONV â â
> ICONV ä ä
> ICONV é é
> ICONV è è
> ICONV ê ê
> etc.

This also handles composing characters rewritten to a single character.
That is not language specific, should be handled elsewhere.

--
Never eat yellow snow.

Dominique Pellé

unread,
Mar 28, 2019, 2:25:37 PM3/28/19
to vim/vim, Subscribed

The documentation of ICONV at https://linux.die.net/man/4/hunspell
is short and a bit vague. Let's ask the author of Hunspell @laszlonemeth
what ICONV is for exactly, and whether ICONV is the right way to
recognize apostrophe ‘ or '.

László Németh

unread,
Mar 29, 2019, 4:00:57 AM3/29/19
to vim/vim, Subscribed

Indeed, in Hunspell, you can convert Unicode or typographical apostrophe ’ (U+2019) to the ASCII one (') using ICONV in the case of UTF-8 encoded dictionary stems (there is a “SET UTF-8” in the affix file). But I don't know the ICONV support of mkspell (Vim’s version of MySpell/Hunspell developed by Bram Moolenaar).

There is no ideal solution, because it's still common to use ASCII apostrophes in plain text files, but that has already been a typographical error in document editing.

With an UTF-8 encoded dictionary, you can store the correct typographical apostrophes in the dic file, and optionally, add a

ICONV 1
ICONV ' ’

definition to the dictionary to recognize and accept the words with ASCII apostrophes automatically. Otherwise it’s worth to use MAP or REP to recognize and correct the words with ASCII apostrophes.

Note: The future is to use the typographical one everywhere, but it's not easy in document editors, too (for example, modifying the shortcut Shift-1 to type typographical apostrophe instead of ASCII one in LibreOffice resulted some surprising problems. The last one: https://bugs.documentfoundation.org/show_bug.cgi?id=108423).

Shady Alfred

unread,
May 11, 2024, 6:34:54 PM5/11/24
to vim/vim, Subscribed

Any updates on what is the optimal way to achieve this?
What I have done was this:

  1. open vim
  2. :spelldump
  3. save the buffer to ~/.config/vim/en/main.aff file
  4. add SET utf-8 MIDWORD '-’ to the beginning of the file.

It works fine.


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/issues/3747/2106047170@github.com>

Reply all
Reply to author
Forward
0 new messages