Hunspell for Japanese

Tak Kunihiro

unread,

Feb 17, 2018, 9:28:05 AM2/17/18

to help-gn...@gnu.org, 国広卓也

I want to spellcheck English phrases that are mixed in Japanese
phrases by `hunspell'. When I call M-x ispell-word, responses from `aspell' and
`hunspell' differ. The difference results in how underlines are drawn in
flyspell-mode. The `hunspell' gives many unnecessary underlines on Japanese phrases.
So I add following to my ~/.emacs.d/inits.el for now.

(defun flyspell-ignore-non-ascii (beg end info)
"Tell flyspell to ignore non ascii characters.
Call this on `flyspell-incorrect-hook'."
(string-match "[^!-~]" (buffer-substring beg end)))
(add-hook 'flyspell-incorrect-hook 'flyspell-ignore-non-ascii)

Is is possible to make `hunspell' behave like `aspell'?

GNU Emacs 25.3.1 (x86_64-apple-darwin13.4.0, NS appkit-1265.21 Version 10.9.5 (Build 13F1911))
of 2017-09-19

##
## Aspell
##

$ which aspell
/opt/local/bin/aspell
$ Emacs -Q
M-: (insert "Emacsは日本ではイーマックスと呼ばれる")
C-a
M-: (setq ispell-program-name "aspell")
M-x ispell-word
X-b *Messages*

> Starting new Ispell process /opt/local/bin/aspell with default dictionary...
> Checking spelling of EMACSは日本語ではイーマックスと呼ばれる...
> EMACSは日本語ではイーマックスと呼ばれる is correct
> You can run the command ‘ispell-word’ with M-$

##
## Hunspell
##

$ which hunspell
/opt/local/bin/hunspell
$ hunspell -D
...
/opt/local/share/hunspell/en_US
LOADED DICTIONARY:
/opt/local/share/hunspell/en_US.aff
/opt/local/share/hunspell/en_US.dic
Hunspell 1.6.2
$ Emacs -Q
M-: (insert "Emacsは日本ではイーマックスと呼ばれる")
C-a
M-: (setq ispell-program-name "hunspell")
M-x ispell-word
X-b *Messages*

> Starting new Ispell process hunspell with default dictionary...
> Checking spelling of EMACSは日本語ではイーマックスと呼ばれる...
> ispell-word: Ispell and its process have different character maps

Eli Zaretskii

unread,

Feb 17, 2018, 10:18:29 AM2/17/18

to help-gn...@gnu.org

> From: Tak Kunihiro <t...@misasa.okayama-u.ac.jp>
> Date: Sat, 17 Feb 2018 22:53:50 +0900
> Cc: 国広卓也 <t...@misasa.okayama-u.ac.jp>

>
> I want to spellcheck English phrases that are mixed in Japanese
> phrases by `hunspell'. When I call M-x ispell-word, responses from `aspell' and
> `hunspell' differ. The difference results in how underlines are drawn in
> flyspell-mode. The `hunspell' gives many unnecessary underlines on Japanese phrases.

If your dictionary is for English, why do you expect flyspell-mode to
work correctly with words in another language? It can't do anything
sensible with such foreign words. The underlines flyspell-mode shows
in Japanese words when the dictionary is for English could be
anything; you should simply disregard any such underlines in
non-English words.

Can you tell why you pay attention to underlines in non-English words
in this situation?

> Is is possible to make `hunspell' behave like `aspell'?

They are very different programs, so they cannot behave the same.

> $ which hunspell
> /opt/local/bin/hunspell
> $ hunspell -D
> ...
> /opt/local/share/hunspell/en_US
> LOADED DICTIONARY:
> /opt/local/share/hunspell/en_US.aff
> /opt/local/share/hunspell/en_US.dic
> Hunspell 1.6.2
> $ Emacs -Q
> M-: (insert "Emacsは日本ではイーマックスと呼ばれる")
> C-a
> M-: (setq ispell-program-name "hunspell")
> M-x ispell-word
> X-b *Messages*
>
> > Starting new Ispell process hunspell with default dictionary...
> > Checking spelling of EMACSは日本語ではイーマックスと呼ばれる...
> > ispell-word: Ispell and its process have different character maps

I see the same message. It is caused by Hunspell somehow considering
the string "は日本語ではイーマックスと呼ばれる" as more than one word,
and it therefore returns 3 misspellings, which then trigger the above
cryptic error message.

But once again, you've set up flyspell-mode to work in English, so you
shouldn't pay attention to what it does with Japanese. For starters,
I believe the encoding Emacs uses is incorrect in that case, because
the en_US.aff file probably states that it wants a Latin-1 encoding,
not UTF-8. But even using UTF-8 will not help here, AFAIU.

Tak Kunihiro

unread,

Feb 18, 2018, 12:54:56 AM2/18/18

to help-gn...@gnu.org, el...@gnu.org, t...@misasa.okayama-u.ac.jp

Thank you for the reply.

I see. It is true that I should not expect both Aspell and Hunspell
to handle Japanese correctly when their task is to check English. It
was just a lucky case how flyspell-mode with Aspell ignores Japanese
words and show no underlines.

> Can you tell why you pay attention to underlines in non-English
> words in this situation?

When I write Japanese, very often English words such for `Emacs' are
mixed. Thus I (I think most of Japanese) run flyspell-mode with
English dictionary all the time. I expect flyspell-mode ignores all
Japanese words and only checks English words like how LibreOffice
does.

With flyspell-mode with Hunspell, lines are shown under many Japanese
phrases (not all Japanese phases) and I cannot tell which underline
corresponds to misspelled English words. As inferred already, Aspell
only shows underline on wrong spelled English.

> But once again, you've set up flyspell-mode to work in English, so you
> shouldn't pay attention to what it does with Japanese.

I agree. I also see problem with M-x ispell-buffer, and noticed a
solution.

(defvar ispell-regexp-non-ascii "[^\000-\377]+"
"Regular expression to match a non-ascii word.")
(add-to-list 'ispell-skip-region-alist (list ispell-regexp-non-ascii))

Once I accept this solution for M-x spell-buffer, I would accept a
solution for flyspell-mode as shown below.

(defun flyspell-skip-non-ascii (beg end info)
"Tell flyspell to skip a non-ascii word.

Call this on `flyspell-incorrect-hook'."

(string-match ispell-regexp-non-ascii (buffer-substring beg end)))
(add-hook 'flyspell-incorrect-hook 'flyspell-skip-non-ascii)

It took me a while to figure this out. I think that what M-x
ispell-buffer and flyspell-mode provide is fundamental functionalities
and it is good to be documented in somewhere in Emacs such for (info
"(emacs) Spelling"). Can you give suggestion?

Eli Zaretskii

unread,

Feb 18, 2018, 11:00:03 AM2/18/18

to help-gn...@gnu.org

> Date: Sun, 18 Feb 2018 14:31:56 +0900 (JST)
> Cc: t...@misasa.okayama-u.ac.jp
> From: Tak Kunihiro <t...@misasa.okayama-u.ac.jp>

>
> (defvar ispell-regexp-non-ascii "[^\000-\377]+"
> "Regular expression to match a non-ascii word.")
> (add-to-list 'ispell-skip-region-alist (list ispell-regexp-non-ascii))
>
> Once I accept this solution for M-x spell-buffer, I would accept a
> solution for flyspell-mode as shown below.
>
> (defun flyspell-skip-non-ascii (beg end info)
> "Tell flyspell to skip a non-ascii word.
> Call this on `flyspell-incorrect-hook'."
> (string-match ispell-regexp-non-ascii (buffer-substring beg end)))
> (add-hook 'flyspell-incorrect-hook 'flyspell-skip-non-ascii)
>
> It took me a while to figure this out. I think that what M-x
> ispell-buffer and flyspell-mode provide is fundamental functionalities
> and it is good to be documented in somewhere in Emacs such for (info
> "(emacs) Spelling"). Can you give suggestion?

On the Wiki?

You see, the solution you propose has one significant disadvantage: it
will skip words used in English prose which are written using
non-ASCII characters. It's true that there aren't many of those, but
they do exist.

You could try instead use 2 dictionaries at the same time, one for
English, the other for Japanese. This will only work with Hunspell,
and only in Emacs 26 or later. Caveat: I never tried it with these
two languages, so I don't know whether this combination has some
subtle problems with that feature.

Tak Kunihiro

unread,

Feb 23, 2018, 9:02:47 PM2/23/18

to help-gn...@gnu.org, el...@gnu.org, t...@misasa.okayama-u.ac.jp

> On the Wiki?

OK. I put the solution on EmacsWiki.

https://www.emacswiki.org/emacs/FlySpell#toc14