[patch] exclude East Asian characters form spell checking

277 views
Skip to first unread message

Ken Takata

unread,
Oct 7, 2013, 8:02:58 AM10/7/13
to vim...@googlegroups.com
Hi,

I wrote a patch for the following items from todo.txt:

> Have an option for spell checking to not mark any Chinese, Japanese or other
> double-width characters as error. Or perhaps all characters above 256.
> (Bill Sun) Helps a lot for mixed Asian and latin text.

> - have some way not to give spelling errors for a range of characters.
> E.g. for Chinese and other languages with specific characters for which we
> don't have a spell file. Useful when there is also text in other
> languages in the file.

When I write mixed Japanese and English text, it really annoys me.
Current Vim's spell checking algorithm doesn't support Chinese, Japanese or
other East Asian languages. So I just exclude these characters from spell
checking. (No options)
Please check the attached patch.

Regards,
Ken Takata

exclude-east-asian-chars-from-spell-checking.patch

Tony Mechelynck

unread,
Oct 7, 2013, 8:59:43 AM10/7/13
to vim...@googlegroups.com
"All characters above 256" would seem a little rash IMHO: after all,
Russian, Ukrainian, Bulgarian, Greek, etc. can (or should be able to)
use spell checking even though their writing systems are entirely above
U+00FF, and even in Latin script, some French nouns such as œil (eye),
œuf (egg), bœuf (ox or beef), œil-de-bœuf (a small round window), vœu
(wish), Œdipe (Oedipus), œsophage (oesophagus), etc., use characters
(the oe / OE digraphs, which in French are one character each) above
U+00FF. Similarly for the accented letters of non-West-European
languages, many of which fall outside tha Latin1 range.

I suppose that excluding CJK is the right thing to do, since the nearest
thing to "spell checking" for handwritten CJK would mean checking that
the correct brush strokes were used, but "wrong" brush stroke
combinations (other than simplified vs. traditional glyphs, or than
Japanese "national" /kokuji/ characters in a Chinese text, etc.) cannot
be produced as computer text even in Unicode; or else it might mean
checking that word elements ("Han syllables") are meaningfully combined,
which IMHO is more akin to checking semantics or syntax than orthography.

Best regards,
Tony.
--
By perseverance the snail reached the Ark.
-- Charles Spurgeon

Bram Moolenaar

unread,
Oct 7, 2013, 5:21:22 PM10/7/13
to Tony Mechelynck, vim...@googlegroups.com

Tony Mechelynck wrote:

> On 07/10/13 14:02, Ken Takata wrote:
> > Hi,
> >
> > I wrote a patch for the following items from todo.txt:
> >
> >> Have an option for spell checking to not mark any Chinese, Japanese or other
> >> double-width characters as error. Or perhaps all characters above 256.
> >> (Bill Sun) Helps a lot for mixed Asian and latin text.
> >
> >> - have some way not to give spelling errors for a range of characters.
> >> E.g. for Chinese and other languages with specific characters for which we
> >> don't have a spell file. Useful when there is also text in other
> >> languages in the file.
> >
> > When I write mixed Japanese and English text, it really annoys me.
> > Current Vim's spell checking algorithm doesn't support Chinese, Japanese or
> > other East Asian languages. So I just exclude these characters from spell
> > checking. (No options)
> > Please check the attached patch.
> >
> > Regards,
> > Ken Takata
> >
>
> "All characters above 256" would seem a little rash IMHO: after all,
> Russian, Ukrainian, Bulgarian, Greek, etc. can (or should be able to)
> use spell checking even though their writing systems are entirely above
> U+00FF, and even in Latin script, some French nouns such as �il (eye),
> �uf (egg), b�uf (ox or beef), �il-de-b�uf (a small round window), v�u
> (wish), �dipe (Oedipus), �sophage (oesophagus), etc., use characters
> (the oe / OE digraphs, which in French are one character each) above
> U+00FF. Similarly for the accented letters of non-West-European
> languages, many of which fall outside tha Latin1 range.
>
> I suppose that excluding CJK is the right thing to do, since the nearest
> thing to "spell checking" for handwritten CJK would mean checking that
> the correct brush strokes were used, but "wrong" brush stroke
> combinations (other than simplified vs. traditional glyphs, or than
> Japanese "national" /kokuji/ characters in a Chinese text, etc.) cannot
> be produced as computer text even in Unicode; or else it might mean
> checking that word elements ("Han syllables") are meaningfully combined,
> which IMHO is more akin to checking semantics or syntax than orthography.

I was wondering if this should be an option or a spell setting of some
kind. So, you argue that we won't every have useful spell checking for
CJK characters, so we should just ignore them.

What if if have some text in a language that is spell checked, and by
some mistake a few CJK characters show up (copy/paste error, encoding
conversion mistake, etc.). Then they should be marked as errors right?

For me, I ocasionally get these characters when an Asian name is used.
I don't really care if that is highlighted as an error or not (can't
read it anyway). Other names are marked as errors, so perhaps foreign
names should be as well?

Following that line of thinking it should be an option. Perhaps a
special entry in 'spelllang' "cjk" ?

--
DEAD PERSON: I'm getting better!
CUSTOMER: No, you're not -- you'll be stone dead in a moment.
MORTICIAN: Oh, I can't take him like that -- it's against regulations.
The Quest for the Holy Grail (Monty Python)

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Ken Takata

unread,
Oct 8, 2013, 11:53:16 AM10/8/13
to vim...@googlegroups.com
Hi,

2013/10/08 Tue 6:21:22 UTC+9 Bram Moolenaar wrote:
> Tony Mechelynck wrote:
>
> > On 07/10/13 14:02, Ken Takata wrote:
> > > Hi,
> > >
> > > I wrote a patch for the following items from todo.txt:
> > >
> > >> Have an option for spell checking to not mark any Chinese, Japanese or other
> > >> double-width characters as error. Or perhaps all characters above 256.
> > >> (Bill Sun) Helps a lot for mixed Asian and latin text.
> > >
> > >> - have some way not to give spelling errors for a range of characters.
> > >> E.g. for Chinese and other languages with specific characters for which we
> > >> don't have a spell file. Useful when there is also text in other
> > >> languages in the file.
> > >
> > > When I write mixed Japanese and English text, it really annoys me.
> > > Current Vim's spell checking algorithm doesn't support Chinese, Japanese or
> > > other East Asian languages. So I just exclude these characters from spell
> > > checking. (No options)
> > > Please check the attached patch.
> > >
> > > Regards,
> > > Ken Takata
> > >
> >
> > "All characters above 256" would seem a little rash IMHO: after all,
> > Russian, Ukrainian, Bulgarian, Greek, etc. can (or should be able to)
> > use spell checking even though their writing systems are entirely above
> > U+00FF, and even in Latin script, some French nouns such as �il (eye),
> > �uf (egg), b�uf (ox or beef), �il-de-b�uf (a small round window), v�u
> > (wish), �dipe (Oedipus), �sophage (oesophagus), etc., use characters
> > (the oe / OE digraphs, which in French are one character each) above
> > U+00FF. Similarly for the accented letters of non-West-European
> > languages, many of which fall outside tha Latin1 range.
> >
> > I suppose that excluding CJK is the right thing to do, since the nearest
> > thing to "spell checking" for handwritten CJK would mean checking that
> > the correct brush strokes were used, but "wrong" brush stroke
> > combinations (other than simplified vs. traditional glyphs, or than
> > Japanese "national" /kokuji/ characters in a Chinese text, etc.) cannot
> > be produced as computer text even in Unicode; or else it might mean
> > checking that word elements ("Han syllables") are meaningfully combined,
> > which IMHO is more akin to checking semantics or syntax than orthography.
>
> I was wondering if this should be an option or a spell setting of some
> kind. So, you argue that we won't every have useful spell checking for
> CJK characters, so we should just ignore them.
>
> What if if have some text in a language that is spell checked, and by
> some mistake a few CJK characters show up (copy/paste error, encoding
> conversion mistake, etc.). Then they should be marked as errors right?
>
> For me, I ocasionally get these characters when an Asian name is used.
> I don't really care if that is highlighted as an error or not (can't
> read it anyway). Other names are marked as errors, so perhaps foreign
> names should be as well?
>
> Following that line of thinking it should be an option. Perhaps a
> special entry in 'spelllang' "cjk" ?

My previous patch excludes only CJK characters not "All characters above 256".
But I agree that checking CJK characters is useful for some kind of mistakes.
How about adding "nocjk" in 'spelllang'? For example, if you want to check
English but exclude CJK chars:
:set spelllang=en,nocjk

Please check the attached patch.
(I also merged my another patch:
https://groups.google.com/d/msg/vim_dev/UxuwQaj1HAc/BvjwIJg6WGIJ )

Regards,
Ken Takata
spelllang-nocjk.patch

Bram Moolenaar

unread,
Oct 8, 2013, 5:05:24 PM10/8/13
to Ken Takata, vim...@googlegroups.com

Ken Takata wrote:

> Hi,
>
> 2013/10/08 Tue 6:21:22 UTC+9 Bram Moolenaar wrote:
> > Tony Mechelynck wrote:
> >
> > > On 07/10/13 14:02, Ken Takata wrote:
> > > > Hi,
> > > >
> > > > I wrote a patch for the following items from todo.txt:
> > > >
> > > >> Have an option for spell checking to not mark any Chinese, Japanese or other
> > > >> double-width characters as error. Or perhaps all characters above 256.
> > > >> (Bill Sun) Helps a lot for mixed Asian and latin text.
> > > >
> > > >> - have some way not to give spelling errors for a range of characters.
> > > >> E.g. for Chinese and other languages with specific characters for which we
> > > >> don't have a spell file. Useful when there is also text in other
> > > >> languages in the file.
> > > >
> > > > When I write mixed Japanese and English text, it really annoys me.
> > > > Current Vim's spell checking algorithm doesn't support Chinese, Japanese or
> > > > other East Asian languages. So I just exclude these characters from spell
> > > > checking. (No options)
> > > > Please check the attached patch.
> > > >
> > > > Regards,
> > > > Ken Takata
> > > >
> > >
> > > "All characters above 256" would seem a little rash IMHO: after all,
> > > Russian, Ukrainian, Bulgarian, Greek, etc. can (or should be able to)
> > > use spell checking even though their writing systems are entirely above
> > > U+00FF, and even in Latin script, some French nouns such as �il (eye),
> > > �uf (egg), b�uf (ox or beef), �il-de-b�uf (a small round window), v�u
> > > (wish), �dipe (Oedipus), �sophage (oesophagus), etc., use characters
Thanks. "nocjk" is a bit strange, the other entries in 'spelllang'
specify languages for which words will be recognized and not marked as
errors. I suggested "cjk" as it would see all CJK letters as OK.
Perhaps "ignore-cjk" would be clearer, but it's a bit long.

I don't think there will ever be a "cjk" language, thus there should be
no reason to avoid that in case we do get a "cjk" spell checker.

--
[clop clop]
MORTICIAN: Who's that then?
CUSTOMER: I don't know.
MORTICIAN: Must be a king.
CUSTOMER: Why?
MORTICIAN: He hasn't got shit all over him.

Ken Takata

unread,
Oct 8, 2013, 6:50:26 PM10/8/13
to vim...@googlegroups.com, Ken Takata
Hi Bram,

2013/10/09 Wed 6:05:24 UTC+9 Bram Moolenaar wrote:
> Thanks. "nocjk" is a bit strange, the other entries in 'spelllang'
> specify languages for which words will be recognized and not marked as
> errors. I suggested "cjk" as it would see all CJK letters as OK.
> Perhaps "ignore-cjk" would be clearer, but it's a bit long.
>
> I don't think there will ever be a "cjk" language, thus there should be
> no reason to avoid that in case we do get a "cjk" spell checker.

Ah, I understand.
I have updated the patch.

Regards,
Ken Takata

spelllang-cjk.patch
Reply all
Reply to author
Forward
0 new messages