word segmentation in Vim

112 views
Skip to first unread message

Xie

unread,
Jan 20, 2009, 11:36:56 AM1/20/09
to vim_dev
hi everybody

Vim is being used around the world, in many different languages. As
the help indicated, a "word" in Vim is defined as "a sequence of
letters, digits and underscores ... bla bla bla ...". But that's the
word for alphabetic languages. Has Vim considered expanding this
concept to more complex multi-byte languages such as Chinese, Japanese
or Korean and use some word segmentation algorithm accordingly for
"w"/"b" etc ?


--
Xie

Tony Mechelynck

unread,
Jan 20, 2009, 8:30:57 PM1/20/09
to vim...@googlegroups.com

Well, not only "this can be changed" (for single-byte characters) "by
the 'iskeyword' option", but also (for multibyte characters) Vim "knows"
that most characters are "word characters", but that some (such as
U+3000 IDEOGRAPHIC SPACE, U+3001 IDEOGRAPHIC COMMA, U+3002 IDEOGRAPHIC
FULL STOP etc.) are non-word characters.

What Vim does _not_ do AFAIK is regard every CJK character as a separate
"word". If you want that, you should use the commands for "character
under cursor" etc. rather than "word under cursor" etc.


Best regards,
Tony.
--
A lack of leadership is no substitute for inaction.

Xie

unread,
Jan 20, 2009, 10:11:09 PM1/20/09
to vim_dev
Thank you for your reply, Tony. I don't know if my English is enough
to make myself clear but I'll try.

In English, semantically, a "word" sequence of characters (a-zA-Z) and
is the smallest meaningful unit. Word segmentation is not needed in
English because the "word" is naturally separated by whitespaces. The
situation is different in CJK languages. It takes several CJK
characters to form a "word" but this "word" exists in a serial of
characters and is not easily distinguishable for computer. That's why
Word Segmentation algorithm is needed to recognize a "word".

As far as I know, Vim simply takes a sequence of whatever characters
(not ,./?><...) as a "word", which is correct semantically for
English, but not for CJK languages. What I want to know is that if Vim
has ever thought about adding support to this.

Thanks
Xie

On Jan 21, 9:30 am, Tony Mechelynck <antoine.mechely...@gmail.com>
wrote:

Tony Mechelynck

unread,
Jan 20, 2009, 11:57:39 PM1/20/09
to vim...@googlegroups.com
On 21/01/09 04:11, Xie wrote:
> Thank you for your reply, Tony. I don't know if my English is enough
> to make myself clear but I'll try.
>
> In English, semantically, a "word" sequence of characters (a-zA-Z) and
> is the smallest meaningful unit. Word segmentation is not needed in
> English because the "word" is naturally separated by whitespaces. The
> situation is different in CJK languages. It takes several CJK
> characters to form a "word" but this "word" exists in a serial of
> characters and is not easily distinguishable for computer. That's why
> Word Segmentation algorithm is needed to recognize a "word".
>
> As far as I know, Vim simply takes a sequence of whatever characters
> (not ,./?><...) as a "word", which is correct semantically for
> English, but not for CJK languages. What I want to know is that if Vim
> has ever thought about adding support to this.
>
> Thanks
> Xie

I don't think it would be feasible, especially since OT1H there exist
compound words which can exist either as distinct words or as part of
larger compounds, and OTOH there exist characters which cannot appear as
separate words in contemporary Chinese but can do so in poetic or
archaic language (and you wouldn't prevent Vim from being usable with,
let's say, commentaries of ancient writers, would you?). So IIUC Vim
would need an extensive dictionary of compounds, and the logic to go
with it, in order to "intelligently" break CJK words (and I'm not sure
it could do so when spelling is not being checked). So I suppose
treating all ideograms (but not ideographic punctuation) as "word"
characters may be less than perfect but at least it's doable (and
someone who doesn't speak CJK languages can program it and test it).

What might be possible (but I'm not sure it is) would be to define
spelling dictionaries for mainland Chinese, Taiwanese, Hong Kong
Chinese, Japanese, South Korean and North Korean, containing only the
"acceptable" isolated words and "indivisible" compounds. This might give
a basis for what you're asking for; but how would you treat a CJK
character which is not used alone in some language, and appears (maybe
as a result of some typo, or maybe in a quotation from some other CJK
language) in a context where it doesn't make an "acceptable" compound
with the hanzi-kanji-hanja/kana/hangeul-chosŏngŭl surrounding it? In
alphabetic languages you could scan both ways to the nearest space, tab,
linebreak or punctuation mark; but I'm not sure how to do it with CJK text.


Best regards,
Tony.
--
When a fly lands on the ceiling, does it do a half roll or a half
loop?

Xie

unread,
Jan 21, 2009, 3:57:54 AM1/21/09
to vim_dev
In order to support Word Segmentation in Chinese, dictionary is
indispensable. I understand your doubt but I'm not going to deal with
these details and re-invent a wheel. As far as I know there are some
open source Chinese word segmentation libs available on the web. It's
much better if they could be integrated into Vim. Although I've been
using Vim for a while and I'm a programmer, I'm still new to Vim
source. I don't know if the concept of "word" in Vim is expandable to
this challenge. If not, that would be a waste of time.

--
Xie

On Jan 21, 12:57 pm, Tony Mechelynck <antoine.mechely...@gmail.com>
wrote:
Reply all
Reply to author
Forward
0 new messages