Set line breaks, word wraps and word searching for Thai and other non-latin languages

Brian Wilson

unread,

Nov 13, 2015, 1:22:19 AM11/13/15

to vim...@vim.org

I posted the following question on the vi/vim stack exchange and was told that the vim-dev mailing list would be a more appropriate place to ask.

Brian

It is edited here as best I can with the assumption that the entered text is utf-8.

My purpose is for a Thai solution, but instead of a hack, a more general solution should be available that will help the more than 1 Billion people of the various Indic languages.

****

I can set the text width and can manually line break imported paragraphs with the following as an example.

set textwidth=72
gqq

I can also navigate English text files with the standard 'w' 'b' 'e' '*' commands, etc.

This works well for English, however Thai and other Brahmic scripts of South and South-east Asia space at the phrasal level. Libreoffice, Word, Indesign, TeX, etc. "know" where line breaks should occur. They also "know" where individual words are, even though there are no spaces. I can navigate by Thai word in these programs. And I can even type English, Thai and Lao in the chrome address bar and then use alternate arrow on my mac to navigate at the word level in all three of these languages. It seems that these programs are tapping into work that has already been done at some lower level. If vim could tap into the same work, then someone could edit a multi-language document without having to do anything fancy. 'w' 'dw' (etc.) would just work happily from one word to the next regardless of the language.

Line breaking poses a different challenge as these languages space at the phrasal level so that the trailing space or absence of a trailing space at the end of the line has meaning when breaking and joining lines. For purpose of example, the spaces are similar to an oxford comma and other punctuation and is the difference of whether or not we had Grandma for breakfast. (Let's eat Grandma. vs. Let's eat, Grandma.) One, also, doesn't, want, random, spaces, coming, when, they, are, not, needed.

My question is two fold: 1. How can vim tap into already available libraries in order to recognize words from Indic languages (including and especially Thai) for the purpose of navigation and other vim word level commands. 2. Is it possible to add language awareness for the purpose of line breaking so that vim does not strip/add spaces when breaking/joining lines at words in Thai or other Indic languages.

Bram Moolenaar

unread,

Nov 14, 2015, 4:53:24 PM11/14/15

to Brian Wilson, vim...@vim.org

Brian Wilson wrote:

> I posted the following question on the vi/vim stack exchange

> <http://vi.stackexchange.com/questions/5452/set-line-breaks-word-wraps-and-word-searching-for-thai-and-other-non-latin-lang>

> *My question is two fold:* 1. How can vim tap into already available

> libraries in order to recognize words from Indic languages (including and
> especially Thai) for the purpose of navigation and other vim word level
> commands. 2. Is it possible to add language awareness for the purpose of
> line breaking so that vim does not strip/add spaces when breaking/joining
> lines at words in Thai or other Indic languages.

Can we see the start and/or end of a word by recognizing characters?
Or do we need to recognize words?

The spell checker does have some knowledge about where words start and
end. It's a bit slow doing it that way, but might still be acceptable.

I suppose we could have a character class that indicates no spaces are
used to separated words. That will assume the character is only used in
that kind of language.

--
hundred-and-one symptoms of being an internet addict:
102. When filling out your driver's license application, you give
your IP address.

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Random832

unread,

Nov 14, 2015, 6:04:44 PM11/14/15

to vim...@vim.org

Bram Moolenaar <Br...@moolenaar.net> writes:
> Can we see the start and/or end of a word by recognizing characters?
> Or do we need to recognize words?

Everything I can find online indicates that word boundary detection (and
line breaking, which requires _syllable_ boundary detection). ICU
provides algorithms for this, which use dictionaries for Thai, Khmer,
Chinese, and Japanese, though I don't know if this is what is used by
the platforms that provide this capability in standard editing controls.

Cynically, I suppose that users of these scripts are probably used to
minor inconsistencies between different software packages, and that
matching platform behavior exactly is less important than having
reasonable behavior 99% of the time.

http://userguide.icu-project.org/boundaryanalysis

One thing I wonder about is, will \< and \> be in scope for such a
feature? I don't think they can be on the same column right now.

Nikolay Pavlov

unread,

Nov 15, 2015, 12:02:24 AM11/15/15

to vim_dev, vim-dev Mailingliste

​`gq` behaviour is by a &formatexpr and ​&formatprg option values and you may use them if you know a program which serves your purposes. `w` and other motions can be remapped, same for `J` (in the last case you may manually choose between `J` (join with spaces) and `gJ` (join without inserting spaces, but also without removing them)). So you can have some minor level of convenience by configuring Vim without patching it. But this does not work for

1. Motions inside “nore” mappings.

2. expand('<cword>') and other means of getting word under the cursor (e.g. :edit <cword>).

3. Behaviour when &linebreak option is set.

4. `\<`/`\>`. Though I am unsure that this should be fixed: I always parsed this as “place between non-word and word character” and “place between word and non-word character”, and not “place where word starts” and “place where word ends”. Documentation says about the second interpretation, but I have a strong impression (based on wording, actual implementation and the fact that this is my interpretation) that author meant the first variant.

--
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brian Wilson

unread,

Nov 16, 2015, 2:08:22 PM11/16/15

to vim_dev, boun...@gmail.com, vim...@vim.org

>Bram Moolenaar wrote:
> Can we see the start and/or end of a word by recognizing characters?
> Or do we need to recognize words?
>
> The spell checker does have some knowledge about where words start and
> end. It's a bit slow doing it that way, but might still be acceptable.
>
> I suppose we could have a character class that indicates no spaces are
> used to separated words. That will assume the character is only used in
> that kind of language.
>

Thank you Bram. I believe there are C algorithms (someplace) that define what is a Thai syllable. (I am not sure if this is the same as the ICU algorithms or not.) This would allow wrapping and navigation etc. at syllable level. I wouldn't recommend it for line breaking, but it would be less CPU intensive than a dictionary solution. However, for use of '*' and other nifty word level commands, tapping into ICU dictionary algorithms seems necessary. In my naivety, I ask, could we enable a setting that turns on/off ICU dictionary algorithms for Southeast and Southern Asian languages "in one fell swoop"? Or does this need to be hammered out one language at a time?

And?? If ICU dictionary algorithms are supported, would this mean thai spelling would be naturally supported, or would this be a separate step?

Brian Wilson

unread,

Nov 16, 2015, 2:14:56 PM11/16/15

to vim_dev, vim...@vim.org, rand...@fastmail.com

On Saturday, November 14, 2015 at 3:04:44 PM UTC-8, Random832 wrote:

Yes! Thank you. It is the ICU algorithms that I am thinking of.

Correct. Depending on the language, expectations vary. Lao has been under supported and has only had ICU algorithms for a couple of years now. Thai, on the other hand is closer to the 99.++%

Not sure I understand the question about \< and \> being on the same column.

Brian

Brian Wilson

unread,

Nov 16, 2015, 2:29:27 PM11/16/15

to vim_dev, vim...@vim.org

On Saturday, November 14, 2015 at 9:02:24 PM UTC-8, ZyX wrote:
> `gq` behaviour is by a &formatexpr and &formatprg option values and you may use them if you know a program which serves your purposes. `w` and other motions can be remapped, same for `J` (in the last case you may manually choose between `J` (join with spaces) and `gJ` (join without inserting spaces, but also without removing them)). So you can have some minor level of convenience by configuring Vim without patching it. But this does not work for
>
>
> 1. Motions inside “nore” mappings.
> 2. expand('<cword>') and other means of getting word under the cursor (e.g. :edit <cword>).
> 3. Behaviour when &linebreak option is set.
> 4. `\<`/`\>`. Though I am unsure that this should be fixed: I always parsed this as “place between non-word and word character” and “place between word and non-word character”, and not “place where word starts” and “place where word ends”. Documentation says about the second interpretation, but I have a strong impression (based on wording, actual implementation and the fact that this is my interpretation) that author meant the first variant.
>
>

Thank you ZyX,
Learning the difference between J and gJ is very helpful

Regarding remapping of w, I do not know how to remap it such that it would move to the next Thai word.

4. Perhaps an example will clarify the technical description of what I mean since I am not sure of the difference between the two examples that you give. :) If I type '*' while sitting on a Thai word, I would expect it to go the next matching word and not try to match the entire unspaced-phrase. 'diw' should not delete the entire phrase, but only the Thai word that I am sitting on. etc.

Brian

Nikolay Pavlov

unread,

Nov 16, 2015, 3:01:30 PM11/16/15

to vim_dev, vim-dev Mailingliste

2015-11-16 22:29 GMT+03:00 Brian Wilson <boun...@gmail.com>:

On Saturday, November 14, 2015 at 9:02:24 PM UTC-8, ZyX wrote:
> `gq` behaviour is by a &formatexpr and &formatprg option values and you may use them if you know a program which serves your purposes. `w` and other motions can be remapped, same for `J` (in the last case you may manually choose between `J` (join with spaces) and `gJ` (join without inserting spaces, but also without removing them)). So you can have some minor level of convenience by configuring Vim without patching it. But this does not work for
>
>
> 1. Motions inside “nore” mappings.
> 2. expand('<cword>') and other means of getting word under the cursor (e.g. :edit <cword>).
> 3. Behaviour when &linebreak option is set.
> 4. `\<`/`\>`. Though I am unsure that this should be fixed: I always parsed this as “place between non-word and word character” and “place between word and non-word character”, and not “place where word starts” and “place where word ends”. Documentation says about the second interpretation, but I have a strong impression (based on wording, actual implementation and the fact that this is my interpretation) that author meant the first variant.
>
>

Thank you ZyX,
Learning the difference between J and gJ is very helpful

Regarding remapping of w, I do not know how to remap it such that it would move to the next Thai word.

​You will have to implement needed algorythm in VimL. If you know how to determine thai word boundaries using ICU you may take Python ICU bindings, Vim with +python[3] and implement the needed motion in “VimL” where most job is done by python and ICU, and VimL is only used to run Python.​

In any case I think that writing this in Python will be much easier even without ICU: Python at least has unicodedata module built in, and Vim has nothing like this.

4. Perhaps an example will clarify the technical description of what I mean since I am not sure of the difference between the two examples that you give. :) If I type '*' while sitting on a Thai word, I would expect it to go the next matching word and not try to match the entire unspaced-phrase. 'diw' should not delete the entire phrase, but only the Thai word that I am sitting on. etc.

​Most likely the only option here will be remapping `*` and `#`: I predict that behaviour of `\<`/`\>` will not be changed.​ Though if somebody implements changes for `[ai]?w|[eb]` motions it may be possible that `*` will also produce different regex out of the box.

Brian

> --
>
> You received this message from the "vim_dev" maillist.
>
> Do not top-post! Type your reply below the text you are replying to.
>
> For more information, visit http://www.vim.org/maillist.php
>
>
>
> ---
>
> You received this message because you are subscribed to the Google Groups "vim_dev" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to vim_dev+u...@googlegroups.com.

>
> For more options, visit https://groups.google.com/d/optout.

--

Random832

unread,

Nov 16, 2015, 4:08:29 PM11/16/15

to vim...@vim.org

Brian Wilson <boun...@gmail.com> writes:
> Not sure I understand the question about \< and \> being on the same
> column.

\< and \> are zero-width regex specifiers matching the beginning and
ending of a word. For english words AAA BBB it would become ^\<AAA\>
\<BBB\>$, but for Thai words AAABBB it would be ^\<AAA\>\<BBB\>$ and I
was wondering if that would break any assumptions in the regex engine.

Bram Moolenaar

unread,

Nov 16, 2015, 5:08:47 PM11/16/15

to Brian Wilson, vim_dev, vim...@vim.org

Brian Wilson wrote:

> >Bram Moolenaar wrote:
> > Can we see the start and/or end of a word by recognizing characters?
> > Or do we need to recognize words?
> >
> > The spell checker does have some knowledge about where words start and
> > end. It's a bit slow doing it that way, but might still be acceptable.
> >
> > I suppose we could have a character class that indicates no spaces are
> > used to separated words. That will assume the character is only used in
> > that kind of language.
> >
>
> Thank you Bram. I believe there are C algorithms (someplace) that
> define what is a Thai syllable. (I am not sure if this is the same as
> the ICU algorithms or not.) This would allow wrapping and navigation
> etc. at syllable level. I wouldn't recommend it for line breaking, but
> it would be less CPU intensive than a dictionary solution. However,
> for use of '*' and other nifty word level commands, tapping into ICU
> dictionary algorithms seems necessary. In my naivety, I ask, could we
> enable a setting that turns on/off ICU dictionary algorithms for
> Southeast and Southern Asian languages "in one fell swoop"? Or does
> this need to be hammered out one language at a time?

If someone can make a patch to support that ICU library, with configure
checks, tests and everything, I could include that.

> And?? If ICU dictionary algorithms are supported, would this mean thai
> spelling would be naturally supported, or would this be a separate
> step?

Thai spell checking is already available. I have no idea how well it
works though.