[vim/vim] Properly break CJK lines (#3875)

102 views
Skip to first unread message

依云

unread,
Jan 28, 2019, 7:05:07 AM1/28/19
to vim/vim, Subscribed

This patch makes Vim break lines obeying rules for CJK punctuations (like commas should start a line).

This is an old patch. It's not written by me, but by yswzing, who didn't intend to submit it upstream but have agreed me to do so. The original patch was posted on a Chinese forum.

I've been using this patch for years, but I'm not familiar with the code. I hope Vim can include this but I may not be able to answers some questions wrt the code.


You can view, comment on, or merge this pull request online at:

  https://github.com/vim/vim/pull/3875

Commit Summary

  • cjk linebreak patch

File Changes

Patch Links:


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub

依云

unread,
Jan 28, 2019, 8:00:46 AM1/28/19
to vim/vim, Push

@lilydjwg pushed 1 commit.

  • ba29015 Add a simple test for CJK linebreaks


You are receiving this because you are subscribed to this thread.

View it on GitHub

Codecov

unread,
Jan 28, 2019, 8:24:24 AM1/28/19
to vim/vim, Subscribed

Codecov Report

Merging #3875 into master will decrease coverage by <.01%.
The diff coverage is 71.91%.

Impacted file tree graph

@@            Coverage Diff             @@

##           master    #3875      +/-   ##

==========================================

- Coverage   78.76%   78.76%   -0.01%     

==========================================

  Files         104      104              

  Lines      141919   142002      +83     

==========================================

+ Hits       111789   111852      +63     

- Misses      30130    30150      +20
Impacted Files Coverage Δ
src/ops.c 83.53% <16.66%> (-0.13%) ⬇️
src/mbyte.c 65.44% <62.22%> (-0.1%) ⬇️
src/edit.c 85.53% <92.1%> (+0.12%) ⬆️
src/if_xcmdsrv.c 84.2% <0%> (-0.18%) ⬇️
src/os_unix.c 58.81% <0%> (-0.14%) ⬇️
src/window.c 83.36% <0%> (-0.04%) ⬇️
src/gui.c 58.05% <0%> (ø) ⬆️
src/message.c 76.5% <0%> (+0.04%) ⬆️
src/gui_gtk_x11.c 48.42% <0%> (+0.14%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26d9821...ba29015. Read the comment docs.

Bram Moolenaar

unread,
Jan 31, 2019, 9:08:30 AM1/31/19
to vim/vim, Subscribed

The patch generally looks OK. But the testing is inadequate. It would be good if someone who knows how the rules should work adds more testing.
The dbcs_ functions are either unfinished or pointless, since they always return the same value.

依云

unread,
Feb 8, 2019, 3:09:27 AM2/8/19
to vim/vim, Subscribed

I can add more tests. However I'm not very clear about those dbcs_ functions. They are incomplete and seem to be used when 'encoding' is not utf-8. (I've tested that it works with 'fileencoding' set to gbk even the comments above #define DBCS_ mention 'fileencoding'.)

If so, is it OK for this to work only in UTF-8 mode? There are a lot of issues with UI or plugins when 'encoding' is not utf-8 anyway.

Bram Moolenaar

unread,
Feb 10, 2019, 4:14:51 PM2/10/19
to vim...@googlegroups.com, 依云

> I can add more tests.

Thanks.

> However I'm not very clear about those `dbcs_`
> functions. They are incomplete and seem to be used when 'encoding' is
> not utf-8. (I've tested that it works with 'fileencoding' set to gbk
> even the comments above `#define DBCS_` mention 'fileencoding'.)

These double-byte encodings all have characters that exist in the
Unicode space, since they can be converted to Unicode and back. Thus
it's just a matter of finding a way to know which Unicode character is
used in which DBCS encoding.

> If so, is it OK for this to work only in UTF-8 mode? There are a lot
> of issues with UI or plugins when 'encoding' is not utf-8 anyway.

I know quite a few users in Asia still use DBCS encodings, for various
reasons. Thus it would be really nice if we can make this work. But we
could do it in two steps, first implement it for Unicode, and then
someone knowledgeable of a specific DBCS encoding in a followup CL.
For now, we can make those dbcs functions a no-op.


--
Why is "abbreviation" such a long word?

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Anton Kochkov

unread,
May 28, 2019, 9:06:44 AM5/28/19
to vim/vim, Subscribed

Is this still active? Would be nice to have this one eventually.

Bram Moolenaar

unread,
Sep 10, 2019, 4:51:14 PM9/10/19
to vim/vim, Subscribed

So how about dropping the dbcs_ part and adding some better testing?

依云

unread,
Sep 12, 2019, 8:11:05 AM9/12/19
to vim/vim, Push

@lilydjwg pushed 1 commit.

  • a245ebf fix merge error and some compiler warnings


You are receiving this because you are subscribed to this thread.

View it on GitHub

依云

unread,
Sep 12, 2019, 8:48:20 AM9/12/19
to vim/vim, Push

@lilydjwg pushed 1 commit.

  • c91522e test_cjk_linebreak: set scriptencoding to utf-8


You are receiving this because you are subscribed to this thread.

View it on GitHub

依云

unread,
Sep 12, 2019, 9:31:33 AM9/12/19
to vim/vim, Push

@lilydjwg pushed 1 commit.


You are receiving this because you are subscribed to this thread.

View it on GitHub

依云

unread,
Sep 12, 2019, 9:54:23 AM9/12/19
to vim/vim, Push

@lilydjwg pushed 1 commit.

  • 34e06a2 cjk_linebreak: still fix for single-width punctuations


You are receiving this because you are subscribed to this thread.

View it on GitHub

依云

unread,
Sep 16, 2019, 12:57:31 PM9/16/19
to vim/vim, Subscribed

Hi Bram, I'm done with the updates for this patch. Could you take a look?

Shidong Wang

unread,
Sep 17, 2019, 9:57:43 AM9/17/19
to vim_dev, vim/vim, Subscribed
Awesome work, I also need this patch for writing chinese doc. thanks!


--
Best regards,
wsdjeg
--
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vim_dev/vim/vim/pull/3875/c531864473%40github.com.

依云

unread,
Oct 5, 2019, 1:55:45 AM10/5/19
to vim/vim, Push

@lilydjwg pushed 9 commits.

  • 28bad33 cjk linebreak patch
  • 0f1eb88 Add a simple test for CJK linebreaks
  • 8f6d922 remove dbcs_* functions from CJK line breaking
  • d523549 add more tests for line breaking
  • 861fb7b fix merge error and some compiler warnings
  • b5fc5ca test_cjk_linebreak: set scriptencoding to utf-8
  • 9dce4f9 cjk_linebreak: fix tests
  • 1c1f239 cjk_linebreak: still fix for single-width punctuations
  • 7626d22 reapply changes to src/ops/c


You are receiving this because you are subscribed to this thread.

View it on GitHub

Anton Kochkov

unread,
Oct 5, 2019, 2:52:20 AM10/5/19
to vim/vim, Subscribed

Failed test here:

1 FAILED:
Found errors in Test_state():
function RunTheTest[40]..Test_state[32]..WaitForAssert[2]..<SNR>10_WaitForCommon[11]..<lambda>10101 line 1: Pattern 'state: mSc; mode: n' does not match 'state: mc; mode: n\[  occurs 39 times]1,1           All'
SKIPPED Test_timer_peek_and_get_char(): only works in the GUI
# without the +eval feature test_result.log is a copy of test.log

依云

unread,
Oct 5, 2019, 3:45:48 AM10/5/19
to vim/vim, Push

@lilydjwg pushed 1 commit.

  • a8765a6 cjk_linebreak: test line joining too


You are receiving this because you are subscribed to this thread.

View it on GitHub

Anton Kochkov

unread,
Oct 5, 2019, 11:33:52 PM10/5/19
to vim/vim, Subscribed

Yes, now seems ready 👍


You are receiving this because you are subscribed to this thread.

Reply to this email directly, view it on GitHub, or mute the thread.

Bram Moolenaar

unread,
Oct 20, 2019, 3:34:30 PM10/20/19
to vim/vim, Subscribed

What I don't understand: The ] flag is added to 'formatoptions' but it is not used in tests.
In the code the only check for it checks that it is excluded. That doesn't seem right.


You are receiving this because you are subscribed to this thread.

Reply to this email directly, view it on GitHub, or unsubscribe.

依云

unread,
Dec 5, 2019, 3:48:37 AM12/5/19
to vim/vim, Push

@lilydjwg pushed 1 commit.

  • 0904a4e Merge remote-tracking branch 'upstream/master' into cjk_linebreak_pr


You are receiving this because you are subscribed to this thread.

View it on GitHub or unsubscribe.

依云

unread,
May 3, 2020, 7:35:33 AM5/3/20
to vim/vim, Push

@lilydjwg pushed 1 commit.

  • bf106d5 Merge remote-tracking branch 'upstream/master' into cjk_linebreak_pr


You are receiving this because you are subscribed to this thread.

Bram Moolenaar

unread,
May 31, 2020, 2:47:51 PM5/31/20
to vim/vim, Subscribed

I still do not see a test case where the new "]" flag is added to 'formatoptions'.
If I understand it correctly, the new code is used by default and can be disabled by that flag.
Tests are failing...

The code still uses old /* */ comments.
One place has "{" after the if () instead of on the next line.


You are receiving this because you are subscribed to this thread.

Reply to this email directly, view it on GitHub, or unsubscribe.

依云

unread,
Jun 4, 2020, 2:47:01 AM6/4/20
to vim/vim, Push

@lilydjwg pushed 3 commits.

  • 4a9f0ad Merge remote-tracking branch 'upstream/master' into cjk_linebreak_pr
  • b73cca4 make logical operator order explicite
  • e74846a test for the 'fo' ']' option


You are receiving this because you are subscribed to this thread.

K.Takata

unread,
Jun 4, 2020, 5:20:57 AM6/4/20
to vim/vim, Subscribed

@k-takata commented on this pull request.

C99-style comments might be better?


In runtime/doc/change.txt:

> @@ -1688,6 +1688,10 @@ B	When joining lines, don't insert a space between two multi-byte

 	characters.  Overruled by the 'M' flag.

 1	Don't break a line after a one-letter word.  It's broken before it

 	instead (if possible).

+]	Respect textwidth rigorously. With this flag set, no line can be

+	longer than textwidth, unless line-break-prohibition rules make this

+	impossible.  Mainly for multi-byte scripts and work only for UTF-8

+        'encoding'.

⬇️ Suggested change
-        'encoding'.

+	'encoding'.

Tab should be used here.


In src/mbyte.c:

> @@ -3842,6 +3842,161 @@ utf_head_off(char_u *base, char_u *p)

     return (int)(p - q);

 }

 

+/*

+ * whether space is allowed before/after 'c'

+ * return  TRUE    if not allowed(eat space)

⬇️ Suggested change
- * return  TRUE    if not allowed(eat space)

+ * return  TRUE    if not allowed (eat space)

It seems that Asian people tend to forget the space before the parentheses.


In src/mbyte.c:

> @@ -3842,6 +3842,161 @@ utf_head_off(char_u *base, char_u *p)

     return (int)(p - q);

 }

 

+/*

+ * whether space is allowed before/after 'c'

+ * return  TRUE    if not allowed(eat space)

+ *	   FALSE   otherwise

+ */

+    int

+utf_eat_space(cc)

+    int		cc;

+{

+    if ((cc >= 0x2000 && cc <= 0x206F)	      /* General punctuations */

+	    || (cc >= 0x2e00 && cc <= 0x2e7f)   /* Supplemental punctuations */

+	    || (cc >= 0x3000 && cc <= 0x303f)   /* CJK symbols and punctuations */

+	    || (cc >= 0xff00 && cc <= 0xffef))  /* Full width ASCII punctuations */

U+FF00 to U+FFEF includes not only punctuations but also full width alphabets and half width katakana.


In src/mbyte.c:

> +    {

+	0x0021, /* ! */

+	0x0025, /* % */

+	0x0029, /* ) */

+	0x002c, /* , */

+	0x003a, /* : */

+	0x003b, /* ; */

+	0x003e, /* > */

+	0x003f, /* ? */

+	0x005d, /* ] */

+	0x007d, /* } */

+	0x2019, /* ’ right single quotation mark */

+	0x201d, /* ” right double quotation mark */

+	0x2020, /* † dagger */

+	0x2021, /* ‡ double dagger */

+	0x2026, /* … horizontal ellipis*/

⬇️ Suggested change
-	0x2026, /* … horizontal ellipis*/

+	0x2026, /* … horizontal ellipsis */


In src/mbyte.c:

> +	0x300b, /* 》 right double angle bracket */

+	0x300d, /* 」 right corner bracket */

+	0x300f, /* 』 right white corner bracket */

+	0x3011, /* 】 right black lenticular bracket */

+	0x3015, /* 〕 right tortoise shell bracket */

+	0x3017, /* 〗 right white lenticular bracket */

+	0x3019, /* 〙 right white tortoise shell bracket */

+	0x301b, /* 〛 right white square bracket */

+	0xff01, /* ! fullwidth exclamation mark */

+	0xff09, /* ) fullwidth right parenthesis */

+	0xff0c, /* , fullwidth comma */

+	0xff0e, /* . fullwidth full stop */

+	0xff1a, /* : fullwidth colon */

+	0xff1b, /* ; fullwidth semicolon */

+	0xff1f, /* ? fullwidth question mark */

+	0xff3d, /* ] fullwidth right squre bracket */

⬇️ Suggested change
-	0xff3d, /* ] fullwidth right squre bracket */

+	0xff3d, /* ] fullwidth right square bracket */


In src/mbyte.c:

> +}

+

+/*

+ * whether line break is allowed between cc and ncc

+ * return  TRUE    if allowed

+ *	   FALSE   otherwise

+ */

+    int

+utf_allow_break(cc, ncc)

+    int		cc;

+    int		ncc;

+{

+    /* don't break between two-letter punctuations */

+    if (cc == ncc

+	    && (cc == 0x2014 /* em dash */

+		|| cc == 0x2026 /* horizontal ellipis */))

⬇️ Suggested change
-		|| cc == 0x2026 /* horizontal ellipis */))

+		|| cc == 0x2026 /* horizontal ellipsis */))


You are receiving this because you are subscribed to this thread.

Reply to this email directly, view it on GitHub, or unsubscribe.

依云

unread,
Jun 4, 2020, 5:52:42 AM6/4/20
to vim/vim, Push

@lilydjwg pushed 2 commits.

  • 0f3e898 typo and style fixes, narrow character ranges to actual punctuations in utf_eat_space
  • 9cde616 change all added oneline comments to //-style


You are receiving this because you are subscribed to this thread.

依云

unread,
Jun 4, 2020, 5:53:49 AM6/4/20
to vim/vim, Subscribed

@lilydjwg commented on this pull request.


In src/mbyte.c:

> @@ -3842,6 +3842,161 @@ utf_head_off(char_u *base, char_u *p)
     return (int)(p - q);
 }
 
+/*
+ * whether space is allowed before/after 'c'
+ * return  TRUE    if not allowed(eat space)
+ *	   FALSE   otherwise
+ */
+    int
+utf_eat_space(cc)
+    int		cc;
+{
+    if ((cc >= 0x2000 && cc <= 0x206F)	      /* General punctuations */
+	    || (cc >= 0x2e00 && cc <= 0x2e7f)   /* Supplemental punctuations */
+	    || (cc >= 0x3000 && cc <= 0x303f)   /* CJK symbols and punctuations */
+	    || (cc >= 0xff00 && cc <= 0xffef))  /* Full width ASCII punctuations */

I've changed to only include punctuations in the range.


You are receiving this because you are subscribed to this thread.

Reply to this email directly, view it on GitHub, or unsubscribe.

依云

unread,
Jun 4, 2020, 5:54:20 AM6/4/20
to vim/vim, Subscribed

Thanks for your quick review! (I didn't notice that there were typos...)

Bram Moolenaar

unread,
Jun 4, 2020, 12:21:45 PM6/4/20
to vim/vim, Subscribed

I'll include this now and clean it up a bit. However, some of the test cases are failing, I'll add TODO items for these. Please fix!

Bram Moolenaar

unread,
Jun 4, 2020, 12:23:06 PM6/4/20
to vim/vim, Subscribed

Closed #3875 via e52702f.

Reply all
Reply to author
Forward
0 new messages