explanation for MB_MAXBYTES value ?

26 views
Skip to first unread message

Matt Anonyme

unread,
Jun 5, 2020, 8:21:33 PM6/5/20
to vim_dev
hi,

I am trying to hack on vim's codebase but there is something I dont get, that is the value of
MB_MAXBYTES defined at:
https://github.com/vim/vim/blob/c17e66c5c0acd5038f1eb3d7b3049b64bb6ea30b/src/vim.h#L1771

Here is the description:
====
//
/ character of up to 6 bytes, or one 16-bit character of up to three bytes
/// plus six following composing characters of three bytes each.
#define MB_MAXBYTES 21
/// Maximum number of bytes in a multi-byte character. It can be one 32-bit
====

I understand that 3 + 6 * 3 = 21 but I don't get how we can input a multibyte character of 21 bytes ? In what encoding/way is it possible ?

Cheers

Tony Mechelynck

unread,
Jun 5, 2020, 9:22:12 PM6/5/20
to vim_dev
Well, originally Unicode codepoints were foreseen as possibly someday extending from U+0000 to U+7FFFFFFF; UTF-32 (aka UCS-4) and UTF-8 can address that, and Vim too; but UCS-2 could only address the BMP (the Basic Multilingual Plane), i.e. up to U+FFFF. Later UCS-2 was expanded to UTF-16 by means of surrogate code points, and UTF-16 can go as high as U+10FFFF but no higher, so the authorities responsible for Unicode decided that no codepoint higher than U+10FFFF would ever be given a value, or indeed considered valid. Now the earlier maximum, U+7FFFFFFF, is represented by the hex bytes FC 9F BF BF BF BF (6 bytes) while the newer maximum, U+10FFF, is represented as F4 8F BF BF (4 bytes). Since Vim goes by the earlier standard, it still reserves 6 bytes per spacing character.

But this is not all. Unicode also knows combining characters, which occupy no space by themselves but are printed on top of the previous codepoint, sometimes modifying its shape (think of accents, underlines, overlines, etc.). Each of these also gets its own codepoint, and there may be several on a single spacing character. The 'maxcombine' option, which can range from 0 to 6 (default 2) defines how many Vim will accept. Arabic can usually print even the most complex vocalised Coranic text with no more than 2 combining characters per spacing character, Hebrew may require 4 in some cases, so Vim took some safety margin and allows up to 6.

But why only three bytes for each combining character? Well, 3 UTF-8 bytes can address everything in the BMP (i.e. U+0000 to U+FFFF) and I suppose that it is not foreseen to have combining characters higher than that. Additionally, as I read the code you quoted, Vim assumes that only BMP spacing characters (U+0000 to U+FFFF) will ever need combining characters, so we arrive at either 6 bytes for a spacing character no matter how high with no combining characters, or 7 times 3 for one spacing character plus up to 6 combining characters, all of them in the BMP.

Best regards,
Tony.

--
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vim_dev/92f5d462-d469-4237-b2f4-242a98b2be85o%40googlegroups.com.

Bram Moolenaar

unread,
Jun 6, 2020, 8:03:02 AM6/6/20
to vim...@googlegroups.com, Matt Anonyme
We allow for six composing character, they go in the same screen cell.

--
Facepalm reply #3: "I had a great time in Manhattan" "I thought you were
going to New York?"

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Matt

unread,
Jun 6, 2020, 10:18:52 AM6/6/20
to Bram Moolenaar, vim...@googlegroups.com
Thanks for the detailed explanation, I would like to combine 6
characters to create one multibyte code as you mentioned here:

>But this is not all. Unicode also knows combining characters, which occupy no space by themselves but are printed on top of the previous codepoint, sometimes modifying its shape (think of accents, underlines, overlines, etc.). Each of these also gets its own codepoint, and there may be several on a single spacing character. The 'maxcombine' option, which can range from 0 to 6 (default 2) defines how many Vim will accept. Arabic can usually print even the most complex vocalised Coranic text with no more than 2 combining characters per spacing character, Hebrew may require 4 in some cases, so Vim took some safety margin and allows up to 6.

I've `set maxcombine=6` but then what ? how do I combine 4 or 6
characters into one (looking for a real example since Idon't know
arabic or Hebrew). if someone can guide me through it I could submit
an update in the help if that's of interest.


2020年6月6日(土) 14:02 Bram Moolenaar <Br...@moolenaar.net>:

Tony Mechelynck

unread,
Jun 6, 2020, 5:52:22 PM6/6/20
to vim_dev, Bram Moolenaar
On Sat, Jun 6, 2020 at 4:18 PM Matt <matt...@gmail.com> wrote:
>
> Thanks for the detailed explanation, I would like to combine 6
> characters to create one multibyte code as you mentioned here:
>
> >But this is not all. Unicode also knows combining characters, which occupy no space by themselves but are printed on top of the previous codepoint, sometimes modifying its shape (think of accents, underlines, overlines, etc.). Each of these also gets its own codepoint, and there may be several on a single spacing character. The 'maxcombine' option, which can range from 0 to 6 (default 2) defines how many Vim will accept. Arabic can usually print even the most complex vocalised Coranic text with no more than 2 combining characters per spacing character, Hebrew may require 4 in some cases, so Vim took some safety margin and allows up to 6.
>
> I've `set maxcombine=6` but then what ? how do I combine 4 or 6
> characters into one (looking for a real example since Idon't know
> arabic or Hebrew). if someone can guide me through it I could submit
> an update in the help if that's of interest.

Neither do I know of a real case where a single spacing character
would realistically accept more than one; but the principle is simple:
with 'encoding' set to utf-8, you first type the spacing character;
then you type the (first or only) combining character (either directly
if you have it on your keyboard or in the currently enabled keymap, or
else by Ctrl-V u + hex value); then repeat zero or more times for any
additional combining characters.

See
:help mbyte-keymap
:help 'keymap'
:help i_CTRL-^
:help 'iminsert'
:help i_CTRL-V_digit

The following is an example of how to use combining characters to
write a Greek upsilon with both diaeresis and accent. I think the
equivalent exists for iota, but for upsilon I couldn't find it. Even
if it exists, or even if it wouldn't actually be used in Greek text,
the following is meant to illustrate the principle of how to use more
than one combining character on a single spacing character:

0. Make sure that Vim is in Insert (or Replace) mode.
1. Type a plain Greek upsilon, either with the help of a Greek
keyboard or keymap, or with |digraphs| as Ctrl-K u *
2. Add a combining diaeresis with Ctrl-V u 0308
3. Add a combining acute accent (AFAICT Unicode has no combining
vertical accent) with Ctrl-V u 0301

You'll notice that at steps 2 and 3, the cursor doesn't move, but the
contents of the screen cell immediately to the left of the cursor
change at the end of each step.

To find the hex code for any Unicode codepoint, there are the
following resources:
http://www.unicode.org/charts/ (by script and by family of symbols)
https://www.unicode.org/charts/charindex.html (alphabetical
index by name)

Best regards,
Tony.

Matt

unread,
Jun 9, 2020, 10:30:28 AM6/9/20
to vim...@googlegroups.com, Bram Moolenaar
Thanks. That was helpful. The figure 2-21 at
https://www.unicode.org/versions/Unicode13.0.0/ch02.pdf was helpful
too (the whole document is great).
I also got confused by neovim's (outdated?) implementation of unicode
I guess see https://github.com/vim/vim/commit/c6f93d6f6ce23a24970bcbb90b72f7cf6f5a352c

2020年6月6日(土) 23:52 Tony Mechelynck <antoine.m...@gmail.com>:
> --
> --
> You received this message from the "vim_dev" maillist.
> Do not top-post! Type your reply below the text you are replying to.
> For more information, visit http://www.vim.org/maillist.php
>
> ---
> You received this message because you are subscribed to a topic in the Google Groups "vim_dev" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/vim_dev/N6VdTzF1pfA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to vim_dev+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/vim_dev/CAJkCKXt%3DQqSmaDAdJaXXXwWxH8DvEoVm1sfhe54ALHm%3DMzjgFw%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages