[vim/vim] Not able to convert between byte index and UTF indices (PR #12216)

41 views
Skip to first unread message

Yegappan Lakshmanan

unread,
Apr 1, 2023, 11:42:53 AM4/1/23
to vim/vim, Subscribed

The language server protocol supports specifying offsets in text documents using UTF-8 or UTF-16 or UTF-32 code units.
The UTF-16 code unit is the default.

https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments

Different language servers have different levels of support for using the different code units. Vim uses the UTF-32
code units for the offsets. This makes it difficult to support different language servers from a Vim LSP plugin.

Port the strutfindex() and strbyteindex() functions from Neovim to support this.

Co-authored-by: bfredl bjorn...@gmail.com


You can view, comment on, or merge this pull request online at:

  https://github.com/vim/vim/pull/12216

Commit Summary

  • efb120a Not able to convert between byte index and UTF indices

File Changes

(8 files)

Patch Links:


Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216@github.com>

codecov[bot]

unread,
Apr 1, 2023, 11:54:26 AM4/1/23
to vim/vim, Subscribed

Codecov Report

Merging #12216 (efb120a) into master (39c9ec1) will decrease coverage by 0.83%.
The diff coverage is 89.70%.

@@            Coverage Diff             @@
##           master   #12216      +/-   ##
==========================================
- Coverage   81.94%   81.12%   -0.83%     
==========================================
  Files         164      154      -10     
  Lines      194103   183711   -10392     
  Branches    43830    41417    -2413     
==========================================
- Hits       159067   149032   -10035     
+ Misses      22197    21714     -483     
- Partials    12839    12965     +126     
Flag Coverage Δ
huge-clang-none 82.63% <91.04%> (-0.01%) ⬇️
huge-gcc-none ?
huge-gcc-testgui ?
huge-gcc-unittests 0.29% <0.00%> (-0.01%) ⬇️
linux 81.12% <89.70%> (-1.28%) ⬇️
mingw-x64-HUGE ?
mingw-x86-HUGE ?
windows ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/evalfunc.c 87.73% <ø> (-2.66%) ⬇️
src/mbyte.c 74.29% <88.46%> (+1.93%) ⬆️
src/strings.c 91.79% <90.47%> (-0.97%) ⬇️

... and 145 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/c1493015571@github.com>

Yegappan Lakshmanan

unread,
Apr 1, 2023, 10:15:18 PM4/1/23
to vim/vim, Push

@yegappan pushed 1 commit.

  • 9604b6f Not able to convert between byte index and UTF indices


View it on GitHub or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/push/13157623708@github.com>

Yegappan Lakshmanan

unread,
Apr 5, 2023, 10:30:14 PM4/5/23
to vim/vim, Push

@yegappan pushed 1 commit.

  • a91fe4b Not able to convert between byte index and UTF indices

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/push/13206259360@github.com>

Bram Moolenaar

unread,
Apr 12, 2023, 1:36:13 PM4/12/23
to vim/vim, Subscribed


Yegappan wrote:

> The language server protocol supports specifying offsets in text
> documents using UTF-8 or UTF-16 or UTF-32 code units.
> The UTF-16 code unit is the default.
>
> https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments
>
> Different language servers have different levels of support for using
> the different code units. Vim uses the UTF-32 code units for the
> offsets. This makes it difficult to support different language
> servers from a Vim LSP plugin.
>
> Port the strutfindex() and strbyteindex() functions from Neovim to
> support this.

I find the function names hard to read and confusing. We might be able
to think of better names when the exact functionality is described.

The terminology is confusing. "UTF-32 byte index" contradicts itself,
since each character is four bytes. I think what is meant is "UTF-32
encoded character index", which is equal to "character index", since
there is no Unicode character that takes more than one UTF-32 code
point.

In Vim all Unicode characters are internally encoded with UTF-8. Thus
the "{string}" argument of strbyteindex() will be UTF-8 encoded. This
is also confusing. The help should be clearer about what this means
exactly. I'm not sure how, saying something like "the character index
of "{string}" if it would be encoded with UTF-32" makes it complex. I
think that instead of using "UTF-32 index" we can just use "character
index", and somewhere mention that "UTF-32" can be considered the same
(if we need to mention this at all, since the term "UTF-32" isn't widely
used).

For "UTF-16" it gets more complicated, we can't avoid mentioning that
the index applies to "{string}" encoded as UTF-16. Looking back UTF-16
should have never been made a standard IMHO, but it exists and it is
used (especially on MS-Windows), thus we need to support it.

Conversion between UTF-8 and character index already exists, you can use
charidx() and byteidx()/byteidxcomp(). Possibly we only need to add
functions to convert between UTF-8 and UTF-16 indexes? Or between
character (UTF-32) and UTF-16 indexes? The latter makes more sense.

It should also be possible to specify the handling of composing
characters. Either as an argument, like with charidx(), or using
separate functions, as with byteidx()/byteidxcomp().

--
My girlfriend told me I should be more affectionate.
So I got TWO girlfriends.

/// Bram Moolenaar -- ***@***.*** -- http://www.Moolenaar.net \\\
/// \\\
\\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///


Reply to this email directly, view it on GitHub.

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/c1505671128@github.com>

Dominique Pelle

unread,
Apr 12, 2023, 2:07:28 PM4/12/23
to vim/vim, Subscribed

This feature looks related to one of my ealier post at https://groups.google.com/g/vim_dev/c/AVpp8DT2_Vc/m/L_p6gzATBQAJ


Reply to this email directly, view it on GitHub.

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/c1505713015@github.com>

Dominique Pelle

unread,
Apr 12, 2023, 2:08:06 PM4/12/23
to vim/vim, Subscribed

@DominiquePelle-TomTom commented on this pull request.


In runtime/doc/builtin.txt:

> @@ -604,6 +606,7 @@ strptime({format}, {timestring})
 strridx({haystack}, {needle} [, {start}])
 				Number	last index of {needle} in {haystack}
 strtrans({expr})		String	translate string to make it printable
+strutfindex({expr} [, {index}])	List	byte index to utf-32 and ut-16 indices

ut-16? I assume you meant utf-16.


Reply to this email directly, view it on GitHub.

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/review/1381850153@github.com>

Dominique Pelle

unread,
Apr 12, 2023, 2:12:09 PM4/12/23
to vim/vim, Subscribed

@DominiquePelle-TomTom commented on this pull request.


In runtime/doc/builtin.txt:

> @@ -8975,8 +8978,22 @@ str2nr({string} [, {base} [, {quoted}]])			*str2nr()*
 
 		Can also be used as a |method|: >
 			GetText()->str2nr()
+<
+strbyteindex({string} [, {index} [, {use_utf16}])	*strbyteindex()*
+		Convert a UTF-32 or UTF-16 {index} to a byte index. If

Sometimes the doc in the PR uses "UTF-16" and sometimes "utf-16".
Let's be consistent (the capitalized one is better IMO).


Reply to this email directly, view it on GitHub.

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/review/1381856491@github.com>

Bram Moolenaar

unread,
Apr 12, 2023, 3:11:43 PM4/12/23
to vim/vim, Subscribed


> This feature looks related to one of my ealier post at
> https://groups.google.com/g/vim_dev/c/AVpp8DT2_Vc/m/L_p6gzATBQAJ

The essential part of that post is to count characters from the start of
the file. This PR is about an index relative to the start of a string.
And also about conversion to/from UTF-16 index. Looking from the
implementation side there is not much in common.

--
In Africa some of the native tribes have a custom of beating the ground
with clubs and uttering spine chilling cries. Anthropologists call
this a form of primitive self-expression. In America we call it golf.


/// Bram Moolenaar -- ***@***.*** -- http://www.Moolenaar.net \\\
/// \\\
\\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///


Reply to this email directly, view it on GitHub.

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/c1505791461@github.com>

Shane-XB-Qian

unread,
Apr 12, 2023, 7:24:12 PM4/12/23
to vim/vim, Subscribed

This feature looks related to one of my earlier post at https://groups.google.com/g/vim_dev/c/AVpp8DT2_Vc/m/L_p6gzATBQAJ

this is for LSP impl, the default encoding of lsp server is utf-16, hence some e.g non-utf32 chars symbol maybe located incorrectly at client if no such funcs (e.g from this pr) from vim itself.


Reply to this email directly, view it on GitHub.

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/c1506090494@github.com>

Yegappan Lakshmanan

unread,
Apr 13, 2023, 12:52:08 AM4/13/23
to vim...@googlegroups.com, reply+ACY5DGAXCPIE7Q5SMV...@reply.github.com, vim/vim, Subscribed
Hi Bram,

What about introducing a function that converts a character index in a string
to a UTF-16 index?

utf16idx({string}, {idx} [, {countcc}])

This is similar to the existing charidx() function.  The "idx" here specifies
the character index in {string} and this function returns the corresponding
UTF-16 index.

To convert from a UTF-16 index to a character index, we can either introduce
a new function or modify the existing charidx() function to accept an additional
boolean argument.  If this argument is specified, then {idx} is a UTF-16 index
instead of a byte index.  If we are going with a new function for this, what
do you think about naming the function as utf16tocharidx()?

- Yegappan

vim-dev ML

unread,
Apr 13, 2023, 12:52:25 AM4/13/23
to vim/vim, vim-dev ML, Your activity

Hi Bram,

On Wed, Apr 12, 2023 at 10:36 AM Bram Moolenaar ***@***.***>
wrote:


>
> Yegappan wrote:
>
> > The language server protocol supports specifying offsets in text
> > documents using UTF-8 or UTF-16 or UTF-32 code units.
> > The UTF-16 code unit is the default.
> >
> >
> https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments
> >
> > Different language servers have different levels of support for using
> > the different code units. Vim uses the UTF-32 code units for the
> > offsets. This makes it difficult to support different language
> > servers from a Vim LSP plugin.
> >
> > Port the strutfindex() and strbyteindex() functions from Neovim to
> > support this.
>


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/c1506342577@github.com>

Yegappan Lakshmanan

unread,
Apr 14, 2023, 12:31:55 AM4/14/23
to vim/vim, vim-dev ML, Push

@yegappan pushed 1 commit.

  • 0fc119c Add the utf16idx() function and add UTF-16 flag to the byteidx() and byteidxcomp() functions

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/push/13296179599@github.com>

Yegappan Lakshmanan

unread,
Apr 14, 2023, 12:32:13 AM4/14/23
to vim/vim, vim-dev ML, Push

@yegappan pushed 2 commits.

  • 0c79e56 Not able to convert between byte index and UTF indices
  • 3f269e6 Add the utf16idx() function and add UTF-16 flag to the byteidx() and byteidxcomp() functions

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/push/13296181587@github.com>

Yegappan Lakshmanan

unread,
Apr 14, 2023, 12:52:11 AM4/14/23
to vim/vim, vim-dev ML, Push

@yegappan pushed 1 commit.

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/push/13296292452@github.com>

Yegappan Lakshmanan

unread,
Apr 14, 2023, 12:56:24 AM4/14/23
to vim...@googlegroups.com, reply+ACY5DGAXCPIE7Q5SMV...@reply.github.com, vim/vim, Subscribed
Hi Bram,

I have updated the PR to add the utf16idx() function and introduced an optional
UTF-16 flag to the byteidx() and byteidxcomp() functions.

- Yegappan
 

vim-dev ML

unread,
Apr 14, 2023, 12:56:40 AM4/14/23
to vim/vim, vim-dev ML, Your activity

Hi Bram,

On Wed, Apr 12, 2023 at 10:36 AM Bram Moolenaar ***@***.***>
wrote:


>
> Yegappan wrote:
>
> > The language server protocol supports specifying offsets in text
> > documents using UTF-8 or UTF-16 or UTF-32 code units.
> > The UTF-16 code unit is the default.
> >
> >
> https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments
> >
> > Different language servers have different levels of support for using
> > the different code units. Vim uses the UTF-32 code units for the
> > offsets. This makes it difficult to support different language
> > servers from a Vim LSP plugin.
> >
> > Port the strutfindex() and strbyteindex() functions from Neovim to
> > support this.
>


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/c1507922708@github.com>

Bram Moolenaar

unread,
Apr 14, 2023, 3:55:39 PM4/14/23
to vim...@googlegroups.com, vim-dev ML

Yegappan wrote:

> I have updated the PR to add the utf16idx() function and introduced an
> optional UTF-16 flag to the byteidx() and byteidxcomp() functions.

Hmm, then when converting an UTF-16 index to a character index one would
need to use byteidx() plus charidx(). Not ideal. See my other message
for a charidx() variant that does this in one step.

Would it be needed to convert an UTF-16 index into a byte index?
Depends on what it is used for, some functions work with a byte index,
others with a character index. So either we have functions for both, or
there need to be two function calls. I can't say I have a clear
preference for either.

--
User: I'm having problems with my text editor.
Help desk: Which editor are you using?
User: I don't know, but it's version VI (pronounced: 6).
Help desk: Oh, then you should upgrade to version VIM (pronounced: 994).

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\

vim-dev ML

unread,
Apr 14, 2023, 3:55:57 PM4/14/23
to vim/vim, vim-dev ML, Your activity


Yegappan wrote:

> > > The language server protocol supports specifying offsets in text
> > > documents using UTF-8 or UTF-16 or UTF-32 code units.
> > > The UTF-16 code unit is the default.
> > >
> > >
> > https://microsoft.github.io/language-server-protocol/specifications/lsp/3=

> .17/specification/#textDocuments
> > >
> > > Different language servers have different levels of support for using
> > > the different code units. Vim uses the UTF-32 code units for the
> > > offsets. This makes it difficult to support different language
> > > servers from a Vim LSP plugin.
> > >
> > > Port the strutfindex() and strbyteindex() functions from Neovim to
> > > support this.
> >
> > I find the function names hard to read and confusing. We might be able
> > to think of better names when the exact functionality is described.
> >
> > The terminology is confusing. "UTF-32 byte index" contradicts itself,
> > since each character is four bytes. I think what is meant is "UTF-32
> > encoded character index", which is equal to "character index", since
> > there is no Unicode character that takes more than one UTF-32 code
> > point.
> >
> > In Vim all Unicode characters are internally encoded with UTF-8. Thus
> > the "{string}" argument of strbyteindex() will be UTF-8 encoded. This
> > is also confusing. The help should be clearer about what this means
> > exactly. I'm not sure how, saying something like "the character index
> > of "{string}" if it would be encoded with UTF-32" makes it complex. I
> > think that instead of using "UTF-32 index" we can just use "character
> > index", and somewhere mention that "UTF-32" can be considered the same
> > (if we need to mention this at all, since the term "UTF-32" isn't widely
> > used).
> >
> > For "UTF-16" it gets more complicated, we can't avoid mentioning that
> > the index applies to "{string}" encoded as UTF-16. Looking back UTF-16
> > should have never been made a standard IMHO, but it exists and it is
> > used (especially on MS-Windows), thus we need to support it.
> >
> > Conversion between UTF-8 and character index already exists, you can use
> > charidx() and byteidx()/byteidxcomp(). Possibly we only need to add
> > functions to convert between UTF-8 and UTF-16 indexes? Or between
> > character (UTF-32) and UTF-16 indexes? The latter makes more sense.
>
> What about introducing a function that converts a character index in a
> string to a UTF-16 index?
>
> utf16idx({string}, {idx} [, {countcc}])
>
> This is similar to the existing charidx() function. The "idx" here
> specifies the character index in {string} and this function returns
> the corresponding UTF-16 index.

charidx() converts a byte index of an UTF-8 encoded string to a
character index. This can't simply be changed to UTF-16, since we don't
support UTF-16 encoded strings. We could (pretend to) convert the
string to UTF-16 and then apply {idx}. But that is doing the opposite
of what you suggested.


> To convert from a UTF-16 index to a character index, we can either introduce
> a new function or modify the existing charidx() function to accept an
> additional boolean argument. If this argument is specified, then
> {idx} is a UTF-16 index instead of a byte index. If we are going with
> a new function for this, what do you think about naming the function
> as utf16tocharidx()?

The function still returns a character index, thus using "charidx" with
something appended works better. At least then they sort next to each
other.

For the other direction an equivalent to byteidx(). That could be
utf16idx() perhaps.

--
ARTHUR: What does it say?
BROTHER MAYNARD: It reads ... "Here may be found the last words of Joseph of
Aramathea." "He who is valorous and pure of heart may find
the Holy Grail in the aaaaarrrrrrggghhh..."
ARTHUR: What?
BROTHER MAYNARD: "The Aaaaarrrrrrggghhh..."
"Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

/// Bram Moolenaar -- ***@***.*** -- http://www.Moolenaar.net \\\

/// \\\
\\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///


Reply to this email directly, view it on GitHub.

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/c1509157848@github.com>

Yegappan Lakshmanan

unread,
Apr 16, 2023, 1:53:39 PM4/16/23
to vim/vim, vim-dev ML, Push

@yegappan pushed 1 commit.

  • 8e387d5 Add support for converting from byte or character index in a string to UTF-16 index and vice versa

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/push/13317866621@github.com>

Yegappan Lakshmanan

unread,
Apr 16, 2023, 1:58:26 PM4/16/23
to vim...@googlegroups.com, vim-dev ML
Hi Bram,

On Fri, Apr 14, 2023 at 12:55 PM Bram Moolenaar <Br...@moolenaar.net> wrote:
>
>
> Yegappan wrote:
>
> > I have updated the PR to add the utf16idx() function and introduced an
> > optional UTF-16 flag to the byteidx() and byteidxcomp() functions.
>
> Hmm, then when converting an UTF-16 index to a character index one would
> need to use byteidx() plus charidx(). Not ideal. See my other message
> for a charidx() variant that does this in one step.
>
> Would it be needed to convert an UTF-16 index into a byte index?
> Depends on what it is used for, some functions work with a byte index,
> others with a character index. So either we have functions for both, or
> there need to be two function calls. I can't say I have a clear
> preference for either.
>

I have updated the PR to support conversion from a byte or character index
in a string to a UTF-16 index and vice versa.

A summary of these functions is below:

byteidx()
byteidxcomp()
Convert from character or UTF-16 index to a byte index.
charidx()
Convert from a byte or UTF-16 index to a character index.
utf16idx()
Convert from a byte or character index to a UTF-16 index.

Regards,
Yegappan

Yegappan Lakshmanan

unread,
Apr 16, 2023, 4:20:21 PM4/16/23
to vim/vim, vim-dev ML, Push

@yegappan pushed 2 commits.

  • 87c7f0f Add the utf16idx() function and add UTF-16 flag to the byteidx() and byteidxcomp() functions
  • 84147e3 Add support for converting from byte or character index in a string to UTF-16 index and vice versa

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/push/13318561677@github.com>

Yegappan Lakshmanan

unread,
Apr 16, 2023, 9:44:54 PM4/16/23
to vim/vim, vim-dev ML, Push

@yegappan pushed 1 commit.

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/push/13320018969@github.com>

Yegappan Lakshmanan

unread,
Apr 18, 2023, 11:39:56 PM4/18/23
to vim/vim, vim-dev ML, Push

@yegappan pushed 1 commit.

  • 9f3457c Add addtional tests and the strutf16len() function

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/push/13349129383@github.com>

Yegappan Lakshmanan

unread,
Apr 18, 2023, 11:57:51 PM4/18/23
to vim/vim, vim-dev ML, Push

@yegappan pushed 2 commits.

  • 5fb8cb9 Add the utf16idx() function and add UTF-16 flag to the byteidx(), byteidxcomp() and charidx() functions
  • c011dbc Add addtional tests and the strutf16len() function

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/push/13349230586@github.com>

Yegappan Lakshmanan

unread,
Apr 20, 2023, 11:44:20 AM4/20/23
to vim/vim, vim-dev ML, Push

@yegappan pushed 1 commit.

  • f782796 Add the utf16idx() and strutf16len() functions and add UTF-16 flag to the byteidx(), byteidxcomp() and charidx() functions

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/push/13371089817@github.com>

Bram Moolenaar

unread,
Apr 20, 2023, 2:38:54 PM4/20/23
to vim...@googlegroups.com, Yegappan Lakshmanan

Yegappan wrote:

> @yegappan pushed 2 commits.
>
> 87c7f0f888bd61604659930276973374dc408e92 Add the utf16idx() function
> and add UTF-16 flag to the byteidx() and byteidxcomp() functions
> 84147e31e7f05403bfaab20ccb7689c74a87befb Add support for converting
> from byte or character index in a string to UTF-16 index and vice
> versa

This looks like the right way to do this, but I find the help a bit
difficult to interpret. I hope others, especially those who want to use
the functionality, have a good look and make comments if something is
missing or unclear.

For byteidx() there is an extra argument, which, when TRUE, makes the
{nr} argument used differently:

When {utf16} is TRUE, {nr} is used as the UTF-16 index in the
String {expr} instead of as the character index.

The first thing that is unclear: what is "the UTF-16 index"? In the
context of the discussion we had I can understand it is the index in the
string when it is encoded with UTF-16, thus with 16 bit words. This
should be explained better. I do not expect many to understand what
UTF-16 encoding means.

The examples are supposed to help understand this:

echo byteidx('a😊😊', 2) returns 5
echo byteidx('a😊😊', 2, 1) returns 1

However, this raises questions: why does the second call return 1?

For the first call I can compute the result: when {nr} is 2 then the
index of the third character is returned, thus the bytes of the first
two characters are added together. These are 1 and 4, total 5. You can
see the second character is 4 bytes by using "g8" on it.

With the second call the second character would take two UTF-16 words.
With {nr} being 2 we refer to the third UTF-16 word, thus halfway the
second character. This is apparently rounded down and only the one byte
for "a" is counted.

This rounding down is new, it should be explained. Perhaps adding this
explanation of how the two examples work is sufficient. But it would be
good to add a third call that is more likely to happen:

echo byteidx('a😊😊', 3, 1) returns 5

This refers to the same character as the first call, thus has the same
return value. This also makes clear (esp. for those who don't know
UTF-16 well) that a character can consist of two words.

For charidx() there is this example:

echo charidx('a😊😊', 4, 0, 1) returns 3

I would think that index 4 is halfway the third character, thus I would
expect a return value of 2. Am I wrong?

--
"I know that there are people who don't love their fellow man,
and I hate those people!" - Tom Lehrer

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\

Yegappan Lakshmanan

unread,
Apr 21, 2023, 12:18:55 AM4/21/23
to vim/vim, vim-dev ML, Push

@yegappan pushed 1 commit.

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/push/13377034388@github.com>

Yegappan Lakshmanan

unread,
Apr 21, 2023, 12:28:42 AM4/21/23
to vim...@googlegroups.com, Yegappan Lakshmanan
Hi Bram,

On Thu, Apr 20, 2023 at 11:38 AM Bram Moolenaar <Br...@moolenaar.net> wrote:
>
>
> Yegappan wrote:
>
> > @yegappan pushed 2 commits.
> >
> > 87c7f0f888bd61604659930276973374dc408e92 Add the utf16idx() function
> > and add UTF-16 flag to the byteidx() and byteidxcomp() functions
> > 84147e31e7f05403bfaab20ccb7689c74a87befb Add support for converting
> > from byte or character index in a string to UTF-16 index and vice
> > versa
>
> This looks like the right way to do this, but I find the help a bit
> difficult to interpret. I hope others, especially those who want to use
> the functionality, have a good look and make comments if something is
> missing or unclear.
>

These functions are mostly useful for LSP plugin developers. I am going to
use it in the Vim9 LSP plugin. Hopefully other LSP authors can comment
on these functions.

>
> For byteidx() there is an extra argument, which, when TRUE, makes the
> {nr} argument used differently:
>
> When {utf16} is TRUE, {nr} is used as the UTF-16 index in the
> String {expr} instead of as the character index.
>
> The first thing that is unclear: what is "the UTF-16 index"? In the
> context of the discussion we had I can understand it is the index in the
> string when it is encoded with UTF-16, thus with 16 bit words. This
> should be explained better. I do not expect many to understand what
> UTF-16 encoding means.
>

I have updated the help text. Let me know if this needs to be expanded further.

>
> The examples are supposed to help understand this:
>
> echo byteidx('a😊😊', 2) returns 5
> echo byteidx('a😊😊', 2, 1) returns 1
>
> However, this raises questions: why does the second call return 1?
>

The byteidx() function returns the index of the first byte in a character
(as you have mentioned below). In the second call, the specified UTF-16
index refers to the second UTF-16 code point in the second character in
the string.

>
> For the first call I can compute the result: when {nr} is 2 then the
> index of the third character is returned, thus the bytes of the first
> two characters are added together. These are 1 and 4, total 5. You can
> see the second character is 4 bytes by using "g8" on it.
>
> With the second call the second character would take two UTF-16 words.
> With {nr} being 2 we refer to the third UTF-16 word, thus halfway the
> second character. This is apparently rounded down and only the one byte
> for "a" is counted.
>

Yes.

>
> This rounding down is new, it should be explained. Perhaps adding this
> explanation of how the two examples work is sufficient. But it would be
> good to add a third call that is more likely to happen:
>
> echo byteidx('a😊😊', 3, 1) returns 5
>
> This refers to the same character as the first call, thus has the same
> return value. This also makes clear (esp. for those who don't know
> UTF-16 well) that a character can consist of two words.
>

I have updated the help with this example and added a note about the
round-down.

>
> For charidx() there is this example:
>
> echo charidx('a😊😊', 4, 0, 1) returns 3
>
> I would think that index 4 is halfway the third character, thus I would
> expect a return value of 2. Am I wrong?
>

Good catch. The example is wrong. It does return 2. I have updated the help.

Regards,
Yegappan

Yegappan Lakshmanan

unread,
Apr 21, 2023, 12:07:33 PM4/21/23
to vim/vim, vim-dev ML, Push

@yegappan pushed 1 commit.

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/push/13383912220@github.com>

Yegappan Lakshmanan

unread,
Apr 21, 2023, 9:38:57 PM4/21/23
to vim/vim, vim-dev ML, Push

@yegappan pushed 1 commit.

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/push/13387752186@github.com>

Yegappan Lakshmanan

unread,
Apr 23, 2023, 10:26:28 AM4/23/23
to vim/vim, vim-dev ML, Push

@yegappan pushed 1 commit.

  • 67ea267 Add the utf16idx() and strutf16len() functions and add UTF-16 flag to the byteidx(), byteidxcomp() and charidx() functions

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/push/13396208733@github.com>

Bram Moolenaar

unread,
Apr 24, 2023, 4:10:27 PM4/24/23
to vim/vim, vim-dev ML, Comment

Closed #12216 via 67672ef.


Reply to this email directly, view it on GitHub.

You are receiving this because you commented.Message ID: <vim/vim/pull/12216/issue_event/9085396154@github.com>

Bram Moolenaar

unread,
May 2, 2023, 7:39:13 PM5/2/23
to vim...@googlegroups.com, Yegappan Lakshmanan, reply+ACY5DGAXCPIE7Q5SMV...@reply.github.com

[resend, picky postmaster refused the message]


Yegappan wrote:

> > > The language server protocol supports specifying offsets in text
> > > documents using UTF-8 or UTF-16 or UTF-32 code units.
> > > The UTF-16 code unit is the default.
> > >
> > >
> > https://microsoft.github.io/language-server-protocol/specifications/lsp/3=
> .17/specification/#textDocuments
charidx() converts a byte index of an UTF-8 encoded string to a
character index. This can't simply be changed to UTF-16, since we don't
support UTF-16 encoded strings. We could (pretend to) convert the
string to UTF-16 and then apply {idx}. But that is doing the opposite
of what you suggested.

> To convert from a UTF-16 index to a character index, we can either introduce
> a new function or modify the existing charidx() function to accept an
> additional boolean argument. If this argument is specified, then
> {idx} is a UTF-16 index instead of a byte index. If we are going with
> a new function for this, what do you think about naming the function
> as utf16tocharidx()?

The function still returns a character index, thus using "charidx" with
something appended works better. At least then they sort next to each
other.

For the other direction an equivalent to byteidx(). That could be
utf16idx() perhaps.

--
ARTHUR: What does it say?
BROTHER MAYNARD: It reads ... "Here may be found the last words of Joseph of
Aramathea." "He who is valorous and pure of heart may find
the Holy Grail in the aaaaarrrrrrggghhh..."
ARTHUR: What?
BROTHER MAYNARD: "The Aaaaarrrrrrggghhh..."
"Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

vim-dev ML

unread,
May 2, 2023, 7:39:33 PM5/2/23
to vim/vim, vim-dev ML, Your activity
/// Bram Moolenaar -- ***@***.*** -- http://www.Moolenaar.net \\\

/// \\\
\\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///


Reply to this email directly, view it on GitHub.

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/12216/c1532282895@github.com>

Reply all
Reply to author
Forward
0 new messages