Getting the byte index (column) given the character column number

85 views
Skip to first unread message

Yegappan Lakshmanan

unread,
Nov 20, 2022, 12:05:21 PM11/20/22
to vim_dev
Hi all,

The language server protocol messages use character column number whereas many
of the built-in Vim functions (e.g. matchaddpos()) deal with byte column number.

Several built-in functions were added to convert between the character and byte
column numbers (byteidx(), charcol(), charidx(), getcharpos(),
getcursorcharpos(), etc,).
But these functions deal with strings, current cursor position or the
position of a mark.

We currently don't have a function to return the byte number given the character
number in a line in a buffer. The workaround is to use getbufline()
to get the entire
buffer line and then use byteidx() to get the byte number from the
character number.

I am thinking of introducing a new function named charcol2bytecol() that accepts
a buffer number, line number and the character number in the line and
returns the
corresponding byte number. Any suggestions/comments on this?

We should also modify the matchaddpos() function to accept character numbers
in a line in addition to the byte numbers.

Regards,
Yegappan

Yegappan Lakshmanan

unread,
Nov 20, 2022, 12:13:23 PM11/20/22
to vim_dev
On Sun, Nov 20, 2022 at 9:05 AM Yegappan Lakshmanan <yega...@gmail.com> wrote:
>
> Hi all,
>
> The language server protocol messages use character column number whereas many
> of the built-in Vim functions (e.g. matchaddpos()) deal with byte column number.
>
> Several built-in functions were added to convert between the character and byte
> column numbers (byteidx(), charcol(), charidx(), getcharpos(),
> getcursorcharpos(), etc,).
> But these functions deal with strings, current cursor position or the
> position of a mark.
>
> We currently don't have a function to return the byte number given the character
> number in a line in a buffer. The workaround is to use getbufline()
> to get the entire
> buffer line and then use byteidx() to get the byte number from the
> character number.
>
> I am thinking of introducing a new function named charcol2bytecol() that accepts
> a buffer number, line number and the character number in the line and
> returns the
> corresponding byte number. Any suggestions/comments on this?
>

Another alternative is to extend the col() function. The col()
function currently accepts a
list with two numbers (a line number and a byte number or "$") and
returns the byte number.
This can be modified to also accept a list with three numbers (line
number, column number
and a boolean indicating character column or byte column) and return
the byte number.

- Yegappan

Bram Moolenaar

unread,
Nov 20, 2022, 7:04:15 PM11/20/22
to vim...@googlegroups.com, Yegappan Lakshmanan
Just to make sure we understand what we are talking about: This is
always about text in a buffer? Thus the buffer text is somehow passed
through the LSP to a server, which then returns information with
character indexes.

One detail that matters: Are composing characters counted separately, or
not counted (part of the base character)?

Also, I assume a Tab is counted as just one character, not the number of
display cells it occupies.

I wonder if it's really helpful to add a new function if it can
currently be done with two. You already mention that the text can be
obtained with getbufline(), and then get the byte index from the
character index with byteidx(). What is the problem with doing it that
way?

Other message:

> Another alternative is to extend the col() function. The col()
> function currently accepts a list with two numbers (a line number and
> a byte number or "$") and returns the byte number.
> This can be modified to also accept a list with three numbers (line
> number, column number and a boolean indicating character column or
> byte column) and return the byte number.

I don't like this, the first line for the col() help is:

The result is a Number, which is the byte index of the column

When the boolean is true this would be the character index, that is hard
to explain. A user would have to look really hard to find this
functionality.

There is also charcol(), it appears to be doing what you want already.


--
hundred-and-one symptoms of being an internet addict:
92. It takes you two hours to check all 14 of your mailboxes.

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// \\\
\\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Yegappan Lakshmanan

unread,
Nov 20, 2022, 8:35:49 PM11/20/22
to Bram Moolenaar, vim...@googlegroups.com
Hi Bram,

On Sun, Nov 20, 2022 at 4:04 PM Bram Moolenaar <Br...@moolenaar.net> wrote:
>
>
> Yegappan wrote:
>
> > The language server protocol messages use character column number
> > whereas many of the built-in Vim functions (e.g. matchaddpos()) deal
> > with byte column number.
> >
> > Several built-in functions were added to convert between the character
> > and byte column numbers (byteidx(), charcol(), charidx(),
> > getcharpos(), getcursorcharpos(), etc,).
> > But these functions deal with strings, current cursor position or the
> > position of a mark.
> >
> > We currently don't have a function to return the byte number given the
> > character number in a line in a buffer. The workaround is to use
> > getbufline() to get the entire buffer line and then use byteidx() to
> > get the byte number from the character number.
> >
> > I am thinking of introducing a new function named charcol2bytecol()
> > that accepts a buffer number, line number and the character number in
> > the line and returns the corresponding byte number. Any
> > suggestions/comments on this?
> >
> > We should also modify the matchaddpos() function to accept character numbers
> > in a line in addition to the byte numbers.
>
> Just to make sure we understand what we are talking about: This is
> always about text in a buffer? Thus the buffer text is somehow passed
> through the LSP to a server, which then returns information with
> character indexes.
>

Yes. The location information returned by the LSP server is about the
text in the buffer.

>
> One detail that matters: Are composing characters counted separately, or
> not counted (part of the base character)?
>

I think composing counters are not counted. But I couldn't find this mentioned
in the LSP specification:

https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#position

>
> Also, I assume a Tab is counted as just one character, not the number of
> display cells it occupies.
>

Yes. Tab is counted as one character.

>
> I wonder if it's really helpful to add a new function if it can
> currently be done with two. You already mention that the text can be
> obtained with getbufline(), and then get the byte index from the
> character index with byteidx(). What is the problem with doing it that
> way?
>

If the conversion has to be done too many times then it is not efficient.

>
> Other message:
>
> > Another alternative is to extend the col() function. The col()
> > function currently accepts a list with two numbers (a line number and
> > a byte number or "$") and returns the byte number.
> > This can be modified to also accept a list with three numbers (line
> > number, column number and a boolean indicating character column or
> > byte column) and return the byte number.
>
> I don't like this, the first line for the col() help is:
>
> The result is a Number, which is the byte index of the column
>
> When the boolean is true this would be the character index, that is hard
> to explain. A user would have to look really hard to find this
> functionality.
>

The boolean doesn't change the return value of the col() function. It just
changes how the col() function interprets the column number in the list.
If it is true, then the col() function will use the column number as the
character number. If it is false or not specified, then the col() function
will use it as the byte number. In both cases the col() function will always
return the byte index of the column.

>
> There is also charcol(), it appears to be doing what you want already.
>

The charcol() function returns the character number in a line. This function
cannot be used to get the byte index given the character index.

Regards,
Yegappan

Bram Moolenaar

unread,
Nov 21, 2022, 6:23:52 AM11/21/22
to vim...@googlegroups.com, Yegappan Lakshmanan
Disappointing to not mention such an important part of the interface.
Since I do not see any mention of composing characters, I would guess
that each utf-8 character is counted separately.

> > Also, I assume a Tab is counted as just one character, not the number of
> > display cells it occupies.
>
> Yes. Tab is counted as one character.
>
> > I wonder if it's really helpful to add a new function if it can
> > currently be done with two. You already mention that the text can be
> > obtained with getbufline(), and then get the byte index from the
> > character index with byteidx(). What is the problem with doing it that
> > way?
>
> If the conversion has to be done too many times then it is not efficient.

How can you say that without trying? Getting the buffer line means
making a copy of the text, that's quite cheap. The only added overhead
is two function calls instead of one, which has really minimal impact in
the context of all the other things being done. Also, if there are
multiple positions in one line then getbufline() only needs to be called
once, thus performance should be very close to whatever function we
would use instead.

> > Other message:
> >
> > > Another alternative is to extend the col() function. The col()
> > > function currently accepts a list with two numbers (a line number and
> > > a byte number or "$") and returns the byte number.
> > > This can be modified to also accept a list with three numbers (line
> > > number, column number and a boolean indicating character column or
> > > byte column) and return the byte number.
> >
> > I don't like this, the first line for the col() help is:
> >
> > The result is a Number, which is the byte index of the column
> >
> > When the boolean is true this would be the character index, that is hard
> > to explain. A user would have to look really hard to find this
> > functionality.
>
> The boolean doesn't change the return value of the col() function. It just
> changes how the col() function interprets the column number in the list.
> If it is true, then the col() function will use the column number as the
> character number. If it is false or not specified, then the col() function
> will use it as the byte number. In both cases the col() function will always
> return the byte index of the column.

I was confused. Currently in the [lnum, col] value of {expr} the column
is the character offset. Since you are converting from character offset
to byte index, I don't see how you would pass the byte index here, since
you'll get the same byte index back. What would be the point in passing
[lnum, col, false] ? BTW, leving out the flag must mean using the
column number (for backwards compatibility).


> > There is also charcol(), it appears to be doing what you want already.
>
> The charcol() function returns the character number in a line. This
> function cannot be used to get the byte index given the character
> index.

But then using col() would already work without any changes...


--
hundred-and-one symptoms of being an internet addict:
95. Only communication in your household is through email.

Yegappan Lakshmanan

unread,
Nov 21, 2022, 10:50:29 AM11/21/22
to Bram Moolenaar, vim...@googlegroups.com
I used the attached Vim9 script to measure the performance of
getbufline() + byteidx()
compared to calling the col() function. I see that the first one
takes three times
longer to get the column number compared to the second one.
Currently in the [lnum, col] value of [expr], the column is the byte offset.
For example, if you use multibyte characters in a line and get the column
number:

=====================================================
new
call setline(1, "\u2345\u2346\u2347\u2348")
echo col([1, 3])
=====================================================

The above script echos 3 instead of 7. The byte index of the third
character is 7.

Regards,
Yegappan
profile_col.vim

Bram Moolenaar

unread,
Nov 21, 2022, 5:17:58 PM11/21/22
to vim...@googlegroups.com, Yegappan Lakshmanan
This must be because getbufline() always returns a list of strings.
Creating the list, adding a list item and then making a copy of the text
takes longer. Using getline() (just to try it out, wouldn't work in
your actual code) brings the difference down to less than two times.

Not storing the result of getbufline() in a variable, but passing it to
byteidx() with "->" also helps make it faster.

The range should be bigger, I used 10x to get more stable results. As a
rule of thumb: the profiling time should be at least 100 msec to avoid
too much fluctuation.

After making some adjustments it is now only about 16% slower.
I'll make a patch to get getbufoneline(), since just getting the string
for one line would be very common and it is about twice as fast.

The name getbufoneline() isn't nice, couldn't come up with something
better. Should have called the existing function getbuflines() instead
of getbufline(), but we can't change that now.

The resulting essential line in ProfByteIdxFunction():

idx = getbufoneline('', 5344)->byteidx(77)
Should really update the help to avoid the term "column number", it is
confusing. The remark "Most useful when the column is "$"" is a hint
that is easily missed.

OK, I finally see your point, sorry it took so long.

Unfortunately, adding a third argument that is a flag, indicating whether
the second argument means bytes or characters, conflicts with other
places where the third argument is "coloff". This is used with
virtcol() for example.

You also still have the limitation that col() only works for the current
buffer.

Making matchaddpos() accept a character index instead of a byte index is
going to trigger doing this in many more places. And internally the
conversion will have to be done anyway. Therefore sticking to using a
byte index in most places that deal with text avoids a lot of complexity
in the arguments of the functions.

So let's go back to making the character index to byte index conversion
fast. That is a generic solution and avoids changes all over the place.
Please try out the new getbufoneline() function, as mentioned above.

If the performance is indeed quite bad, adding a function that converts
a text location in a buffer specified by character index to a byte index
could be a solution. Perhaps:

bufcol({buf}, {expr}) {expr} a string like with col()
bufcol({buf}, {lnum}, {expr}) {expr} a string like with col()
bufcol({buf}, {lnum}, {charidx})


--
hundred-and-one symptoms of being an internet addict:
100. The most exciting sporting events you noticed during summer 1996
was Netscape vs. Microsoft.

Yegappan Lakshmanan

unread,
Nov 22, 2022, 1:14:42 AM11/22/22
to Bram Moolenaar, vim...@googlegroups.com
Hi Bram,
I tested the new getbufoneline() function and the performance is much
better. Thanks for adding this function.

>
> If the performance is indeed quite bad, adding a function that converts
> a text location in a buffer specified by character index to a byte index
> could be a solution. Perhaps:
>
> bufcol({buf}, {expr}) {expr} a string like with col()
> bufcol({buf}, {lnum}, {expr}) {expr} a string like with col()
> bufcol({buf}, {lnum}, {charidx})
>

For now, I think we can use the getbufoneline() and byteidx() functions.
If another use case for this comes up in the future, we can add this.

Regards,
Yegappan

Dominique Pellé

unread,
Nov 22, 2022, 2:05:22 AM11/22/22
to vim...@googlegroups.com
Related to this thread, the grammar checker LanguageTool has
changed its API [1] and now defines the position of errors as:
- an offset in Unicode characters from the beginning of the
document (not from the beginning of the line! newlines \n are
counted as 1 character)
- and length in Unicode characters.

This API change breaks my LanguageTool grammar checker
plugin [2] with the latest LanguageTool.

LanguageTool API is poorly documented, but experimenting
with it, I see that combining Unicode characters such as
U+0065 + U+0301 for e-acute are counted as 2
characters.

I wonder whether vim has suitable functions() to find the
corresponding byte offset of a line/column with such input
data (i.e. Unicode character offset from start of file + Unicode
character length). At first glance, I did not see any suitable
Vim function.

Regards
Dominique

[1] https://languagetool.org/http-api/#!/default/post_check
[2] https://github.com/dpelle/vim-LanguageTool

Bram Moolenaar

unread,
Nov 22, 2022, 6:07:00 AM11/22/22
to vim...@googlegroups.com, Dominique Pellé

Dominique wrote:

> Related to this thread, the grammar checker LanguageTool has
> changed its API [1] and now defines the position of errors as:
> - an offset in Unicode characters from the beginning of the
> document (not from the beginning of the line! newlines \n are
> counted as 1 character)
> - and length in Unicode characters.
>
> This API change breaks my LanguageTool grammar checker
> plugin [2] with the latest LanguageTool.
>
> LanguageTool API is poorly documented, but experimenting
> with it, I see that combining Unicode characters such as
> U+0065 + U+0301 for e-acute are counted as 2
> characters.
>
> I wonder whether vim has suitable functions() to find the
> corresponding byte offset of a line/column with such input
> data (i.e. Unicode character offset from start of file + Unicode
> character length). At first glance, I did not see any suitable
> Vim function.

There isn't one. And there probably will not be one.

We do have the byte2line() function, which works with the byte offset.
This was quite a bit of work to implement and get right. And it adds
overhead, since any change to the text requires the cached values to be
updated. Adding character offset on top of that would make it less
efficient, and since it wasn't askef for until now I expect it to be
rarely used.

I would suggest to the authors, explain that the API can't be used this
way, and that they should fix that. They could provide a setting to
either use byte or character offsets. Or at least provide the line
number, then the computation on the Vim side is likely fast enough.

--
A meeting is an event at which the minutes are kept and the hours are lost.
Reply all
Reply to author
Forward
0 new messages