Getting a character from a string.

430 views
Skip to first unread message

Bram Moolenaar

unread,
Apr 2, 2016, 12:47:40 PM4/2/16
to vim...@googlegroups.com

I (again) noticed how difficult it is to deal with multi-byte strings in
Vim script. A simple thing, getting the Nth character, isn't so simple.

One can turn the string into a list of characters:

let chars = split(mystring, '\zs')

Then an index in the list works:

let c = chars[idx]

But it's very inefficient and clumsy.

How about adding strgetchar(string, idx)?
It's not that efficient when using it often, but it's simple to use.

A more drastic solution would be to have another String format, where
each character is a number. Internally it would be an array of int.
The problem with this is that you end up with lots of conversions
between the byte string and this character string. And lots of
functions would need to be changed to accept both types.


--
Some of the well known MS-Windows errors:
EMULTI Multitasking attempted, system confused
EKEYBOARD Keyboard locked, try getting out of this one!
EXPLAIN Unexplained error, please tell us what happened
EFUTURE Reserved for our future mistakes

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Nikolay Aleksandrovich Pavlov

unread,
Apr 2, 2016, 4:33:19 PM4/2/16
to vim_dev
2016-04-02 19:47 GMT+03:00 Bram Moolenaar <Br...@moolenaar.net>:
>
> I (again) noticed how difficult it is to deal with multi-byte strings in
> Vim script. A simple thing, getting the Nth character, isn't so simple.
>
> One can turn the string into a list of characters:
>
> let chars = split(mystring, '\zs')
>
> Then an index in the list works:
>
> let c = chars[idx]
>
> But it's very inefficient and clumsy.
>
> How about adding strgetchar(string, idx)?
> It's not that efficient when using it often, but it's simple to use.
>
> A more drastic solution would be to have another String format, where
> each character is a number. Internally it would be an array of int.
> The problem with this is that you end up with lots of conversions
> between the byte string and this character string. And lots of
> functions would need to be changed to accept both types.

This is very space-inefficient. I would suggest either
`strcharsub(string, start[, end])`: works like string[start] or
string[start:end], but uses character offsets. It is not identical to
indexing `split(string, '\zs')` because such split will group
composing characters with preceding code point.

Or use experience from other languages for the new type: e.g. in
Python 3 unicode string is something like

struct {
enum {
kStringTypeAscii,
kStringTypeLatin1,
kStringTypeUTF16, // Only for characters up to U+FFFF
kStringTypeUTF32,
} type;
size_t size;
union {
uint8_t *one_byte; // ASCII or latin1
uint16_t *utf16;
uint32_t *utf32;
} data;
};

(there are more optimizations though: e.g. cache for string encoded
with UTF-8). When creating new string type is automatically selected
based on the largest code point in the string.

Question is though what do you need to get n’th character for. I do
not write in Rust or Go, but based on their documentation it appears
that these modern and rather thoroughly designed languages do not have
something like Python unicode strings.

I would say that while I sometimes need strcharsub() it is not too
hard to work around, but when I sometimes need things like \p{…} (i.e.
match character with specific unicode properties: e.g. \p{East Asian
Width=A} will search for characters with ambiguous width or
\p{Punctuation} is like [[:punct:]], but matches a lot more
characters) it is impossible to work around without invoking Python,
Perl or something like this.

I also do not remember last time I needed n’th character (if ever).
When I needed to select some character it usually looks like
“character at position n reported by col(), match(), etc” and here n
is a byte offset, so strcharsub() will be absolutely useless if it
indexed characters. I guess adding strcharsub() which gets exactly
n’th character just will introduce another possible error in addition
to `let char=line[col]`: `let char=strcharsub(line, col)` which may
even be worse (first will take *part* of the *right* multibyte
character, second will take the *whole* *wrong* multibyte character if
line contains multibyte strings). AFAIR this point was already
discussed, and idea that function like strcharsub() should work like
`let char=nr2char(char2nr(line[col:]))` (hack to get character at byte
offset `col`, assuming that `line` contains only valid &encoding
characters) with a difference in invalid characters handling
(specifically strcharsub("\x80", 0) should return "\x80" and not
"\u0080" like with my hack) was already expressed.

With last suggestion signature of strcharsub() should be
`strcharsub(string, byte_offset[, length=1])`.

>
>
> --
> Some of the well known MS-Windows errors:
> EMULTI Multitasking attempted, system confused
> EKEYBOARD Keyboard locked, try getting out of this one!
> EXPLAIN Unexplained error, please tell us what happened
> EFUTURE Reserved for our future mistakes
>
> /// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
> /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
> \\\ an exciting new programming language -- http://www.Zimbu.org ///
> \\\ help me help AIDS victims -- http://ICCF-Holland.org ///
>
> --
> --
> You received this message from the "vim_dev" maillist.
> Do not top-post! Type your reply below the text you are replying to.
> For more information, visit http://www.vim.org/maillist.php
>
> ---
> You received this message because you are subscribed to the Google Groups "vim_dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to vim_dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages