2016-04-02 19:47 GMT+03:00 Bram Moolenaar <
Br...@moolenaar.net>:
>
> I (again) noticed how difficult it is to deal with multi-byte strings in
> Vim script. A simple thing, getting the Nth character, isn't so simple.
>
> One can turn the string into a list of characters:
>
> let chars = split(mystring, '\zs')
>
> Then an index in the list works:
>
> let c = chars[idx]
>
> But it's very inefficient and clumsy.
>
> How about adding strgetchar(string, idx)?
> It's not that efficient when using it often, but it's simple to use.
>
> A more drastic solution would be to have another String format, where
> each character is a number. Internally it would be an array of int.
> The problem with this is that you end up with lots of conversions
> between the byte string and this character string. And lots of
> functions would need to be changed to accept both types.
This is very space-inefficient. I would suggest either
`strcharsub(string, start[, end])`: works like string[start] or
string[start:end], but uses character offsets. It is not identical to
indexing `split(string, '\zs')` because such split will group
composing characters with preceding code point.
Or use experience from other languages for the new type: e.g. in
Python 3 unicode string is something like
struct {
enum {
kStringTypeAscii,
kStringTypeLatin1,
kStringTypeUTF16, // Only for characters up to U+FFFF
kStringTypeUTF32,
} type;
size_t size;
union {
uint8_t *one_byte; // ASCII or latin1
uint16_t *utf16;
uint32_t *utf32;
} data;
};
(there are more optimizations though: e.g. cache for string encoded
with UTF-8). When creating new string type is automatically selected
based on the largest code point in the string.
Question is though what do you need to get n’th character for. I do
not write in Rust or Go, but based on their documentation it appears
that these modern and rather thoroughly designed languages do not have
something like Python unicode strings.
I would say that while I sometimes need strcharsub() it is not too
hard to work around, but when I sometimes need things like \p{…} (i.e.
match character with specific unicode properties: e.g. \p{East Asian
Width=A} will search for characters with ambiguous width or
\p{Punctuation} is like [[:punct:]], but matches a lot more
characters) it is impossible to work around without invoking Python,
Perl or something like this.
I also do not remember last time I needed n’th character (if ever).
When I needed to select some character it usually looks like
“character at position n reported by col(), match(), etc” and here n
is a byte offset, so strcharsub() will be absolutely useless if it
indexed characters. I guess adding strcharsub() which gets exactly
n’th character just will introduce another possible error in addition
to `let char=line[col]`: `let char=strcharsub(line, col)` which may
even be worse (first will take *part* of the *right* multibyte
character, second will take the *whole* *wrong* multibyte character if
line contains multibyte strings). AFAIR this point was already
discussed, and idea that function like strcharsub() should work like
`let char=nr2char(char2nr(line[col:]))` (hack to get character at byte
offset `col`, assuming that `line` contains only valid &encoding
characters) with a difference in invalid characters handling
(specifically strcharsub("\x80", 0) should return "\x80" and not
"\u0080" like with my hack) was already expressed.
With last suggestion signature of strcharsub() should be
`strcharsub(string, byte_offset[, length=1])`.
>
>
> --
> Some of the well known MS-Windows errors:
> EMULTI Multitasking attempted, system confused
> EKEYBOARD Keyboard locked, try getting out of this one!
> EXPLAIN Unexplained error, please tell us what happened
> EFUTURE Reserved for our future mistakes
>
> /// Bram Moolenaar -- Br...@Moolenaar.net --
http://www.Moolenaar.net \\\
> /// sponsor Vim, vote for features --
http://www.Vim.org/sponsor/ \\\
> \\\ an exciting new programming language --
http://www.Zimbu.org ///
> \\\ help me help AIDS victims --
http://ICCF-Holland.org ///
>
> --
> --
> You received this message from the "vim_dev" maillist.
> Do not top-post! Type your reply below the text you are replying to.
> For more information, visit
http://www.vim.org/maillist.php
>
> ---
> You received this message because you are subscribed to the Google Groups "vim_dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
vim_dev+u...@googlegroups.com.
> For more options, visit
https://groups.google.com/d/optout.