I've received a couple of requests about getting Align.vim to work with
utf-8 characters. As an example, consider:
let x='grün'
echo "strlen(x)=".strlen(x)
Thus, strlen() returns 5, not 4 as one might (sometimes) expect. So, I
tried a workaround:
fun! Strlen(x)
1split
enew
call setline(1,a:x)
let ret= virtcol("$") - 1
bwipe!
return ret
endfun
echo Strlen(x)
now returns 4 (at the price of using interpreted code over built-in
strlen()). So, is this the best that can be done?
I'd prefer to have a built-in compiled function for this.
Regards,
Chip Campbell
> let x='grün'
> echo "strlen(x)=".strlen(x)
>
> Thus, strlen() returns 5, not 4 as one might (sometimes) expect.
Here's what I have in one my base library:
function now#mbc#len(str)
return strlen(substitute(a:str, '.', 'c', 'g'))
endfunction
Which is incredibly much better than your solution ;-).
nikolai
Well, I came up with another solution, but it still isn't as good as
yours! Shouldn't strlen() just handle this on its own? With C or C++,
one may be wanting to use the output of strlen() to help with allocating
memory to hold a string; I don't see any of that application with Vim.
Regards,
Chip Campbell
The multibyte strlen() is even suggested/documented here:
:h strlen()
--
Andy
It all depends on what exactly you want to do. (I haven't read the Align.vim
docs.) The length of a UTF-8 string can be counted in several nonequivalent ways:
- number of bytes (Latin a + combining circumflex is three bytes):
strlen(string)
- number of codepoints (Latin a + combining circumflex is two codepoints):
strlen(substitute(string, '.', 'x', 'g'))
- number of spacing codepoints (Latin a + combining circumflex is one spacing
codepoint; a hard tab is one; wide and narrow CJK are one each; etc.): (untested)
strlen(substitute(string, '.\Z', 'x', 'g'))
- virtual length (counting, for instance, tabs as anything between 1 and
'tabstop', wide CJK as 2 rather than 1, Arabic alif as zero when immediately
preceded by lam, one otherwise, etc.): I guess something like what you're
doing above will be necessary because of the wide range of things that can happen.
The first two above are documented at ":help strlen()", the third (in
addition) at ":help patterns-composing".
Best regards,
Tony.
>
>It all depends on what exactly you want to do. (I haven't read the Align.vim
>docs.) The length of a UTF-8 string can be counted in several nonequivalent ways:
>
>- number of bytes (Latin a + combining circumflex is three bytes):
> strlen(string)
>
>- number of codepoints (Latin a + combining circumflex is two codepoints):
> strlen(substitute(string, '.', 'x', 'g'))
>
>- number of spacing codepoints (Latin a + combining circumflex is one spacing
>codepoint; a hard tab is one; wide and narrow CJK are one each; etc.): (untested)
> strlen(substitute(string, '.\Z', 'x', 'g'))
>
>- virtual length (counting, for instance, tabs as anything between 1 and
>'tabstop', wide CJK as 2 rather than 1, Arabic alif as zero when immediately
>preceded by lam, one otherwise, etc.): I guess something like what you're
>doing above will be necessary because of the wide range of things that can happen.
>
>The first two above are documented at ":help strlen()", the third (in
>addition) at ":help patterns-composing".
>
>
Thank you, Tony, for that explanation! I've modified Align so that the
method used is selectable by the user. Align v33d available at my
website (http://mysite.verizon.net/astronaut/vim/index.html#ALIGN) with
these changes.
Regards,
Chip Campbell
... and, in addition, when 'fileencoding' is nonempty and different from
'encoding', the number of disk bytes used might be useful, but I don't know
how Vim could get it, especially for encodings such as those used in Eastern
Asia, where the number of bytes per character may vary in a way which is often
not easily predictable from the UTF-8 representation. (The 2-or-4-bytes of
UTF-16 is peanuts next to that, but Vim cannot use UTF-16 for its internal
representation of the data because of the intervening nulls.)
Best regards,
Tony.
--
Weiler's Law:
Nothing is impossible for the man who doesn't have to do it
himself.