Is there any way to count all latin characters in utf-8 as 1 byte?

76 views
Skip to first unread message

rameo

unread,
Mar 7, 2016, 6:54:36 AM3/7/16
to vim_use
I use searchpos() to capture start/endcolumns of a matches.
Then I use the results in Python code to transform the text.

However I noted that latin characters as 'èéàòìù' are counted as 1 byte in Python but 2 bytes in Vim and the output is not as expected.

Is there any way to resolve this problem?

Nikolay Aleksandrovich Pavlov

unread,
Mar 7, 2016, 9:05:42 AM3/7/16
to vim...@googlegroups.com
In Python you are not using *byte* counts, it indexes *unicode
codepoints*. You may convert unicode Python objects to bytes objects
by using `string.encode(vim.options['encoding'])`, use
`.decode(vim.options['encoding'])` to convert back. bytes objects are
indexed by bytes. You may also count codepoints on Vim side by using
`strchars()`.

>
> --
> --
> You received this message from the "vim_use" maillist.
> Do not top-post! Type your reply below the text you are replying to.
> For more information, visit http://www.vim.org/maillist.php
>
> ---
> You received this message because you are subscribed to the Google Groups "vim_use" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

rameo

unread,
Mar 7, 2016, 5:32:26 PM3/7/16
to vim_use

> In Python you are not using *byte* counts, it indexes *unicode
> codepoints*. You may convert unicode Python objects to bytes objects
> by using `string.encode(vim.options['encoding'])`, use
> `.decode(vim.options['encoding'])` to convert back. bytes objects are
> indexed by bytes. You may also count codepoints on Vim side by using
> `strchars()`.
>

Thank you ZyX, Can you please tell me where to put string.encode(vim.options['encoding'])? Before searchpos()? And decode(vim.options['encoding'])after searchpos()?

Nikolay Aleksandrovich Pavlov

unread,
Mar 7, 2016, 5:40:07 PM3/7/16
to vim...@googlegroups.com
When using byte indexes you use them on encoded unicode string in
Python. Decoding is needed to convert byte strings (which are rather
inconvenient to use in Python 3) back to unicode strings when you are
done working with indexes. I cannot say anything more because it is
your code.

rameo

unread,
Mar 8, 2016, 4:23:51 AM3/8/16
to vim_use
Can't find anything on the net about string.encode(vim.options[encoding]).
No info either in Vim documentation: if_pyth

Let say I create my list "MyPositions" with start/end position of matches using searchpos() in vim.

Then in my python code I have to do something like this to convert it to byte strings:

python3 << endpython
import vim
myposPyth = str(vim.eval("MyPositions"))
myposPyth = myposPyth.encode(vim.options['utf8'])

?
I still don't get it.
(btw above returns a key-error)

Nikolay Aleksandrovich Pavlov

unread,
Mar 8, 2016, 4:58:25 AM3/8/16
to vim...@googlegroups.com
You do not understand what you are doing. `string` in my code means
*string* to which positions apply, not *position*. And not even
stringified position. To convert byte offsets into character ones you
would need to get string to which position applies, convert it to a
byte string, slice it using found positions, convert the slice back to
unicode string and find its length.

And I explicitly written `vim.options['encoding']`, where on Earth
have you seen `vim.options['utf8']`?

rameo

unread,
Mar 8, 2016, 5:22:16 AM3/8/16
to vim_use
Thank you.
You're right. It is not a question of decoding a list but decoding a string. Never did anything before with string encoding. I've got it: I cannot use searchpos() to use with python. Searching positions must be done in python (p.e. finditer).

BTW I thought that ['encoding'] was a placeholder for utf8/latin1.

Nikolay Aleksandrovich Pavlov

unread,
Mar 8, 2016, 5:34:51 AM3/8/16
to vim...@googlegroups.com
2016-03-08 13:22 GMT+03:00 rameo <rai...@gmail.com>:
> Thank you.
> You're right. It is not a question of decoding a list but decoding a string. Never did anything before with string encoding. I've got it: I cannot use searchpos() to use with python. Searching positions must be done in python (p.e. finditer).

You can use searchpos() with Python. Searchpos searches strings in
buffer which *is* accessible through Python, vim.current.buffer[linenr
- 1] or vim.buffers[bufnr][linenr - 1]. Line numbers are returned by
searchpos.

When using Python-3 though prepare that `vim.current.buffer[linenr -
1]` may yield UnicodeDecodeError: because buffer is not guaranteed to
contain only valid &encoding strings. Using
`vim.bindeval('getline(%u)' % linenr)` you may get byte string without
encoding anything and thus without errors.

>
> BTW I thought that ['encoding'] was a placeholder for utf8/latin1.
>

Nikolay Aleksandrovich Pavlov

unread,
Mar 8, 2016, 5:36:04 AM3/8/16
to vim...@googlegroups.com
2016-03-08 13:34 GMT+03:00 Nikolay Aleksandrovich Pavlov <zyx...@gmail.com>:
> 2016-03-08 13:22 GMT+03:00 rameo <rai...@gmail.com>:
>> Thank you.
>> You're right. It is not a question of decoding a list but decoding a string. Never did anything before with string encoding. I've got it: I cannot use searchpos() to use with python. Searching positions must be done in python (p.e. finditer).
>
> You can use searchpos() with Python. Searchpos searches strings in
> buffer which *is* accessible through Python, vim.current.buffer[linenr
> - 1] or vim.buffers[bufnr][linenr - 1]. Line numbers are returned by
> searchpos.
>
> When using Python-3 though prepare that `vim.current.buffer[linenr -
> 1]` may yield UnicodeDecodeError: because buffer is not guaranteed to
> contain only valid &encoding strings. Using
> `vim.bindeval('getline(%u)' % linenr)` you may get byte string without

Better `vim.Function('getline')(linenr)

rameo

unread,
Mar 8, 2016, 6:03:56 AM3/8/16
to vim_use
> Better `vim.Function('getline')(linenr)

You seems to know everything in every computer language :)
Yes I use Pyth3 and many times also `vim.current.buffer[linenr-1]`
or things like this `r = vim.current.buffer[startline:endline]`

To avoid decoding errors it is better to switch all these statements to:
vim.Function('getline')(linenr) ?
p.e. line 3 in vim: vim.current.buffer[2]` --> vim.Function('getline')(2)?
How does python know from which buffer it has to capture the line and what would be the vim.Function statement in this case with a slice: `vim.current.buffer[startline:endline]`?

Nikolay Aleksandrovich Pavlov

unread,
Mar 8, 2016, 6:38:13 AM3/8/16
to vim...@googlegroups.com
2016-03-08 14:03 GMT+03:00 rameo <rai...@gmail.com>:
:h getline()
:h getbufline()

vim.Function does nothing more then creating a reference to a Vim function.

rameo

unread,
Mar 8, 2016, 7:18:57 AM3/8/16
to vim_use
Thanks but I still don't understand it. No problem.
Hope I'll not have problems with `vim.current.buffer[linenr -1]`.

My experience with Python tells me that Python language is much easier then vimscript. Things can be done easier with the many Python modules using less code then in vimscript. Even Python regex is much easier (p.e. lookahead/behind). I wish both language were more compatible. ;)

Reply all
Reply to author
Forward
0 new messages