When a vim variable has a value that is not a valid UTF-8 string, :py3 vim.eval('variable') raises UnicodeDecodeError.
This is causing plugins such as UltiSnips to crash when they try to process a list of mappings that involve <A-@>, since Vim thinks <A-@> is byte 0xC0. Here's what happens in the plugin:
vim.command('redir => _tmp_smaps | smap | redir END') to get the mappingsvim.eval('_tmp_smaps').splitlines() to see what select-mode mappings there areNow I've the following mapping in my .vimrc
map! <Esc>@ <A-@>
(I've many more, but this one seems to be the one causing the problem)
When I run smap, vim shows this mapping as

but when I do
:redir => tmp | smap <Esc>@ | redir END
:echo tmp
I see
and I can reproduce the crash with
:py3 import vim; vim.eval('tmp')
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 18: invalid start byte
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
Here is another test case
:0put =printf('%c',0xFF)
:py3 print(repr(vim.current.buffer[0]))
:py3 vim.current.buffer[0] += "x"
:py3 print(vim.eval("getbufline('', 1)"))
apparently the first :py3 is ok, it uses surrogatescape, but not the other two. Probably :python3 should use surrogatescape always, and both when decoding and encoding. (This is what neovim's :python3 implementation does, BTW)
Maybe like this:
diff --git a/src/if_py_both.h b/src/if_py_both.h index c44fc93..19f8584 100644 --- a/src/if_py_both.h +++ b/src/if_py_both.h @@ -134,7 +134,7 @@ StringToChars(PyObject *obj, PyObject **todecref) { PyObject *bytes; - if (!(bytes = PyUnicode_AsEncodedString(obj, ENC_OPT, NULL))) + if (!(bytes = PyUnicode_AsEncodedString(obj, ENC_OPT, CODEC_ERROR_HANDLER))) return NULL; if(PyBytes_AsStringAndSize(bytes, (char **) &str, NULL) == -1 @@ -4117,7 +4117,7 @@ StringToLine(PyObject *obj) } else if (PyUnicode_Check(obj)) { - if (!(bytes = PyUnicode_AsEncodedString(obj, ENC_OPT, NULL))) + if (!(bytes = PyUnicode_AsEncodedString(obj, ENC_OPT, CODEC_ERROR_HANDLER))) return NULL; if (PyBytes_AsStringAndSize(bytes, &str, &len) == -1 @@ -6197,7 +6197,7 @@ _ConvertFromPyObject(PyObject *obj, typval_T *tv, PyObject *lookup_dict) PyObject *bytes; char_u *str; - bytes = PyUnicode_AsEncodedString(obj, ENC_OPT, NULL); + bytes = PyUnicode_AsEncodedString(obj, ENC_OPT, CODEC_ERROR_HANDLER); if (bytes == NULL) return -1; diff --git a/src/if_python.c b/src/if_python.c index 622634d..edb6400 100644 --- a/src/if_python.c +++ b/src/if_python.c @@ -90,6 +90,9 @@ struct PyMethodDef { Py_ssize_t a; }; # define PySequenceMethods Py_ssize_t #endif +/* The "surrogateescape" error handler is new in Python 3.1 */ +#define CODEC_ERROR_HANDLER NULL + #if defined(PY_VERSION_HEX) && PY_VERSION_HEX >= 0x02070000 # define PY_USE_CAPSULE #endif diff --git a/src/if_python3.c b/src/if_python3.c index 53a1313..5d9c058 100644 --- a/src/if_python3.c +++ b/src/if_python3.c @@ -96,7 +96,7 @@ # define PyString_Check(obj) PyUnicode_Check(obj) #endif #define PyString_FromString(repr) \ - PyUnicode_Decode(repr, STRLEN(repr), ENC_OPT, NULL) + PyUnicode_Decode(repr, STRLEN(repr), ENC_OPT, CODEC_ERROR_HANDLER) #define PyString_FromFormat PyUnicode_FromFormat #ifndef PyInt_Check # define PyInt_Check(obj) PyLong_Check(obj)
To incorporate @bfredl 's patch into our build system, it looks like we need an additional patch:
diff --git a/src/if_python.c b/src/if_python.c index edb6400..ffad23e 100644 --- a/src/if_python.c +++ b/src/if_python.c @@ -91,7 +91,9 @@ struct PyMethodDef { Py_ssize_t a; }; #endif /* The "surrogateescape" error handler is new in Python 3.1 */ -#define CODEC_ERROR_HANDLER NULL +#if defined(PY_VERSION_HEX) && PY_VERSION_HEX < 0x03010000 +# define CODEC_ERROR_HANDLER NULL +#endif
#if defined(PY_VERSION_HEX) && PY_VERSION_HEX >= 0x02070000 # define PY_USE_CAPSULE
With those patches, I confirm that the resulting vim works well with the examples @bfredl mentioned above.
But I don't think I am a good python tester. It would be far better if someone else could confirm that, ideally, with other examples.
Hmm, but ain't PY_VERSION_HEX < 0x03010000 always true when you are in if_python.c ? If the version hex is >= 0x03000000, if_python3.c should be used...
—
You are receiving this because you commented.
615351832d75df3dfbc3f22694e675583e0b325d
—
You are receiving this because you commented.
What is wrong with just using CODEC_ERROR_HANDLER directly?
—
You are receiving this because you commented.
But python2 does not have surrogateescape, in python2 one would represent a byte string as str with no problems. The very point of "surrogateescape" is that you use it in both directions, so that a bytestring can be roundtripped losslessly as a python3 str and then back to a bytestring (but only in that direction).
I'm asking because after your patch CODEC_ERROR_HANDLER is then only ever to #define another macro. Why not just use it directly as in my patch? If indirection is needed later it could be added later when it's needed.
—
You are receiving this because you commented.
@bfredl Python2 has unicode() strings. And if you want to write python23 scripts you will use from __future__ import unicode_literals (thus Python->Vim will receive unicode strings with no non-unicode characters unless explicitly requested), but Vim API will still produce byte strings in Python 2, so using this in “both directions” is impossible. Also my Python2 has surrogateescape error handler, though after some investigation it appeared that it was coming from some site package provided for compatibility with Python3 so this idea does not make much sense.
—
You are receiving this because you commented.
In case it helps, here is an updated patch from @ZyX-I:
diff --git a/src/if_py_both.h b/src/if_py_both.h index 7b748b25e..e657624dd 100644 --- a/src/if_py_both.h +++ b/src/if_py_both.h @@ -130,7 +130,8 @@ StringToChars(PyObject *obj, PyObject **todecref) { PyObject *bytes;
- if (!(bytes = PyUnicode_AsEncodedString(obj, ENC_OPT, NULL))) + if (!(bytes = PyUnicode_AsEncodedString(obj, ENC_OPT, + ERRORS_ENCODE_ARG))) return NULL; if(PyBytes_AsStringAndSize(bytes, (char **) &str, NULL) == -1
@@ -4243,7 +4244,8 @@ StringToLine(PyObject *obj)
}
else if (PyUnicode_Check(obj))
{
- if (!(bytes = PyUnicode_AsEncodedString(obj, ENC_OPT, NULL))) + if (!(bytes = PyUnicode_AsEncodedString(obj, ENC_OPT, + ERRORS_ENCODE_ARG))) return NULL; if (PyBytes_AsStringAndSize(bytes, &str, &len) == -1
@@ -6290,7 +6292,7 @@ _ConvertFromPyObject(PyObject *obj, typval_T *tv, PyObject *lookup_dict)
PyObject *bytes;
char_u *str;
- bytes = PyUnicode_AsEncodedString(obj, ENC_OPT, NULL); + bytes = PyUnicode_AsEncodedString(obj, ENC_OPT, ERRORS_ENCODE_ARG); if (bytes == NULL) return -1; diff --git a/src/if_python.c b/src/if_python.c
index 6338a5b8d..29f7ed560 100644 --- a/src/if_python.c +++ b/src/if_python.c @@ -69,6 +69,9 @@
# undef PY_SSIZE_T_CLEAN #endif +#define ERRORS_DECODE_ARG NULL +#define ERRORS_ENCODE_ARG ERRORS_DECODE_ARG +
#undef main // Defined in python.h - aargh #undef HAVE_FCNTL_H // Clash with os_win32.h diff --git a/src/if_python3.c b/src/if_python3.c index a51be2949..ea4fd7dd8 100644 --- a/src/if_python3.c +++ b/src/if_python3.c @@ -81,12 +81,15 @@ // Python 3 does not support CObjects, always use Capsules #define PY_USE_CAPSULE
+#define ERRORS_DECODE_ARG CODEC_ERROR_HANDLER +#define ERRORS_ENCODE_ARG ERRORS_DECODE_ARG + #define PyInt Py_ssize_t #ifndef PyString_Check # define PyString_Check(obj) PyUnicode_Check(obj) #endif #define PyString_FromString(repr) \ - PyUnicode_Decode(repr, STRLEN(repr), ENC_OPT, NULL) + PyUnicode_Decode(repr, STRLEN(repr), ENC_OPT, ERRORS_DECODE_ARG) #define PyString_FromFormat PyUnicode_FromFormat #ifndef PyInt_Check # define PyInt_Check(obj) PyLong_Check(obj)
@@ -1088,8 +1091,8 @@ DoPyCommand(const char *cmd, rangeinitializer init_range, runner run, void *arg)
// PyRun_SimpleString expects a UTF-8 string. Wrong encoding may cause
// SyntaxError (unicode error).
cmdstr = PyUnicode_Decode(cmd, strlen(cmd),
- (char *)ENC_OPT, CODEC_ERROR_HANDLER); - cmdbytes = PyUnicode_AsEncodedString(cmdstr, "utf-8", CODEC_ERROR_HANDLER); + (char *)ENC_OPT, ERRORS_DECODE_ARG); + cmdbytes = PyUnicode_AsEncodedString(cmdstr, "utf-8", ERRORS_ENCODE_ARG); Py_XDECREF(cmdstr); run(PyBytes_AsString(cmdbytes), arg, &pygilstate);
@@ -1745,7 +1748,7 @@ LineToString(const char *str)
}
*p = '\0';
- result = PyUnicode_Decode(tmp, len, (char *)ENC_OPT, CODEC_ERROR_HANDLER); + result = PyUnicode_Decode(tmp, len, (char *)ENC_OPT, ERRORS_DECODE_ARG); vim_free(tmp); return result;
As for the question asked in the relevant todo item:
Yes, the patch works. I tested it against the original example. Without the patch:
vim -Nu NONE -S <(cat <<'EOF'
smap <Esc>@ <A-@>
py3 vim.command('redir => _tmp_smaps | smap | redir END')
py3 vim.eval('_tmp_smaps').splitlines()
EOF
)
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 18: invalid start byte
With the patch:
./src/vim -Nu NONE -S <(cat <<'EOF'
smap <Esc>@ <A-@>
py3 vim.command('redir => _tmp_smaps | smap | redir END')
py3 vim.eval('_tmp_smaps').splitlines()
EOF
)
# no error
Note that someone has asked a question on vi.stackexchange which I think has the same cause. This command:
:let variable = "\<bs>" | py3 print(vim.eval('variable'))
raises this error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I've tried the patch, and unfortunately the error persists; but it changes from UnicodeDecodeError to UnicodeEncodeError:
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.![]()
Note that someone has asked a question on vi.stackexchange which I think has the same cause. This command:
:let variable = "\<bs>" | py3 print(vim.eval('variable'))raises this error:
Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I've tried the patch, and unfortunately the error persists; but it changes from
UnicodeDecodeErrortoUnicodeEncodeError:
Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed
So it is an unsolved issue now?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.![]()
it's probably unsolved until it has been merged successfully.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.![]()
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.![]()
Please check for any remaining encoding/decoding issues.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.![]()
Please check for any remaining encoding/decoding issues.
The patch has fixed the original issue. But a similar one persists for a string containing <bs>. This command:
:let variable = "\<bs>" | py3 print(vim.eval('variable'))
Raises:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.![]()
Reopened #1053.