blob2str() function did not properly handle UTF-16/UCS-2/UTF-32/UCS-4 encodings with endianness suffixes (e.g., utf-16le, utf-16be, ucs-2le). The encoding name was canonicalized too aggressively, losing the endianness information needed by iconv.
This pull-request include few fixes.
"ucs2be" → "ucs-2be", "utf16le" → "utf-16le"I considered using enc_canonize(), but it converts BE encodings to canonical names without the endianness suffix (e.g., "utf16be" → "utf-16", "ucs2be" → "ucs-2"), which loses the information needed by iconv.
convert_string() cannot handle UTF-16 because it uses string_convert() which expects NUL-terminated strings. UTF-16 contains 0x00 bytes within characters (e.g., "H" = 0x48 0x00), causing premature termination. Therefore, for UTF-16/32 encodings, the fix uses string_convert_ext() with an explicit input length to convert the entire blob at once.
The code appends two NUL bytes (ga_append(&blob_ga, NUL) twice) because UTF-16 requires a 2-byte NUL terminator (0x00 0x00), not a single-byte NUL.
from_encoding_raw to preserve endianness, special handling for UTF-16/32 and UCS-2/4convert_setup_ext() to use == ENC_UNICODE instead of & ENC_UNICODE. The bitwise AND was incorrectly treating UTF-16/UCS-2 (which have ENC_UNICODE + ENC_2BYTE etc.) as UTF-8, causing iconv setup to be skipped.Closes #19198
https://github.com/vim/vim/pull/19246
(4 files)
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.![]()
@h-east commented on this pull request.
In src/strings.c:
> + // UTF-16 requires 2-byte NUL terminator + ga_append(&blob_ga, NUL); + ga_append(&blob_ga, NUL);
Is it okay to use 2 bytes even in the case of 4-byte encoding such as 'utf-32'?
In src/strings.c:
> + // Special handling for UTF-16/UCS-2/UTF-32/UCS-4 encodings: convert entire blob before splitting by newlines + if (from_encoding != NULL && (STRNCMP(from_encoding, "utf-16", 6) == 0 + || STRNCMP(from_encoding, "utf16", 5) == 0 + || STRNCMP(from_encoding, "ucs-2", 5) == 0 + || STRNCMP(from_encoding, "ucs2", 4) == 0 + || STRNCMP(from_encoding, "utf-32", 6) == 0 + || STRNCMP(from_encoding, "utf32", 5) == 0 + || STRNCMP(from_encoding, "ucs-4", 5) == 0 + || STRNCMP(from_encoding, "ucs4", 4) == 0))
Isn't from_encoding already canonicalized by enc_canonize()?
Or can it be determined using enc_canon_props()?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.![]()