[vim/vim] Fix blob2str() with UTF-16 encodings (PR #19246)

2 views
Skip to first unread message

mattn

unread,
Jan 24, 2026, 8:26:07 AM (yesterday) Jan 24
to vim/vim, Subscribed

blob2str() function did not properly handle UTF-16/UCS-2/UTF-32/UCS-4 encodings with endianness suffixes (e.g., utf-16le, utf-16be, ucs-2le). The encoding name was canonicalized too aggressively, losing the endianness information needed by iconv.

This pull-request include few fixes.

  • Preserve the raw encoding name with endianness suffix for iconv calls
  • Normalize encoding names properly: "ucs2be""ucs-2be", "utf16le""utf-16le"
  • For multi-byte encodings (UTF-16/32, UCS-2/4), convert the entire blob first, then split by newlines

I considered using enc_canonize(), but it converts BE encodings to canonical names without the endianness suffix (e.g., "utf16be""utf-16", "ucs2be""ucs-2"), which loses the information needed by iconv.

convert_string() cannot handle UTF-16 because it uses string_convert() which expects NUL-terminated strings. UTF-16 contains 0x00 bytes within characters (e.g., "H" = 0x48 0x00), causing premature termination. Therefore, for UTF-16/32 encodings, the fix uses string_convert_ext() with an explicit input length to convert the entire blob at once.

The code appends two NUL bytes (ga_append(&blob_ga, NUL) twice) because UTF-16 requires a 2-byte NUL terminator (0x00 0x00), not a single-byte NUL.

  • src/strings.c: Add from_encoding_raw to preserve endianness, special handling for UTF-16/32 and UCS-2/4
  • src/mbyte.c: Fix convert_setup_ext() to use == ENC_UNICODE instead of & ENC_UNICODE. The bitwise AND was incorrectly treating UTF-16/UCS-2 (which have ENC_UNICODE + ENC_2BYTE etc.) as UTF-8, causing iconv setup to be skipped.

Closes #19198


You can view, comment on, or merge this pull request online at:

  https://github.com/vim/vim/pull/19246

Commit Summary

File Changes

(4 files)

Patch Links:


Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/19246@github.com>

h_east

unread,
10:21 AM (2 hours ago) 10:21 AM
to vim/vim, Subscribed

@h-east commented on this pull request.


In src/strings.c:

> +	// UTF-16 requires 2-byte NUL terminator
+	ga_append(&blob_ga, NUL);
+	ga_append(&blob_ga, NUL);

Is it okay to use 2 bytes even in the case of 4-byte encoding such as 'utf-32'?


In src/strings.c:

> +    // Special handling for UTF-16/UCS-2/UTF-32/UCS-4 encodings: convert entire blob before splitting by newlines
+    if (from_encoding != NULL && (STRNCMP(from_encoding, "utf-16", 6) == 0
+				   || STRNCMP(from_encoding, "utf16", 5) == 0
+				   || STRNCMP(from_encoding, "ucs-2", 5) == 0
+				   || STRNCMP(from_encoding, "ucs2", 4) == 0
+				   || STRNCMP(from_encoding, "utf-32", 6) == 0
+				   || STRNCMP(from_encoding, "utf32", 5) == 0
+				   || STRNCMP(from_encoding, "ucs-4", 5) == 0
+				   || STRNCMP(from_encoding, "ucs4", 4) == 0))

Isn't from_encoding already canonicalized by enc_canonize()?
Or can it be determined using enc_canon_props()?


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/19246/review/3703959294@github.com>

Reply all
Reply to author
Forward
0 new messages