[feature-requests:#1575] cp936/GBK treat 0x80 as valid single byte
Status: open
Group: Initial
Labels: Scintilla encoding dbcs
Created: Sun Jan 18, 2026 02:08 PM UTC by Zufu Liu
Last Updated: Sun Jan 18, 2026 02:08 PM UTC
Owner: Neil Hodgson
Based on https://github.com/python/cpython/issues/72530, 0x80 in Windows 936 and web GBK is mapped to Euro sign € U+20AC.
The change for IsDBCSValidSingleByte() is simple:
@@ -90,10 +90,13 @@
bool IsDBCSValidSingleByte(int codePage, int ch) noexcept {
switch (codePage) {
case cp932:
+ // Shift_jis
return ch == 0x80
|| (ch >= 0xA0 && ch <= 0xDF)
|| (ch >= 0xFD);
-
+ case cp936:
+ // GBK
+ return ch == 0x80;
default:
return false;
}
But not sure whether it will cause problem on earlier or non-Windows systems.
Sent from sourceforge.net because scintill...@googlegroups.com is subscribed to https://sourceforge.net/p/scintilla/feature-requests/
To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/scintilla/admin/feature-requests/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
not sure whether it will cause problem on earlier or non-Windows systems.
Tested following code on XP, Vista and Win7:
#include <windows.h>
#include <stdio.h>
int main(void) {
char chars[2] = {'\x80'};
wchar_t code[2] = {0};
int len = MultiByteToWideChar(936, 0, chars, 1, code, 2);
return printf("len=%d, code=%04X\n", len, code[0]);
}
the output is len=1, code=20AC as on Win 10 and 11.
Committed as [808977]. Works on Linux/GTK and macOS/Cocoa as well. Failures on other platforms can be addressed if they occur.
[feature-requests:#1575] cp936/GBK treat 0x80 as valid single byte
Status: open
Group: Committed
Labels: Scintilla encoding dbcs
Created: Sun Jan 18, 2026 02:08 PM UTC by Zufu Liu
Last Updated: Sun Jan 18, 2026 02:40 PM UTC
Owner: Neil Hodgson
off-topic the single byte range for CP932/Shift-JIS seems contains EUDC (end user defined character?).
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt
vs https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
EUDC from https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/
| Code Page | EUDC | Control |
|---|---|---|
| 932 | 0xa0, 0xfd - 0xff | |
| 936 | 0xff | |
| 949, 950 | 0xff | 0x80 |
| 1361 | 0xd4 - 0xff | 0x80 - 0x83 |
Should these EUDC also be included?
[feature-requests:#1575] cp936/GBK treat 0x80 as valid single byte
Status: open
Group: Committed
Labels: Scintilla encoding dbcs
Created: Sun Jan 18, 2026 02:08 PM UTC by Zufu Liu
Last Updated: Mon Jan 19, 2026 05:01 AM UTC
Owner: Neil Hodgson
Without a more specific benefit, such as a report of use of some single-byte EUDC, I think changing behaviour is more likely to produce new problems.
[feature-requests:#1575] cp936/GBK treat 0x80 as valid single byte
Status: open
Group: Committed
Labels: Scintilla encoding dbcs
Created: Sun Jan 18, 2026 02:08 PM UTC by Zufu Liu
Last Updated: Mon Jan 19, 2026 06:13 AM UTC
Owner: Neil Hodgson
OK.
[feature-requests:#1575] cp936/GBK treat 0x80 as valid single byte
Status: open
Group: Committed
Labels: Scintilla encoding dbcs
Created: Sun Jan 18, 2026 02:08 PM UTC by Zufu Liu
Last Updated: Mon Jan 19, 2026 11:21 PM UTC
Owner: Neil Hodgson
0x80 can be omitted from CP932/Shift_JIS, as it just maps to U+0080 C1 control character in CP932, and unsupported in other Japanese encodings:
>>> pages = ['cp932', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213']
>>> [b'\x80'.decode(page, 'backslashreplace') for page in pages]
['\x80', '\\x80', '\\x80', '\\x80', '\\x80', '\\x80', '\\x80']
>>>
though omit it will cause a visual change: rendered by platform as invisible/box/question block vs rendered by Scintilla as hex blob.
[feature-requests:#1575] cp936/GBK treat 0x80 as valid single byte
Status: open
Group: Committed
Labels: Scintilla encoding dbcs
Created: Sun Jan 18, 2026 02:08 PM UTC by Zufu Liu
Last Updated: Tue Jan 20, 2026 09:53 AM UTC
Owner: Neil Hodgson