[scintilla:feature-requests] #1575 cp936/GBK treat 0x80 as valid single byte

0 views
Skip to first unread message

Zufu Liu

unread,
Jan 18, 2026, 9:08:37 AM (13 days ago) Jan 18
to scintill...@googlegroups.com

[feature-requests:#1575] cp936/GBK treat 0x80 as valid single byte

Status: open
Group: Initial
Labels: Scintilla encoding dbcs
Created: Sun Jan 18, 2026 02:08 PM UTC by Zufu Liu
Last Updated: Sun Jan 18, 2026 02:08 PM UTC
Owner: Neil Hodgson

Based on https://github.com/python/cpython/issues/72530, 0x80 in Windows 936 and web GBK is mapped to Euro sign € U+20AC.
The change for IsDBCSValidSingleByte() is simple:

@@ -90,10 +90,13 @@
 bool IsDBCSValidSingleByte(int codePage, int ch) noexcept {
    switch (codePage) {
    case cp932:
+       // Shift_jis
        return ch == 0x80
            || (ch >= 0xA0 && ch <= 0xDF)
            || (ch >= 0xFD);
-
+   case cp936:
+       // GBK
+       return ch == 0x80;
    default:
        return false;
    }

But not sure whether it will cause problem on earlier or non-Windows systems.


Sent from sourceforge.net because scintill...@googlegroups.com is subscribed to https://sourceforge.net/p/scintilla/feature-requests/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/scintilla/admin/feature-requests/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.

Zufu Liu

unread,
Jan 18, 2026, 9:40:31 AM (13 days ago) Jan 18
to scintill...@googlegroups.com

not sure whether it will cause problem on earlier or non-Windows systems.

Tested following code on XP, Vista and Win7:

#include <windows.h>
#include <stdio.h>

int main(void) {
    char chars[2] = {'\x80'};
    wchar_t code[2] = {0};
    int len = MultiByteToWideChar(936, 0, chars, 1, code, 2);
    return printf("len=%d, code=%04X\n", len, code[0]);
}

the output is len=1, code=20AC as on Win 10 and 11.

Neil Hodgson

unread,
Jan 19, 2026, 12:01:09 AM (12 days ago) Jan 19
to scintill...@googlegroups.com
  • Group: Initial --> Committed
  • Comment:

Committed as [808977]. Works on Linux/GTK and macOS/Cocoa as well. Failures on other platforms can be addressed if they occur.


[feature-requests:#1575] cp936/GBK treat 0x80 as valid single byte

Status: open
Group: Committed


Labels: Scintilla encoding dbcs
Created: Sun Jan 18, 2026 02:08 PM UTC by Zufu Liu

Last Updated: Sun Jan 18, 2026 02:40 PM UTC
Owner: Neil Hodgson

Zufu Liu

unread,
Jan 19, 2026, 1:13:51 AM (12 days ago) Jan 19
to scintill...@googlegroups.com

off-topic the single byte range for CP932/Shift-JIS seems contains EUDC (end user defined character?).
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt
vs https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

EUDC from https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/

Code Page EUDC Control
932 0xa0, 0xfd - 0xff
936 0xff
949, 950 0xff 0x80
1361 0xd4 - 0xff 0x80 - 0x83

Should these EUDC also be included?


[feature-requests:#1575] cp936/GBK treat 0x80 as valid single byte

Status: open
Group: Committed


Labels: Scintilla encoding dbcs
Created: Sun Jan 18, 2026 02:08 PM UTC by Zufu Liu

Last Updated: Mon Jan 19, 2026 05:01 AM UTC
Owner: Neil Hodgson

Neil Hodgson

unread,
Jan 19, 2026, 6:21:22 PM (12 days ago) Jan 19
to scintill...@googlegroups.com

Without a more specific benefit, such as a report of use of some single-byte EUDC, I think changing behaviour is more likely to produce new problems.


[feature-requests:#1575] cp936/GBK treat 0x80 as valid single byte

Status: open
Group: Committed


Labels: Scintilla encoding dbcs
Created: Sun Jan 18, 2026 02:08 PM UTC by Zufu Liu

Last Updated: Mon Jan 19, 2026 06:13 AM UTC
Owner: Neil Hodgson

Zufu Liu

unread,
Jan 20, 2026, 4:53:31 AM (11 days ago) Jan 20
to scintill...@googlegroups.com

OK.


[feature-requests:#1575] cp936/GBK treat 0x80 as valid single byte

Status: open
Group: Committed
Labels: Scintilla encoding dbcs
Created: Sun Jan 18, 2026 02:08 PM UTC by Zufu Liu

Last Updated: Mon Jan 19, 2026 11:21 PM UTC
Owner: Neil Hodgson

Zufu Liu

unread,
Jan 21, 2026, 7:14:30 AM (10 days ago) Jan 21
to scintill...@googlegroups.com

0x80 can be omitted from CP932/Shift_JIS, as it just maps to U+0080 C1 control character in CP932, and unsupported in other Japanese encodings:

>>> pages = ['cp932', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213']
>>> [b'\x80'.decode(page, 'backslashreplace') for page in pages]
['\x80', '\\x80', '\\x80', '\\x80', '\\x80', '\\x80', '\\x80']
>>>

though omit it will cause a visual change: rendered by platform as invisible/box/question block vs rendered by Scintilla as hex blob.


[feature-requests:#1575] cp936/GBK treat 0x80 as valid single byte

Status: open
Group: Committed


Labels: Scintilla encoding dbcs
Created: Sun Jan 18, 2026 02:08 PM UTC by Zufu Liu

Last Updated: Tue Jan 20, 2026 09:53 AM UTC
Owner: Neil Hodgson

Reply all
Reply to author
Forward
0 new messages