Default character set SC_CHARSET_DEFAULT on Win32

163 views
Skip to first unread message

Neil Hodgson

unread,
Aug 22, 2016, 12:43:13 AM8/22/16
to Scintilla mailing list
The default character set in Scintilla is SC_CHARSET_DEFAULT=1. This value is forwarded to Windows when calling APIs like CreateFontIndirect. On English/European installations this means Code Page 1252 for European characters. On Japanese/Korean/Chinese it means code pages 932, 949, 936, or 950.

Some of the Win32 text measuring and display code assumes a single-byte character set for SC_CHARSET_DEFAULT and shows assertion errors on East Asian installations in debug mode as well as incorrect displays.

There appears to be two main ways to fix this:

1) Treat SC_CHARSET_DEFAULT the same way as Win32 would by calling GetACP and treating the text as one of the East Asian code pages if that is the result. This is likely to be the expected result for CJK programmers. The downside is a system call deep within the display code where it will be called many times causing overhead. There may be ways to move this call and so minimize cost.

2) Treat SC_CHARSET_DEFAULT as indicating European text in code page 1252. This would make the results more predictable and its likely what Western programmers expect.

For (1) a first attempt at an implementation looks like this:

UINT CodePageFromCharSet(DWORD characterSet, UINT documentCodePage) {
if (documentCodePage == SC_CP_UTF8) {
return SC_CP_UTF8;
}
if (characterSet == SC_CHARSET_DEFAULT) {
const UINT acp = ::GetACP();
switch (acp) {
case 932:
case 936:
case 949:
case 950:
case 1361:
return acp;
}
}
switch (characterSet) {
case SC_CHARSET_ANSI: return 1252;
case SC_CHARSET_DEFAULT: return documentCodePage;
//…

Neil

johnsonj

unread,
Aug 22, 2016, 10:28:33 AM8/22/16
to scintilla-interest, nyama...@me.com
I think a simple solution might be default "dbcsCodePage = SC_CP_UTF8" in Document::Document().

What if all dbcsCodePage is 0?
If then, I guess the above strategy could not tell ANSI from 8859-1.


Neil Hodgson

unread,
Aug 22, 2016, 7:50:01 PM8/22/16
to scintilla-interest
johnsonj:

> I think a simple solution might be default "dbcsCodePage = SC_CP_UTF8" in Document::Document().

While defaulting to Unicode would have been a good choice to have made early in Scintilla’s development, this is likely to break many downstream projects if done now.

Neil

Mike Lischke

unread,
Aug 23, 2016, 2:49:16 AM8/23/16
to scintilla...@googlegroups.com

>> I think a simple solution might be default "dbcsCodePage = SC_CP_UTF8" in Document::Document().
>
> While defaulting to Unicode would have been a good choice to have made early in Scintilla’s development, this is likely to break many downstream projects if done now.

Downstream projects don't *have* to upgrade. If they don't want a pure Unicode backend they can stay with whatever they have used last. Or in other words: if they don't want to go with the future, they have to stay in the past.

You could start with a new branch and merge in bug fixes in both versions for a while. After a year or so you could stop the older version, giving so downstream projects enough time to adapt. Just a thought.

Mike
--
www.soft-gems.net

Neil Hodgson

unread,
Aug 25, 2016, 1:47:01 AM8/25/16
to scintilla...@googlegroups.com
Me:

> 1) Treat SC_CHARSET_DEFAULT the same way as Win32 would by calling GetACP and treating the text as one of the East Asian code pages if that is the result. This is likely to be the expected result for CJK programmers. The downside is a system call deep within the display code where it will be called many times causing overhead. There may be ways to move this call and so minimize cost.

Changing the system locale causes a reboot so the ACP can not change while Scintilla is running and can be cached at start up. Better patch that sets a code page for the default character set at startup. Only uses this where there is no document code page:

diff -r dd92d66fc96f win32/ScintillaWin.cxx
--- a/win32/ScintillaWin.cxx Thu Aug 25 11:57:48 2016 +1000
+++ b/win32/ScintillaWin.cxx Thu Aug 25 15:18:03 2016 +1000
@@ -159,6 +159,8 @@

static HMODULE commctrl32 = 0;

+static UINT defaultCharsetCodePage = 0;
+
/**
*/
class FormatEnumerator {
@@ -1145,7 +1147,7 @@
}
switch (characterSet) {
case SC_CHARSET_ANSI: return 1252;
- case SC_CHARSET_DEFAULT: return documentCodePage;
+ case SC_CHARSET_DEFAULT: return documentCodePage ? documentCodePage : defaultCharsetCodePage;
case SC_CHARSET_BALTIC: return 1257;
case SC_CHARSET_CHINESEBIG5: return 950;
case SC_CHARSET_EASTEUROPE: return 1250;
@@ -3419,6 +3421,16 @@
// This function is externally visible so it can be called from container when building statically.
// Must be called once only.
int Scintilla_RegisterClasses(void *hInstance) {
+ const UINT acp = ::GetACP();
+ switch (acp) {
+ // Only set defaultCharsetCodePage for DBCS system locales
+ case 932:
+ case 936:
+ case 949:
+ case 950:
+ case 1361:
+ defaultCharsetCodePage = acp;
+ }
Platform_Initialise(hInstance);
bool result = ScintillaWin::Register(static_cast<HINSTANCE>(hInstance));
#ifdef SCI_LEXER

Neil

johnsonj

unread,
Aug 26, 2016, 1:02:16 AM8/26/16
to scintilla-interest, nyama...@me.com
   Neil:

it works well for ime input.
but caret can move inside a character since it moves byte to byte.

I tried to initialize dbcsCodePage with SCI_SETCODEPAGE like SciTEWin, but in vain.
 

Neil Hodgson

unread,
Aug 26, 2016, 5:32:51 AM8/26/16
to scintilla...@googlegroups.com
johnsonj:

> it works well for ime input.
> but caret can move inside a character since it moves byte to byte.

OK, that appears to stop that idea. I’m most interested in avoiding crashes like the one reported by a Chinese user below.

Looks ike treating SC_CHARSET_DEFAULT as the European single byte code page 1252 when there is no explicit code page is safer. That is, in CodePageFromCharSet, something like:

switch (characterSet) {
case SC_CHARSET_ANSI: return 1252;
case SC_CHARSET_DEFAULT: return documentCodePage ? documentCodePage : 1252;

Stack trace of crash:
> SciTE.exe!Platform::Assert(const char * c, const char * file, int line) 行 3152 C++
SciTE.exe!SurfaceD2D::MeasureWidths(Font & font_, const char * s, int len, float * positions) 行 1662 C++
SciTE.exe!PositionCache::MeasureWidths(Surface * surface, const ViewStyle & vstyle, unsigned int styleNumber, const char * s, unsigned int len, float * positions, Document * pdoc) 行 689 C++
SciTE.exe!EditView::LayoutLine(const EditModel & model, int line, Surface * surface, const ViewStyle & vstyle, LineLayout * ll, int width) 行 498 C++
SciTE.exe!EditView::PaintText(Surface * surfaceWindow, const EditModel & model, PRectangle rcArea, PRectangle rcClient, const ViewStyle & vsDraw) 行 1850 C++
SciTE.exe!Editor::Paint(Surface * surfaceWindow, PRectangle rcArea) 行 1756 C++
SciTE.exe!ScintillaWin::WndPaint(unsigned long wParam) 行 819 C++
SciTE.exe!ScintillaWin::WndProc(unsigned int iMessage, unsigned long wParam, long lParam) 行 1241 C++
SciTE.exe!ScintillaWin::SWndProc(HWND__ * hWnd, unsigned int iMessage, unsigned int wParam, long lParam) 行 3409 C++
[外部代码]
SciTE.exe!WinMain(HINSTANCE__ * hInstance, HINSTANCE__ * __formal, char * __formal, int __formal) 行

Even with assertions off, this leads to bad position information which leads to buggy behaviour.

Neil

Neil Hodgson

unread,
Aug 27, 2016, 10:37:54 PM8/27/16
to scintilla...@googlegroups.com
Me:

> switch (characterSet) {
> case SC_CHARSET_ANSI: return 1252;
> case SC_CHARSET_DEFAULT: return documentCodePage ? documentCodePage : 1252;

This has been committed as
https://sourceforge.net/p/scintilla/code/ci/683c955af63eb8450aa7ce875fd466dd61ed0404/

Also improved documentation for character sets and listed character sets supported by Cocoa.

Neil

johnsonj

unread,
Aug 28, 2016, 3:20:09 AM8/28/16
to scintilla-interest, nyama...@me.com
Neil:
This works well for me for both viewing and editing.
Would you please take a review?
-------------------------------------------------------------------------------------------------
CaseFolder *ScintillaWin::CaseFolderForEncoding() {
    // If App does not set user's dbcsCodePage, set it with system codepage.
    if (vs.styles[STYLE_DEFAULT].characterSet == SC_CHARSET_DEFAULT) {
        if ((pdoc->dbcsCodePage == 0) && (ValidCodePage(::GetACP()))) {
            pdoc->SetDBCSCodePage(::GetACP());
        }
    }

    UINT cpDest = CodePageOfDocument();
    if (cpDest == SC_CP_UTF8) {
        return new CaseFolderUnicode();
    } else {
    .....

johnsonj

unread,
Aug 28, 2016, 5:13:26 AM8/28/16
to scintilla-interest, nyama...@me.com
 Neil:

I will take my word back.
It proved randomly work.

I will try to find a hot spot.

Neil Hodgson

unread,
Aug 28, 2016, 6:38:25 PM8/28/16
to scintilla...@googlegroups.com
johnsonj:

> I will take my word back.
> It proved randomly work.

Yes, that looked like a particularly uncertain spot as it only called to set up for a search.

> I will try to find a hot spot.

There is no point at which Scintilla can tell that the application has finished set up. It may, for example, perform initialisation incrementally during idle time in order to avoid blocking. It may also use Scintilla before setting up by, for example, searching for an indicator of the programming language mode in the text. If you set the code page to (say) 949 based on an early version of character set (with Scintilla initialising to SC_CHARSET_DEFAULT) then there will be trouble if the application changes the character set later to SC_CHARSET_RUSSIAN. The user won’t see the Russian characters they want.

Neil

johnsonj

unread,
Aug 30, 2016, 3:57:25 AM8/30/16
to scintilla-interest, nyama...@me.com
Thank you for instructions.
I will keep in mind.

I still think it needs for scintilla to handle SC_CHARSET_DEFAULT.
surely there might be a way.

johnsonj

unread,
Sep 9, 2016, 4:30:57 AM9/9/16
to scintilla-interest, nyama...@me.com
Neil: SCI_GRABFOCUS.

ValidCodePage(::GetACP()) is called only once whenever SCI_GRABFOCUS.

        case SCI_GRABFOCUS:
            ::SetFocus(MainHWND());

            if (vs.styles[STYLE_DEFAULT].characterSet == SC_CHARSET_DEFAULT) {
                if ((pdoc->dbcsCodePage == 0) && (ValidCodePage(::GetACP()))) {
                    pdoc->SetDBCSCodePage(::GetACP());
                }
            }
            break;

It appears not bad yet to me.
but I could not find a way to notify SciTE of documentCodePage change.

johnsonj

unread,
Sep 9, 2016, 6:34:05 AM9/9/16
to scintilla-interest, nyama...@me.com
Neil: ScitillQt tested with Haven
 
       case SCI_GRABFOCUS:
            scrollArea->setFocus(Qt::OtherFocusReason);

            if (vs.styles[STYLE_DEFAULT].characterSet == SC_CHARSET_DEFAULT) {
                if ((pdoc->dbcsCodePage == 0)) {
                    QLocale sysLocale = QLocale::system();
                    if (sysLocale.language() != QLocale::C) {
                        int sysCodePage = 0;
                        switch (sysLocale.language()) {
                        case QLocale::Japanese:
                             sysCodePage = 932;
                        case QLocale::Chinese:
                             sysCodePage = 936; //950
                        case QLocale::Korean:
                             sysCodePage = 949; //1351
                        default:
                             sysCodePage = SC_CP_UTF8;
                        }
                        pdoc->SetDBCSCodePage(sysCodePage);
                    }
                }
            }
            break;

Neil Hodgson

unread,
Sep 9, 2016, 6:40:55 PM9/9/16
to scintilla-interest
johnsonj:

> Neil: SCI_GRABFOCUS.
>
> ValidCodePage(::GetACP()) is called only once whenever SCI_GRABFOCUS.
>
> case SCI_GRABFOCUS:

If you are trying to find a good location that will always be called, SCI_GRABFOCUS doesn’t have to be called. I expect most applications just call the platform function (SetFocus on Windows) so this code will never be executed.

> but I could not find a way to notify SciTE of documentCodePage change.

Which is a problem with this approach: the application should be in charge so should choose the code page and character set. Existing applications aren’t expecting the code page to change unless they do it.

Neil

Reply all
Reply to author
Forward
0 new messages