Get correct Scintilla-Text with UTF8 encoding?

Charly Dante

unread,

May 18, 2014, 7:08:58 AM5/18/14

to scintilla...@googlegroups.com

Hi there,

I'm currently trying to get the correct text from my Scintilla Control in UTF8 encoding, but until now I fail to do so.

I have used the option SendMessage(my_sci_window, SCI_SETCODEPAGE, SC_CP_UTF8, 0); to enable all Unicode chars like russian or chinese for my edit control. These characters are displayed correct when entered to the Scintilla Control.

However, if I try to get the whole text of the Scintilla Control, I only get Garbage values for the chars that were Chinese/Russian Symbols. I use this code to get the Scintilla Text:

size_t text_length = SendMessage(my_sci_window, WM_GETTEXTLENGTH, 0, 0);
char *buffer = new char[text_length + 1];
SendMessage(my_sci_window, WM_GETTEXT, text_length + 1, reinterpret_cast<LPARAM>(buffer));

I tried both WM_GETTEXT and SCI_GETTEXT, but without finding any difference. Also converting the char buffer array to a wchar_t buffer array using mbstowcs_s doesn't help. As soon as the Scintilla Control contains any "non-normal" chars like Chinese/Thai etc., they are translated to some broken chars and I cannot get the correct text of the Scintilla control.

Any Idea how to fix this?

Best,
CD

Neil Hodgson

unread,

May 19, 2014, 9:09:50 PM5/19/14

to scintilla...@googlegroups.com

Charly Dante:

> However, if I try to get the whole text of the Scintilla Control, I only get Garbage values for the chars that were Chinese/Russian Symbols.

I was very hesitant to reply to this question because that description is just extremely vague. Unless you include information in your question then no one will have any idea what your problem is and it is unlikely anyone will reply.

Neil

Charly Dante

unread,

May 21, 2014, 2:22:54 PM5/21/14

to scintilla...@googlegroups.com, nyama...@me.com

Hi,

ok then I will explain more whats my problem. Actually the only thing I want is: a function to get the whole text of the Scintilla Control and a function to set the whole text of the Scintilla Control. I wrote those functions already and they work perfectly fine with normal chars. However, if the code contains some chinese/russian/etc chars, I get problems.

If you open Scite and then select under "File->Encoding->UTF-8", you are able to paste basically all those chars in the Scite Text Window and Scite will display them correct. Lets for example take this test string: "test string 俿侶侏佶". It contains several normal chars but also some chinese letters. If I paste this one to Scite with encoding set to UTF-8 I have no problems and all chars get displayed correct.

However, with those chars my function to get and set the text of the Control don't work anymore. Lets consider the following code:

// Get the text length of the scite control
size_t text_length = SendMessage(scite_hwnd, WM_GETTEXTLENGTH, 0, 0);

// Allocate a buffer for the text
char *buffer = new char[text_length];

// Get the text of the scite control
SendMessage(scite_hwnd, WM_GETTEXT, text_length, (LPARAM)buffer);

// Test: Convert the text to a wchar_t array
size_t newsize = strlen(buffer) + 1;
wchar_t * wcstring = new wchar_t[newsize];
size_t convertedChars = 0;
mbstowcs_s(&convertedChars, wcstring, newsize, buffer, _TRUNCATE);

// Set the text of the scite control: Both buffer (char array) and wcstring (wchar_t array) don't work correct
SendMessage(scite_hwnd, WM_SETTEXT, 0, (LPARAM)wcstring);

I tried WM_SETTEXT with both the char array and the wchar_t array, but both methods don't deliver the result I expect. Basically I only get the text of the Scite Control and then set the text again to the exactly same text, so the text should remain unchanged after all.

With the wcstring I only get one letter in the control, namely "t". I guess this doesn't work at all because WM_SETTEXT expects a pointer to a char array and not a wchar_t array. However, with a char array it doesn't really work either, because if I use the char array I get:

test string 俿侶侏

followed by two "black boxes". In the first "black box" is the text xE4, in the second black box the text xBD printed, the last char of the teststring "test string 俿侶侏佶" is missing.

This is my problem and this is what I want to fix. I hope this makes more sense now :)

Best,
CD

Matthew Brush

unread,

May 21, 2014, 2:29:12 PM5/21/14

to scintilla...@googlegroups.com

Hi,

Did you try to pass SC_CP_UTF8 to SCI_SETCODEPAGE message before setting
the buffer with UTF-8 encoded bytes?

Cheers,
Matthew Brush

On 14-05-21 11:22 AM, Charly Dante wrote:
> Hi,
>
> ok then I will explain more whats my problem. Actually the only thing I
> want is: a function to get the whole text of the Scintilla Control and a
> function to set the whole text of the Scintilla Control. I wrote those
> functions already and they work perfectly fine with normal chars. However,
> if the code contains some chinese/russian/etc chars, I get problems.
>

> If you open Scite and then select under *"File->Encoding->UTF-8"*, you are

Ferdinand Prantl

unread,

May 21, 2014, 3:02:05 PM5/21/14

to scintilla...@googlegroups.com, nyama...@me.com

On 21 May 2014 20:22, Charly Dante <dante...@gmail.com> wrote:

// Test: Convert the text to a wchar_t array
size_t newsize = strlen(buffer) + 1;
wchar_t * wcstring = new wchar_t[newsize];
size_t convertedChars = 0;
mbstowcs_s(&convertedChars, wcstring, newsize, buffer, _TRUNCATE);

You cannot generally use the mbstowcs_s for a UTF-8 conversions. It expects the input encoding set up by your current locale (LC_CTYPE) which may not be UTF-8. You can use APIs like MultiByteToWideChar or libiconv, for example. Having the luxury of Scintilla sources, you can include UniConversion.h & cxx in your project and use it without additional dependencies.

When using mbstowcs_s or similar APIs, you should let the API compute the target length. (In this case by calling the method with NULL as the target.) You'd probably have no problem in your sample, but in the other way round you would.

I'm sorry for the off-topic - I think that Matthew's answer nailed it.

--- Ferda

Charly Dante

unread,

May 21, 2014, 3:34:48 PM5/21/14

to scintilla...@googlegroups.com

Uhm, yes as desribed in the initial post, I already did that. Otherwise scite would not display the chinese letters correct, right?

The option I mentioned "File->Encoding->UTF-8" should already do that, but I also tried it with explicitly sending SC_CP_UTF8 via SCI_SETCODEPAGE with no difference.

I tried MultiByteToWideChar like this:

// Get the correct length of the buffer
int wchars_num = MultiByteToWideChar(CP_UTF8, 0, buffer, -1, NULL, 0);

// Allocate an array of that length
wchar_t * wcstring = new wchar_t[wchars_num + 1];

// Convert the text to a wchar_t array
MultiByteToWideChar(CP_UTF8, 0, buffer, -1, wcstring, wchars_num);

// Try to send it to the scite window -> fail
SendMessage(scite_hwnd, WM_SETTEXT, 0, reinterpret_cast<LPARAM>(wcstring));

but it doesn't work either... So even if I can convert it correctly to a wchar_t array, how can I transmit it correctly to the Scintilla/Scite Edit Window? Because SendMessage expects a pointer
to a char array, so that doesn't seem to work at all...

I mean this has to be somehow possible, right?

Neil Hodgson

unread,

May 21, 2014, 6:24:39 PM5/21/14

to scintilla...@googlegroups.com

[previous message was empty because aI hit the wrong button]

Prefer SCI_* messages over WM_* messages as SCI_* are defined completely by Scintilla are more likely to be used by others and thus be maintained. WM_* are just for compatibility and are in the deprecated section of the documentation.

Charly Dante:

// Get the text length of the scite control
size_t text_length = SendMessage(scite_hwnd, WM_GETTEXTLENGTH, 0, 0);

http://msdn.microsoft.com/en-us/library/windows/desktop/ms632628(v=vs.85).aspx

-> The return value is the length of the text in characters, not including the terminating null character.

// Allocate a buffer for the text
char *buffer = new char[text_length];

The buffer should have a NUL terminator so its new char[text_length+1]

// Get the text of the scite control
SendMessage(scite_hwnd, WM_GETTEXT, text_length, (LPARAM)buffer);

http://msdn.microsoft.com/en-us/library/windows/desktop/ms632627(v=vs.85).aspx

-> wParam The maximum number of characters to be copied, including the terminating null character.

text_length+1

In the current release, 3.4.1, both WM_GETTEXT and WM_SETTEXT use UTF-8 (when that is set as the code page).

In release 3.4.2, for compatibility with some general-purpose applications like screen readers, WM_GETTEXT and WM_GETTEXTLENGTH will return UTF-16, but only when the Scintilla window is created as a wide character window (CreateWindowW).

Neil

Charly Dante

unread,

May 25, 2014, 4:39:35 PM5/25/14

to scintilla...@googlegroups.com

Hi again,

I tried now several things (including all the mentioned tipps) and I'm quite desperated already, because I just can't get that thing to work. To ensure that the problem is not related to such things like choosing the wrong length off a buffer, I provide a (very small) minimal example which already demonstrates my problems:

char *buffer = "teststringüü";
SendMessage(scite_hwnd, WM_SETTEXT, 0, (LPARAM)buffer);

This simply doesn't work. The text in the Scite Control is set to "teststring" and then followed by two black boxes containing
"xFC" and not the desired chars "üü". I also tried SCI_SETTEXT (= 2181), but this one doesn't work at all, I don't know why, but the text of the Scite Textcontrol is completly erased with SCI_SETTEXT (note that I'm trying to modify the Scite Text from another, external Application, could that cause a problem with SCI_SETTEXT?). However, the Main-Problem exists also within my own Scintilla Application.

Both my Application and Scite are set to Unicode Encoding (Scite via "File->Encoding->UTF-8").

The strange thing is, if I copy and paste the teststring "teststringüü" to the Scite Control, everything works fine and gets displayed correct. However, if I use the programmatic solution from aboth, I get black boxes and broken values in the Edit Control.

I just don't get what I'm doing wrong...

Best,
CD

Neil Hodgson

unread,

May 25, 2014, 5:30:37 PM5/25/14

to scintilla...@googlegroups.com

Charly Dante:

char *buffer = "teststringüü";
SendMessage(scite_hwnd, WM_SETTEXT, 0, (LPARAM)buffer);

This simply doesn't work. The text in the Scite Control is set to "teststring" and then followed by two black boxes containing
"xFC" and not the desired chars "üü”.

‘\xFC' is ‘ü' in Windows-1252 (and ISO-8859-1) so either your source code is in Windows-1252, or your compiler is converting the literal to Windows-1252.

http://en.wikipedia.org/wiki/Windows-1252

I also tried SCI_SETTEXT (= 2181), but this one doesn't work at all, I don't know why, but the text of the Scite Textcontrol is completly erased with SCI_SETTEXT (note that I'm trying to modify the Scite Text from another, external Application, could that cause a problem with SCI_SETTEXT?). However, the Main-Problem exists also within my own Scintilla Application.

Each application has a separate address space and you can not, in general, hand an address in your application to another. Windows provides limited interception of some known messages, including WM_SETTEXT, and marshals the string across to the other application’s address space.

Neil

Lex Trotman

unread,

May 25, 2014, 8:58:02 PM5/25/14

to scintilla...@googlegroups.com

On 26 May 2014 06:39, Charly Dante <dante...@gmail.com> wrote:
> Hi again,
>
> I tried now several things (including all the mentioned tipps) and I'm quite
> desperated already, because I just can't get that thing to work. To ensure
> that the problem is not related to such things like choosing the wrong
> length off a buffer, I provide a (very small) minimal example which already
> demonstrates my problems:
>
>
> char *buffer = "teststringüü";
> SendMessage(scite_hwnd, WM_SETTEXT, 0, (LPARAM)buffer);
>
>
> This simply doesn't work. The text in the Scite Control is set to
> "teststring" and then followed by two black boxes containing
> "xFC" and not the desired chars "üü". I also tried SCI_SETTEXT (= 2181), but

FC is not the UTF-8 encoding of the character ü, its the unicode
value. The UTF-8 encoding is two bytes C3BC. Is your locale UTF-8?

Cheers
Lex

> --
> You received this message because you are subscribed to the Google Groups
> "scintilla-interest" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scintilla-inter...@googlegroups.com.
> To post to this group, send email to scintilla...@googlegroups.com.
> Visit this group at http://groups.google.com/group/scintilla-interest.
> For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward