In wxUSE_UNICODE_UTF8 mode, anything passing UTF-8 strings through wxTextBuffer::Translate() can result in broken string/encoding "translations" when the string contains non-wxChar compatible characters. For example, 😊 can be mistranslated to 😊on Windows.
This impacts wxTextCtrl and may affect other widgets/components that also use wxTextFile/wxTextBuffer.
Passing a UTF-8 wxString to wxTextBuffer::Translate while in wxUSE_UNICODE_UTF8 mode should not break the UTF-8 wxString encoding on any system. Since this mode forces wxString to use a UTF-8 encoding, this "newline translation" should be a non-destructive operation. However, on Windows (wxChar==wchar_t==16bit), if a UTF-8 wxString contains 😊 (32bits), the wxTextBuffer::Translate() call uses wxChar incorrectly, mistranslating the codepoint and returning 😊instead.
n/a
To use this snippet, you should apply #25127 (comment) first; otherwise, the SetValue() call will break the string before GetValue() breaks it, which is one known location where this bug can be triggered.
sampleTextCtrl->SetValue("😊"); // source files and manifest code page are UTF-8 wxString test = sampleTextCtrl->GetValue(); // <-- debug here and drill down into wxTextBuffer::Translate() // note that test = `😊`
wxUSE_UNICODE_UTF8 enabled.wxTextCtrl widget (ensure wxUSE_UNICODE_UTF8 is enabled).wxString test = wxTextCtrl->GetValue() in the code; preferably in a wxEVT_TEXT_ENTER event handler or similar.GetValue() call.😊 character into the wxTextCtrl field; press enter to trigger the event.wxTextCtrl::GetValue()->wxTextCtrl::GetRange()->wxTextFile::Translate()->wxTextBuffer::Translate()wxTextBuffer::Translate() will be the broken wxString: 😊.3.2.6wxMSWWindows 11—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.![]()
As a momentary workaround in my local build, I'm using this patch for textbuf.cpp:
diff --git a/src/common/textbuf.cpp b/src/common/textbuf.cpp --- a/src/common/textbuf.cpp +++ b/src/common/textbuf.cpp @@ -108,7 +108,7 @@ // add to the current line - result += ch; + result += wxUniChar(*i);
Again, I'm not familiar with wxWidgets, much less its nuances. I am not informed about wxUniChar(); this just happened to work for me on a whim. I don't even know if wxString::const_iterator has an existing or better accessor for the current character in this context.
The main thing I noticed here was that on line:68 the code explicitly defines wxString eol = GetEOL(type) as a wxString type that it uses for newline substitutions, which suggests that the attempt to reconstruct the original string after newline substitution on line:109 using ch (wxChar typed) is probably incorrect (and indeed, is insufficient in my use case).
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.![]()
I wanted to add a test before fixing the bug but at least in master
diff --git a/tests/textfile/textfiletest.cpp b/tests/textfile/textfiletest.cpp index 1a1347e854..a91c187b51 100644 --- a/tests/textfile/textfiletest.cpp +++ b/tests/textfile/textfiletest.cpp @@ -327,6 +327,15 @@ void TextFileTestCase::ReadBig() f[NUM_LINES - 1] ); } +TEST_CASE("wxTextBuffer::Translate", "[textbuffer]") +{ + // Bytes with the value of LF that are part of an UTF-8 character shouldn't + // be mangled. + const wxString smiley = wxString::FromUTF8("\xf0\x9f\x98\x8a"); // U+1F60A + + CHECK( wxTextBuffer::Translate(smiley, wxTextFileType_Dos) == smiley ); +} + #ifdef __LINUX__ // Check if using wxTextFile with special files, whose reported size doesn't
doesn't show it and I don't see any differences with 3.2. Also, the code in Translate() looks correct to me as it iterates over characters (and not bytes): https://github.com/wxWidgets/wxWidgets/blob/16a07ec7096f0e56cbf51460125678734528b8cb/src/common/textbuf.cpp#L76-L78
I ran out of time for today and will try to get to this later, but if you already understand what's going on here, please let me know.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.![]()
@vadz, I will try to look at the unit test later today if I have time (or Monday if not). My initial hunch is that you ran this test on a system/configuration where wchar_t==char32_t; however, if you retry it on a system/configuration where wchar_t==char16_t, such as on Windows, then the bug will be reproducible because 😊 cannot fit into char16_t==wchar_t==wxChar as a single codepoint.
Let me know, thanks.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.![]()
Yes, you're right, use of wxChar is wrong in the presence of surrogates as it can't represent them.
The fix is trivial:
diff --git a/src/common/textbuf.cpp b/src/common/textbuf.cpp index 17f1bd8967..8c9910b753 100644 --- a/src/common/textbuf.cpp +++ b/src/common/textbuf.cpp @@ -72,11 +72,11 @@ wxString wxTextBuffer::Translate(const wxString& text, wxTextFileType type) // unnecessary relocations result.Alloc(text.Len()); - wxChar chLast = 0; + wxUniChar chLast = 0; for ( wxString::const_iterator i = text.begin(); i != text.end(); ++i ) { - wxChar ch = *i; - switch ( ch ) { + wxUniChar ch = *i; + switch ( ch.GetValue() ) { case wxT('\n'): // Dos/Unix line termination result += eol;
but we now need to find all the other places where we could be using wxChar in the same (wrong) way...
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.![]()
Yes, you're right, use of wxChar is wrong in the presence of surrogates as it can't represent them. The fix is trivial:
Excellent!
but we now need to find all the other places where we could be using wxChar in the same (wrong) way...
True, even though textbuf was the first culprit found, there certainly could be more instances. We should try our best to search for other cases of wxChar usage when explicitly/implicitly manipulating wxStrings. I should be able to try to look for these later today.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.![]()
Note that #25136 already includes a few more fixes. But it's difficult to guarantee that there are no other bugs remaining, of course, so if you find anything else, please let me know and/or open new issue for them as I'm planning to merge that PR soon and so this issue will get closed.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.![]()
@vadz, oh, that's very nice! I searched the codebase, and wxChar/wxString appeared within 5 lines of each other around 300 or so times. There are probably too many to check individually for this ticket? However, wxChar/wxString/iterator appears only around 32 times, and you've already addressed many of these 3 days ago (I had no idea). This is the output I was going to look through; I do not have time at the moment to look through (2):
Regarding (1), you found many of them already, so I removed the files you already fixed in #25136, but the rest of this list may warrant a look:
Show wxChar_wxString_iterator matches.include\wx\list.h @@ -1257,11 @@ + class WXDLLIMPEXP_BASE wxStringList : public wxStringListBase { public: + compatibility_iterator Append(wxChar* s) include\wx\string.h @@ -846,11 @@ + // This is implemented by maintaining linked list of iterators for every // string and traversing it in wxUniCharRef::operator=(). Head of the + // list is stored in wxString. (FIXME-UTF8) + class WXDLLIMPEXP_BASE iterator { + WX_STR_ITERATOR_IMPL(iterator, wxChar*, wxUniCharRef); include\wx\stringops.h @@ -70,11 @@ { SingleCharBuffer buf; + buf.data[0] = (wxChar)ch; buf.data[1] = 0; return buf; } include\wx\ustring.h @@ -526,11 @@ + void insert(iterator it, const_iterator first, const_iterator last) { + std::basic_string<wxChar32> *base = this; include\wx\private\jsscriptwrapper.h @@ -36,11 @@ // Adds one escape level. const char *charsNeededToBeEscaped = "\\\"\n\r\v\t\b\f"; m_escapedCode.reserve(js.size()); + for (wxString::const_iterator it = js.begin(); it != js.end(); ++it) { if (wxStrchr(charsNeededToBeEscaped, *it)) { m_escapedCode += '\\'; + switch ((wxChar) *it) include\wx\propgrid\propgriddefs.h @@ -663,11 @@ { public: + wxPGStringTokenizer( const wxString& str, wxChar delimiter ); src\common\cmdline.cpp @@ -42,11 @@ + static wxString GetOptionName(wxString::const_iterator p, + wxString::const_iterator end, + const wxChar *allowedChars); src\common\fileconf.cpp @@ -500,11 @@ + wxString strLine = buffer[n]; + // FIXME-UTF8: rewrite using iterators, without this buffer + wxWxCharBuffer buf(strLine.c_str()); src\common\filesys.cpp @@ -454,11 @@ + wxString loc = MakeCorrectPath(location); unsigned i, ln; + wxChar meta; wxFSFile *s = NULL; + wxList::compatibility_iterator node; src\common\tokenzr.cpp @@ -32,11 @@ + static wxString::const_iterator + find_first_of(const wxChar *delims, size_t len, src\common\translation.cpp @@ -1273,11 @@ if (context.IsEmpty()) + i = m_messages.find(wxString(str) + wxChar(index)); // plural, no context src\common\xtixml.cpp @@ -309,11 @@ + propertyNames.push_back( (const wxChar*)name.c_str() ); + propertyNodes[(const wxChar*)name.c_str()] = children->GetChildren(); children = children->GetNext(); } for ( int i = 0; i <classInfo->GetCreateParamCount(); ++i ) { + const wxChar* paramName = classInfo->GetCreateParamName(i); + PropertyNodes::iterator propiter = propertyNodes.find( paramName ); src\html\htmlpars.cpp @@ -33,11 @@ + const wxChar *wxTRACE_HTML_DEBUG = wxT("htmldebug"); src\html\htmltag.cpp @@ -44,11 @@ + wxString::const_iterator End1, End2; // name of this tag + wxChar *Name; src\html\winpars.cpp @@ -355,11 @@ + m_tmpStrBuf = new wxChar[lng+1]; m_tmpStrBufSize = lng+1; } + wxChar *temp = m_tmpStrBuf; + wxString::const_iterator i = txt.begin(); src\xrc\xmlres.cpp @@ -1585,11 @@ + const wxChar amp_char = (m_handler->m_resource->CompareVersion(2,3,0,1) < 0) ? '$' : '_'; + for ( wxString::const_iterator dt = str1.begin(); dt != str1.end(); ++dt ) tests\regex\regextest.cpp @@ -319,11 @@ + wxString str; + wxArrayString::const_iterator it; + str << (wxChar)m_mode << wxT(" ") << m_id << wxT(" ") << m_flags << wxT(" ")
I hope this is helpful for you.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.![]()
@vadz, I have only glanced over the wxChar/wxString match results in (2) listed here. I do not possess enough familiarity/knowledge of wxWidgets to know which of the 300 cases could use an update or not. I have listed some here seems like they could possibly be updated:
Show a few potential wxChar_wxString matches.include\wx\arrstr.h @@ -474,11 @@ + WXDLLIMPEXP_BASE wxString wxJoin(const wxArrayString& arr, + const wxChar sep, include\wx\regex.h @@ -92,11 @@ + bool Matches(const wxString& text, int flags = 0) const; + bool Matches(const wxChar *text, int flags, size_t len) const include\wx\textbuf.h @@ -54,11 @@ // get the buffer termination string + static const wxChar *GetEOL(wxTextFileType type = typeDefault); include\wx\msw\private.h @@ -197,11 @@ + // Macros for converting wxString to the type expected by API functions. + #define wxMSW_CONV_LPCTSTR(s) static_cast<const wxChar *>((s).t_str()) include\wx\msw\registry.h @@ -87,11 @@ + static const wxChar *GetStdKeyName(size_t key); // get the short name of a standard key + static const wxChar *GetStdKeyShortName(size_t key); include\wx\osx\app.h @@ -109,11 @@ + bool MacSendKeyDownEvent( wxWindow* focus , long keyval , long modifiers , long when , wxChar uniChar ) ; include\wx\osx\dataform.h @@ -15,11 @@ + wxDataFormat(const wxString& rId); + wxDataFormat(const wxChar* pId); include\wx\propgrid\property.h @@ -533,11 @@ + // wxFileProperty/wxImageFileProperty specific, wxChar*, default is include\wx\propgrid\props.h @@ -360,11 @@ + wxEnumProperty( const wxString& label = wxPG_LABEL, + const wxString& name = wxPG_LABEL, + const wxChar* const* labels = NULL, include\wx\protocol\ftp.h @@ -111,11 @@ + bool DoSimpleCommand(const wxChar *command, include\wx\protocol\protocol.h @@ -153,11 @@ + wxProtoInfo(const wxChar *name, + const wxChar *serv_name, protected: + wxString m_protoname; include\wx\richtext\richtextbuffer.h @@ -1973,11 @@ + Sets a property by name and wxChar* value. include\wx\univ\textctrl.h @@ -331,11 @@ + size_t GetPartOfWrappedLine(const wxChar* text, interface\wx\archive.h @@ -400,11 @@ + wxString list; + const wxChar *const *p; interface\wx\filefn.h @@ -177,11 @@ + void wxUnix2DosFilename(wxChar* s); interface\wx\hashmap.h @@ -66,11 @@ + @c wxStringHash for strings ( wxString, wxChar*, char* ), and @c wxPointerHash for interface\wx\hashset.h @@ -55,11 @@ + wxStringHash for strings ( wxString, wxChar*, char* ), and wxPointerHash for interface\wx\stream.h @@ -907,11 @@ + wxString list; + const wxChar *const *p; interface\wx\tokenzr.h @@ -125,11 @@ + wxChar GetLastDelimiter() const; interface\wx\txtstrm.h @@ -355,11 @@ + wxTextOutputStream& PutChar(wxChar c); interface\wx\utils.h @@ -257,11 @@ + wxChar* wxGetenv(const wxString& var); interface\wx\variant.h @@ -181,11 @@ + wxVariant(const wxChar* value, const wxString& name = wxEmptyString); interface\wx\propgrid\advprops.h @@ -147,11 @@ + wxSystemColourProperty( const wxString& label, const wxString& name, + const wxChar* const* labels, const long* values, wxPGChoices* choicesCache, interface\wx\propgrid\props.h @@ -361,11 @@ + wxEnumProperty( const wxString& label = wxPG_LABEL, + const wxString& name = wxPG_LABEL, + const wxChar* const* labels = NULL, interface\wx\richtext\richtextbuffer.h @@ -1809,11 @@ + Sets a property by name and wxChar* value. samples\archive\archive.cpp @@ -86,11 @@ + void AppendAllSupportedOfType(wxString& all, wxStreamProtocolType type) { for (const T* factory = T::GetFirst(); factory; factory = factory->GetNext()) { + const wxChar* const* exts = factory->GetProtocols(type); samples\caret\caret.cpp @@ -384,11 @@ + m_text = (wxChar *)calloc(m_xChars * m_yChars, sizeof(wxChar)); if ( frame && frame->GetStatusBar() ) { + wxString msg; samples\except\except.cpp @@ -178,11 @@ // A trivial exception class + const wxChar *what() const { return m_msg.c_str(); } samples\exec\exec.cpp @@ -1486,11 @@ + wxString data; if ( !file.IsOpened() || !file.ReadAll(&data) ) return; size_t len = data.length(); + const wxChar *pc = data.c_str(); samples\ipc\server.h @@ -72,11 @@ // topic for which we advise the client or empty if none + wxString m_advise; protected: // the data returned by last OnRequest(): we keep it in this buffer to // ensure that the pointer we return from OnRequest() stays valid + wxCharBuffer m_requestData; src\aui\framemanager.cpp @@ -1398,11 @@ + static wxString EscapeDelimiters(const wxString& s) { + wxString result; result.Alloc(s.length()); + const wxChar* ch = s.c_str(); src\common\appbase.cpp @@ -202,11 @@ + bool wxAppConsoleBase::Initialize(int& WXUNUSED(argc), wxChar **WXUNUSED(argv)) src\common\arrstr.cpp @@ -636,11 @@ + wxString wxJoin(const wxArrayString& arr, const wxChar sep, const wxChar escape)
src\common\cmdline.cpp @@ -42,11 @@ + static wxString GetOptionName(wxString::const_iterator p, + wxString::const_iterator end, + const wxChar *allowedChars);
src\common\ctrlcmn.cpp @@ -176,11 @@ + static const wxChar MNEMONIC_PREFIX = wxT('&'); src\common\datetime.cpp @@ -271,11 @@ + wxString wxCallStrftime(const wxString& format, const tm* tm) { + wxChar buf[4096]; src\common\datstrm.cpp @@ -173,11 @@ + wxString wxDataInputStream::ReadString() { + wxString ret; const size_t len = Read32(); if ( len > 0 ) { #if wxUSE_UNICODE + wxCharBuffer tmp(len); src\common\dcsvg.cpp @@ -478,11 @@ // write to the SVG file + const wxCharBuffer buf = s.utf8_str(); src\common\debugrpt.cpp @@ -80,11 @@ + HexProperty(wxXmlNode *node, const wxChar *name, wxUIntPtr value) { + node->AddAttribute(name, wxString::Format(wxT("%#zx"), value)); src\common\encconv.cpp @@ -308,11 @@ + wxString wxEncodingConverter::Convert(const wxString& input) const { if (m_JustCopy) return input; + wxString s; + const wxChar *i; src\common\fileconf.cpp @@ -443,11 @@ wxMemoryText memText; + for ( const wxChar *s = cbuf; ; ++s ) { + const wxChar *e = s; while ( *e != '\0' && *e != '\n' && *e != '\r' ) ++e; // notice that we throw away the original EOL kind here, maybe we // should preserve it? if ( e != s ) + memText.AddLine(wxString(s, e)); src\common\filefn.cpp @@ -137,11 @@ // Add paths e.g. from the PATH environment variable + void wxPathList::AddEnvList (const wxString& envVariable) { // The space has been removed from the tokenizers, otherwise a // path such as "C:\Program Files" would be split into 2 paths: // "C:\Program" and "Files"; this is true for both Windows and Unix
I've seen the code in wxHtmlWinParser and it definitely seems to be suspicious/wrong, but I don't really understand what does it do, so I left it alone. Looking at it again, we probably should be able to just make m_tmpStrBuf a buffer of wxUniChar without changing anything else...
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.![]()
... OK, it wasn't as simple as that, but I think I've fixed it in a slightly better way in 744f756. If you can test it, please do! TIA!
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.![]()
@vadz, that looks like a much better approach! It does work for 😊 in some cases; however, I am seeing mixed results.
I am trying to track down a use case that is still being mangled, but I'm pretty sure is not related to this code. It seems to be happening elsewhere, but I had trouble tracking it down in my first attempt. I will try to look at it again or provide a use case to reproduce it, probably tomorrow. Thanks
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.![]()
I'm going to merge the associated PR, which will close this issue, but please open new ones if/when you find other problems. TIA!
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.![]()
Closed #25128 as completed via 8b0696d.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.![]()
@vadz, thanks for pushing this through! I've been sick and need to pivot to another project, but I hope to follow up on this later.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.![]()