Copy and Paste methods for wxSTC break invalid UTF-8 code

128 views
Skip to first unread message

Paul K

unread,
Nov 18, 2015, 12:43:30 PM11/18/15
to wx-dev
Hi All,

I've been using AddTextRaw and GetTextRaw methods of wxSTC to load and display invalid UTF-8 code. Everything seems to work fine except Copy/Paste operations.

Let's say I have text with three characters: '\193' (single quote, a character with decimal code 193 and another single quote). When loaded into wxSTC using SetTextRaw (and codepage is being set to SC_CP_UTF8/65001), the text is displayed as '[xC1]', where [xC1] is the way Scintilla displays invalid characters (usually in inverted color).

When this fragment is copied and pasted, only two characters are pasted 'g (open single quote and a letter 'g'). Since ScintillaWX::CopyToClipboard is using "wxTextBuffer::Translate(stc2wx(st.Data(), st.Length()))" and "wxTextDataObject(text)", it seems like the conversion happens in (one of) those two places. In fact, wxTextDataObject doesn't like invalid text at all (it stores an empty string in those cases). It seems like somewhere in conversion to/from UTF8 the proper content is lost.

Scintilla provides TargetAsUTF8 method, which seems like something that may help in this case, but it's not available in wxSTC:

    # not sure what to do about these yet
    'TargetAsUTF8' :       ( None, 0, 0, 0),
    'SetLengthForEncode' : ( None, 0, 0, 0),
    'EncodedFromUTF8' :    ( None, 0, 0, 0),

Is there a proper way to handle this copying or a workaround?

Paul.

Paul K

unread,
Nov 19, 2015, 1:54:43 AM11/19/15
to wx-dev
Vadim, I tried to figure out what's going on with the clipboard copying and I think there are couple of issues with the current processing (at least on Windows).

I get strange results from wxTextDataObject; this is Lua code, but it's easily mapped to the C code:

> tdo = wx.wxTextDataObject()
> tdo:SetData(wx.wxDataFormat(wx.wxDF_TEXT), "1234")
true
> tdo:GetDataHere()
true "1234\0\0"
> tdo:SetData(wx.wxDataFormat(wx.wxDF_TEXT), "123")
true
> tdo:GetDataHere()
true "12\0\0"
> tdo:GetText()
"㈱"
> tdo:GetDataSize()
4

Where do two trailing zeros in "1234\0\0" are coming from and why "3" is missing in "12\0\0"? In fact, it's missing the last character for any odd sequence of characters when storing wxDF_TEXT format.

Also, even when I explicitly set data format to wx.wxDF_TEXT, the wxDataFormat object reports GetPreferredFormat as 13 (which is UNICODE).

I managed to copy the raw data I'm interested in, but only when I explicitly set data format as wx.wxDF_TEXT. wxwidgets is compiled with --enable-unicode.

Paul.

Paul K

unread,
Nov 19, 2015, 1:42:02 PM11/19/15
to wx-dev
Vadim,


I got help from one of the users who provided a snapshot of clipboard content when pasting from SciTE/Notepad (with correct results) and my wxwidgets app (with not quote correct results). Here is the information:

The test string is a 6 character string containing 3 ascii chars and 3 german umlauts. (Hex codes in windows codepage is 616F75E4F6FC)

This is put into the clipboard when copying with Notepad/SciTE (in codepage mode)

fmt=13 (CF_UNICODETEXT)
>txt[14]="a\000o\000u\000\228\000\246\000\252\000\000\000"
>hex[14]=61006F007500E400F600FC000000
fmt
=16 (CF_LOCALE)
>txt[4]="\007\004\000\000"
>hex[4]=07040000
fmt
=1 (CF_TEXT)
>txt[7]="aou\228\246\252\000"
>hex[7]=616F75E4F6FC00
fmt
=7 (CF_OEMTEXT)
>txt[7]="aou\132\148\129\000"
>hex[7]=616F7584948100

This is put into the clipboard when copying with my wxwidgets app (wxstc is populated using SetTextRaw calls)

fmt=49161 (???)
>txt[4]="B\012\002\000"
>hex[4]=420C0200
fmt
=13 (CF_UNICODETEXT)
>txt[10]="a\000o\000u\000\252]\000\000"
>hex[10]=61006F007500FC5D0000
fmt
=49171 (???)
>txt[120]="\000\000\000\000x\000\000\000\001\000\000\000\001\000\000\000\000\000\000\000\000\000\000\000\r\000\000\000\000\000\000\000\001\000\000\000\255\255\255\255\001\000\000\000\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
>hex[120]=0000000078000000010000000100000000000000000000000D0000000000000001000000FFFFFFFF010000000100000000000000000000000000000000000000...
fmt
=16 (CF_LOCALE)
>txt[4]="\007\004\000\000"
>hex[4]=07040000
fmt
=1 (CF_TEXT)
>txt[5]="aou?\000"
>hex[5]=616F753F00
fmt
=7 (CF_OEMTEXT)
>txt[5]="aou?\000"
>hex[5]=616F753F00

Note the differences in both CF_UNICODETEXT and CF_TEXT.

Any idea how this can be fixed? Should I open a ticket for this one?

Paul.

Vadim Zeitlin

unread,
Nov 19, 2015, 9:39:48 PM11/19/15
to wx-...@googlegroups.com
On Wed, 18 Nov 2015 22:54:42 -0800 (PST) Paul K wrote:

PK> Vadim, I tried to figure out what's going on with the clipboard copying and
PK> I think there are couple of issues with the current processing (at least on
PK> Windows).

Sorry, it's really hard to understand what is going on here. For me ideal
would be to have to reproduce the problem in an existing C++ sample, e.g. I
believe dnd sample has some test code for the clipboard already. Right now
I have no idea if the problem is in wxTextDataObject, wxSTC or wxLua.

Regards,
VZ

Paul K

unread,
Nov 19, 2015, 11:45:10 PM11/19/15
to wx-dev
> Sorry, it's really hard to understand what is going on here. For me ideal 
> would be to have to reproduce the problem in an existing C++ sample, e.g. I 
> believe dnd sample has some test code for the clipboard already. Right now 
> I have no idea if the problem is in wxTextDataObject, wxSTC or wxLua. 

I understand; unfortunately I'm no in position to provide a full example and none of the examples I looked at that use clipboard do what I'm interested in.

Let's try really simple thing: I only need to be able to store and read DF_TEXT data on all three platforms. I got it working on Windows, but I can't get it working on either Linux or OSX.

Here is what I'm trying to do:

text = "--'\193'"
tdo =wx.wxTextDataObject()
tdo:SetData(wx.wxDataFormat(wx.wxDF_TEXT), t)

on Linux:
tdo:GetDataHere() returns ""
tdo:GetDataSize() returns 0

on OSX:
tdo:GetDataHere() returns "-\0-\0'\0\161\0'\0"
tdo:GetDataSize() returns 10

Why is it empty on Linux and why it changed \193 (0xC1) to \161 (0xA1) and added zeroes on OSX? None of these is a correct/expected result.

At least on Windows it returns the expected result "--'\193'"... (with the exception of the issue with the last odd character being replaced by \0).

All I want is to be able to store and retrieve data in wxDF_TEXT format even when I'm using a Unicode build. The format is supported (I checked the results from GetAllFormats()), so I don't understand why this is not working.

Paul.

Paul K

unread,
Nov 19, 2015, 11:57:49 PM11/19/15
to wx-dev
ok, I got correct result on OSX by passing the explicit DF_TEXT format into GetDataHere.

I still get an empty string on Linux though:

tdo:SetData(wx.wxDataFormat(wx.wxDF_TEXT), "--'x'")
tdo:GetDataHere(wx.wxDataFormat(wx.wxDF_TEXT)) returns "--'x'" as expected

tdo:SetData(wx.wxDataFormat(wx.wxDF_TEXT), "--'\193'") this is UTF-8 invalid string
tdo:GetDataHere(wx.wxDataFormat(wx.wxDF_TEXT)) returns "" (empty string)

Paul.

Vadim Zeitlin

unread,
Nov 20, 2015, 2:27:10 PM11/20/15
to wx-...@googlegroups.com
On Thu, 19 Nov 2015 20:45:09 -0800 (PST) Paul K wrote:

PK> > Sorry, it's really hard to understand what is going on here. For me ideal
PK> > would be to have to reproduce the problem in an existing C++ sample, e.g.
PK> I
PK> > believe dnd sample has some test code for the clipboard already. Right
PK> now
PK> > I have no idea if the problem is in wxTextDataObject, wxSTC or wxLua.
PK>
PK> I understand; unfortunately I'm no in position to provide a full example
PK> and none of the examples I looked at that use clipboard do what I'm
PK> interested in.

I don't understand this, the dnd sample does put (text) data on clipboard
and retrieve it from it. How is this different from what you're interested
in?

PK> Let's try really simple thing: I only need to be able to store and read
PK> DF_TEXT data on all three platforms. I got it working on Windows, but I
PK> can't get it working on either Linux or OSX.
PK>
PK> Here is what I'm trying to do:
PK>
PK> text = "--'\193'"
PK> tdo =wx.wxTextDataObject()
PK> tdo:SetData(wx.wxDataFormat(wx.wxDF_TEXT), t)
PK>
PK> on Linux:
PK> tdo:GetDataHere() returns ""
PK> tdo:GetDataSize() returns 0

That's almost certainly because your text is not valid UTF-8 and your
locale encoding is UTF-8.

PK> All I want is to be able to store and retrieve data in wxDF_TEXT format
PK> even when I'm using a Unicode build.

This sounds like a very strange idea. Clipboard is for storing text, and
text should be valid Unicode, so you should store it either as
wxDF_UNICODETEXT or wxDF_TEXT using the system encoding (i.e. almost
invariably UTF-8 under Linux). Anything else is, frankly, just masochism.

Regards,
VZ

Paul K

unread,
Nov 20, 2015, 3:29:18 PM11/20/15
to wx-dev
> This sounds like a very strange idea. Clipboard is for storing text, and 
> text should be valid Unicode, so you should store it either as 
> wxDF_UNICODETEXT or wxDF_TEXT using the system encoding (i.e. almost 
> invariably UTF-8 under Linux). Anything else is, frankly, just masochism. 

I won't object strongly to this label ;), but wxSTC allows for storing binary data that may be invalid UTF-8 even when the codepage is set to UTF-8 (using *Raw methods), which makes sense, but that data can no longer be copied using the clipboard.

Somehow I thought that DF_TEXT shouldn't worry about validity (as I'd use DF_UNICODETEXT if I cared about that, right?) and it seems like it doesn't check for that on Win and OSX, so Linux looks like a special case here. Also, clipboard can be used to transfer different types of data and storing just a binary stream wouldn't be a very special case.

Anyway, I think I'll be good for now with the solutions I have for Win/OSX and just allowing for only valid UTF-8 on Linux. Thank you!

Paul.

Vadim Zeitlin

unread,
Nov 20, 2015, 5:56:49 PM11/20/15
to wx-...@googlegroups.com
On Fri, 20 Nov 2015 12:29:18 -0800 (PST) Paul K wrote:

PK> I won't object strongly to this label ;), but wxSTC allows for storing
PK> binary data that may be invalid UTF-8 even when the codepage is set to
PK> UTF-8 (using *Raw methods), which makes sense, but that data can no longer
PK> be copied using the clipboard.

You should copy it using some other format than text, because such data is
not text any more -- unless you interpret as being in, say, Latin-1, and
then encode it in UTF-8 properly as needed.

PK> Somehow I thought that DF_TEXT shouldn't worry about validity

wxDF_TEXT works with text, i.e. wxString, which can't be created from just
anything. You can use wxString::From8BitData(), but whether you actually
want to do it is another question.

PK> and it seems like it doesn't check for that on Win and OSX, so Linux
PK> looks like a special case here.

I am not sure about OS X, but MSW uses something like CP1252 in which
there are no invalid strings, unlike in UTF-8.

PK> Also, clipboard can be used to transfer different types of data and storing
PK> just a binary stream wouldn't be a very special case.

No, but don't use wxDF_TEXT for it.

Regards,
VZ
Reply all
Reply to author
Forward
0 new messages