Character encoding problems reading Html from Clipboard

Tim_Mac

unread,

Aug 26, 2005, 2:41:36 PM8/26/05

to

hi,
i am accessing some html (originating from MS Word) in the clipboard in
my winforms app. i catch it before the paste, clean up the html, set
the clipboard with the cleaned Html, and then paste.

here is my (simplified) code:
string html =
Clipboard.GetDataObject().GetData(DataFormats.Html).ToString();

the problem is that the html string gets lots of substituted strange
characters, for example:
a dash - character from the word document gets converted into â€"
a line break gets converted into Â
an apostrophe gets converted into â€˜

this doesn't happen when i just paste as normal into my html editor.
the characters import normally.

is there a way to read from the clipboard without screwing up the
characters? i tried Ascii.Encoding.GetString() but it needs a byte[],
which i don't know how to get from the DataObject.

many thanks for any help.
tim

Michael Phillips, Jr.

unread,

Aug 26, 2005, 4:09:10 PM8/26/05

to

You need to use the UTF8Encoding class.

string html =
Clipboard.GetDataObject().GetData(DataFormats.Html).ToString();

// Create a UTF-8 encoding.
UTF8Encoding utf8 = new UTF8Encoding();

// Get the encoded html string.
byte[] encodedBytes = utf8.GetBytes(html);

// Decode bytes back to string.
String decodedString = utf8.GetString(encodedBytes);
Console.WriteLine();
Console.WriteLine("Decoded bytes:");
Console.WriteLine(decodedString);

"Tim_Mac" <t...@mackey.ie> wrote in message
news:1125081696.6...@g43g2000cwa.googlegroups.com...

hi,
i am accessing some html (originating from MS Word) in the clipboard in
my winforms app. i catch it before the paste, clean up the html, set
the clipboard with the cleaned Html, and then paste.

here is my (simplified) code:
string html =
Clipboard.GetDataObject().GetData(DataFormats.Html).ToString();

the problem is that the html string gets lots of substituted strange
characters, for example:

a dash - character from the word document gets converted into ā?"
a line break gets converted into Ā
an apostrophe gets converted into ā?~

Jon Skeet [C# MVP]

unread,

Aug 26, 2005, 4:54:39 PM8/26/05

to

Michael Phillips, Jr. <mphil...@nospam.jun0.c0m> wrote:
> You need to use the UTF8Encoding class.
>
> string html =
> Clipboard.GetDataObject().GetData(DataFormats.Html).ToString();
>
> // Create a UTF-8 encoding.
> UTF8Encoding utf8 = new UTF8Encoding();
>
> // Get the encoded html string.
> byte[] encodedBytes = utf8.GetBytes(html);
>
> // Decode bytes back to string.
> String decodedString = utf8.GetString(encodedBytes);
> Console.WriteLine();
> Console.WriteLine("Decoded bytes:");
> Console.WriteLine(decodedString);

I can't see how that would help - it's just encoding and decoding with
the same encoding. As UTF-8 can encode any string, I can't envisage any
situation where html wouldn't be equal to decodedString - could you
give an example of such a situation?

--
Jon Skeet - <sk...@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Michael Phillips, Jr.

unread,

Aug 26, 2005, 6:04:19 PM8/26/05

to

You are correct. I thought erroneously that there was
some time of problem with the character encoding.

My code snippet certainly doesn't solve anything.

"Jon Skeet [C# MVP]" <sk...@pobox.com> wrote in message
news:MPG.1d7991eff...@msnews.microsoft.com...

Jeffrey Tan[MSFT]

unread,

Aug 29, 2005, 7:00:50 AM8/29/05

to

Hi Tim_Mac,

Thanks for your post.

Yes, I can reproduce out your issue on my side. It seems that this issue
only occurs for localized characters, not for standard english characters.

Also, this issue only occurs with DataFormats.Html, but not for
DataFormats.Text etc..

Then after doing some research, I found that this issue is documented in
our internal database as a known issue. This is not winform side problem.
When asked for HTML format, GetData returns an ANSI string which obviously
does not have enough information to render chinese script. Currently, I can
not think of a better workaround for this issue.

Hope this helps.

Best regards,
Jeffrey Tan
Microsoft Online Partner Support
Get Secure! - www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.

Tim_Mac

unread,

Aug 29, 2005, 7:50:23 AM8/29/05

to

hi Jeffrey,
many thanks for the reply. the word document i've been testing it off
doesn't have localised characters to my knowledge.
to reproduce my situation, create a blank MS word document, and insert
2 apostrophes and 2 double-quotes. you'll see Word changes them to the
open/close versions of the characters. i also added the ` character,
and the elipsis ... character (auto-corrected by word 2003 when you
type in 3 period characters followed by a space)
i then copy this, and paste it into my application. when i debug, this
is the fragment i get via clip.GetData(DataFormats.Html).ToString();

\r\n\r\n<p class=MsoNormal>â€˜â€â™
â€œ€ </p>\r\n\r\n<p class=MsoNormal>`â€¦
</p>\r\n\r\n

as you can see, there are garbage characters in the middle
corresponding to the characters in the word doc.

interestingly, when i paste the content into WordPad, it preserves the
open/close quote characters etc., but when i then copy and paste from
WordPad, the html string is read correctly in my application. the
open/close apostrohpes get demoted back to the normal apostrophe
character, and the ellipsis character gets demoted back to 3 period
characters.

what's a little bit annoying is that this problem only arose when i
attempt to intercept the html in the clipboard before it is pasted.
i'm using the Comzept HtmlEditor control for win-forms (a wrapper for
MSHTML), and it has it's own Paste() method, which does not produce
such character problems as i am experiencing. i presume it just calls
the MSHTML Paste() method.

looking forward to your reply
tim

Tim_Mac

unread,

Aug 29, 2005, 12:05:04 PM8/29/05

to

p.s. you can download my word doc here
http://tim.mackey.ie/stuff/html_char_encoding.doc

Jon Skeet [C# MVP]

unread,

Aug 29, 2005, 2:30:58 PM8/29/05

to

Tim_Mac <t...@mackey.ie> wrote:
> many thanks for the reply. the word document i've been testing it off
> doesn't have localised characters to my knowledge.

But it *does* have characters where aren't in the ANSI code page.
That's what Jeffrey meant by "not for standard English characters" I
believe.

Jeffrey Tan[MSFT]

unread,

Aug 30, 2005, 3:39:01 AM8/30/05

to

Hi Tim_Mac,

Thanks for your feedback.

Yes, I just tested '-' in english, which has no problem. However, with '"'
character, I can reproduce out this problem on my side.

After doing some further research, I found that this issue only occurs with
Word application, if we copy '"' characters from IE, Winform application
will get the characters well without any problem. Even with Excel, it will
retrieve well. So it seems that this issue is on Word application side.

Because Winform Clipboard class is just a wrapper of underlying windows
Clipboard operation, it seems there is little work can be done in Winform
side.

Tim_Mac

unread,

Aug 30, 2005, 6:06:39 AM8/30/05

to

hi Jeffrey,
thanks again for the reply.
i wonder if Microsoft would consider posting a list of the affected
characters and their distorted equivalents in the clipboard after being
converted into ANSI? this would allow other applications to work
around the problem and map the characters back to their proper
equivalents.

so far i can identify the following mappings:

â€˜ open single quote
â€™ close single quote
â€œ open double quote
â€ close double quote
â€¦ ellipsis
Â two space characters, (as used by some formatting
conventions) after period

thanks
tim

Jeffrey Tan[MSFT]

unread,

Aug 30, 2005, 9:42:04 PM8/30/05

to

Hi Tim,

Thanks for your post.

Yes, after doing some more research in this issue, I found that it seems
that it is Winform's problem. Because I created a Win32 appliction, which
use Win32 Api to get the clipboard CF_HTML format, I can get it without
garbled text. Then I converted this Win32 code into managed code with
P/invoke:

[DllImport("user32.dll",SetLastError=true)]
static extern IntPtr GetClipboardData(uint uFormat);
[DllImport("user32.dll",SetLastError=true)]
static extern bool OpenClipboard(IntPtr hWndNewOwner);
[DllImport("user32.dll",SetLastError=true)]
static extern bool CloseClipboard();
[DllImport("user32.dll", SetLastError=true)]
static extern uint RegisterClipboardFormatA(string lpszFormat);
[DllImport("user32.dll",SetLastError=true)]
static extern bool IsClipboardFormatAvailable(uint format);
[DllImport("kernel32.dll",SetLastError=true)]
static extern IntPtr GlobalLock(IntPtr hMem);
[DllImport("kernel32.dll",SetLastError=true)]
static extern uint GlobalSize(IntPtr hMem);
[DllImport("kernel32.dll",SetLastError=true)]
static extern IntPtr GlobalUnlock(IntPtr hMem);

private void button1_Click(object sender, System.EventArgs e)
{
uint CF_HTML = RegisterClipboardFormatA("HTML Format");

if (IsClipboardFormatAvailable(CF_HTML))
{
if(OpenClipboard(this.Handle))
{
IntPtr hGMem = GetClipboardData(CF_HTML) ;
IntPtr pMFP = GlobalLock(hGMem) ;
uint len=GlobalSize(hGMem);
byte[] bytes=new byte[len];
Marshal.Copy(pMFP,bytes, 0, (int)len);

string strMFP =System.Text.Encoding.UTF8.GetString(bytes);
this.textBox1.Text=strMFP;
GlobalUnlock(hGMem) ;
CloseClipboard() ;
}
}
}

This works well on my side. Hope this helps.
=================================================================
Thank you for your patience and cooperation. If you have any questions or
concerns, please feel free to post it in the group. I am standing by to be
of assistance.

Tim_Mac

unread,

Aug 31, 2005, 4:29:35 AM8/31/05

to

hi Jeffrey,
that's excellent, it works well so far on my side also. not being a
COM expert, i'm a little bit wary of relying on the user32 or kernel
dlls. will this work on any flavours of windows 2000 and XP with all
the different service packs, IE versions etc?
many thanks for this solution.
tim

Jeffrey Tan[MSFT]

unread,

Aug 31, 2005, 5:24:36 AM8/31/05

to

Hi Tim,

I am glad my reply makes sense to you.

Yes, I think it will not break in all win32 version of OS. Because we are
just using Win32 API, which is guarantee to have consistent behavior on all
Win32 OS, our solution should be safe.

Thanks