Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to WebBrowser.DocumentText with right encoding

87 views
Skip to first unread message

MaartyMan

unread,
Jul 17, 2009, 4:13:01 AM7/17/09
to
Hi, I am new to C#. Maybe someone can help me with this:
I am writing a web crawler that puts one page at a time in WebBrowser and I
want to get the DocumentText and work with it. Since I don't know the
encoding of the page beforehand, I have to get the encoding and then set so I
get the correct html text (without any "funny" characters). Any suggestions
which way is the best way of doing this? Thanks in advance.

MaartyMan

unread,
Jul 17, 2009, 5:13:01 AM7/17/09
to
I read the encoding with string searching in the "META" HtmlElement (String
charsetEncoding = "iso-8859-1") from the stream from WebBrowser. Now I try
to get the stream again by setting the encoding to this encoding, but still I
get wrong characters in the extracted html (string htmlText). Any ideas why
this is not working below:?

HttpWebRequest request2 = (HttpWebRequest)HttpWebRequest.Create(url);
request2.UserAgent = "A1 .NET Web Crawler";

WebResponse response2 = request2.GetResponse();

Stream stream2 = response2.GetResponseStream();

Encoding charsetEncoding = Encoding.GetEncoding(charSetStr);
StreamReader reader = new StreamReader(stream2, charsetEncoding);

//StreamReader reader = new StreamReader(stream2);
string htmlText = reader.ReadToEnd();

Mihai N.

unread,
Jul 18, 2009, 2:33:28 AM7/18/09
to

> Any ideas why this is not working below:?

Looks ok (without trying it).
Can you make sure the page is indeed 8859-1?
Some pages are not tagged correctly.
Or maye you can post here what you see and what you expect
(even better, describe it (e with accent grave) and post it,
to make sure nothing got damaged on the way)
Some hex values might also help.


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

MaartyMan

unread,
Jul 20, 2009, 2:53:00 AM7/20/09
to
Thanks for the reply. I'm not sure if it is really "8859-1", although I've
checked it is specified that way in the meta tag. It seems it replaces a
single apostrophe (') with 2 hex characters \0xC2 \0x91 in the extracted html
string. I don't know how to check whether it really is encoded in 8859-1?
(don't know much about code pages). Any suggestions? Thanks in advance.

Mihai N.

unread,
Jul 20, 2009, 6:23:43 AM7/20/09
to
> It seems it replaces a
> single apostrophe (') with 2 hex characters \0xC2 \0x91 in the extracted
> html string.

That "smells" like utf-8.

MaartyMan

unread,
Jul 31, 2009, 6:12:01 AM7/31/09
to
It seems I was looking at entries in the database I saved before I was
extracting with the encoding, which is why I was getting incorrect characters
in the data extracted for earlier items in the database. When I looked at
latest entries in the database saved it seems it was saving correct with the
encoding now. Thanks for all the help.
0 new messages