Examples of using charset in Jsoup

Showing 1-6 of 6 messages
Examples of using charset in Jsoup jeff 8/25/10 12:27 AM
Hi, I want to use charset but don't know how to apply it in parse. Can
someone provide me an example?

Thanks.

Re: [jsoup] Examples of using charset in Jsoup Jonathan Hedley 8/25/10 1:32 AM
Hi Jeff,

Can you give me some more information as to what you're trying to do? It will help me give a better answer.

If you are parsing content from a URL, jsoup will automatically recognise the charset from the response headers or the HTML meta-equiv tag, so you don't need to worry about it. Once the HTML is parsed and you are accessing it (via .html(), .text() etc) you are working with Java Unicode Strings.

If you are parsing from a file on disk, you may need to tell jsoup what the charset is, because there is no HTTP header to help it. It comes down to however you saved the file on disk in the first place. If you're saving as (e.g.) UTF-8, then when you load it you need to specify that. (If you pass in a null charset to the file parse, jsoup will try the meta-equiv header, but that is unreliable.)

Anyway, let me know what you're doing and hopefully I can give a clearer example.

Thanks,
Jonathan


--
----------------------------------------------------------------
http://jsoup.org/ Java HTML parser
jsoup discussion list: js...@googlegroups.com
Archive and to unsubscribe: http://groups.google.com/group/jsoup

Re: [jsoup] Examples of using charset in Jsoup Jeff Zhou 8/25/10 8:05 AM
Hi Jonathan,

We have a crawler program that downloads a large number of web pages. Unfortunately I don't know how the crawler saves the html files after it crawls, but I notice that some have UTF-8 charset, while some have other charsets including foreign languages. When I look at the parsed text from some web pages, some text contents are not readable which is caused by charset.  So we want to know if we can use Jsoup's charset related function to parse the contents correctly.

Hope that clarify.

Jeff
Re: [jsoup] Examples of using charset in Jsoup Jonathan Hedley 8/26/10 2:20 AM
Thanks for the info Jeff.

It seems that there are three possibilities here, depending on how your crawler behaves:

1: Crawler is aware of the input charset decodes correctly, and saves everything as UTF-8. E.g. input may be ascii, gb232, utf-8, etc. If that was the case, you can just "UTF8" as the charset when parsing from a file in jsoup. However I don't think this is the case per your description.

2: Crawler is aware of the input charset, and saves the output using the input charset. E.g. input = ascii, output = ascii; input = gb232, output = gb232. I think this is most likely to be the case because it's the easiest and most likely to always work. If it is the case, you need to get the crawler to tell you what the output charset is so that you can parse it correctly. Does the crawler save the response HTTP headers? Most I've seen do. If it does, you need to parse the headers, get the charset, and then use jsoup file parse with that as the input charset.

3: Crawler is not aware of the input content type, assumes it's UTF-8, and outputs as UTF-8. If that's the case, you're hosed, because that's a destructive operation, and you'd need to get it modified to work like 1 or 2. But that seems unlikely.

So, I think the best approach is to find out exactly how your crawler works with different input charsets, how it saves them out, and how it records what the output charset was; and then parse from there. One experiment may be to hit some of the pages that the crawler has saved yourself and identify what the server charset is, and then compare that with the output of the crawler.

Hope these thoughts help; any other suggestions from the group are welcome.

Cheers,
Jonathan


On Thu, Aug 26, 2010 at 1:05 AM, Jeff Zhou <jeffers...@gmail.com> wrote:
Hi Jonathan,

We have a crawler program that downloads a large number of web pages. Unfortunately I don't know how the crawler saves the html files after it crawls, but I notice that some have UTF-8 charset, while some have other charsets including foreign languages. When I look at the parsed text from some web pages, some text contents are not readable which is caused by charset.  So we want to know if we can use Jsoup's charset related function to parse the contents correctly.

Hope that clarify.


Re: [jsoup] Examples of using charset in Jsoup jeff 8/26/10 9:35 PM
I am afraid I am facing the third scenario.

Is it possible to have some functions like el.text(String encoding) and
el.attr(String attributename, String encoding)?

Or is it possible to add such a function, Jsoup(String htmltext, String
encoding)?

> --
> ----------------------------------------------------------------
> http://jsoup.org/ Java HTML parser
> jsoup discussion list: js...@googlegroups.com
> Archive and to unsubscribe: http://groups.google.com/group/jsoup


Re: [jsoup] Examples of using charset in Jsoup jeff 8/26/10 10:12 PM
Jonathan,

I just spoke to my colleague about this charset issue. He said the
crawler outputs all the page sources of the html pages as byte[] type,
and we probably can manage to get the Http header information. Given
that we can extract charset from the headers, what will be the next
steps? Kindly note that we are dealing with byte[] variables not file
variables.

Thanks,
Jeff


On Thu, 2010-08-26 at 19:20 +1000, Jonathan Hedley wrote:
> --
> ----------------------------------------------------------------
> http://jsoup.org/ Java HTML parser
> jsoup discussion list: js...@googlegroups.com
> Archive and to unsubscribe: http://groups.google.com/group/jsoup