Encoding/Decoding of special characters

zed

unread,

Feb 8, 2013, 1:40:29 PM2/8/13

to tagsoup...@googlegroups.com

Hi,

In my input HTML string, I have the characters "'" . I find that depending on where I run it, these are outputted differently in the SAX callbacks.

When the same piece of code runs in TestNG, the characters(ch[], int, int) SAX callback gets an ' (apostrophe). However in tomcat, the callback gets the original "'". I'd really like tomcat to be consistent and return the apostrophe consistently.

On the same machine as tomcat, running it from command line returns an apostrophe. What am I doing wrong?

I checked the language encoding on tomcat (using jinfo) and it was set to "UTF-8". I'm using the Parser class and setting a ContentHandler that extends DefaultHandler. Am I supposed to use XMLFilter as well?

Thanks, any pointers will be appreciated.

John Cowan

unread,

Feb 8, 2013, 1:54:15 PM2/8/13

to zed, tagsoup...@googlegroups.com

zed scripsit:

> However in tomcat, the callback gets the original "'". I'd really
> like tomcat to be consistent and return the apostrophe consistently.

TagSoup never outputs an ASCII character as a numeric character reference,
so this issue has to be in Tomcat only. Alas, I can't help with that.

> On the same machine as tomcat, running it from command line returns an
> apostrophe. What am I doing wrong?

Almost certainly nothing.

> I checked the language encoding on tomcat (using jinfo) and it was set to
> "UTF-8". I'm using the Parser class and setting a ContentHandler that
> extends DefaultHandler. Am I supposed to use XMLFilter as well?

No need.

--
Principles. You can't say A is John Cowan <co...@ccil.org>
made of B or vice versa. All mass http://www.ccil.org/~cowan
is interaction. --Richard Feynman

zed

unread,

Feb 8, 2013, 2:10:00 PM2/8/13

to tagsoup...@googlegroups.com, zed, co...@mercury.ccil.org

To clarify, the input string is ', and the parser normally decodes this to an apostrophe in tests.

Is there any other environment variable or difference in underlying libraries that would make it *not* decode the ' to ' ?

Could it be something to do with the input or output encodings?

John Cowan

unread,

Feb 8, 2013, 2:30:08 PM2/8/13

to zed, tagsoup...@googlegroups.com

zed scripsit:

> To clarify, the input string is ', and the parser normally decodes this
> to an apostrophe in tests.
>
> Is there any other environment variable or difference in underlying
> libraries that would make it *not* decode the ' to ' ?

No. TagSoup always decodes numeric character references, as well as any
of the thousand-odd named character entity references that it understands.
The only references that are left alone are named ones that are not
understood, such as "&#xyz;" or "&##32;".

On the output side, the <, >, &, ", and ' characters are re-encoded
using the built-in character references when required. If the output
encoding is not a UTF, non-ASCII characters are encoded as hex numeric
character references. So TagSoup will never generate "'".

--
Business before pleasure, if not too bloomering long before.
--Nicholas van Rijn
John Cowan <co...@ccil.org>
http://www.ccil.org/~cowan

Reply all

Reply to author

Forward