Subhabrata Biswas scripsit:
> The Parser::aval() calls Parser::expandEntities() which is a private
> function. This function looks up the entities from the entity string and
> aval() finally returns a string that contains the translated codes. Now,
> when I print this back into my output HTML, I don't have the &...; strings
> any more - it contains the translated codes only. Unfortunately, the
> mapping in the stock library is incorrect, it shows up like an A with a ^
> over it.
Actually the mapping is correct. The output encoding from TagSoup does
not depend on the input encoding. The two-byte sequence for a NBSP in
UTF-8, when interpreted as Latin-1 (your platform default, most likely)
is "� ".
If you want a different output encoding, use the
--output-encoding=us-ascii switch, and you will get
encodings for all non-ASCII characters. Alternatively, call
setOutputProperty(XMLWriter.ENCODING, "us-ascii") on the XMLWriter object,
which is what the switch does.
If you absolutely must have the escape sequences in the output appear
exactly as in the input, you can try removing all the "entity" elements
from html.tssl in the source and rebuilding with Ant. I don't guarantee
that this will work, however.
> Now, (a) I don't know how the correct the rest of the mappings are,
> (b) this is non-portable in the first place and (c) this could lead
> to issues when my string is, say, <B>.
The data on character entities comes straight from the W3C, and
the five standard XML entities <, >, &, ", and '
will be re-created in the output in any case. So no worries there.
--
John Cowan
http://ccil.org/~cowan co...@ccil.org
There are books that are at once excellent and boring. Those that at
once leap to the mind are Thoreau's Walden, Emerson's Essays, George
Eliot's Adam Bede, and Landor's Dialogues. --Somerset Maugham