Need help with the SAX API

Subhabrata Biswas

unread,

Dec 26, 2012, 5:24:13 AM12/26/12

to tagsoup...@googlegroups.com

Hi,

In my application I have to parse an HTML and replace the URL in some LINK, SCRIPT and IMG tags. The rest of the HTML has to remain unchanged. I am trying to achieve this using the SAX parser of tagsoup.

I got the basic application to work pretty quickly. The problem is that the HTML I have to parse uses entities liberally. So,   and € and < are to be expected everywhere and the Parser class does not let me do a good job of handling them very easily.

The Parser::aval() calls Parser::expandEntities() which is a private function. This function looks up the entities from the entity string and aval() finally returns a string that contains the translated codes. Now, when I print this back into my output HTML, I don't have the &...; strings any more - it contains the translated codes only. Unfortunately, the   mapping in the stock library is incorrect, it shows up like an A with a ^ over it. Now, (a) I don't know how the correct the rest of the mappings are, (b) this is non-portable in the first place and (c) this could lead to issues when my string is, say, <B>.

I want to be able to retain the encoded entity string as is in my output HTML. So, I want < and not < in my final HTML.

I could have done that by overriding Parser::expandEntities() but that is a private method.
I could have done this by overriding Parser::aval() method and copy-pasting the super method content except the call to expandEntities(). But then the variables used in the function are all private!

Is there any easy way to achieve my goal?

Thanks,

-- Subhabrata

John Cowan

unread,

Dec 26, 2012, 10:45:24 AM12/26/12

to Subhabrata Biswas, tagsoup...@googlegroups.com

Subhabrata Biswas scripsit:

> The Parser::aval() calls Parser::expandEntities() which is a private
> function. This function looks up the entities from the entity string and
> aval() finally returns a string that contains the translated codes. Now,
> when I print this back into my output HTML, I don't have the &...; strings
> any more - it contains the translated codes only. Unfortunately, the  
> mapping in the stock library is incorrect, it shows up like an A with a ^
> over it.

Actually the mapping is correct. The output encoding from TagSoup does
not depend on the input encoding. The two-byte sequence for a NBSP in
UTF-8, when interpreted as Latin-1 (your platform default, most likely)
is "� ".

If you want a different output encoding, use the
--output-encoding=us-ascii switch, and you will get
encodings for all non-ASCII characters. Alternatively, call
setOutputProperty(XMLWriter.ENCODING, "us-ascii") on the XMLWriter object,
which is what the switch does.

If you absolutely must have the escape sequences in the output appear
exactly as in the input, you can try removing all the "entity" elements
from html.tssl in the source and rebuilding with Ant. I don't guarantee
that this will work, however.

> Now, (a) I don't know how the correct the rest of the mappings are,
> (b) this is non-portable in the first place and (c) this could lead
> to issues when my string is, say, <B>.

The data on character entities comes straight from the W3C, and
the five standard XML entities <, >, &, ", and '
will be re-created in the output in any case. So no worries there.

--
John Cowan http://ccil.org/~cowan co...@ccil.org
There are books that are at once excellent and boring. Those that at
once leap to the mind are Thoreau's Walden, Emerson's Essays, George
Eliot's Adam Bede, and Landor's Dialogues. --Somerset Maugham

Subhabrata Biswas

unread,

Dec 27, 2012, 1:50:29 AM12/27/12

to tagsoup...@googlegroups.com, Subhabrata Biswas, co...@mercury.ccil.org

Thanks a lot for the pointers, John.

I was not using an XMLWriter - I was writing directly into the output file. I have modified the code to use the XMLWriter now and progressed one more step.

And now there is a different problem: the XMLWriter encodes all entities while writing the output file. My HTML has scripts like this:

<script>

</script>

And the XMLWriter is encoding all entities in there:

<script>

</script>

Should I have done something to avoid this problem?

Thanks again and regards,

-- Subhabrata

On Wednesday, 26 December 2012 21:15:24 UTC+5:30, John Cowan wrote:

Subhabrata Biswas scripsit:

> The Parser::aval() calls Parser::expandEntities() which is a private
> function. This function looks up the entities from the entity string and
> aval() finally returns a string that contains the translated codes. Now,
> when I print this back into my output HTML, I don't have the &...; strings
> any more - it contains the translated codes only. Unfortunately, the  
> mapping in the stock library is incorrect, it shows up like an A with a ^
> over it.

Actually the mapping is correct. The output encoding from TagSoup does
not depend on the input encoding. The two-byte sequence for a NBSP in
UTF-8, when interpreted as Latin-1 (your platform default, most likely)

is "ï¿½ ".

John Cowan

unread,

Dec 27, 2012, 2:15:15 PM12/27/12

to Subhabrata Biswas, tagsoup...@googlegroups.com

Subhabrata Biswas scripsit:

> And now there is a different problem: the XMLWriter encodes all entities
> while writing the output file. My HTML has scripts like this:

Are you setting the HTML output in the XMLWriter? If you want to generate
HTML, you need to do that.

--
Here lies the Christian, John Cowan
judge, and poet Peter, http://www.ccil.org/~cowan
Who broke the laws of God co...@ccil.org
and man and metre.

Subhabrata Biswas

unread,

Dec 28, 2012, 12:08:05 AM12/28/12

to tagsoup...@googlegroups.com, Subhabrata Biswas, co...@mercury.ccil.org

Done. I am home and dry :-)

Thanks a lot, John.

Reply all

Reply to author

Forward