How to keep entities (like ) using the HTML SAX Parser?

Rodrigo Rosenfeld Rosas

unread,

Jun 29, 2016, 3:38:03 PM6/29/16

to nokogiri-talk

It seems Nokogiri has the ability of opt out from substituting entities using the NOENT parse option, but I couldn't find how to apply it to the SAX parser:

http://www.nokogiri.org/tutorials/parsing_an_html_xml_document.html

Our application will download and modify documents from Edgar filings and they usually contain entities such as  which we would like to preserve in the modified documents. I can see it used to work this way at some point because some of the previously downloaded documents have the expected output but today I was trying to reproduce a bug in my application and got an error while attempting to upload the related document in my local environment and noticed that this behavior has changed.

The error happens when attempting to convert "\xC2" from ASCII-8BIT to UTF-8, but the main reason I'm getting this is that the document#characters method is getting "\u0092" when it should be getting "" instead as it used to be the case.

How can I tell Nokogiri SAX Parser to not substitute entities?

Thanks,

Rodrigo.

Rodrigo Rosenfeld Rosas

unread,

Jun 30, 2016, 7:13:40 AM6/30/16

to nokogiri-talk

I noticed that the same version of the application yield to different results in my server and in my local machine. In the server the SAX parser won't substitute entities while locally it will. I guess the reason may be the version of libxml2:

In the server (Ubuntu):

libxml2-dev:amd 2.9.1+dfsg1- amd64

In my local computer (Debian Sid):

libxml2-dev:amd 2.9.3+dfsg1- amd64

Not sure whether that helps but it would be awesome if I could tell Nokogiri explicitly that I don't want the substitutions to happen in the SAX parser. Would that be possible?

Rodrigo Rosenfeld Rosas

unread,

Jun 30, 2016, 7:19:49 AM6/30/16

to nokogiri-talk

Sorry, this is not exactly correct. For some reason I don't understand the production server works as expected (keeping entity chars) with the same code. But the result of this irb session is the same in all computers I try:

irb -r nokogiri

class Parser < Nokogiri::XML::SAX::Document;def characters(v); p v;
end;end
Nokogiri::HTML::SAX::Parser.new(Parser.new).parse("ab")

The result being:

"a"

"\u0092"

"b"

Ideally I should get just "ab" at once but I'd be fine with this as well:

"a"

""

"b"

Any help is appreciated.

Rodrigo Rosenfeld Rosas

unread,

Jun 30, 2016, 7:57:00 AM6/30/16

to nokogiri-talk

I finally found how to use the API to tell Nokogiri to not perform the substitutions but it seems it doesn't work. Here's the corresponding issue for reference:

https://github.com/sparklemotion/nokogiri/issues/1284

Reply all

Reply to author

Forward

How to keep entities (like &#146;) using the HTML SAX Parser?

Rodrigo Rosenfeld Rosas

Rodrigo Rosenfeld Rosas

Rodrigo Rosenfeld Rosas

Rodrigo Rosenfeld Rosas

How to keep entities (like ) using the HTML SAX Parser?