It seems Nokogiri has the ability of opt out from substituting entities using the NOENT parse option, but I couldn't find how to apply it to the SAX parser:
Our application will download and modify documents from Edgar filings and they usually contain entities such as ’ which we would like to preserve in the modified documents. I can see it used to work this way at some point because some of the previously downloaded documents have the expected output but today I was trying to reproduce a bug in my application and got an error while attempting to upload the related document in my local environment and noticed that this behavior has changed.
The error happens when attempting to convert "\xC2" from ASCII-8BIT to UTF-8, but the main reason I'm getting this is that the document#characters method is getting "\u0092" when it should be getting "’" instead as it used to be the case.
How can I tell Nokogiri SAX Parser to not substitute entities?
Thanks,
Rodrigo.