Google Groups

Custom fallback encoding for Nokogiri::HTML::Document.parse with autodetect on files

gabriele renzi Apr 22, 2012 10:21 AM
Posted in group: nokogiri-talk
Hi everyone,

I have a small issue using nokogiri in one project because I don't seem to be able to both use the HTML charset autodetection feature _and_ fallback to a user defined encoding.

Basically, if I have a bunch of documents with different encodings, but I can assume they are either UTF-8 or that they have meta tags specifying their original encoding.

Yet, I cannot do:

    doc = Nokogiri::HTML::Document.parse(io,url, 'utf-8')

cause this will make the parse choke on the invalid ones. At the same time, I cannot do

    doc = Nokogiri::HTML::Document.parse(io,url, nil)

(which allows nokogiri to consider the meta tags) as this will cause parsing of some UTF-8 files as.. something else, producing mangled strings.

This means that to have a behaviour like "consider metatags, or fallback to utf8' I need to either reimplement EncodingReader or access private/undocumented/internal Nokogiri code.

Am I overlooking something?

If not, I have a tiny patch that allows passing an options hash as the value of encoding, which allows me to specify a fallback encoding and still get autodetection[1].
This api is rather bad but at least it lets old code work fine without being aware of it. 

Possibly a better thing would be to document EncodingReader and at the same time allow instances of it to be passed as options to HTML::Document methods?
(or make it an ivar and thus overrideable, but that seems to incur into modifying global state) 

I also believe the same issue occurse wrt to non-file readable objects, as there is a line:

    encoding ||= EncodingReader.detect_encoding(string_or_io)

which means detection only occurs if encoding is not set, but that is not the issue I faced.