Custom fallback encoding for Nokogiri::HTML::Document.parse with autodetect on files

141 views
Skip to first unread message

gabriele renzi

unread,
Apr 22, 2012, 1:21:31 PM4/22/12
to nokogi...@googlegroups.com
Hi everyone,


I have a small issue using nokogiri in one project because I don't seem to be able to both use the HTML charset autodetection feature _and_ fallback to a user defined encoding.


Basically, if I have a bunch of documents with different encodings, but I can assume they are either UTF-8 or that they have meta tags specifying their original encoding.


Yet, I cannot do:


    doc = Nokogiri::HTML::Document.parse(io,url, 'utf-8')


cause this will make the parse choke on the invalid ones. At the same time, I cannot do


    doc = Nokogiri::HTML::Document.parse(io,url, nil)



(which allows nokogiri to consider the meta tags) as this will cause parsing of some UTF-8 files as.. something else, producing mangled strings.




This means that to have a behaviour like "consider metatags, or fallback to utf8' I need to either reimplement EncodingReader or access private/undocumented/internal Nokogiri code.


Am I overlooking something?


If not, I have a tiny patch that allows passing an options hash as the value of encoding, which allows me to specify a fallback encoding and still get autodetection[1].
This api is rather bad but at least it lets old code work fine without being aware of it. 

Possibly a better thing would be to document EncodingReader and at the same time allow instances of it to be passed as options to HTML::Document methods?
(or make it an ivar and thus overrideable, but that seems to incur into modifying global state) 

PS
I also believe the same issue occurse wrt to non-file readable objects, as there is a line:

    encoding ||= EncodingReader.detect_encoding(string_or_io)

which means detection only occurs if encoding is not set, but that is not the issue I faced.

Walter Lee Davis

unread,
Apr 23, 2012, 9:42:50 AM4/23/12
to nokogi...@googlegroups.com
Totally a hack (my speciality) but could you not seek into the text stream some number of bytes and do a regexp to find a content-type meta tag? If you don't find one in the first 1200 bytes or so, give up and use Unicode. 

Walter
--
You received this message because you are subscribed to the Google Groups "nokogiri-talk" group.
To view this discussion on the web visit https://groups.google.com/d/msg/nokogiri-talk/-/tGihB-iBRSAJ.
To post to this group, send email to nokogi...@googlegroups.com.
To unsubscribe from this group, send email to nokogiri-tal...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nokogiri-talk?hl=en.

gabriele renzi

unread,
Apr 23, 2012, 10:10:46 AM4/23/12
to nokogi...@googlegroups.com
On Mon, Apr 23, 2012 at 3:42 PM, Walter Lee Davis <wa...@wdstudio.com> wrote:
> Totally a hack (my speciality) but could you not seek into the text stream
> some number of bytes and do a regexp to find a content-type meta tag? If you
> don't find one in the first 1200 bytes or so, give up and use Unicode.
>
> Walter

yes, but if I go the hack way than it's easier to copy&paste
EncodingReader out of nokogiri's sources :)

--
twitter: @riffraff
blog (en, it): www.riffraff.info riffraff.blogsome.com
work: circleme.com

Mike Dalessio

unread,
Apr 23, 2012, 10:21:38 AM4/23/12
to nokogi...@googlegroups.com
Hey there,

This doesn't seem unreasonable at first glance. Can you send a pull request, please?

Thanks!

gabriele renzi

unread,
Apr 23, 2012, 10:36:27 AM4/23/12
to nokogi...@googlegroups.com
>> [1]
>> https://github.com/riffraff/nokogiri/commit/09811dd3723e37c765e7293a0fad62cd7016521e
>>
>>
>
> This doesn't seem unreasonable at first glance. Can you send a pull request,
> please?

if you meant my patch, just done it, thanks for the consideration.

I still believe (IMVHO) that documenting EncodingReader thus allowing
users to reuse it and customize it could be better, I can try that in
a branch if you prefer.

Thanks!

Mike Dalessio

unread,
Apr 23, 2012, 10:45:46 AM4/23/12
to nokogi...@googlegroups.com
On Mon, Apr 23, 2012 at 10:36 AM, gabriele renzi <rff...@gmail.com> wrote:
>> [1]
>> https://github.com/riffraff/nokogiri/commit/09811dd3723e37c765e7293a0fad62cd7016521e
>>
>>
>
> This doesn't seem unreasonable at first glance. Can you send a pull request,
> please?

if you meant my patch, just done it, thanks for the consideration.

I still believe (IMVHO) that documenting EncodingReader thus allowing
users to reuse it and customize it could be better, I can try that in
a branch if you prefer.

I think you're right, and we should probably do that for Nokogiri 2.0 ... I've added it to the 2.0 Roadmap at https://github.com/tenderlove/nokogiri/blob/master/ROADMAP.md

 

Thanks!

--
twitter: @riffraff
blog (en, it): www.riffraff.info riffraff.blogsome.com
work: circleme.com

--
You received this message because you are subscribed to the Google Groups "nokogiri-talk" group.
Reply all
Reply to author
Forward
0 new messages