Problem getting Nokogiri to properly parse file containing CDATA wrapped Base64 encoded images.

77 views
Skip to first unread message

JD Hendrickson

unread,
Jul 22, 2016, 9:22:23 PM7/22/16
to nokogiri-talk
I am using Nokogiri to parse XML files that contain Base64 encoded images.  The images are wrapped in CDATA tags.  When I run an xpath query against the document using an XPATH tool (using pathology on my Mac) I get the correct and expected number of nodes.  However, using Nokogiri I get a significantly decreased result (expecting 203 and only getting 85).  If I remove the Base64 data, the correct number of nodes are found by Nokogiri.

As far as I can tell, Nokogiri should not be popping on the Base64 encoded strings, but it definitely does appear to be the case.  Can anyone shed some light on this?

Thanks,

JD

Mike Dalessio

unread,
Jul 22, 2016, 9:40:01 PM7/22/16
to nokogiri-talk

How large are your XML files?


--
You received this message because you are subscribed to the Google Groups "nokogiri-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nokogiri-tal...@googlegroups.com.
To post to this group, send email to nokogi...@googlegroups.com.
Visit this group at https://groups.google.com/group/nokogiri-talk.
For more options, visit https://groups.google.com/d/optout.

JD Hendrickson

unread,
Jul 25, 2016, 11:57:52 AM7/25/16
to nokogiri-talk
largest one has been around 100MB - the most recent one is around 50MB.

Mike Dalessio

unread,
Feb 10, 2017, 4:58:07 AM2/10/17
to nokogiri-talk
Sorry for the late reply.

libxml2 has limits on the length and depth of the tree it will parse unless you opt out of those limits by using the `huge` parse option.

You can learn more about how to use parse options here: http://www.nokogiri.org/tutorials/parsing_an_html_xml_document.html

HTH,
-mike


To unsubscribe from this group and stop receiving emails from it, send an email to nokogiri-talk+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages