Nokogiri 1.8.1 skips last XML tag once file size crosses about 10Mb

13 views
Skip to first unread message

Matti Jagula

unread,
Nov 16, 2017, 9:35:25 AM11/16/17
to nokogiri-talk
Hi All,

I've run into an issue where for certain large xml files the last tag in the file is not present in the DOM when parsed with Nokogiri 1.8.1. I'm parsing files that contain the following structure: `Root/ModelFile/ProductOccurrence*` and with certain files the last `ProductOccurrence` tag is missing from DOM.

A repro script is available here.

Parsing using other XML parsers (xmllint for example) shows the xml files are valid. The issue does not seem to affect Nokogiri 1.6.6. The issue appears on Ruby 2.1, 2.2 and 2.3 on multiple machines.

The problem seems to be loosely related to the size of xml. With smallish files it's fine, but once the file size gets to around 10Mb the last tag starts to be omitted. Also the tag attributes seem to be somewhat relevant, as the issue does not reproduce when I leave out the attributes from xml.

Can someone try to verify the problem before I create an issue?

Thanks!,

Matti

Mike Dalessio

unread,
Nov 16, 2017, 9:38:23 AM11/16/17
to nokogiri-talk
Matti,

Thanks for asking this question. There are some default limits set by libxml2 (the underlying parser used by Nokogiri) that can be worked around (with a small performance penalty) by setting the `HUGE` parse option.

Some help with setting parse options can be found here: http://www.nokogiri.org/tutorials/parsing_an_html_xml_document.html#parse_options

Can you try parsing with this option set and see if it addresses the behavior you're seeing?


--
You received this message because you are subscribed to the Google Groups "nokogiri-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nokogiri-talk+unsubscribe@googlegroups.com.
To post to this group, send email to nokogi...@googlegroups.com.
Visit this group at https://groups.google.com/group/nokogiri-talk.
For more options, visit https://groups.google.com/d/optout.

Matti Jagula

unread,
Nov 16, 2017, 9:56:42 AM11/16/17
to nokogiri-talk
Mike,

Thanks for the pointer, that indeed works around the issue!

It's quite puzzling that even with xml files with hundreds of megabytes in size only the very last record gets dropped. I would expect a limitation in libxml2 to manifest itself a bit more dramatically.

Matti

Mike Dalessio

unread,
Nov 16, 2017, 6:47:27 PM11/16/17
to nokogiri-talk
Great to hear this option helps you! Agree, libxml2 could be more obvious about how it starts dropping tags during parsing.

Reply all
Reply to author
Forward
0 new messages