XML files with entities

134 views
Skip to first unread message

James Healy

unread,
Nov 9, 2009, 10:54:28 PM11/9/09
to nokogi...@googlegroups.com
Hi folks,

I'm attempting to process an XML file that follows the ONIX standard[1]
using nokogiri 1.4.

Most files work fine, but I ran into one this morning that had a named
entity (–) in it, which triggered an exception. See [2] for a
sample XML file, test script and output.

The ONIX spec is defined via a DTD, and if you dig through it there's
~1500 named entities that are permitted. Is there currently any way for
me to stop nokogiri raising an exception on files with entities?

cheers

-- James Healy <ji...@deefa.com> Tue, 10 Nov 2009 14:51:43 +1100

[1] http://www.editeur.org/15/Previous-Releases/
[2] http://gist.github.com/230595

Aaron Patterson

unread,
Nov 9, 2009, 11:20:59 PM11/9/09
to nokogi...@googlegroups.com
On Mon, Nov 9, 2009 at 7:54 PM, James Healy <ji...@deefa.com> wrote:
>
> Hi folks,
>
> I'm attempting to process an XML file that follows the ONIX standard[1]
> using nokogiri 1.4.
>
> Most files work fine, but I ran into one this morning that had a named
> entity (&ndash;) in it, which triggered an exception. See [2] for a
> sample XML file, test script and output.

Thank you so much for the sample script! That makes my life much easier! :-D

> The ONIX spec is defined via a DTD, and if you dig through it there's
> ~1500 named entities that are permitted. Is there currently any way for
> me to stop nokogiri raising an exception on files with entities?

The only way to get it to stop complaining is by loading the DTD.
Once you load the DTD, then libxml2 will know how to properly deal
with the named entities.

Do you really need to use the Reader API? It's quite easy to get it
to load the DTD if you're parsing with the DOM api. I'm not so sure
that is the case with the Reader API.

--
Aaron Patterson
http://tenderlovemaking.com/

James Healy

unread,
Nov 9, 2009, 11:51:02 PM11/9/09
to nokogi...@googlegroups.com
Aaron Patterson wrote:
> > The ONIX spec is defined via a DTD, and if you dig through it there's
> > ~1500 named entities that are permitted. Is there currently any way for
> > me to stop nokogiri raising an exception on files with entities?
>
> The only way to get it to stop complaining is by loading the DTD.
> Once you load the DTD, then libxml2 will know how to properly deal
> with the named entities.
>
> Do you really need to use the Reader API? It's quite easy to get it
> to load the DTD if you're parsing with the DOM api. I'm not so sure
> that is the case with the Reader API.

I need to deal with files that range from < 1kB to > 300Mb, so the DOM
api isn't really the best option. I guess I could go SAX if that helps,
the Reader api just makes things so easy though.

Is there an example of how to load the DTD in the DOM and/or SAX apis?
Maybe with those I can work out if the reader API has similar support.

-- James Healy <ji...@deefa.com> Tue, 10 Nov 2009 15:48:41 +1100

Aaron Patterson

unread,
Nov 13, 2009, 9:20:58 PM11/13/09
to nokogi...@googlegroups.com

Hrm... Not that I know of. I'm going to have to research this.
Would you mind filing a ticket to research this? I've been crazy busy
this week (because of RubyConf). I know I'll forget otherwise, and I
don't want to let my users down. :-)

James Healy

unread,
Nov 14, 2009, 1:55:49 AM11/14/09
to nokogi...@googlegroups.com
Aaron Patterson wrote:
> Hrm... Not that I know of. I'm going to have to research this.
> Would you mind filing a ticket to research this? I've been crazy busy
> this week (because of RubyConf). I know I'll forget otherwise, and I
> don't want to let my users down. :-)

Done, as ticket #165, although it looks like it may be a dup of #104?

Thanks for the support, I appreciate how responsive you are to queries.

-- James Healy <ji...@deefa.com> Sat, 14 Nov 2009 17:54:32 +1100

Aaron Patterson

unread,
Dec 6, 2009, 5:47:17 PM12/6/09
to nokogiri-talk
On Nov 13, 10:55 pm, James Healy <ji...@deefa.com> wrote:
> Aaron Patterson wrote:
> > Hrm...  Not that I know of.  I'm going to have to research this.
> > Would you mind filing a ticket to research this?  I've been crazy busy
> > this week (because of RubyConf).  I know I'll forget otherwise, and I
> > don't want to let my users down.  :-)
>
> Done, as ticket #165, although it looks like it may be a dup of #104?
>
> Thanks for the support, I appreciate how responsive you are to queries.

Finally figured it out. Looks like it's possible with the current
release of nokogiri:

http://gist.github.com/250477

The only crappy part is that it takes time to load the DTD from the
internets. If you're willing to perform some superhacks, you can
trick it in to loading the DTD from your filesystem though.
Reply all
Reply to author
Forward
0 new messages