I'm attempting to process an XML file that follows the ONIX standard[1] using nokogiri 1.4.
Most files work fine, but I ran into one this morning that had a named entity (–) in it, which triggered an exception. See [2] for a sample XML file, test script and output.
The ONIX spec is defined via a DTD, and if you dig through it there's ~1500 named entities that are permitted. Is there currently any way for me to stop nokogiri raising an exception on files with entities?
cheers
-- James Healy <ji...@deefa.com> Tue, 10 Nov 2009 14:51:43 +1100
On Mon, Nov 9, 2009 at 7:54 PM, James Healy <ji...@deefa.com> wrote:
> Hi folks,
> I'm attempting to process an XML file that follows the ONIX standard[1] > using nokogiri 1.4.
> Most files work fine, but I ran into one this morning that had a named > entity (–) in it, which triggered an exception. See [2] for a > sample XML file, test script and output.
Thank you so much for the sample script! That makes my life much easier! :-D
> The ONIX spec is defined via a DTD, and if you dig through it there's > ~1500 named entities that are permitted. Is there currently any way for > me to stop nokogiri raising an exception on files with entities?
The only way to get it to stop complaining is by loading the DTD. Once you load the DTD, then libxml2 will know how to properly deal with the named entities.
Do you really need to use the Reader API? It's quite easy to get it to load the DTD if you're parsing with the DOM api. I'm not so sure that is the case with the Reader API.
Aaron Patterson wrote: > > The ONIX spec is defined via a DTD, and if you dig through it there's > > ~1500 named entities that are permitted. Is there currently any way for > > me to stop nokogiri raising an exception on files with entities?
> The only way to get it to stop complaining is by loading the DTD. > Once you load the DTD, then libxml2 will know how to properly deal > with the named entities.
> Do you really need to use the Reader API? It's quite easy to get it > to load the DTD if you're parsing with the DOM api. I'm not so sure > that is the case with the Reader API.
I need to deal with files that range from < 1kB to > 300Mb, so the DOM api isn't really the best option. I guess I could go SAX if that helps, the Reader api just makes things so easy though.
Is there an example of how to load the DTD in the DOM and/or SAX apis? Maybe with those I can work out if the reader API has similar support.
-- James Healy <ji...@deefa.com> Tue, 10 Nov 2009 15:48:41 +1100
On Mon, Nov 9, 2009 at 8:51 PM, James Healy <ji...@deefa.com> wrote:
> Aaron Patterson wrote: >> > The ONIX spec is defined via a DTD, and if you dig through it there's >> > ~1500 named entities that are permitted. Is there currently any way for >> > me to stop nokogiri raising an exception on files with entities?
>> The only way to get it to stop complaining is by loading the DTD. >> Once you load the DTD, then libxml2 will know how to properly deal >> with the named entities.
>> Do you really need to use the Reader API? It's quite easy to get it >> to load the DTD if you're parsing with the DOM api. I'm not so sure >> that is the case with the Reader API.
> I need to deal with files that range from < 1kB to > 300Mb, so the DOM > api isn't really the best option. I guess I could go SAX if that helps, > the Reader api just makes things so easy though.
> Is there an example of how to load the DTD in the DOM and/or SAX apis? > Maybe with those I can work out if the reader API has similar support.
Hrm... Not that I know of. I'm going to have to research this. Would you mind filing a ticket to research this? I've been crazy busy this week (because of RubyConf). I know I'll forget otherwise, and I don't want to let my users down. :-)
Aaron Patterson wrote: > Hrm... Not that I know of. I'm going to have to research this. > Would you mind filing a ticket to research this? I've been crazy busy > this week (because of RubyConf). I know I'll forget otherwise, and I > don't want to let my users down. :-)
Done, as ticket #165, although it looks like it may be a dup of #104?
Thanks for the support, I appreciate how responsive you are to queries.
-- James Healy <ji...@deefa.com> Sat, 14 Nov 2009 17:54:32 +1100
On Nov 13, 10:55 pm, James Healy <ji...@deefa.com> wrote:
> Aaron Patterson wrote:
> > Hrm... Not that I know of. I'm going to have to research this.
> > Would you mind filing a ticket to research this? I've been crazy busy
> > this week (because of RubyConf). I know I'll forget otherwise, and I
> > don't want to let my users down. :-)
> Done, as ticket #165, although it looks like it may be a dup of #104?
> Thanks for the support, I appreciate how responsive you are to queries.
Finally figured it out. Looks like it's possible with the current
release of nokogiri:
The only crappy part is that it takes time to load the DTD from the
internets. If you're willing to perform some superhacks, you can
trick it in to loading the DTD from your filesystem though.