XML Validation against custom DTDs.

124 views
Skip to first unread message

Wayne Brissette

unread,
Jul 10, 2020, 10:53:17 AM7/10/20
to nokogiri-talk
I'm not sure how to do this exactly, so I'm hoping somebody can point me
in the right direction.

https://nokogiri.org/tutorials/ensuring_well_formed_markup.html

Provides no real information on how I would validate against custom DTDs
we have that reside on a server. Is there a way to tell Nokogiri to look
on some server for the DTD that it needs to validate against? or is the
DTD declaration in the XML document enough?

What I'm going to do is add a process to our Jenkins pipeline that will
use Nokogiri to validate XML content that has been created elsewhere.
But I'm still unsure if just being able to read the file into Nokogiri
is enough (as shown on the well formed markup webpage).

-Wayne

Mike Dalessio

unread,
Jul 10, 2020, 10:59:30 AM7/10/20
to nokogiri-talk
Hi Wayne,

The page you're linking to is about "well-formedness" (related to the XML spec) which is different from "valid" (related to a domain-specific DTD or schema).

You can see a short code snippet of how to validate against a DTD schema here:


Hopefully this helps?

-m


--
You received this message because you are subscribed to the Google Groups "nokogiri-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nokogiri-tal...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nokogiri-talk/ff0c75a4-9203-1324-ab8e-1bf44fcc7225%40att.net.

Wayne Brissette

unread,
Jul 10, 2020, 11:06:27 AM7/10/20
to nokogi...@googlegroups.com
Quick question about this. This assumes it has an XSD. Do we need to convert our DTDs into XSD in order to do this?

-Wayne

Jack Royal-Gordon

unread,
Jul 10, 2020, 1:44:48 PM7/10/20
to nokogi...@googlegroups.com
Hi Guys,

Lurker here. The docs say about the schema source “usually an XSD file”. It would be good for the docs to specify what alternatives are available.

Thanks,

Jack

Wayne Brissette

unread,
Jul 11, 2020, 10:03:15 AM7/11/20
to nokogi...@googlegroups.com
As a follow-up to this... there is a DTD option, but it looks like that's strictly for nodes. However, I haven't quite thrown in the towel yet.

I was able to get things to validate in Python using lxml, which also uses the libxml2 library as the source.

I used:
response = requests.get(dtdpath)
dtdtxt = response.text
f = StringIO(dtdtxt)
dtd = etree.DTD(f)for indv in allfiles:
    xmldoc = etree.parse(indv)
    results = (dtd.validate(xmldoc))
    if results == False:
        print ("{}".format(dtd.error_log.filter_from_errors()[0]))

which gave me:
/Users/waybri01/gitrepos/validation/samples/ext_performance_monitorssummary1.xml:10:0:ERROR:VALID:DTD_CONTENT_MODEL: Element regsumbody content does not follow the DTD, expecting (longdesc? , regsumtable), got (p regsumtable )

Which is what I would have expected since I placed a <p> element in a place we don't allow it in the DTD.

I did find this answer for Nokogiri on Stackoverflow, but something isn't right with it, or Nokogiri has changed syntax slightly since that was used.

https://stackoverflow.com/questions/36890769/how-does-one-properly-validate-an-xml-file-with-a-local-dtd-file-using-nokogiri/36892851#36892851

@Mike, can provide a bit more insight into if this can be done in Nokogiri and if so, what that would look like. For the record, I did try to use oXygen's DTD to XSD conversion tool, but kept getting an error regarding some elements that were referred to from outside the XSD (probably true, since we specialize our DTD from an OASIS DTD).

-Wayne

wbrisett

unread,
Jul 12, 2020, 8:38:04 AM7/12/20
to nokogiri-talk
So I've done more playing around with this issue and an old bug report against Nokogiri actually helped me track down what's going on. 

https://github.com/sparklemotion/nokogiri/issues/440

require 'nokogiri'

xml = <<-EOXML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE reference PUBLIC "-//OASIS//DTD DITA Reference//EN" "reference.dtd">
<reference id="reference_y1f_bpy_hmb">
    <title>Simple Title</title>
    <shortdesc>Reference SD</shortdesc>
    <refbody> <p>should not be here</p></refbody>
</reference>
EOXML

options = Nokogiri::XML::ParseOptions::DTDLOAD
# options: 4
doc =  Nokogiri::XML::Document.parse(xml, nil, nil, options)
# 2:79: WARNING: failed to load external entity "reference.dtd"
puts doc.validate
# 3:0: ERROR: No declaration for element reference
# 3:0: ERROR: No declaration for attribute id of element reference
# 4:0: ERROR: No declaration for element title
# 5:0: ERROR: No declaration for element shortdesc
# 6:0: ERROR: No declaration for element refbody
# 6:0: ERROR: No declaration for element p
externalSubSet = doc.external_subset
# nil


So I can clearly see that I'm not loading the external DTD. Which is probably true, oXygen and the DITA OT have their own local copies of this file. 

oXygen says this:
https://www.oxygenxml.com/dita/1.3/specs/non-normative/dtd-public-identifiers.html

So, the question I have is since it can't seem to find the external entity because it's elsewhere, how can I feed that information to Nokogiri?
I can read the DTD in from one of the local DITA OT or oXygen shells (I've tried this with our own customized DTD), but I've not had much luck trying to feed that into Nokogiri.

So, I understand the issue, but the question then becomes, *how* do I properly tell Nokogiri how to ingest the DTD so it can do the validation?

-Wayne 

Mike Dalessio

unread,
Jul 15, 2020, 8:26:17 AM7/15/20
to nokogiri-talk
Hi Wayne,

External entities can be loaded from the local filesystem by using either an absolute or a relative path. If I run your script with and without a `reference.dtd` file, I get:

```
juno ruby-2.7.0 (master)
nokogiri $ ~/baz.rb

2:79: WARNING: failed to load external entity "reference.dtd"
---

3:0: ERROR: No declaration for element reference
3:0: ERROR: No declaration for attribute id of element reference
4:0: ERROR: No declaration for element title
5:0: ERROR: No declaration for element shortdesc
6:0: ERROR: No declaration for element refbody
6:0: ERROR: No declaration for element p

juno ruby-2.7.0 (master)
nokogiri $ touch reference.dtd

juno ruby-2.7.0 (master)
nokogiri $ ~/baz.rb
---

3:0: ERROR: No declaration for element reference
3:0: ERROR: No declaration for attribute id of element reference
4:0: ERROR: No declaration for element title
5:0: ERROR: No declaration for element shortdesc
6:0: ERROR: No declaration for element refbody
6:0: ERROR: No declaration for element p
```

I hope that makes sense.

--
You received this message because you are subscribed to the Google Groups "nokogiri-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nokogiri-tal...@googlegroups.com.

Wayne Brissette

unread,
Jul 15, 2020, 8:40:31 AM7/15/20
to nokogi...@googlegroups.com
Mike:

The question I have is where does that go? In other words,

options = Nokogiri::XML::ParseOptions::
DTDLOAD

That says to load the DTD, but is that where I would specify the external DTD path, or is there some other place? I'm trying to figure out where Nokogiri is looking for that external path and can't seem to work it out from the one or two very old examples I've found online, or via the documents. I did figure out how to make the XSD validation work, but we don't want to manually convert these since we don't control them. 

If you're saying that in each XML file we have to manually point to the location of the DTD in the declaration statement, that's also a bit of an issue for us since the idea is not to manually touch that, but just to validate items before they get processed. 


-Wayne

John Shahid

unread,
Jul 15, 2020, 9:05:12 AM7/15/20
to nokogi...@googlegroups.com, Wayne Brissette

I think the location is coming from the XML document that you pasted
earlier. In other words, the location is "reference.dtd" in:

> <!DOCTYPE reference PUBLIC "-//OASIS//DTD DITA Reference//EN" "reference.dtd">

Wayne Brissette <wbri...@att.net> writes:

> Mike:
>
> The question I have is where does that go? In other words,
>
> options= Nokogiri::XML::ParseOptions::DTDLOAD That says to load the DTD, but is
>> require 'nokogiri' xml= <<-EOXML <?xml version="1.0" encoding="UTF-8"?>
>> <!DOCTYPE reference PUBLIC "-//OASIS//DTD DITA Reference//EN" "reference.dtd">
>> <reference id="reference_y1f_bpy_hmb">
>> <title>Simple Title</title>
>> <shortdesc>Reference SD</shortdesc>
>> <refbody> <p>should not be here</p></refbody>
>> </reference>
>> EOXML options= Nokogiri::XML::ParseOptions::DTDLOAD # options: 4 doc= Nokogiri::XML::Document.parse(xml,nil,nil, options)
>> # 2:79: WARNING: failed to load external entity "reference.dtd"
>> puts doc.validate
>> # 3:0: ERROR: No declaration for element reference# 3:0: ERROR: No
>> declaration for attribute id of element reference# 4:0: ERROR: No
>> declaration for element title# 5:0: ERROR: No declaration for
>> element shortdesc# 6:0: ERROR: No declaration for element refbody#
>> 6:0: ERROR: No declaration for element p
>> externalSubSet = doc.external_subset
>> # nil
>>
>>
>> So I can clearly see that I'm not loading the external DTD. Which is probably true, oXygen and the DITA OT have their own local copies of this file.
>>
>> oXygen says this:
>> https://www.oxygenxml.com/dita/1.3/specs/non-normative/dtd-public-identifiers.html
>>
>> So, the question I have is since it can't seem to find the external entity because it's elsewhere, how can I feed that information to Nokogiri?
>> I can read the DTD in from one of the local DITA OT or oXygen shells (I've tried this with our own customized DTD), but I've not had much luck trying to feed that into Nokogiri.
>>
>> So, I understand the issue, but the question then becomes, **how** do I properly tell Nokogiri how to ingest the DTD so it can do the validation?
>>
>> -Wayne
>>
>> -- You received this message because you are subscribed to the Google
>> Groups "nokogiri-talk" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to nokogiri-tal...@googlegroups.com
>> <mailto:nokogiri-tal...@googlegroups.com>.
>> <https://groups.google.com/d/msgid/nokogiri-talk/35617fdb-9056-418a-9c39-3534f4cd4e8do%40googlegroups.com?utm_medium=email&utm_source=footer>.
>>
>> -- You received this message because you are subscribed to the Google
>> Groups "nokogiri-talk" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to nokogiri-tal...@googlegroups.com
>> <mailto:nokogiri-tal...@googlegroups.com>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/nokogiri-talk/CAGJbjKZXwWxvV97G0HJpXifxXRNHWzPi1CB%2By4kJVMOZoTQ6Kw%40mail.gmail.com
>> <https://groups.google.com/d/msgid/nokogiri-talk/CAGJbjKZXwWxvV97G0HJpXifxXRNHWzPi1CB%2By4kJVMOZoTQ6Kw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Wayne Brissette

unread,
Jul 15, 2020, 9:25:57 AM7/15/20
to nokogi...@googlegroups.com


John Shahid wrote on 2020-07-15 08:05:
> I think the location is coming from the XML document that you pasted
> earlier. In other words, the location is "reference.dtd" in:
>
>> <!DOCTYPE reference PUBLIC "-//OASIS//DTD DITA Reference//EN" "reference.dtd">
That's what I'm assuming as well which is not ideal. I like how Nokogiri
handles XSD validation, but doesn't seem to have that same ability with
DTDs. Oh well, this has given me some ideas (they aren't ideal, but they
will work... I think). I'll just have to read in the XML file, rewrite
the doctype declaration and insert the path to the DTDs.

-Wayne
Reply all
Reply to author
Forward
0 new messages